Conversation
|
Also example of implemented G-Eval evalution for LLM reply to request
Tests config example: prompts:
- >-
How can I learn another language if I have small free time daily and I have
poor memory?
providers:
- id: openai:gpt-4o
config:
organization: ''
temperature: 0.5
max_tokens: 1024
top_p: 1
frequency_penalty: 0
presence_penalty: 0
scenarios: []
tests:
- description: ''
vars: {}
assert:
- type: g-eval
value: >-
Coherence - the collective quality of all sentences. We align this
dimension with the DUC quality question of structure and coherence
whereby "the reply should be well-structured and well-organized. The
reply should not just be a heap of related information, but should
build from sentence to a coherent body of information about a topic."
- description: ''
vars: {}
assert:
- type: g-eval
value: >-
Consistency - the factual alignment between the reply and the source.
A factually consistent reply contains only statements that are
entailed by the source document. Annotators were also asked to
penalize replies that contained hallucinated facts.
- description: ''
vars: {}
assert:
- type: g-eval
value: >-
Fluency - the quality of the reply in terms of grammar, spelling,
punctuation, word choice, and sentence structure.
- description: ''
vars: {}
assert:
- type: g-eval
value: >-
Relevance - selection of important content for the source. The reply
should include only important information for the source document.
Annotators were instructed to penalize replies which contained
redundancies and excess information. |
|
Hi @typpo, may you please help with review? |
7ab2523 to
7108008
Compare
|
This a great addition @schipiga! I'm noticing ~half the time the grader thinks that
Might be worth a look? |
|
Hi @typpo thank you for report, yep, indeed, looks like sometimes LLM can't extract required parts. I will try to add clarification in prompt |
7108008 to
40e2b3d
Compare
|
@typpo looks like prompt minor tweaks helped. I launched it dozen times and didn't meet the issue. Maybe it could happen sometimes as so we interact with LLM, but I think more rare, than it was before update. Could you pls review and check also |
d1a9c8a to
4aa3739
Compare
|
Hi @typpo , I think, now prompt works correct. Also I actualised test example for multiple values. Can you please review pr? |
|
will take a look shortly!
…On Wed, Dec 18, 2024 at 9:46 AM Sergei Chipiga ***@***.***> wrote:
Hi @typpo <https://github.com/typpo> , I think, now prompt works correct.
Also I actualised test example for multiple values. Can you please review
pr?
—
Reply to this email directly, view it on GitHub
<#2436 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACLYJRRBNESTAON5DRLDL32GGYFRAVCNFSM6AAAAABTX4FWAWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJRHEZDOOBUGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
4aa3739 to
500631d
Compare
|
awesome change! thanks @schipiga |

Hi!
This is an attempt to implement G-Eval LLM self-assesstment approach in promptfoo.
It requires a bit code polish(it's done already), and I'm interesting if maintainers have interest to include it to promptfoo. Because as I know, there was already issue request for that.It's inspired by:
This an example of G-Eval in promptfoo.
I used next criterias:
Which gave next G-Eval responses:
(UI rendered result isn't so informative for passed tests)

More details about prompts are in matchers.ts code. Could you please review it.
Regards, Sergei