Conversation
There was a problem hiding this comment.
Pull Request Overview
Adds a new gh models eval command to run evaluators defined in .prompt.yml files, along with built-in LLM evaluators and sample fixtures for testing.
- Introduces prompt parsing and templating (
LoadFromFile,TemplateString) inpkg/prompt - Implements the eval subcommand under
cmd/evaland registers it in the root command - Provides built-in LLM evaluators (
builtins.go) and dozens of YAML fixtures plus comprehensive tests
Reviewed Changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/prompt/prompt.go | Adds File, LoadFromFile, and simple TemplateString logic |
| pkg/prompt/prompt_test.go | Tests loading/parsing, templating, and error cases |
| cmd/root.go | Registers the new eval subcommand |
| cmd/root_test.go | Updates root help tests to include eval |
| cmd/run/run.go | Switches run command to use the shared prompt loader and templater |
| cmd/eval/… | Implements the eval command handler, tests, and built-in evaluators |
| fixtures/*.yml | Sample and test data for string and plugin evaluators |
Comments suppressed due to low confidence (3)
cmd/eval/builtins.go:1
- [nitpick] The builtins.go file contains very large inline prompt definitions, which can be hard to navigate and maintain. Consider extracting these templates into separate resource files or moving each evaluator into its own file to improve readability and manageability.
package eval
pkg/prompt/prompt.go:82
- The doc comment for TemplateString is minimal. It would help to explain edge-case behavior (e.g., handling of non-map data or unresolved placeholders) so future maintainers know exactly what to expect.
// TemplateString templates a string with the given data using simple {{variable}} replacement
pkg/prompt/prompt_test.go:11
- We don’t have a test verifying how
ModelParameters.TopPis handled when it's omitted from the YAML. Adding a case for missing optional fields would ensure consistent behavior.
func TestPromptFile(t *testing.T) {
|
Can you support a machine friendly output like JSON? |
cheshire137
left a comment
There was a problem hiding this comment.
Thanks for all the test coverage!
| var role azuremodels.ChatMessageRole | ||
| switch strings.ToLower(msg.Role) { | ||
| case "system": | ||
| role = azuremodels.ChatMessageRoleSystem | ||
| case "user": | ||
| role = azuremodels.ChatMessageRoleUser | ||
| case "assistant": | ||
| role = azuremodels.ChatMessageRoleAssistant | ||
| default: | ||
| return nil, fmt.Errorf("unknown message role: %s", msg.Role) | ||
| } |
There was a problem hiding this comment.
Could put this into a function like getAzureChatMessageRole(msg.Role).
| req := azuremodels.ChatCompletionOptions{ | ||
| Messages: messages, | ||
| Model: h.evalFile.Model, | ||
| Stream: false, | ||
| } | ||
|
|
||
| // Apply model parameters | ||
| if h.evalFile.ModelParameters.MaxTokens != nil { | ||
| req.MaxTokens = h.evalFile.ModelParameters.MaxTokens | ||
| } | ||
| if h.evalFile.ModelParameters.Temperature != nil { | ||
| req.Temperature = h.evalFile.ModelParameters.Temperature | ||
| } | ||
| if h.evalFile.ModelParameters.TopP != nil { | ||
| req.TopP = h.evalFile.ModelParameters.TopP | ||
| } |
There was a problem hiding this comment.
You use h.evalFile a lot here. What about adding a method on the EvalFile type like BuildChatCompletionOptions? Then this could be:
| req := azuremodels.ChatCompletionOptions{ | |
| Messages: messages, | |
| Model: h.evalFile.Model, | |
| Stream: false, | |
| } | |
| // Apply model parameters | |
| if h.evalFile.ModelParameters.MaxTokens != nil { | |
| req.MaxTokens = h.evalFile.ModelParameters.MaxTokens | |
| } | |
| if h.evalFile.ModelParameters.Temperature != nil { | |
| req.Temperature = h.evalFile.ModelParameters.Temperature | |
| } | |
| if h.evalFile.ModelParameters.TopP != nil { | |
| req.TopP = h.evalFile.ModelParameters.TopP | |
| } | |
| req := h.evalFile.BuildChatCompletionOptions(messages, false) |
Deborah-Digges
left a comment
There was a problem hiding this comment.
🚢 This looks great, thanks for all the hard work here! 🙇♀
|
Thank you for the reviews! @cheshire137, I'm going to address your points in a new PR, given the size (and the fact that I'm going to immediately follow up with a new PR to do JSON mode) |
Adds a
gh models evalcommand that runs evaluators based on the test data and evaluators in a.prompt.ymlfile (https://docs.github.com/en/github-models/use-github-models/storing-prompts-in-github-repositories)It supports string evaluators and the set of built-in evaluators supported in the Models prompting UI.
As part of this change, I've extracted the prompt-yml-parsing code into a common package.
Here's an example GitHub Action that runs this on new PRs:
Details