Pairwise Model Tests

Focus within Eugene Yan's LLM evals flowchart
Example dataset of comparisons
Example labelling interface from Label Studio

Inspiration

For many subjective measures of LLM model performance, it's difficult for humans and machines to produce meaningful scalar measures. For humans, expressing a preference between two options is often much easier than giving each option a 1-5 score.

What it does

This is a proof of concept starting point for exploring how to develop LLM judges using Weave and LabelStudio.

Challenges we ran into

Started by trying to develop a pairwise judge for chatbot arena data
- Ran into several bugs in DSPy when trying to develop an LLM eval
- learned that optimizers for TypedPredictors don't actually affect very much, so no meaningful change was produced between baseline and optimization.
Re-oriented on the toy problem of a joke bot that writes jokes for topics
- Ran into bugs in W&B SDK/docs when trying to re-use models to produce output datasets
Ran out of time for DSPy optimization of a pairwise judge using labelled comparison data

What's next for Pairwise Model Tests

More comparable JokeBot models
Fully labeled data
Optimize an LLM-as-Judge off of the training evaluations and evaluate judge performance on test set of human joke preferences

Built With

labelstudio
openrouter
weave

Updates

Daniel Fennelly started this project — Sep 22, 2024 05:20 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.