Inspiration
For many subjective measures of LLM model performance, it's difficult for humans and machines to produce meaningful scalar measures. For humans, expressing a preference between two options is often much easier than giving each option a 1-5 score.
What it does
This is a proof of concept starting point for exploring how to develop LLM judges using Weave and LabelStudio.
Challenges we ran into
- Started by trying to develop a pairwise judge for chatbot arena data
- Ran into several bugs in DSPy when trying to develop an LLM eval
- learned that optimizers for TypedPredictors don't actually affect very much, so no meaningful change was produced between baseline and optimization.
- Re-oriented on the toy problem of a joke bot that writes jokes for topics
- Ran into bugs in W&B SDK/docs when trying to re-use models to produce output datasets
- Ran out of time for DSPy optimization of a pairwise judge using labelled comparison data
What's next for Pairwise Model Tests
- More comparable JokeBot models
- Fully labeled data
- Optimize an LLM-as-Judge off of the training evaluations and evaluate judge performance on test set of human joke preferences
Built With
- labelstudio
- openrouter
- weave
Log in or sign up for Devpost to join the conversation.