Inspiration

For many subjective measures of LLM model performance, it's difficult for humans and machines to produce meaningful scalar measures. For humans, expressing a preference between two options is often much easier than giving each option a 1-5 score.

What it does

This is a proof of concept starting point for exploring how to develop LLM judges using Weave and LabelStudio.

Challenges we ran into

  • Started by trying to develop a pairwise judge for chatbot arena data
    • Ran into several bugs in DSPy when trying to develop an LLM eval
    • learned that optimizers for TypedPredictors don't actually affect very much, so no meaningful change was produced between baseline and optimization.
  • Re-oriented on the toy problem of a joke bot that writes jokes for topics
    • Ran into bugs in W&B SDK/docs when trying to re-use models to produce output datasets
  • Ran out of time for DSPy optimization of a pairwise judge using labelled comparison data

What's next for Pairwise Model Tests

  • More comparable JokeBot models
  • Fully labeled data
  • Optimize an LLM-as-Judge off of the training evaluations and evaluate judge performance on test set of human joke preferences

Built With

  • labelstudio
  • openrouter
  • weave
Share this project:

Updates