LLM Idol

Inspiration

The project is an implementation of this paper: https://www.arxiv.org/abs/2408.09235#:~:text=17%20Aug%202024%5D-,Reference%2DGuided%20Verdict%3A%20LLMs%2Das%2DJudges%20in%20Automatic,Evaluation%20of%20Free%2DForm%20Text&text=The%20rapid%20advancements%20in%20Large,particularly%20in%20free%2Dform%20tasks.

What it does

We use reference-guided evaluations with multiple LLMs to determine if using LLMs as judges for a candidate LLM is useful. To do this we determine majority vote and calculate kappa statistics to assess the inter-reliability between the user-provided answers and the LLM Judge answers.

How we built it

I use Open Router for both my candidate LLM and the judge LLMs and added a simple python script and specific prompt for the LLM judges. Then we loop through a random sample of 30 of the Hotpot Trivia dataset. The LLM Judges are provided the Candidate LLM response, the question, and the reference answer or correct answer to determine if the Candidate is correct.

I then use Majority vote between each judge llm to determine if the Candidate LLM is indeed correct. Next I implement a kappa statistics function and calculate Cohen's Kappa to determine how accurate the judges were.

Challenges we ran into

Unfortunately I kept getting rate-limited by Open Router due to everyone in the hackathon using the same models. So I couldn't use the same models that were used in the paper and replicate their exact results.

I also didn't have time to use multiple datasets like the Trivia QA and the more complex Truthful QA datasets. Ideally I would also like to use a larger sample than 30 to asses the judge LLMs.

Accomplishments that we're proud of

I've never implemented a paper before so that was cool.

What we learned

Rate limits suck! Also Kappa statistics are very useful for assessing LLM as evaluators. Using reference guided evaluation is one way to get around human-in-the-loop evaluation, but the scores leave more to be desired if one were to use LLMs as evaluators.

What's next for LLM Idol

I'll probably write a blog post on my experience. Shout out to the W&B team for the cool workspace, food, and awesome connections I made!

Built With

python

Updates

Faris Habib started this project — Sep 22, 2024 04:51 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.