modelcitizens

Pre-judging
Post-judging
Model Detail Modal
Accuracy

Inspiration

Recently I watched a video about how betting markets often outperform field experts. Within that video, one story in particular stood out to me. It was about a contest at a fair, where people guessed the weight of an ox. Around 800 people guessed, and although the individual guesses varied wildly, the average of all of the guesses was just one pound off from the actual weight of the ox. This story sat in the back of my mind, and coming into the datathon I knew that I wanted to experiment with the idea of using multiple models with varying levels of knowledge, in order to mimic that effect. Then I came across the LMSYS dataset, and it clicked. In the chatbot arena, users from all sorts of backgrounds evaluate the language model responses. They each bring different perspectives and intuitions, and so although no single user's judgement is perfect, together, their preferences create an emergent truth. This felt very similar to the ox story: individual judgements may be noisy, but when aggregated, they can reveal surprising accuracy. So I decided to design my model the same way.

What it does

The ensemble is composed of 50 XGBoost classifiers. Each model is trained differently - viewing different parts of the dataset, through different features, as well as different hyperparameters. This gives each one a unique perspective; some are focused on structure, some on content overlap, and some on similarity to the prompt. Together, they form a system that is able to simulate human judgement. In order to demonstrate this judgement, I decided to flip the idea of a chatbot arena, where instead of humans judging model outputs, now models judge human ones. This visualization helps show how the model comes to a decision, with each classifier acting as an independent voter.

How I built it

I used Pandas, Numpy, nltk and Pytorch with sentence-transformers to generate features for the dataset, and XGBoost for the model itself, with Sklearn to split the data. I used Python with a Flask API framework for the backend, and React with NextJS as the frontend framework, with tailwind for styling.

Challenges I ran into

Like all data science projects, there were two main challenges that I ran into: time and accuracy. I spent the first 8 hours of the hackathon iterating through numerous different features and models, from logistic regression to tiny LLMs. However, I found that every solution either was only slightly better than a coinflip, or needed over a day to train. In fact, to honor this struggle, I have decided to keep my model graveyard within my repository, a folder of most of my failed ideas (some of them I was so mad at that I simply deleted them).

Accomplishments that I'm proud of

I'm proud that I was able to push through the initial discouragement of constantly waiting 20 minutes for a model to train, just for it to perform 2% better than random chance. I'm also proud of the accuracy that I was able to achieve, especially since I only used discriminative models. Finally, I'm proud that I trusted myself to go into this competition solo, win or lose, I know that this experience will greatly benefit my future endeavors.

What I learned

The biggest thing that this project taught me was how to think holistically about a problem. When I first opened the data set, I immediately started combing through it for patterns, correlations, word counts, overlaps, without fully considering what the labels actually represented. I had completely forgotten to ask a fundamental question: How is a winner actually chosen? A response isn't preferred just because it's longer, or it has more keywords. There's a lot of nuance to it, factors such as helpfulness, tone, clarity. If I had paused earlier to think about how users evaluate chatbot responses, then I would have realized the solution of using LLMs faster. Of course, LLMs ended up not working out, due to extreme training times, but I wish that I had stumbled on that conclusion sooner than I did. Because machine learning isn't just about computing through data. It's about grasping the human context behind it.

What's next for modelcitizens

Honestly, I think that the name is way too cool to just end here. This definitely has me thinking about trying to make my own machine learning library, something ensemble based. Maybe an army of LLMs are next...

Built With

flask
nextjs
numpy
pandas
python
pytorch
react
scikit-learn
tailwind
tqdm
xgboost

Updates

Justin Siek started this project — Apr 13, 2025 06:00 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.