Weights & Biases Judgement Day Hackathon: Come together to make the best LLM Judge - Devpost

- Log in
- Sign up

Join a hackathon

Devpost

Participate in our public hackathons

Hackathons Projects

Devpost for Teams

Access your company's private hackathons

Host a hackathon

Devpost

Grow your developer ecosystem and promote your platform

Host a public hackathon

Devpost for Teams

Drive innovation, collaboration, and retention within your organization

Host an internal hackathon

By use case

AI hackathons Customer hackathons Employee hackathons Public hackathons

Blog

Insights into hackathon planning and participation

Customer stories

Inspiration from peers and other industry leaders

Planning guides

Best practices for planning online and in-person hackathons

Webinars & events

Upcoming events and on-demand recordings

Help desk

Common questions and support documentation

Join a hackathon
- Devpost
  
  Participate in our public hackathons
  
  Hackathons Projects
  
  Devpost for Teams
  
  Access your company's private hackathons
  
  Login
Host a hackathon
- Devpost
  
  Grow your developer ecosystem and promote your platform
  
  Host a public hackathon
  
  Devpost for Teams
  
  Drive innovation, collaboration, and retention within your organization
  
  Host an internal hackathon
  
  By use case
  AI hackathons Customer hackathons Employee hackathons Public hackathons
Resources
- Blog
  
  Insights into hackathon planning and participation
  
  Customer stories
  
  Inspiration from peers and other industry leaders
  
  Planning guides
  
  Best practices for planning online and in-person hackathons
  
  Webinars & events
  
  Upcoming events and on-demand recordings
  
  Help desk
  
  Common questions and support documentation

Log in
Sign up

Descend

Overview
My projects
Participants (28)
Resources
Rules
Project gallery
Updates
Discussions

Connect with the participants – support your favorite projects by liking, sharing, and commenting on them.

Sort

kg-evaluator — kg-evaluator

evaluate how well LLM builds a knowledge graph

OptoPrompt — OptoPrompt

One-click prompt alignment for LLM judges via Weave and DSPy. Focused on easy user interface.

Let's Get Creative! — Let's Get Creative!

Exploratory data analysis looking at how creative LLMs are. Evaluating generated lists of names/words/ideas against each other

Pairwise Model Tests — Pairwise Model Tests

Weave has good support for scalar metrics, and comparing models by metrics. However, in some cases pairwise comparison of outputs is a better measure approach to model comparison.

Attack_prompt — Attack_prompt

Used DSPy and other tools to automate prompts for attack prompts, adversarial prompts, red teaming, and generating harmful intent prompts and prompt responses that get around LLM guardrails.

Judge Tuner — Judge Tuner

Mechanism for Concurrently Tuning Example Data and Metrics for Evaluation of a Model

CycleVal — CycleVal

Evalute LLM judgements with open success criteria

LBM — LBM

Large Bagging Model Optimize the best debate agent. Creating a pipeline to generate agents and recursively improve them based on human feedback and LLM evalutors.

LLM Idol — LLM Idol

Are you and your buddies tired of annotating and evaluating LLMs manually? Use LLMs as judges instead!

EvalGuard — EvalGuard

EvalGuard simplifies evaluation with auto fact-checking and domain-specific synthetic data generation with plug and play JSON. With Weave tracing, it ensures transparency—no more black-box processes

Lazy Evals — Lazy Evals

Lazy Evals: Don't wait—dominate! Instant AI evaluations blending speed and quality. Cut latency without sacrificing modularity. Supercharge your AI development. Get lazy, get ahead with Lazy Evals!

LLMonPy — LLMonPy

Build on Self taught evaluators paper (https://arxiv.org/abs/2408.02666) to generate a lot of training data for using an LLM to write an evaluation prompt

Pro-Bias: Self-Improving Human-Aligned Subject LLM Evals — Pro-Bias: Self-Improving Human-Aligned Subject LLM Evals

Creating human-aligned LLM judges through enhanced evaluation frameworks and self-improving automated grading systems.

Enhancing Preference Learning for Monte Carlo Tree Search — Enhancing Preference Learning for Monte Carlo Tree Search

Leveraging AI as a judge within Monte Carlo Tree Search, we automate reasoning evaluation, enabling faster and more accurate preference learning to optimize decision-making without human bias.

1 – 14 of 14

Devpost

About
Careers
Contact
Help

Hackathons

Browse hackathons
Explore projects
Host a hackathon
Hackathon guides

Portfolio

Your projects
Your hackathons
Settings

Connect

Twitter
Discord
Facebook
LinkedIn

© 2026 Devpost, Inc. All rights reserved.

Community guidelines
Security
CA notice
Privacy policy
Terms of service