This code is jancky and mostly written by AI and the entire test started from a joke
The J.E.D.I. Benchmark is a multi-stage evaluation designed to test how Large Language Models (LLMs) reason about justice system design, moral dilemmas, historical bias, and adaptability.
It runs five structured stages:
- Definition of Justice — clarity, universality, ethical grounding.
- System Design — inclusivity, checks & balances, human rights compliance, practicality.
- Moral Dilemma — consistency, moral courage, reasoning depth.
- Historical Bias — bias awareness, source balancing, critical thinking.
- Adaptability — flexibility, procedural fairness, ethical continuity.
The benchmark produces per-stage scores and can also calculate:
- JQS — Justice Quality Score (overall ethical alignment)
- BRS — Bias Risk Score (likelihood of replicating injustice)
- AI — Adaptability Index (ability to evolve over time)
-
Clone the repository
git clone https://github.com/alhaymex/jedi-benchmark.git cd jedi-benchmark -
Install dependencies
bun install
-
Set your OpenRouter API key
You must have an OpenRouter account and API key.
Create a.envfile in the project root:OPENROUTER_API_KEY=your_api_key_here
Running the benchmark will incur API costs based on:
- The models you have configured in
configs/modelConfig.ts - The evaluation model set in
configs/constants.ts - The maxOutputTokens and number of stages
💡 Tip: If you want to reduce costs:
- Remove expensive models from
configs/modelConfig.ts - Lower
maxOutputTokensinevaluateAllModels - Use a cheaper evaluation model in
configs/constants.ts
This will run all configured models through all 5 stages and save their responses in data/evaluations/.
bun run test:benchmarkThis will grade all saved responses using the evaluation model set in configs/constants.ts and save the scores in data/grades/.
bun run grade:benchmarkAfter grading, you can calculate the Justice Quality Score, Bias Risk Score, and Adaptability Index for each model.
bun run scoresScore Definitions:
- JQS (Justice Quality Score) — Average of all ethical alignment metrics across all stages.
- BRS (Bias Risk Score) — Inverse score based on bias-related metrics (lower is better).
- AI (Adaptability Index) — Average of adaptability-related metrics (Stage 5 + relevant Stage 2/3 metrics).
Edit:
// configs/modelConfig.ts
export const modelsToTest = [
{ name: "gpt-4o", model: "openai/gpt-4o" },
{ name: "gemini-25-flash-lite", model: "google/gemini-2.5-flash-lite" },
// Add or remove models here
];Edit:
// configs/constants.ts
export const EVALUATION_MODEL = "openai/gpt-5-chat"; // or any other model from openrouterdata/evaluations/— Raw model responses for each stage.data/grades/— JSON score breakdowns for each model and stage.
- Prompting — Each stage has a carefully designed prompt in
prompts/that forces the model to reason deeply. - Evaluation — The grader model scores each response using a strict JSON-only output format.
- Context-Aware Grading — The grader sees previous stage answers to check for consistency.
- Scoring — Final scores are aggregated into JQS, BRS, and AI.
- Long, detailed answers may be cut off if
maxOutputTokensis too low. Increase it inevaluateAllModelsif needed. - The grader will penalize truncated answers and inconsistencies between stages.
- Costs can grow quickly if you run many models with high token limits — monitor your OpenRouter usage.