🧑‍⚖️ J.E.D.I. — Justice Emergence & Design Index

Warning

This code is jancky and mostly written by AI and the entire test started from a joke

The J.E.D.I. Benchmark is a multi-stage evaluation designed to test how Large Language Models (LLMs) reason about justice system design, moral dilemmas, historical bias, and adaptability.

It runs five structured stages:

Definition of Justice — clarity, universality, ethical grounding.
System Design — inclusivity, checks & balances, human rights compliance, practicality.
Moral Dilemma — consistency, moral courage, reasoning depth.
Historical Bias — bias awareness, source balancing, critical thinking.
Adaptability — flexibility, procedural fairness, ethical continuity.

The benchmark produces per-stage scores and can also calculate:

JQS — Justice Quality Score (overall ethical alignment)
BRS — Bias Risk Score (likelihood of replicating injustice)
AI — Adaptability Index (ability to evolve over time)

📦 Setup

Clone the repository

git clone https://github.com/alhaymex/jedi-benchmark.git
cd jedi-benchmark

Install dependencies
```
bun install
```
Set your OpenRouter API key
You must have an OpenRouter account and API key.
Create a .env file in the project root:
```
OPENROUTER_API_KEY=your_api_key_here
```

💰 Cost Warning

Running the benchmark will incur API costs based on:

The models you have configured in configs/modelConfig.ts
The evaluation model set in configs/constants.ts
The maxOutputTokens and number of stages

💡 Tip: If you want to reduce costs:

Remove expensive models from configs/modelConfig.ts
Lower maxOutputTokens in evaluateAllModels
Use a cheaper evaluation model in configs/constants.ts

🚀 Running the Benchmark

1️⃣ Run the Test (Generate Model Responses)

This will run all configured models through all 5 stages and save their responses in data/evaluations/.

bun run test:benchmark

2️⃣ Run the Grader (Score the Responses)

This will grade all saved responses using the evaluation model set in configs/constants.ts and save the scores in data/grades/.

bun run grade:benchmark

3️⃣ Calculate Final Scores (JQS, BRS, AI)

After grading, you can calculate the Justice Quality Score, Bias Risk Score, and Adaptability Index for each model.

bun run scores

Score Definitions:

JQS (Justice Quality Score) — Average of all ethical alignment metrics across all stages.
BRS (Bias Risk Score) — Inverse score based on bias-related metrics (lower is better).
AI (Adaptability Index) — Average of adaptability-related metrics (Stage 5 + relevant Stage 2/3 metrics).

⚙️ Configuration

Change the Models Being Tested

Edit:

// configs/modelConfig.ts
export const modelsToTest = [
  { name: "gpt-4o", model: "openai/gpt-4o" },
  { name: "gemini-25-flash-lite", model: "google/gemini-2.5-flash-lite" },
  // Add or remove models here
];

Change the Evaluation Model

Edit:

// configs/constants.ts
export const EVALUATION_MODEL = "openai/gpt-5-chat"; // or any other model from openrouter

📂 Data Output Structure

data/evaluations/ — Raw model responses for each stage.
data/grades/ — JSON score breakdowns for each model and stage.

🧠 How It Works

Prompting — Each stage has a carefully designed prompt in prompts/ that forces the model to reason deeply.
Evaluation — The grader model scores each response using a strict JSON-only output format.
Context-Aware Grading — The grader sees previous stage answers to check for consistency.
Scoring — Final scores are aggregated into JQS, BRS, and AI.

⚠️ Notes

Long, detailed answers may be cut off if maxOutputTokens is too low. Increase it in evaluateAllModels if needed.
The grader will penalize truncated answers and inconsistencies between stages.
Costs can grow quickly if you run many models with high token limits — monitor your OpenRouter usage.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
src		src
.gitignore		.gitignore
README.md		README.md
bun.lock		bun.lock
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧑‍⚖️ J.E.D.I. — Justice Emergence & Design Index

Warning

📦 Setup

💰 Cost Warning

🚀 Running the Benchmark

1️⃣ Run the Test (Generate Model Responses)

2️⃣ Run the Grader (Score the Responses)

3️⃣ Calculate Final Scores (JQS, BRS, AI)

⚙️ Configuration

Change the Models Being Tested

Change the Evaluation Model

📂 Data Output Structure

🧠 How It Works

⚠️ Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧑‍⚖️ J.E.D.I. — Justice Emergence & Design Index

Warning

📦 Setup

💰 Cost Warning

🚀 Running the Benchmark

1️⃣ Run the Test (Generate Model Responses)

2️⃣ Run the Grader (Score the Responses)

3️⃣ Calculate Final Scores (JQS, BRS, AI)

⚙️ Configuration

Change the Models Being Tested

Change the Evaluation Model

📂 Data Output Structure

🧠 How It Works

⚠️ Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages