Skip to content

radianceteam/llm-nextjs-evals

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

evals-test

Local workspace for running and analysing next-evals-oss — a Vercel eval framework for AI coding agents against Next.js tasks.

Repository layout

agent-eval/          ← @vercel/agent-eval source (forked / patched locally)
  packages/
    agent-eval/      ← core library: runner, parsers, CLI
    playground/       ← web viewer for results
next-evals-oss/      ← eval definitions, experiments, scripts, results

DeepSeek and Qwen native API support ← KEY ADDITION

One of the main things we added is support for DeepSeek and Qwen models using their own native API keys — no Vercel AI Gateway required.

Both agents use the OpenCode CLI as the coding agent backend. OpenCode is configured at runtime with a provider-specific opencode.json that points to the respective OpenAI-compatible endpoint and injects the API key from the environment.

DeepSeek

Item Value
Agent name deepseek
Implementation agent-eval/packages/agent-eval/src/lib/agents/deepseek.ts
API endpoint https://api.deepseek.com
Env var DEEPSEEK_API_KEY
Default model deepseek/deepseek-chat

Qwen (Alibaba Cloud DashScope)

DashScope is not a built-in OpenCode provider. We register it as a custom OpenAI-compatible provider via @ai-sdk/openai-compatible, using the international DashScope endpoint.

Item Value
Agent name qwen
Implementation agent-eval/packages/agent-eval/src/lib/agents/qwen.ts
API endpoint https://dashscope-intl.aliyuncs.com/compatible-mode/v1
Env var DASHSCOPE_API_KEY
Default model dashscope/qwen3.5-plus

Supported Qwen models (passed via --model):

Model ID Name
dashscope/qwen3.5-plus Qwen3.5 Plus
dashscope/qwen3.5-flash Qwen3.5 Flash
dashscope/qwen3.5-397b-a17b Qwen3.5 397B A17B
dashscope/qwen3.5-122b-a10b Qwen3.5 122B A10B
dashscope/qwen3.5-27b Qwen3.5 27B
dashscope/qwen3.5-35b-a3b Qwen3.5 35B A3B

Required .env keys

Create a .env file in next-evals-oss/ (or export variables in your shell) before running any experiments:

# Required for DeepSeek experiments
DEEPSEEK_API_KEY=sk-...

# Required for Qwen experiments
DASHSCOPE_API_KEY=sk-...

⚠️ Without the correct API key the eval run will fail immediately when OpenCode tries to authenticate with the provider.

How it works

  1. The deepseek / qwen agent spins up a sandbox and installs opencode-ai globally inside it.
  2. It writes an opencode.json to the sandbox with the provider config and {env:DEEPSEEK_API_KEY} / {env:DASHSCOPE_API_KEY} placeholders.
  3. It calls opencode run <prompt> --model <model> --format json, passing the API key in the environment.
  4. The JSON transcript (JSONL) is captured from stdout and parsed by the existing OpenCode parser to extract tool calls, messages, and token usage.

⚠️ ВАЖНО / IMPORTANT — Docker only, Vercel cloud untested

All DeepSeek and Qwen evaluations were run with the Docker sandbox (sandbox: "docker" in the experiment config, or by setting VERCEL_SANDBOX_TOKEN to an empty/unset value so the runner falls back to Docker automatically).

There is no guarantee that these agents work correctly when the Vercel cloud sandbox is used instead. Specifically:

  • The Vercel sandbox may not allow outbound HTTPS connections to api.deepseek.com or dashscope-intl.aliyuncs.com.
  • The OpenCode CLI global install step (npm install -g opencode-ai) may behave differently in the Vercel runtime.

Changes made to agent-eval

Token tracking (packages/agent-eval/src/lib/o11y/)

The upstream package had no token/cost tracking. The following files were modified to add it:

File Change
types.ts Added TokenUsage interface (inputTokens, outputTokens, reasoningTokens, cacheReadTokens, cacheWriteTokens, totalTokens, cost?); added token_usage event type; added tokenUsage?: TokenUsage to TranscriptEvent and TranscriptSummary
parsers/opencode.ts Extracts step_finish events — covers Qwen, DeepSeek, Devstral, Minimax (all OpenCode-based agents)
parsers/claude-code.ts Extracts message.usage fields from Claude Code transcripts
parsers/gemini.ts Extracts step_finish metadata — covers Gemini CLI agent
parsers/index.ts generateSummary() now aggregates all token_usage events into a single tokenUsage object on the summary
index.ts Exports TokenUsage type

After parsing, each result.json file automatically gets o11y.tokenUsage populated for all new runs.

Token availability by agent:

Agent Input Output Reasoning Cache Read Cache Write Cost
OpenCode (Qwen, DeepSeek, Devstral, Minimax) ✅*
Claude Code
Gemini CLI
Codex / GPT
Cursor

* Qwen always reports cost: 0.

Build

cd agent-eval/packages/agent-eval
npm run build        # outputs to dist/

Changes made to next-evals-oss

Problem: installed package vs local source

next-evals-oss runs evals via the installed node_modules/@vercel/agent-eval (v0.7.0), not the local source. After any agent-eval code change, the dist must be rebuilt and copied over.

This is now automated via postinstall.


New / modified scripts

scripts/postinstall.tsNEW

Builds agent-eval and copies dist/ to node_modules/@vercel/agent-eval/dist/.

Runs automatically on npm install. Can also be triggered manually:

npx tsx scripts/postinstall.ts

Registered in package.json:

"postinstall": "tsx scripts/postinstall.ts"

⚠️ Every npm install re-runs this automatically. If you skip this step after modifying agent-eval sources, the changes will not be active during eval runs.


scripts/retro-extract-tokens.tsNEW

Retroactively extracts token usage from existing transcript-raw.jsonl files and writes tokenUsage into result.json and transcript.json. Use this for runs that were made before the token tracking was added.

# All experiments
npx tsx scripts/retro-extract-tokens.ts

# Specific experiment(s)
npx tsx scripts/retro-extract-tokens.ts qwen35
npx tsx scripts/retro-extract-tokens.ts qwen35 deepseek-v3.2

# Dry-run (no writes)
npx tsx scripts/retro-extract-tokens.ts --dry-run

Experiment → agent mapping is defined in EXPERIMENT_AGENT_MAP at the top of the file. Add new experiments there when needed.

ℹ️ Runs that have no transcript-raw.jsonl (e.g. very old runs or Cursor) will be skipped silently.


scripts/export-results.tsMODIFIED

Reads results from results/<experiment>/ and produces agent-results.json.

Changes from upstream:

  • Added tokenUsage: TokenUsage | "n/a" field to each AgentResult
  • Reads o11y.tokenUsage from run-1/result.json
  • MODEL_NAMES map updated: gemini-3-pro-preview-gemini-cli"Gemini 3.0 Pro Preview (Gemini CLI)" to distinguish it from the OpenCode-based variant
npx tsx scripts/export-results.ts              # all experiments
npx tsx scripts/export-results.ts qwen35       # specific

Output: agent-results.json


scripts/run-batched.tsNEW

Runs evals in small batches by rewriting the experiment config's evals array and calling npm run eval -- run-all <experiment> per batch.

# Run all evals in batches of 3
npx tsx scripts/run-batched.ts qwen35-flash

# Resume from a specific eval
npx tsx scripts/run-batched.ts qwen35 --start agent-028

# Custom batch size
npx tsx scripts/run-batched.ts deepseek-v3.2 --batch-size 1

# Preview without running
npx tsx scripts/run-batched.ts qwen35 --dry

scripts/generate-report.tsNEW

Generates a self-contained interactive HTML report from agent-results.json.

npx tsx scripts/generate-report.ts
# → report.html

Features:

  • Summary view — one row per model: Passed, Tests, Rate (%), Avg Duration, all 6 token columns (n/a if any eval is missing token data for that model)
  • Detailed view — one row per model × eval: Score, Duration, all 6 token columns
  • Toggle between Summary / Detailed with a button group
  • Filter by model and eval (multi-select dropdowns with Select all / Clear all)
  • Sortable columns in both tables
  • Top stats bar (total runs, pass rate, avg duration, avg tokens)
  • Dark theme, no external dependencies

Typical workflow for a new experiment

cd next-evals-oss

# 1. Run evals in batches
npx tsx scripts/run-batched.ts <experiment> --batch-size 3

# 2. (Optional) Backfill tokens for runs made with old package
npx tsx scripts/retro-extract-tokens.ts <experiment>

# 3. Export to agent-results.json
npx tsx scripts/export-results.ts

# 4. Generate HTML report
npx tsx scripts/generate-report.ts
# Open report.html in browser

Adding a new experiment

  1. Create experiments/<name>.ts with an ExperimentConfig
  2. Add "<name>": "<agent-type>" to EXPERIMENT_AGENT_MAP in scripts/retro-extract-tokens.ts
  3. Add "<name>": "Display Name" to MODEL_NAMES in scripts/export-results.ts

About

AI Coding Agents comparison forked from vercel/next-evals-oss framework to support more LLMs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors