Local workspace for running and analysing next-evals-oss — a Vercel eval framework for AI coding agents against Next.js tasks.
agent-eval/ ← @vercel/agent-eval source (forked / patched locally)
packages/
agent-eval/ ← core library: runner, parsers, CLI
playground/ ← web viewer for results
next-evals-oss/ ← eval definitions, experiments, scripts, results
One of the main things we added is support for DeepSeek and Qwen models using their own native API keys — no Vercel AI Gateway required.
Both agents use the OpenCode CLI as the coding agent backend. OpenCode is configured at runtime with a provider-specific opencode.json that points to the respective OpenAI-compatible endpoint and injects the API key from the environment.
| Item | Value |
|---|---|
| Agent name | deepseek |
| Implementation | agent-eval/packages/agent-eval/src/lib/agents/deepseek.ts |
| API endpoint | https://api.deepseek.com |
| Env var | DEEPSEEK_API_KEY |
| Default model | deepseek/deepseek-chat |
DashScope is not a built-in OpenCode provider. We register it as a custom OpenAI-compatible provider via @ai-sdk/openai-compatible, using the international DashScope endpoint.
| Item | Value |
|---|---|
| Agent name | qwen |
| Implementation | agent-eval/packages/agent-eval/src/lib/agents/qwen.ts |
| API endpoint | https://dashscope-intl.aliyuncs.com/compatible-mode/v1 |
| Env var | DASHSCOPE_API_KEY |
| Default model | dashscope/qwen3.5-plus |
Supported Qwen models (passed via --model):
| Model ID | Name |
|---|---|
dashscope/qwen3.5-plus |
Qwen3.5 Plus |
dashscope/qwen3.5-flash |
Qwen3.5 Flash |
dashscope/qwen3.5-397b-a17b |
Qwen3.5 397B A17B |
dashscope/qwen3.5-122b-a10b |
Qwen3.5 122B A10B |
dashscope/qwen3.5-27b |
Qwen3.5 27B |
dashscope/qwen3.5-35b-a3b |
Qwen3.5 35B A3B |
Create a .env file in next-evals-oss/ (or export variables in your shell) before running any experiments:
# Required for DeepSeek experiments
DEEPSEEK_API_KEY=sk-...
# Required for Qwen experiments
DASHSCOPE_API_KEY=sk-...
⚠️ Without the correct API key the eval run will fail immediately when OpenCode tries to authenticate with the provider.
- The
deepseek/qwenagent spins up a sandbox and installsopencode-aiglobally inside it. - It writes an
opencode.jsonto the sandbox with the provider config and{env:DEEPSEEK_API_KEY}/{env:DASHSCOPE_API_KEY}placeholders. - It calls
opencode run <prompt> --model <model> --format json, passing the API key in the environment. - The JSON transcript (JSONL) is captured from stdout and parsed by the existing OpenCode parser to extract tool calls, messages, and token usage.
⚠️ ВАЖНО / IMPORTANT — Docker only, Vercel cloud untestedAll DeepSeek and Qwen evaluations were run with the Docker sandbox (
sandbox: "docker"in the experiment config, or by settingVERCEL_SANDBOX_TOKENto an empty/unset value so the runner falls back to Docker automatically).There is no guarantee that these agents work correctly when the Vercel cloud sandbox is used instead. Specifically:
- The Vercel sandbox may not allow outbound HTTPS connections to
api.deepseek.comordashscope-intl.aliyuncs.com.- The OpenCode CLI global install step (
npm install -g opencode-ai) may behave differently in the Vercel runtime.
The upstream package had no token/cost tracking. The following files were modified to add it:
| File | Change |
|---|---|
types.ts |
Added TokenUsage interface (inputTokens, outputTokens, reasoningTokens, cacheReadTokens, cacheWriteTokens, totalTokens, cost?); added token_usage event type; added tokenUsage?: TokenUsage to TranscriptEvent and TranscriptSummary |
parsers/opencode.ts |
Extracts step_finish events — covers Qwen, DeepSeek, Devstral, Minimax (all OpenCode-based agents) |
parsers/claude-code.ts |
Extracts message.usage fields from Claude Code transcripts |
parsers/gemini.ts |
Extracts step_finish metadata — covers Gemini CLI agent |
parsers/index.ts |
generateSummary() now aggregates all token_usage events into a single tokenUsage object on the summary |
index.ts |
Exports TokenUsage type |
After parsing, each result.json file automatically gets o11y.tokenUsage populated for all new runs.
Token availability by agent:
| Agent | Input | Output | Reasoning | Cache Read | Cache Write | Cost |
|---|---|---|---|---|---|---|
| OpenCode (Qwen, DeepSeek, Devstral, Minimax) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅* |
| Claude Code | ✅ | ✅ | — | ✅ | ✅ | — |
| Gemini CLI | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Codex / GPT | — | — | — | — | — | — |
| Cursor | — | — | — | — | — | — |
* Qwen always reports cost: 0.
cd agent-eval/packages/agent-eval
npm run build # outputs to dist/next-evals-oss runs evals via the installed node_modules/@vercel/agent-eval (v0.7.0), not the local source. After any agent-eval code change, the dist must be rebuilt and copied over.
This is now automated via postinstall.
Builds agent-eval and copies dist/ to node_modules/@vercel/agent-eval/dist/.
Runs automatically on npm install. Can also be triggered manually:
npx tsx scripts/postinstall.tsRegistered in package.json:
"postinstall": "tsx scripts/postinstall.ts"
⚠️ Everynpm installre-runs this automatically. If you skip this step after modifyingagent-evalsources, the changes will not be active during eval runs.
Retroactively extracts token usage from existing transcript-raw.jsonl files and writes tokenUsage into result.json and transcript.json. Use this for runs that were made before the token tracking was added.
# All experiments
npx tsx scripts/retro-extract-tokens.ts
# Specific experiment(s)
npx tsx scripts/retro-extract-tokens.ts qwen35
npx tsx scripts/retro-extract-tokens.ts qwen35 deepseek-v3.2
# Dry-run (no writes)
npx tsx scripts/retro-extract-tokens.ts --dry-runExperiment → agent mapping is defined in EXPERIMENT_AGENT_MAP at the top of the file. Add new experiments there when needed.
ℹ️ Runs that have no
transcript-raw.jsonl(e.g. very old runs or Cursor) will be skipped silently.
Reads results from results/<experiment>/ and produces agent-results.json.
Changes from upstream:
- Added
tokenUsage: TokenUsage | "n/a"field to eachAgentResult - Reads
o11y.tokenUsagefromrun-1/result.json MODEL_NAMESmap updated:gemini-3-pro-preview-gemini-cli→"Gemini 3.0 Pro Preview (Gemini CLI)"to distinguish it from the OpenCode-based variant
npx tsx scripts/export-results.ts # all experiments
npx tsx scripts/export-results.ts qwen35 # specificOutput: agent-results.json
Runs evals in small batches by rewriting the experiment config's evals array and calling npm run eval -- run-all <experiment> per batch.
# Run all evals in batches of 3
npx tsx scripts/run-batched.ts qwen35-flash
# Resume from a specific eval
npx tsx scripts/run-batched.ts qwen35 --start agent-028
# Custom batch size
npx tsx scripts/run-batched.ts deepseek-v3.2 --batch-size 1
# Preview without running
npx tsx scripts/run-batched.ts qwen35 --dryGenerates a self-contained interactive HTML report from agent-results.json.
npx tsx scripts/generate-report.ts
# → report.htmlFeatures:
- Summary view — one row per model: Passed, Tests, Rate (%), Avg Duration, all 6 token columns (n/a if any eval is missing token data for that model)
- Detailed view — one row per model × eval: Score, Duration, all 6 token columns
- Toggle between Summary / Detailed with a button group
- Filter by model and eval (multi-select dropdowns with Select all / Clear all)
- Sortable columns in both tables
- Top stats bar (total runs, pass rate, avg duration, avg tokens)
- Dark theme, no external dependencies
cd next-evals-oss
# 1. Run evals in batches
npx tsx scripts/run-batched.ts <experiment> --batch-size 3
# 2. (Optional) Backfill tokens for runs made with old package
npx tsx scripts/retro-extract-tokens.ts <experiment>
# 3. Export to agent-results.json
npx tsx scripts/export-results.ts
# 4. Generate HTML report
npx tsx scripts/generate-report.ts
# Open report.html in browser- Create
experiments/<name>.tswith anExperimentConfig - Add
"<name>": "<agent-type>"toEXPERIMENT_AGENT_MAPinscripts/retro-extract-tokens.ts - Add
"<name>": "Display Name"toMODEL_NAMESinscripts/export-results.ts