evals-test

Local workspace for running and analysing next-evals-oss — a Vercel eval framework for AI coding agents against Next.js tasks.

Repository layout

agent-eval/          ← @vercel/agent-eval source (forked / patched locally)
  packages/
    agent-eval/      ← core library: runner, parsers, CLI
    playground/       ← web viewer for results
next-evals-oss/      ← eval definitions, experiments, scripts, results

DeepSeek and Qwen native API support ← KEY ADDITION

One of the main things we added is support for DeepSeek and Qwen models using their own native API keys — no Vercel AI Gateway required.

Both agents use the OpenCode CLI as the coding agent backend. OpenCode is configured at runtime with a provider-specific opencode.json that points to the respective OpenAI-compatible endpoint and injects the API key from the environment.

DeepSeek

Item	Value
Agent name	`deepseek`
Implementation	`agent-eval/packages/agent-eval/src/lib/agents/deepseek.ts`
API endpoint	`https://api.deepseek.com`
Env var	`DEEPSEEK_API_KEY`
Default model	`deepseek/deepseek-chat`

Qwen (Alibaba Cloud DashScope)

DashScope is not a built-in OpenCode provider. We register it as a custom OpenAI-compatible provider via @ai-sdk/openai-compatible, using the international DashScope endpoint.

Item	Value
Agent name	`qwen`
Implementation	`agent-eval/packages/agent-eval/src/lib/agents/qwen.ts`
API endpoint	`https://dashscope-intl.aliyuncs.com/compatible-mode/v1`
Env var	`DASHSCOPE_API_KEY`
Default model	`dashscope/qwen3.5-plus`

Supported Qwen models (passed via --model):

Model ID	Name
`dashscope/qwen3.5-plus`	Qwen3.5 Plus
`dashscope/qwen3.5-flash`	Qwen3.5 Flash
`dashscope/qwen3.5-397b-a17b`	Qwen3.5 397B A17B
`dashscope/qwen3.5-122b-a10b`	Qwen3.5 122B A10B
`dashscope/qwen3.5-27b`	Qwen3.5 27B
`dashscope/qwen3.5-35b-a3b`	Qwen3.5 35B A3B

Required `.env` keys

Create a .env file in next-evals-oss/ (or export variables in your shell) before running any experiments:

# Required for DeepSeek experiments
DEEPSEEK_API_KEY=sk-...

# Required for Qwen experiments
DASHSCOPE_API_KEY=sk-...

⚠️ Without the correct API key the eval run will fail immediately when OpenCode tries to authenticate with the provider.

How it works

The deepseek / qwen agent spins up a sandbox and installs opencode-ai globally inside it.
It writes an opencode.json to the sandbox with the provider config and {env:DEEPSEEK_API_KEY} / {env:DASHSCOPE_API_KEY} placeholders.
It calls opencode run <prompt> --model <model> --format json, passing the API key in the environment.
The JSON transcript (JSONL) is captured from stdout and parsed by the existing OpenCode parser to extract tool calls, messages, and token usage.

⚠️ ВАЖНО / IMPORTANT — Docker only, Vercel cloud untested

All DeepSeek and Qwen evaluations were run with the Docker sandbox (sandbox: "docker" in the experiment config, or by setting VERCEL_SANDBOX_TOKEN to an empty/unset value so the runner falls back to Docker automatically).

There is no guarantee that these agents work correctly when the Vercel cloud sandbox is used instead. Specifically:

The Vercel sandbox may not allow outbound HTTPS connections to api.deepseek.com or dashscope-intl.aliyuncs.com.

The OpenCode CLI global install step (npm install -g opencode-ai) may behave differently in the Vercel runtime.

Changes made to `agent-eval`

Token tracking (`packages/agent-eval/src/lib/o11y/`)

The upstream package had no token/cost tracking. The following files were modified to add it:

File	Change
`types.ts`	Added `TokenUsage` interface (`inputTokens`, `outputTokens`, `reasoningTokens`, `cacheReadTokens`, `cacheWriteTokens`, `totalTokens`, `cost?`); added `token_usage` event type; added `tokenUsage?: TokenUsage` to `TranscriptEvent` and `TranscriptSummary`
`parsers/opencode.ts`	Extracts `step_finish` events — covers Qwen, DeepSeek, Devstral, Minimax (all OpenCode-based agents)
`parsers/claude-code.ts`	Extracts `message.usage` fields from Claude Code transcripts
`parsers/gemini.ts`	Extracts `step_finish` metadata — covers Gemini CLI agent
`parsers/index.ts`	`generateSummary()` now aggregates all `token_usage` events into a single `tokenUsage` object on the summary
`index.ts`	Exports `TokenUsage` type

After parsing, each result.json file automatically gets o11y.tokenUsage populated for all new runs.

Token availability by agent:

Agent	Input	Output	Reasoning	Cache Read	Cache Write	Cost
OpenCode (Qwen, DeepSeek, Devstral, Minimax)	✅	✅	✅	✅	✅	✅*
Claude Code	✅	✅	—	✅	✅	—
Gemini CLI	✅	✅	✅	✅	✅	✅
Codex / GPT	—	—	—	—	—	—
Cursor	—	—	—	—	—	—

* Qwen always reports cost: 0.

Build

cd agent-eval/packages/agent-eval
npm run build        # outputs to dist/

Changes made to `next-evals-oss`

Problem: installed package vs local source

next-evals-oss runs evals via the installed node_modules/@vercel/agent-eval (v0.7.0), not the local source. After any agent-eval code change, the dist must be rebuilt and copied over.

This is now automated via postinstall.

New / modified scripts

`scripts/postinstall.ts` ← NEW

Builds agent-eval and copies dist/ to node_modules/@vercel/agent-eval/dist/.

Runs automatically on npm install. Can also be triggered manually:

npx tsx scripts/postinstall.ts

Registered in package.json:

"postinstall": "tsx scripts/postinstall.ts"

⚠️ Every npm install re-runs this automatically. If you skip this step after modifying agent-eval sources, the changes will not be active during eval runs.

`scripts/retro-extract-tokens.ts` ← NEW

Retroactively extracts token usage from existing transcript-raw.jsonl files and writes tokenUsage into result.json and transcript.json. Use this for runs that were made before the token tracking was added.

# All experiments
npx tsx scripts/retro-extract-tokens.ts

# Specific experiment(s)
npx tsx scripts/retro-extract-tokens.ts qwen35
npx tsx scripts/retro-extract-tokens.ts qwen35 deepseek-v3.2

# Dry-run (no writes)
npx tsx scripts/retro-extract-tokens.ts --dry-run

Experiment → agent mapping is defined in EXPERIMENT_AGENT_MAP at the top of the file. Add new experiments there when needed.

ℹ️ Runs that have no transcript-raw.jsonl (e.g. very old runs or Cursor) will be skipped silently.

`scripts/export-results.ts` ← MODIFIED

Reads results from results/<experiment>/ and produces agent-results.json.

Changes from upstream:

Added tokenUsage: TokenUsage | "n/a" field to each AgentResult
Reads o11y.tokenUsage from run-1/result.json
MODEL_NAMES map updated: gemini-3-pro-preview-gemini-cli → "Gemini 3.0 Pro Preview (Gemini CLI)" to distinguish it from the OpenCode-based variant

npx tsx scripts/export-results.ts              # all experiments
npx tsx scripts/export-results.ts qwen35       # specific

Output: agent-results.json

`scripts/run-batched.ts` ← NEW

Runs evals in small batches by rewriting the experiment config's evals array and calling npm run eval -- run-all <experiment> per batch.

# Run all evals in batches of 3
npx tsx scripts/run-batched.ts qwen35-flash

# Resume from a specific eval
npx tsx scripts/run-batched.ts qwen35 --start agent-028

# Custom batch size
npx tsx scripts/run-batched.ts deepseek-v3.2 --batch-size 1

# Preview without running
npx tsx scripts/run-batched.ts qwen35 --dry

`scripts/generate-report.ts` ← NEW

Generates a self-contained interactive HTML report from agent-results.json.

npx tsx scripts/generate-report.ts
# → report.html

Features:

Summary view — one row per model: Passed, Tests, Rate (%), Avg Duration, all 6 token columns (n/a if any eval is missing token data for that model)
Detailed view — one row per model × eval: Score, Duration, all 6 token columns
Toggle between Summary / Detailed with a button group
Filter by model and eval (multi-select dropdowns with Select all / Clear all)
Sortable columns in both tables
Top stats bar (total runs, pass rate, avg duration, avg tokens)
Dark theme, no external dependencies

Typical workflow for a new experiment

cd next-evals-oss

# 1. Run evals in batches
npx tsx scripts/run-batched.ts <experiment> --batch-size 3

# 2. (Optional) Backfill tokens for runs made with old package
npx tsx scripts/retro-extract-tokens.ts <experiment>

# 3. Export to agent-results.json
npx tsx scripts/export-results.ts

# 4. Generate HTML report
npx tsx scripts/generate-report.ts
# Open report.html in browser

Adding a new experiment

Create experiments/<name>.ts with an ExperimentConfig
Add "<name>": "<agent-type>" to EXPERIMENT_AGENT_MAP in scripts/retro-extract-tokens.ts
Add "<name>": "Display Name" to MODEL_NAMES in scripts/export-results.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

evals-test

Repository layout

DeepSeek and Qwen native API support ← KEY ADDITION

DeepSeek

Qwen (Alibaba Cloud DashScope)

Required `.env` keys

How it works

Changes made to `agent-eval`

Token tracking (`packages/agent-eval/src/lib/o11y/`)

Build

Changes made to `next-evals-oss`

Problem: installed package vs local source

New / modified scripts

`scripts/postinstall.ts` ← NEW

`scripts/retro-extract-tokens.ts` ← NEW

`scripts/export-results.ts` ← MODIFIED

`scripts/run-batched.ts` ← NEW

`scripts/generate-report.ts` ← NEW

Typical workflow for a new experiment

Adding a new experiment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
agent-eval		agent-eval
next-evals-oss		next-evals-oss
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

evals-test

Repository layout

DeepSeek and Qwen native API support ← KEY ADDITION

DeepSeek

Qwen (Alibaba Cloud DashScope)

Required .env keys

How it works

Changes made to agent-eval

Token tracking (packages/agent-eval/src/lib/o11y/)

Build

Changes made to next-evals-oss

Problem: installed package vs local source

New / modified scripts

scripts/postinstall.ts ← NEW

scripts/retro-extract-tokens.ts ← NEW

scripts/export-results.ts ← MODIFIED

scripts/run-batched.ts ← NEW

scripts/generate-report.ts ← NEW

Typical workflow for a new experiment

Adding a new experiment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Required `.env` keys

Changes made to `agent-eval`

Token tracking (`packages/agent-eval/src/lib/o11y/`)

Changes made to `next-evals-oss`

`scripts/postinstall.ts` ← NEW

`scripts/retro-extract-tokens.ts` ← NEW

`scripts/export-results.ts` ← MODIFIED

`scripts/run-batched.ts` ← NEW

`scripts/generate-report.ts` ← NEW

Packages