Jekyll2026-03-03T15:51:20+00:00https://aider.chat/feed.xmlaideraider is AI pair programming in your terminalQwen3 benchmark results2025-05-08T00:00:00+00:002025-05-08T00:00:00+00:00https://aider.chat/2025/05/08/qwen3Qwen3 results on the aider polyglot benchmark

As previously discussed when Qwen2.5 was released, details matter when working with open source models for AI coding. Proprietary models are served by their creators or trusted providers with stable inference settings. Open source models are wonderful because anyone can serve them, but API providers can use very different inference settings, quantizations, etc.

Below are collection of aider polyglot benchmark results for the new Qwen3 models. Results are presented using both “diff” and “whole” edit formats, with various models settings, against various API providers.

See details on the model settings used after the results table.

This article is being updated as new results become available. Also, some results were submitted by aider users and have not been verified.

Qwen3 results on the aider polyglot benchmark

Model Percent correct Cost Command Correct edit format Edit Format
Qwen3-235B-A22B whole with VLLM, bfloat16, recommended /no_think settings
65.3%
aider --model openai/Qwen3-235B-A22B 100.0% whole
Qwen3 235B A22B whole, no think, via official Alibaba API
61.8%
aider --model openai/qwen3-235b-a22b 100.0% whole
Qwen3-235B-A22B diff with VLLM, bfloat16, recommended /no_think settings
61.3%
aider --model openai/Qwen3-235B-A22B 94.7% diff
Qwen3 235B A22B diff, no think, via official Alibaba API
59.6%
aider --model openai/qwen3-235b-a22b 92.9% diff
Qwen3-235B-A22B whole with llama.cpp, Q5_K_M (unsloth), recommended /no_think settings
59.1%
aider --model openai/Qwen3-235B-A22B-Q5_K_M 100.0% whole
Qwen3 235B A22B diff on OpenRouter only TogetherAI, recommended /no_think settings
54.7%
$0.64
aider --model openrouter/qwen/qwen3-235b-a22b 90.7% diff
Qwen3 235B A22B diff on OpenRouter, all providers, default settings (thinking)
49.8%
$1.8
aider --model openrouter/qwen/qwen3-235b-a22b 91.6% diff
Qwen3-32B whole with VLLM, bfloat16, recommended /no_think settings
45.8%
aider --model openai/Qwen3-32B 100.0% whole
Qwen3-32B diff with VLLM, bfloat16, recommended /no_think settings
41.3%
aider --model openai/Qwen3-32B 94.2% diff
Qwen3 32B diff on OpenRouter, all providers, default settings (thinking)
40.0%
$0.76
aider --model openrouter/qwen/qwen3-32b 83.6% diff

No think, via official Alibaba API

These results were obtained running against https://dashscope.aliyuncs.com/compatible-mode/v1 with no thinking.

export OPENAI_API_BASE=https://dashscope.aliyuncs.com/compatible-mode/v1
export OPENAI_API_KEY=<key>
- name: openai/qwen3-235b-a22b
  use_temperature: 0.7
  streaming: false
  extra_params:
    stream: false
    max_tokens: 16384
    top_p: 0.8
    top_k: 20
    temperature: 0.7
    enable_thinking: false
    extra_body:
      enable_thinking: false

These results were obtained with the recommended non-thinking model settings in .aider.model.settings.yml:

- name: openrouter/qwen/qwen3-235b-a22b
  system_prompt_prefix: "/no_think"
  use_temperature: 0.7
  extra_params:
    max_tokens: 24000
    top_p: 0.8
    top_k: 20
    min_p: 0.0
    temperature: 0.7
    extra_body:
      provider:
        order: ["Together"]

And then running aider:

aider --model openrouter/qwen/qwen3-235b-a22b

OpenRouter, all providers, default settings (thinking)

These results were obtained by simply running aider as shown below, without any model specific settings. This should have enabled thinking, assuming upstream API providers honor that convention for Qwen3.

aider --model openrouter/qwen/qwen3-xxx

These benchmarks results were obtained by GitHub user AlongWY with the recommended non-thinking model settings in .aider.model.settings.yml:

- name: openai/<model-name>
  system_prompt_prefix: "/no_think"
  use_temperature: 0.7
  extra_params:
    max_tokens: 24000
    top_p: 0.8
    top_k: 20
    min_p: 0.0
    temperature: 0.7        

And then running aider:

aider --model openai/<model-name> --openai-api-base <url>
]]>
Gemini 2.5 Pro Preview 03-25 benchmark cost2025-05-07T00:00:00+00:002025-05-07T00:00:00+00:00https://aider.chat/2025/05/07/gemini-costMay 07, 2025

Gemini 2.5 Pro Preview 03-25 benchmark cost

Summary

The $6.32 cost reported to run the aider polyglot benchmark on Gemini 2.5 Pro Preview 03-25 was incorrect. The true cost was higher, possibly significantly so. The incorrect cost has been removed from the leaderboard.

An investigation determined the primary cause was that the litellm package (used by aider for LLM API connections) was not properly including reasoning tokens in the token counts it reported. While an incorrect price-per-token entry for the model also existed in litellm’s cost database at that time, this was found not to be a contributing factor. Aider’s own internal, correct pricing data was utilized during the benchmark.

Resolution

Litellm began correctly including reasoning tokens in the reported counts on April 21, 2025 in commit a7db0df. This change was released in litellm v1.67.1. Aider picked up this change April 28, 2025 when it upgraded its litellm dependency from v1.65.7 to v1.67.4.post1 in commit 9351f37. That dependency change shipped on May 5, 2025 in aider v0.82.3.

Unfortunately the 03-25 version of Gemini 2.5 Pro Preview is no longer available, so it is not possible to re-run the benchmark to obtain an accurate cost. As a possibly relevant comparison, the newer 05-06 version of Gemini 2.5 Pro Preview completed the benchmark at a cost of about $37.

Investigation detail

The version of litellm available at that time of the benchmark appears to have been excluding reasoning tokens from the token counts it reported. So even though aider had correct per-token pricing, it did not have the correct token counts used during the benchmark. This resulted in an underestimate of the benchmark costs.

The incorrect litellm database entry does not appear to have affected the aider benchmark costs. Aider maintains and uses its own database of costs for some models, and it contained the correct pricing at the time of the benchmark. Aider appears to have loaded the correct cost data from its database and made use of it during the benchmark.

Every aider benchmark report contains the git commit hash of the aider repository state used to run the benchmark. The benchmark run in question was built from commit 0282574.

Additional runs of the benchmark from that build verified that the error in litellm’s model cost database appears not to have been a factor:

  • Aider’s internal model database correctly overrides the litellm database, which contained an incorrect token cost at the time.
  • The correct pricing is loaded from aider’s internal model database and produces similar (incorrect) costs as the original run.
  • Updating aider’s internal model database with an absurdly high token cost resulted in an appropriately high benchmark cost report, demonstrating that the internal database costs were in effect.

This specific build of aider was then updated with various versions of litellm using git biset to identify the first litellm commit where reasoning tokens counts were correctly reported.

Timeline

Below is the full timeline of git commits related to this issue in the aider and litellm repositories. Each entry has a UTC timestamp, followed by the original literal timestamp obtained from the relevant source.

  • 2025-04-04 19:54:45 UTC (Sat Apr 5 08:54:45 2025 +1300)
    • Correct value "output_cost_per_token": 0.000010 for gemini/gemini-2.5-pro-preview-03-25 added to aider/resources/model-metadata.json
    • Commit eda796d in aider.
  • 2025-04-05 16:20:01 UTC (Sun Apr 6 00:20:01 2025 +0800)
    • First litellm commit of gemini/gemini-2.5-pro-preview-03-25 metadata, with incorrect price "output_cost_per_token": 0.0000010
    • Commit cd0a1e6 in litellm.
  • 2025-04-10 01:48:43 UTC (Wed Apr 9 18:48:43 2025 -0700)
    • litellm commit updates gemini/gemini-2.5-pro-preview-03-25 metadata, but not price
    • Commit ac4f32f in litellm.
  • 2025-04-12 04:55:50 UTC (2025-04-12-04-55-50 UTC)
    • Benchmark performed.
    • Aider repo hash 0282574 recorded in benchmark results, without a “dirty” annotation, indicating that the benchmark was run on a clean checkout of the aider repo at commit 0282574.
    • Correct value "output_cost_per_token": 0.000010 is in aider/resources/model-metadata.json at this commit 0282574.
  • 2025-04-12 15:06:39 UTC (Apr 12 08:06:39 2025 -0700)
    • Benchmark results added to aider repo.
    • Commit 7fbeafa in aider.
  • 2025-04-12 15:20:04 UTC (Sat Apr 12 19:20:04 2025 +0400)
    • litellm commit fixes gemini/gemini-2.5-pro-preview-03-25 price metadata to "output_cost_per_token": 0.00001
    • Commit 93037ea in litellm.
  • 2025-04-22 05:48:00 UTC (Mon Apr 21 22:48:00 2025 -0700)
    • Litellm started including reasoning tokens in token count reporting.
    • Commit a7db0df in litellm.
    • This fix was released in litellm v1.67.1.
  • 2025-04-28 14:53:20 UTC (Mon Apr 28 07:53:20 2025 -0700)
    • Aider upgraded its litellm dependency from v1.65.7 to v1.67.4.post1, which included the reasoning token count fix.
    • Commit 9351f37 in aider.
    • This dependency change shipped on May 5, 2025 in aider v0.82.3.
]]>
Alternative DeepSeek V3 providers2025-01-28T00:00:00+00:002025-01-28T00:00:00+00:00https://aider.chat/2025/01/28/deepseek-downJanuary 28, 2025

Alternative DeepSeek V3 providers

DeepSeek’s API has been experiencing significant reliability issues for the past 24-48+ hours, with many users reporting downtime and overload problems. Their status page notes an ongoing incident.

If you’re affected by these issues, several alternative providers offer access to DeepSeek V3. This article compares their performance on aider’s polyglot benchmark to help you choose a reliable alternative.

Providers

OpenRouter

OpenRouter offers many DeepSeek providers through their unified API. You can use aider with OpenRouter like this:

# Set your API key using environment variables
export OPENROUTER_API_KEY=<your-key>
aider --model openrouter/deepseek/deepseek-chat

# Or use the --api-key command line option
aider --model openrouter/deepseek/deepseek-chat --api-key openrouter=<your-key>

# Or add it to .aider.conf.yml in your home directory or project root:
api-key:
  - openrouter=<your-key>

OpenRouter automatically monitors their providers and routes requests to stable APIs and away from those experiencing unreliable performance.

But not all providers serve the same version of open source models, and not all have the same privacy guarantees. You can control which OpenRouter providers are used to serve the model via aider’s model settings. Create a .aider.model.settings.yml file in your home directory or git project root with settings like this:

- name: openrouter/deepseek/deepseek-chat
  extra_params:
    extra_body:
      provider:
        # Only use these providers, in this order
        order: ["Novita"]
        # Don't fall back to other providers
        allow_fallbacks: false

See OpenRouter’s provider routing docs for more details.

Fireworks

# Set your API key using environment variables
export FIREWORKS_API_KEY=<your-key>
aider --model fireworks_ai/accounts/fireworks/models/deepseek-chat

# Or use the --api-key command line option
aider --model fireworks_ai/accounts/fireworks/models/deepseek-chat --api-key fireworks=<your-key>

# Or add it to .aider.conf.yml in your home directory or project root:
api-key:
  - fireworks=<your-key>

Create a .aider.model.settings.yml file in your home directory or git project root with settings like this:

- name: fireworks_ai/accounts/fireworks/models/deepseek-chat
  edit_format: diff
  weak_model_name: null
  use_repo_map: true
  send_undo_reply: false
  lazy: false
  reminder: sys
  examples_as_sys_msg: true
  extra_params:
    max_tokens: 8192
  cache_control: false
  caches_by_default: true
  use_system_prompt: true
  use_temperature: true
  streaming: true

Hyperbolic

You can use Hyperbolic’s API as an OpenAI-compatible provider:

# Set your API key using environment variables
export OPENAI_API_BASE=https://api.hyperbolic.xyz/v1/
export OPENAI_API_KEY=<your-key>
aider --model openai/deepseek-ai/DeepSeek-V3

# Or use the --api-key command line option
aider --model openai/deepseek-ai/DeepSeek-V3 --api-key openai=<your-key>

# Or add it to .aider.conf.yml in your home directory or project root:
api-key:
  - openai=<your-key>

Create a .aider.model.settings.yml file in your home directory or git project root with settings like this:

- name: openai/deepseek-ai/DeepSeek-V3
  edit_format: diff
  weak_model_name: null
  use_repo_map: true
  send_undo_reply: false
  lazy: false
  reminder: sys
  examples_as_sys_msg: true
  cache_control: false
  caches_by_default: true
  use_system_prompt: true
  use_temperature: true
  streaming: true
  editor_model_name: null
  editor_edit_format: null
  extra_params:
    max_tokens: 65536

Ollama

You can run DeepSeek V3 via Ollama.

# Pull the model
ollama pull deepseek-v3

# Start your ollama server
ollama serve

# In another terminal window...
export OLLAMA_API_BASE=http://127.0.0.1:11434 # Mac/Linux
setx   OLLAMA_API_BASE http://127.0.0.1:11434 # Windows, restart shell after setx

aider --model ollama/deepseek-v3

It’s important to provide model settings, especially the num_ctx parameter to set the context window. Ollama uses a 2k context window by default, which is very small for working with aider. Larger context windows will allow you to work with larger amounts of code, but will use memory and increase latency.

Unlike most other LLM servers, Ollama does not throw an error if you submit a request that exceeds the context window. Instead, it just silently truncates the request by discarding the “oldest” messages in the chat to make it fit within the context window.

So if your context window is too small, you won’t get an explicit error. The biggest symptom will be that aider says it can’t see (some of) the files you added to the chat. That’s because ollama is silently discarding them because they exceed the context window.

Create a .aider.model.settings.yml file in your home directory or git project root with settings like this:

- name: ollama/deepseek-v3
  edit_format: diff
  weak_model_name: null
  use_repo_map: true
  send_undo_reply: false
  lazy: false
  reminder: sys
  examples_as_sys_msg: true
  cache_control: false
  caches_by_default: true
  use_system_prompt: true
  use_temperature: true
  streaming: true
  extra_params:
    num_ctx: 8192 # How large a context window?

Other providers

You will need to properly configure aider to work with DeepSeek V3 when served via other providers:

  • Determine the --model name to use.
  • Provide your API key to aider.
  • Add model settings to .aider.model.settings.yml.

Adapt the .aider.model.settings.yml shown above for Fireworks. You will need to change the name field to match you chosen provider’s model naming scheme.

See Advanced model settings for details about all aider model settings

Results

Model Percent completed correctly Percent using correct edit format Command Edit format
Hyperbolic 48.4% 97.3% OPENAI_API_BASE=https://api.hyperbolic.xyz/v1/ aider --model openai/deepseek-ai/DeepSeek-V3 diff
Fireworks 48.4% 96.9% aider --model fireworks_ai/accounts/fireworks/models/deepseek-v3 diff
DeepSeek 48.4% 98.7% aider --model deepseek/deepseek-chat diff
OpenRouter: DeepInfra 48.0% 99.5% aider --model openrouter/deepseek/deepseek-chat diff
OpenRouter: Novita 42.7% 84.0% aider --model openrouter/deepseek/deepseek-chat diff
]]>
R1+Sonnet set SOTA on aider’s polyglot benchmark2025-01-24T00:00:00+00:002025-01-24T00:00:00+00:00https://aider.chat/2025/01/24/r1-sonnetJanuary 24, 2025

R1+Sonnet set SOTA on aider’s polyglot benchmark

Aider supports using a pair of models for coding:

  • An Architect model is asked to describe how to solve the coding problem. Thinking/reasoning models often work well in this role.
  • An Editor model is given the Architect’s solution and asked to produce specific code editing instructions to apply those changes to existing source files.

R1 as architect with Sonnet as editor has set a new SOTA of 64.0% on the aider polyglot benchmark. They achieve this at 14X less cost compared to the previous o1 SOTA result.

o1 paired with Sonnet didn’t produce better results than just using o1 alone. Using various other models as editor didn’t seem to improve o1 or R1 versus their solo scores. This is in contrast to the first wave of thinking models like o1-preview and o1-mini, which improved when paired with many different editor models.

o1 was set with reasoning effort high for these tests.

Try it

Once you install aider, you can use aider, R1 and Sonnet like this:

export DEEPSEEK_API_KEY=<your-key>
export ANTHROPIC_API_KEY=<your-key>

aider --architect --model r1 --editor-model sonnet

Or if you have an OpenRouter account:

export OPENROUTER_API_KEY=<your-key>

aider --architect --model openrouter/deepseek/deepseek-r1 --editor-model openrouter/anthropic/claude-3.5-sonnet

Thinking output

There has been some recent discussion about extracting the <think> tokens from R1’s responses and feeding them to Sonnet. That was an interesting experiment, for sure.

To be clear, the results above are not using R1’s thinking tokens, just the normal final output. R1 is configured in aider’s standard architect role with Sonnet as editor. The benchmark results that used the thinking tokens appear to be worse than the architect/editor results shared here.

Results

Model Percent completed correctly Percent using correct edit format Command Edit format Total Cost
R1+Sonnet 64.0% 100.0% aider --architect --model r1 --editor-model sonnet architect $13.29
o1 61.7% 91.5% aider --model o1 diff $186.5
R1 56.9% 96.9% aider --model r1 diff $5.42
Sonnet 51.6% 99.6% aider --model sonnet diff $14.41
DeepSeek V3 48.4% 98.7% aider --model deepseek diff $0.34
]]>
Using uv as an installer2025-01-15T00:00:00+00:002025-01-15T00:00:00+00:00https://aider.chat/2025/01/15/uvJanuary 15, 2025

Using uv as an installer

It’s hard to reliably package and distribute python command line tools to end users. Users frequently encounter challenges: dependency version conflicts, virtual environment management, needing to install python or a specific version of python, etc.

Aider employs uv in a couple of novel ways to streamline the installation process:

  1. Install aider with curl https://aider.chat/install.sh | sh even if python isn’t already installed.

  2. Users who have python 3.8+ installed can pip install aider-install && aider-install.

Both methods use uv to globally install the aider command line program, with all of its dependencies in an isolated environment. They ensure that aider will run with python 3.12, and install that version if it is not already available.

These uv install methods are especially helpful for aider, because it has a large set of very specific dependencies. Since not all of aider’s dependencies are available on all python versions, it requires python 3.9-3.12.

Most users don’t want to worry about these details – they just want a quick way to install and run aider.

One-liners

Users can install aider with a shell one-liner, without even having python previously installed:

curl -LsSf https://aider.chat/install.sh | sh

This installs uv, then uses it to install python 3.12, install the aider command line tool and update the user’s shell path. Under the hood, it is simply a copy of uv’s own install script https://astral.sh/uv/install.sh with one line added, to install aider as a tool:

ensure "${_install_dir}/uv" tool install --force --python python3.12 aider-chat@latest

aider-install

The aider-install python package allows quick global installation of aider for users who already have python 3.8+ installed. It simply provides the aider-install command line program, which users just need to run once.

pip install aider-install
aider-install

The pip install aider-install installs only two packages: aider-install and the uv python package. This ensures that uv is available in the user’s environment. Everything else is installed in a stand-alone environment created by uv.

When the user runs aider-install, it runs uv to install aider as a tool and update the user’s shell path if needed:

uv tool install --force --python python3.12 aider-chat
uv tool update-shell

Benefits

These uv install methods have been popular with users, providing a hassle free way to install aider and quickly get started. Installs are also extremely fast, much faster than pip or pipx installs even when uv is also installing python 3.12!

There are also a number of benefits from the perspective of the tool developer/publisher. Since providing these install methods, far fewer users report dependency problems and version conflicts as compared to users who pip install aider-chat. There is also less pressure to rapidly support the newest python versions, since aider always installs with python 3.12.

]]>
o1 tops aider’s new polyglot leaderboard2024-12-21T00:00:00+00:002024-12-21T00:00:00+00:00https://aider.chat/2024/12/21/polyglotDecember 21, 2024

o1 tops aider’s new polyglot leaderboard

OpenAI’s new o1 model with “high” reasoning effort gets the top score on the new aider polyglot leaderboard, significantly ahead of other top LLMs. The new polyglot benchmark uses many popular coding languages and was designed to be much more challenging than aider’s original code editing benchmark. This more clearly distinguishes the performance of today’s strongest coding models and leaves headroom for future LLMs.

See the main aider leaderboard for benchmark results from more models. This article only contains a snapshot of results at the time of publication.

The polyglot benchmark

Like aider’s original code editing benchmark, the new polyglot benchmark is based on Exercism coding exercises.

The new polyglot benchmark:

  • Contains coding problems in C++, Go, Java, JavaScript, Python and Rust. The old benchmark was solely based on Python exercises.
  • Focuses on the most difficult 225 exercises out of the 697 that Exercism provides for those languages. The old benchmark simply included all 133 Python exercises, regardless of difficulty.

Motivation and goals

Aider’s original code editing benchmark was saturating as the top scores approached and then surpassed 80%. Sonnet’s score of 84.2% was based on solving 112 of the 133 exercises, leaving only 21 unsolved exercises. New champions were advancing the top score by solving just 1-2 more problems than the previous record. This made it hard to clearly measure the difference in code editing skill between these top models.

Part of the problem is that many of the original 133 Python problems are very easy and provide little challenge to today’s frontier LLMs. Models as old as GPT 3.5 Turbo were able to solve half of the 133 problems. Such easy problems simply inflate the benchmark scores of modern LLMs without providing any data about which models are better or worse.

The main goal for a new benchmark was to re-calibrate the scale so that today’s top coding LLMs would occupy a wide range of scores between about 5% and 50%. This should leave headroom for future LLMs and make it possible to more clearly compare the relative performance of top models.

Designing the polyglot benchmark

The new benchmark:

  • Tests LLMs with more coding languages, to increase diversity and source a larger pool of problems.
  • Includes just the most challenging coding problems and excludes easy problems that are solvable by most of today’s top coding LLMs.
  • Includes more total coding problems, to enable more granularity of comparison.

The new benchmark is based on Exercism coding problems from 6 of the most popular programming languages:

  • C++
  • Go
  • Java
  • JavaScript
  • Python
  • Rust

Exercism provides a total of 697 coding problems in those 6 languages. A set of 7 of today’s top coding models each attempted all 697 of the Exercism problems:

  • Sonnet
  • Haiku
  • o1 Mini
  • DeepSeek
  • GPT-4o
  • Qwen 32B Coder Instruct
  • GPT-4o Mini

Depending on the difficulty of the problems, a different number of solutions were found by the collection of 7 models:

Solutions
found
Number of
problems
Cumulative number
of problems
0 66 66
1 61 127
2 50 177
3 48 225
4 53 278
5 71 349
6 90 439
7 258 697

In the table above, you can see that 258 of the problems were solved by all 7 LLMs. These problems are far too easy, and wouldn’t be good choices for the new benchmark. Instead, we need hard problems like the 66 that none of the 7 models were able to solve.

The new benchmark uses the 225 problems that were solved by 3 or fewer models. This achieves a balance between hard and moderate problems, and provides a large but not excessive total pool of problems. It also represents a good diversity of coding languages:

Language Problems
C++ 26
Go 39
Java 47
JavaScript 49
Python 34
Rust 30
Total 225

o1

OpenAI’s new o1 model established a very strong top score of 62% on the new benchmark. This still leaves 86 problems of headroom for future models to solve. Given the incredible pace of recent advancements, it will be interesting to see how long it will take for this new benchmark to saturate.

Benchmark problems

The 225 coding problems are available in the aider polyglot benchmark repo on GitHub.

Results

Model Percent completed correctly Percent using correct edit format Command Edit format
o1-2024-12-17 (high) 61.7% 91.5% aider --model openrouter/openai/o1 diff
claude-3-5-sonnet-20241022 45.3% 100.0% aider --model claude-3-5-sonnet-20241022 diff
gemini-exp-1206 38.2% 98.2% aider --model gemini/gemini-exp-1206 whole
o1-mini-2024-09-12 32.9% 96.9% aider --model o1-mini whole
claude-3-5-haiku-20241022 28.0% 91.1% aider --model claude-3-5-haiku-20241022 diff
gemini-2.0-flash-exp 22.2% 100.0% aider --model gemini/gemini-2.0-flash-exp whole
DeepSeek Chat V2.5 17.8% 92.9% aider --model deepseek/deepseek-chat diff
gpt-4o-2024-11-20 15.1% 96.0% aider --model gpt-4o-2024-11-20 diff
Qwen2.5-Coder-32B-Instruct 8.0% 71.6% aider --model openai/Qwen/Qwen2.5-Coder-32B-Instruct # via hyperbolic diff
gpt-4o-mini-2024-07-18 3.6% 100.0% aider --model gpt-4o-mini-2024-07-18 whole
]]>
QwQ is a code architect, not an editor2024-12-03T00:00:00+00:002024-12-03T00:00:00+00:00https://aider.chat/2024/12/03/qwqDecember 03, 2024

QwQ is a code architect, not an editor

QwQ 32B Preview is a “reasoning” model, which spends a lot of tokens thinking before rendering a final response. This is similar to OpenAI’s o1 models, which are most effective with aider when paired as an architect with a traditional LLM as an editor. In this mode, the reasoning model acts as an “architect” to propose a solution to the coding problem without regard for how to actually make edits to the source files. The “editor” model receives that proposal, and focuses solely on how to edit the existing source code to implement it.

Used alone without being paired with an editor, QwQ was unable to comply with even the simplest editing format. It was not able to reliably edit source code files. As a result, QwQ’s solo score on the benchmark was quite underwhelming (and far worse than the o1 models performing solo).

QwQ is based on Qwen 2.5 Coder 32B Instruct, and does better when paired with it as an architect + editor combo. Though this provided only a modest benchmark improvement over just using Qwen alone, and comes with a fairly high cost in terms of latency. Each request must wait for QwQ to return all its thinking text and the final solution proposal. And then one must wait for Qwen to turn that large response into actual file edits.

Pairing QwQ with other sensible editor models performed the same or worse than just using Qwen 2.5 Coder 32B Instruct alone.

QwQ+Qwen seems to be the best way to use QwQ, achieving a score of 74%. That is well below the SOTA results for this benchmark: Sonnet alone scores 84%, and o1-preview + o1-mini as architect + editor scores 85%.

QwQ specific editing formats

I spent some time experimenting with a variety of custom editing formats for QwQ. In particular, I tried to parse the QwQ response and discard the long sections of “thinking” and retain only the “final” solution. None of this custom work seemed to translate into any significant improvement in the benchmark results.

Results

Model Percent completed correctly Percent using correct edit format Command Edit format
o1-preview 79.7% 93.2% aider --model o1-preview diff
QwQ + Qwen2.5 Coder 32B-I 73.6% 100.0% aider --model openrouter/qwen/qwq-32b-preview --editor-model openrouter/qwen/qwen-2.5-coder-32b-instruct --editor-edit-format editor-whole architect
Qwen2.5 Coder 32B-I 71.4% 94.7% aider --model openai/hf:Qwen/Qwen2.5-Coder-32B-Instruct --openai-api-base https://glhf.chat/api/openai/v1 (via GLHF) diff
QwQ + Haiku 71.4% 100.0% aider --model openrouter/qwen/qwq-32b-preview --editor-model claude-3-5-haiku-20241022 --edit-format editor-whole architect
o1-mini 70.7% 90.0% aider --model o1-mini whole
QwQ + DeepSeek V2.5 67.7% 100.0% aider --model openrouter/qwen/qwq-32b-preview --editor-model deepseek/deepseek-chat --edit-format editor-whole architect
QwQ 42.1% 91.0% aider --model openrouter/qwen/qwq-32b-preview whole

Open source model caveats

As discussed in a recent blog post, details matter with open source models. For clarity, new benchmark runs for this article were performed against OpenRouter’s endpoints for QwQ 32B Preview and Qwen 2.5 Coder 32B Instruct. For the other models, the benchmark was direct to their providers’ APIs.

Having recently done extensive testing of OpenRouter’s Qwen 2.5 Coder 32B Instruct endpoint, it seems reliable. The provider Mancer was blocked due to the small context window it provides.

For QwQ 32B Preview, Fireworks was blocked because of its small context window.

]]>
Details matter with open source models2024-11-21T00:00:00+00:002024-11-21T00:00:00+00:00https://aider.chat/2024/11/21/quantizationNovember 21, 2024

Details matter with open source models

Open source models like Qwen 2.5 32B Instruct are performing very well on aider’s code editing benchmark, rivaling closed source frontier models.

But pay attention to how your model is being served and quantized, as it can impact code editing skill. Open source models are often available at a variety of quantizations, and can be served with different token limits. These details matter when working with code.

The graph above and table below compares different versions of the Qwen 2.5 Coder 32B Instruct model, served both locally and from a variety of cloud providers.

Pitfalls and details

This benchmarking effort highlighted a number of pitfalls and details specific to open source models which can have a significant impact on their ability to correctly edit code:

  • Quantization – Open source models are often available at dozens of different quantizations. Most seem to only modestly decrease code editing skill, but stronger quantizations do have a real impact.
  • Context window – Cloud providers can decide how large a context window to accept, and they often choose differently. Ollama’s local API server defaults to a tiny 2k context window, and silently discards data that exceeds it. Such a small window has catastrophic effects on performance, without throwing obvious hard errors.
  • Output token limits – Open source models are often served with wildly differing output token limits. This has a direct impact on how much code the model can write or edit in a response.
  • Buggy cloud providers – While benchmarking Qwen 2.5 Coder 32B Instruct and DeepSeek V2.5, I discovered multiple cloud providers with broken or buggy API endpoints. They seemed to be returning results different from expected based on the advertised quantization and context sizes. The harm caused to the code editing benchmark varied from serious to catastrophic. One provider scored 0.5% on the benchmark with DeepSeek V2.5, a highly capable model.

Closed source, proprietary models don’t typically have these issues. They are owned and operated by the organization that created them, and typically served with specific, predictable context window and output token limits. Their quantization level is usually unknown, but fixed and unchanging for all users.

Conclusions

The best versions of the Qwen model rival GPT-4o, while the worst performing quantization is more like the older GPT-4 Turbo when served competently. Even an otherwise excellent fp16 quantization falls to GPT-3.5 Turbo levels of performance if run with Ollama’s default 2k context window.

Sections

Benchmark results

These are results from single benchmark runs, so expect normal variance of +/- 1-2%.

Model Percent completed correctly Percent using correct edit format Command Edit format
Fireworks: unknown 72.2% 94.0% aider --model fireworks_ai/accounts/fireworks/models/qwen2p5-coder-32b-instruct diff
Deepinfra: BF16 72.2% 94.7% aider --model deepinfra/Qwen/Qwen2.5-Coder-32B-Instruct diff
mlx-community: 8bit 72.2% 92.5% aider --model openai/mlx-community/Qwen2.5-Coder-32B-Instruct-8bit diff
mlx-community: 4bit 72.2% 88.7% aider --model openai/mlx-community/Qwen2.5-Coder-32B-Instruct-4bit diff
Ollama: fp16 71.4% 90.2% aider --model ollama/qwen2.5-coder:32b-instruct-fp16 diff
HuggingFace via GLHF: BF16 71.4% 94.7% aider --model openai/hf:Qwen/Qwen2.5-Coder-32B-Instruct --openai-api-base https://glhf.chat/api/openai/v1 diff
Deepinfra via OpenRouter: BF16 69.9% 89.5% aider --model openrouter/qwen/qwen-2.5-coder-32b-instruct diff
Hyperbolic: BF16 69.2% 91.7% aider --model openai/Qwen/Qwen2.5-Coder-32B-Instruct --openai-api-base https://api.hyperbolic.xyz/v1/ diff
Hyperbolic via OpenRouter: BF16 68.4% 89.5% aider --model openrouter/qwen/qwen-2.5-coder-32b-instruct diff
Fireworks via OpenRouter: unknown 67.7% 94.0% aider --model openrouter/qwen/qwen-2.5-coder-32b-instruct diff
OpenRouter: multiple 67.7% 95.5% aider --model openrouter/qwen/qwen-2.5-coder-32b-instruct diff
Ollama: q4_K_M 66.9% 94.0% aider --model ollama/qwen2.5-coder:32b-instruct-q4_K_M diff
Ollama: q2_K 61.7% 91.7% aider --model ollama/qwen2.5-coder:32b-instruct-q2_K diff
Ollama: fp16, 2k ctx 51.9% 46.2% aider --model ollama/qwen2.5-coder:32b-instruct-fp16 # num_ctx: 2048 diff

Setting Ollama’s context window size

Ollama uses a 2k context window by default, which is very small for working with aider. Unlike most other LLM servers, Ollama does not throw an error if you submit a request that exceeds the context window. Instead, it just silently truncates the request by discarding the “oldest” messages in the chat to make it fit within the context window.

Except for the single 2k context result, all of the Ollama results above were collected with at least an 8k context window. An 8k window is large enough to attempt all the coding problems in the benchmark. Aider sets Ollama’s context window to 8k by default, starting in aider v0.65.0.

You can change the Ollama server’s context window with a .aider.model.settings.yml file like this:

- name: ollama/qwen2.5-coder:32b-instruct-fp16
  extra_params:
    num_ctx: 8192

Choosing providers with OpenRouter

OpenRouter allows you to ignore specific providers in your preferences. This can be used to limit your OpenRouter requests to be served by only your preferred providers.

Notes

This article went through many revisions as I received feedback from numerous members of the community. Here are some of the noteworthy learnings and changes:

  • The first version of this article included incorrect Ollama models.
  • Earlier Ollama results used the too small default 2k context window, artificially harming the benchmark results.
  • The benchmark results appear to have uncovered a problem in the way OpenRouter was communicating with Hyperbolic. They fixed the issue 11/24/24, shortly after it was pointed out.
]]>
Separating code reasoning and editing2024-09-26T00:00:00+00:002024-09-26T00:00:00+00:00https://aider.chat/2024/09/26/architectSeptember 26, 2024

Separating code reasoning and editing

Aider now has experimental support for using two models to complete each coding task:

  • An Architect model is asked to describe how to solve the coding problem.
  • An Editor model is given the Architect’s solution and asked to produce specific code editing instructions to apply those changes to existing source files.

Splitting up “code reasoning” and “code editing” in this manner has produced SOTA results on aider’s code editing benchmark. Using o1-preview as the Architect with either DeepSeek or o1-mini as the Editor produced the SOTA score of 85%. Using the Architect/Editor approach also significantly improved the benchmark scores of many models, compared to their previous “solo” baseline scores (striped bars).

Motivation

This approach was motivated by the release of OpenAI’s o1 models. They are strong at reasoning, but often fail to output properly formatted code editing instructions. It helps to instead let them describe the solution however they prefer and then pass that output to a more traditional LLM. This second Editor LLM can then interpret the solution description and produce the code editing instructions needed to update the existing source code.

This approach has recently become attractive for aider due to rapid improvements in the speed and costs of frontier models. In particular, chaining older LLMs would have been quite slow and incompatible with aider’s goal of providing an interactive, pair programming AI coding experience.

Code reasoning and code editing

Normally aider asks the model to solve a coding problem in one prompt, asking the LLM to explain the solution and return a well formatted series of file edits. All of aider’s editing formats require the LLM to return source code edits in a specific text format, so that aider can process the edits and apply them to the local source files.

Because this all happens in a single prompt/response round trip to the LLM, the model has to split its attention between solving the coding problem and conforming to the edit format.

The Architect/Editor approach splits this into two inference steps, possibly using two different LLMs:

  1. Solve the coding problem (Architect).
  2. Turn the proposed solution into a series of well formed code edits (Editor).

The Architect/Editor approach allows the Architect to focus on solving the coding problem and describe the solution however comes naturally to it. Similarly, the Editor can focus all of its attention on properly formatting the edits without needing to reason much about how to solve the coding problem.

We can assign the Architect and Editor roles to LLMs which are well suited to their needs. Strong reasoning model like o1-preview make excellent Architects, while the Editor role can be assigned to an appropriate model based on cost, speed and code editing skill.

Results

The graph above and the table below show the aider’s code editing benchmark score for various combinations of Architect and Editor models.

Some noteworthy observations:

  • Pairing o1-preview as Architect with either Deepseek or o1-mini as Editor sets a SOTA significantly above the previous best score. This result is obtained with the “whole” editing format, requiring the Editor to output a full update copy of each edited source file. Both of these steps are therefore quite slow, so probably not practical for interactive use with aider.
  • Pairing OpenAI’s o1-preview with Anthropic’s Sonnet as the Editor produces the second best result. This is an entirely practical configuration for users able to work with both providers.
  • Pairing many models with themselves in the Architect/Editor configuration can provide significant benefits. Sonnet, GPT-4o and GPT-4o-mini all scored higher when used as an Architect/Editor pair.
  • Deepseek is surprisingly effective as an Editor model. It seems remarkably capable at turning proposed coding solutions into new, updated versions of the source files. Using the efficient “diff” editing format, Deepseek helps all the Architect models except for Sonnet.

Try it!

The development version of aider has built in defaults to support Architect/Editor coding with o1-preview, o1-mini, GPT-4o and Claude 3.5 Sonnet. Run aider with --architect or get started quickly like this:

pip install -U aider-chat

# Change directory into a git repo
cd /to/your/git/repo

# Work with Claude 3.5 Sonnet as the Architect and Editor
export ANTHROPIC_API_KEY=your-key-goes-here
aider --sonnet --architect

# Work with OpenAI models, using gpt-4o as the Editor
export OPENAI_API_KEY=your-key-goes-here
aider --4o --architect
aider --o1-mini --architect
aider --o1-preview --architect

More info

Aider has a number of “chat modes”, and “architect” is available as a new chat mode. The --architect switch is a shortcut for --chat-mode architect. For more details, see documentation on aider’s chat modes.

Full results

Below are the benchmark results using various models as the Architect, paired with various models as the Editor. Each section includes a “baseline” result, where the model works by itself in aider’s normal “code” editing mode (not as part of an Architect/Editor configuration). This “solo” baseline represents the performance previously available when using this model with aider.

Architect Editor Edit Format Pass Rate
o1-preview o1-mini whole 85.0%
o1-preview deepseek whole 85.0%
o1-preview claude-3-5-sonnet diff 82.7%
o1-preview deepseek diff 80.5%
o1-preview gpt-4o diff 80.5%
o1-preview Baseline diff 79.7%
claude-3.5-sonnet claude-3.5-sonnet diff 80.5%
claude-3.5-sonnet deepseek diff 78.9%
claude-3.5-sonnet deepseek whole 78.9%
claude-3.5-sonnet Baseline diff 77.4%
gpt-4o gpt-4o diff 75.2%
gpt-4o deepseek diff 74.4%
gpt-4o deepseek whole 73.7%
gpt-4o Baseline diff 71.4%
o1-mini deepseek whole 71.4%
o1-mini gpt-4o diff 70.7%
o1-mini deepseek diff 69.2%
o1-mini Baseline diff 61.1%
gpt-4o-mini gpt-4o-mini whole 60.2%
gpt-4o-mini Baseline whole 55.6%
]]>
o1-preview is SOTA on the aider leaderboard2024-09-12T00:00:00+00:002024-09-12T00:00:00+00:00https://aider.chat/2024/09/12/o1September 12, 2024

OpenAI o1-preview is SOTA on the aider leaderboard

o1-preview

OpenAI o1-preview scored 79.7% on aider’s code editing benchmark, a state of the art result. It achieved this result with the “whole” edit format, where the LLM returns a full copy of the source code file with changes.

It is much more practical to use aider’s “diff” edit format, which allows the LLM to return search/replace blocks to efficiently edit the source code. This saves significant time and token costs.

Using the diff edit format the o1-preview model had a strong benchmark score of 75.2%. This likely places o1-preview between Sonnet and GPT-4o for practical use, but at significantly higher cost.

o1-mini

OpenAI o1-mini is priced similarly to GPT-4o and Claude 3.5 Sonnet, but scored below those models. It also works best with the whole edit format.

Future work

The o1-preview model had trouble conforming to aider’s diff edit format. The o1-mini model had trouble conforming to both the whole and diff edit formats. Aider is extremely permissive and tries hard to accept anything close to the correct formats.

It is surprising that such strong models had trouble with the syntactic requirements of simple text output formats. It seems likely that aider could optimize its prompts and edit formats to better harness the o1 models.

Using aider with o1

OpenAI’s new o1 models are supported in v0.57.0 of aider:

aider --model o1-mini
aider --model o1-preview

These are initial benchmark results for the o1 models, based on aider v0.56.1-dev. See the aider leaderboards for up-to-date results based on the latest aider releases.

Model Percent completed correctly Percent using correct edit format Command Edit format
o1-preview (whole) 79.7% 100.0% aider --model o1-preview whole
claude-3.5-sonnet (diff) 77.4% 99.2% aider --sonnet diff
o1-preview (diff) 75.2% 84.2% aider --model o1-preview diff
claude-3.5-sonnet (whole) 75.2% 100.0% aider --model openrouter/anthropic/claude-3.5-sonnet --edit-format whole whole
gpt-4o-2024-08-06 (diff) 71.4% 98.5% aider --model openai/gpt-4o-2024-08-06 diff
o1-mini (whole) 70.7% 90.0% aider --model o1-mini whole
o1-mini (diff) 62.4% 85.7% aider --model o1-mini --edit-format diff diff
gpt-4o-mini (whole) 55.6% 100.0% aider --model gpt-4o-mini whole
]]>