CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis (COLM'25)

Quick Start

Setting Up the Environment

Create and activate a Conda environment:

conda create -y -n CodeARC python=3.10.12
conda activate CodeARC

Install dependencies:
```
pip install -r requirements.txt
```

Set API keys: Ensure you have valid API keys for the required services:

export OPENAI_API_KEY=<your_openai_api_key>
export ANTHROPIC_API_KEY=<your_anthropic_api_key>
export TOGETHER_API_KEY=<your_together_api_key>

Running Main Evaluation

python3 run.py --model_name openai/gpt-4o-mini --total_idx 20

We support OpenAI models (e.g., openai/gpt-4o), Anthropic models (e.g., anthropic/claude-3-7-sonnet-20250219), and models served by Together AI (e.g., meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo). For testing purposes, you can pass --total_idx 20 to limit evaluation to 20 problems instead of the full dataset (1114 problems). See run.py for additional configuration options.

To summarize results:

python3 src/compute_metrics.py

HuggingFace Dataset

The CodeARC datasets are hosted on HuggingFace:

Problems Dataset: anjiangwei/CodeARC-Problems
Invocations Dataset: anjiangwei/CodeARC-Invocations

Setting up HuggingFace Account

Obtain an access token:
- Go to HuggingFace Tokens and generate a token with read or write permissions.
Login using the token:

Option A: Use the command line:
```
huggingface-cli login
huggingface-cli whoami
```
Option B: Add the token to the environment variable:
```
export HF_TOKEN=<your_huggingface_token>
```

Accessing Datasets via the HuggingFace `datasets` Library

You can directly load the datasets using the HuggingFace datasets library:

from datasets import load_dataset

# Define dataset paths
hf_problems_path = "anjiangwei/CodeARC-Problems"
hf_invocations_path = "anjiangwei/CodeARC-Invocations"

# Load datasets
problems_dataset = load_dataset(hf_problems_path)
invocations_dataset = load_dataset(hf_invocations_path)

# Example: Access the first training sample
print(problems_dataset["train"][0])
print(invocations_dataset["train"][0])

Citation

If our research inspires you, please cite our paper:

@inproceedings{wei2025codearc,
  title={Code{ARC}: Benchmarking Reasoning Capabilities of {LLM} Agents for Inductive Program Synthesis},
  author={Anjiang Wei and Tarun Suresh and Jiannan Cao and Naveen Kannan and Yuheng Wu and Kai Yan and Thiago S. F. X. Teixeira and Ke Wang and Alex Aiken},
  booktitle={Second Conference on Language Modeling},
  year={2025},
  url={https://openreview.net/forum?id=Q5pVZCrrKr}
}

License

This project is licensed under the Apache 2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
dataset		dataset
differential_testing		differential_testing
prompt_invocations		prompt_invocations
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
llm_generate.py		llm_generate.py
prompts.yaml		prompts.yaml
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis (COLM'25)

Quick Start

Setting Up the Environment

Running Main Evaluation

HuggingFace Dataset

Setting up HuggingFace Account

Accessing Datasets via the HuggingFace `datasets` Library

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis (COLM'25)

Quick Start

Setting Up the Environment

Running Main Evaluation

HuggingFace Dataset

Setting up HuggingFace Account

Accessing Datasets via the HuggingFace datasets Library

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Accessing Datasets via the HuggingFace `datasets` Library

Packages