Infinite-RL is a reward functions toolbox for LLM Reinforcement Learning. It provides modular reward functions for evaluating programming puzzles, mathematical problems, language detection, and auxiliary metrics like length and repetition penalties. The toolbox is designed to integrate with fine-tuning frameworks like Tunix for model training and optimization.
The package includes pre-built datasets for math tasks (math.json compiled from OpenAI's GSM8K) and programming puzzles (puzzles.json compiled from Microsoft's Python Programming Puzzles), along with WASM runtimes for secure JavaScript execution.
git clone https://github.com/hon9kon9ize/infinite-rl.git
cd infinite-rl
pip install .pip install git+https://github.com/hon9kon9ize/infinite-rl.git-
Install dependencies and language runtimes:
The installation process will automatically attempt to install required language runtimes:
- macOS: Uses Homebrew to install Node.js and ts-node
- Linux: Uses apt-get to install Node.js and ts-node
- Windows: Provides links for manual installation, ts-node installation via npm if available
-
(Optional) Activate the Python virtual environment before using the CLI:
source .venv/bin/activate -
Runtimes (WASM)
- The JS runtime is built by
build_src/build_wasm.sh. - A GitHub Actions workflow (
.github/workflows/build_and_release_runtimes.yml) runs the build and uploadspuzzle_js.wasmto a GitHub Release. - During installation,
setup.pywill try to download these runtimes automatically from the latest release (or use theRUNTIME_RELEASE_TAGenvironment variable to pin a release). If you prefer to build locally, run./build_src/build_wasm.shand the generated files will be placed ininfinite_rl/runtimes/.
- The JS runtime is built by
Infinite-RL provides modular reward functions for evaluating model responses across different task types. Use the package to integrate reward evaluation into fine-tuning frameworks like Tunix.
Infinite RL includes a comprehensive testing suite for reward functions and evaluation.
Use pytest to run the unit tests for reward functions and the parser:
# Run all tests
python -m pytest tests -v
# Run specific reward function tests
python -m pytest tests/test_reward_functions.py -v
# Run puzzle reward function tests
python -m pytest tests/test_puzzle_reward_function.py -v
# Run tests with coverage report
python -m pytest tests --cov=infinite_rl --cov-report=html --cov-report=term-missing
# Or use the convenience script
python tests/run_coverage.py run # Run tests with coverage
python tests/run_coverage.py view # Opens HTML report in browser
python tests/run_coverage.py badge # Update coverage badge in README
python tests/run_coverage.py all # Run tests and update badge
# View coverage report in browser
open htmlcov/index.htmlThe project maintains >80% code coverage. Coverage reports are generated automatically in CI and can be viewed locally:
- HTML Report:
htmlcov/index.html- Interactive coverage report - Terminal Report: Shows missing lines in terminal output
- XML Report:
coverage.xml- For CI integration - GitHub Status Checks: Coverage percentage shown in PR/commit status
- Auto-updating Badges: Coverage badges update automatically in README
- Coverage Enforcement: Builds fail if coverage drops below 80%
- Status Checks: GitHub shows coverage percentage for each commit
- Badge Updates: README coverage badges update automatically on main branch
- Multiple Report Formats: HTML, XML, and terminal reports generated
Evaluates LLM-generated solutions to programming puzzles across multiple languages with automated verification.
Supported Languages:
- Python (executed locally via subprocess)
- JavaScript (executed via WASM runtime)
Features:
- Puzzle solution validation using predefined sat functions
- Support for various puzzle types (algebra, basic math, etc.)
- Secure execution environments (WASM for JS, local subprocess for Python)
- Detailed error reporting
- Difficulty ratings: Each programming puzzle has been rated for difficulty (1-5 scale) using Gemini 3 Flash model.
- Math tasks are drawn from GSM8K, filtered for easy mathematical problems (level 0)
Example:
from infinite_rl import get_reward_functions
# Initialize with custom timeout
reward_fns = get_reward_functions(timeout=10)
puzzle_fn = reward_fns["puzzle"]
# Evaluate Python puzzle solution
result = puzzle_fn.compute_reward(
model_output="<answer>\n```python\ndef sol(s):\n return \"19\"\n```\n</answer>",
expected_output={"puzzle": "SumOfDigits", "inputs": {"s": 10}, "language": "python"}
)
print(f"Score: {result.score}")Getting Puzzle Prompts: You can access the puzzle data programmatically to understand what problems are available or to inspect puzzle specifications:
from infinite_rl.puzzles import get_puzzle_data, get_available_puzzles
# Get all available JavaScript puzzles
js_puzzles = get_available_puzzles("javascript")
print(f"Available JS puzzles: {len(js_puzzles)}")
print(f"First few: {js_puzzles[:5]}")
# Get all available Python puzzles
py_puzzles = get_available_puzzles("python")
print(f"Available Python puzzles: {len(py_puzzles)}")
# Get a specific puzzle's data (includes sat, sol, docstring, example, etc.)
puzzle_data = get_puzzle_data("QuadraticRoot", "javascript")
if puzzle_data:
print("QuadraticRoot puzzle data:")
print(f"Docstring: {puzzle_data['docstring']}")
print(f"Example inputs: {puzzle_data.get('example', {})}")
else:
print("Puzzle not found")Evaluates mathematical problem-solving using symbolic computation.
Example:
from infinite_rl import get_reward_functions
reward_fns = get_reward_functions()
math_fn = reward_fns["math"]
result = math_fn.compute_reward(
model_output="<answer>x^2 + 2x + 1</answer>",
expected_output="(x+1)^2"
)
print(f"Correctness: {result.score}") Conversation-based quality evaluation using LLM Judge as the primary evaluator. Supports multilingual quality assessment (Cantonese, Chinese, English) from truthy-dpo and yue-truthy datasets.
Features:
- LLM Judge (Skywork Reward Model) provides continuous quality score (0.0-1.0)
- Conversation format: system prompt + user prompt + model response
- Distributed across all difficulty levels (not rating-limited)
- Configurable weight in task selection during training (default: 10%)
- Multilingual support (yue, zh, en)
- Format gate on judge: If response format is invalid (missing
<answer>tag), judge reward is gated to zero
Requirements:
- Running sglang server with Skywork Reward Model
use_llm_judge=Truewithapi_host,api_port,model_nameconfigured
Example:
from infinite_rl import CurriculumLearning
# Initialize with truthy task support
cl = CurriculumLearning(
use_llm_judge=True,
llm_judge_kwargs={
"api_host": "localhost",
"api_port": 8000,
"model_name": "Skywork/Skywork-Reward-Llama-3.1-8B"
}
)
task = cl.get_prompt()
if task.task_type == "truthy":
print(f"System: {task.expected_answer['conversation'][0]['content']}")
print(f"User: {task.prompt}")A small encouragement reward that detects explicit chain-of-thought style reasoning placed inside a <think>...</think> block. The ReasoningStepsRewardFunction looks for common reasoning indicators (e.g., "first", "second", "finally", "therefore") and awards a modest bonus when multiple indicators are present.
Behavior:
- If no
<think>block is found, no bonus is awarded. - If 1–2 unique indicators appear in the
<think>block, a small bonus (0.1) is returned. - If 3+ unique indicators appear, a larger encouragement bonus (0.2) is returned.
Example:
from infinite_rl import get_reward_functions
reward_fns = get_reward_functions()
reason_fn = reward_fns["reasoning_steps"]
model_out = "<think>First, we compute the sum. Second, we verify the result. Finally, we present it.</think>"
score = reason_fn.compute_reward(model_out, expected_output=None)
print(f"Reasoning bonus: {score.score}")Uses a remote LLM-based reward model to evaluate response quality continuously. The LLMJudgeRewardFunction integrates with sglang server running the Skywork Reward Model (V2-Qwen3-4B) to score responses on a continuous scale.
Requirements:
- Running sglang server with Skywork Reward Model
- Network access to the sglang API endpoint
Features:
- Continuous quality scoring (not binary correct/incorrect)
- Flexible score normalization (raw or tanh-based [0, 1] mapping)
- Configurable API endpoint, timeout, and score thresholds
- Graceful error handling when API unavailable
Example:
from infinite_rl.reward_functions import LLMJudgeRewardFunction
judge = LLMJudgeRewardFunction(
api_host="localhost",
api_port=8000,
normalize=True # Normalize to [0, 1]
)
judge.initialize()
result = judge.compute_reward(task)
print(f"Quality score: {result.score:.4f}")Setup Instructions:
See docs/LLM_JUDGE_REWARD_FUNCTION.md for detailed setup, configuration, and integration guide.
The CurriculumLearning class provides adaptive task difficulty progression based on model performance using a sliding window success rate mechanism. It starts with math tasks (level 0) and progressively includes programming puzzles with increasing difficulty (levels 1-5) as the model demonstrates competence.
Features:
- Sliding Window Tracking: Tracks the last N episodes (default: 50) of success/failure per task type
- Dual-Criterion Advancement: Advances difficulty only when both conditions are met:
- Average success rate > 80% (configurable via
success_rate_threshold) - Variance < 0.05 (configurable via
variance_threshold)
- Average success rate > 80% (configurable via
- Dual-Criterion Demotion: Demotes difficulty when both conditions are met:
- Average success rate < 40% (configurable via
demote_threshold) - Variance < 0.05 (configurable via
variance_threshold)
- Average success rate < 40% (configurable via
- Level Change Cooldown: Prevents rapid level fluctuations by enforcing a minimum of 5 steps (configurable via
level_change_cooldown) between changes - Per-Task-Type Windows: Maintains independent sliding windows for math and puzzle tasks
- Simplified Task Management: Clean separation between GRPO batches with automatic diversity weighting
- Inverse Task Weighting: Uses inverse weighting based on task distribution (weight = 1.0 / num_tasks_at_level) to ensure balanced sampling across difficulty levels, giving higher priority to underrepresented levels. The current level receives an additional 2x weight multiplier to focus training on the active difficulty level.
- Multi-Task Support: Works with math problems and programming puzzles
- Automatic Judge Score Computation:
get_judge_scores()computes missing LLM Judge scores on-demand for accurate statistics during training
This ensures the model has truly mastered a difficulty level rather than just "catching up" with lucky guesses.
Configuration:
from infinite_rl import CurriculumLearning
# Initialize with default settings
cl = CurriculumLearning()
# Or customize thresholds
cl = CurriculumLearning(
timeout=10,
answer_tag="answer",
think_tag="think",
window_size=50, # Track last 50 episodes
success_rate_threshold=0.8, # Require 80% success rate for advancement
demote_threshold=0.4, # Demote if success rate falls below 40%
variance_threshold=0.05, # Require low variance for consistency
level_change_cooldown=5, # Minimum steps between level changes
truthy_learning_rate=0.1, # 10% chance of truthy tasks
)Example Usage:
from infinite_rl import CurriculumLearning
cl = CurriculumLearning()
# Get a task appropriate for current skill level
task = cl.get_prompt()
print(f"Task type: {task.task_type}, Difficulty level: {task.level}")
print(f"Prompt: {task.prompt}")
# Evaluate model response and update learning state
model_response = "<answer>4</answer>"
reward = cl.compute_reward(
task_id=task.task_id,
model_output=model_response
)
print(f"Reward: {reward}")
# Check learning progress
stats = cl.get_learning_stats()
print(f"Current level: {stats['current_level']}")
print(f"Success rate: {stats['sliding_window_stats']['mean_success_rate']:.1%}")
print(f"Variance: {stats['sliding_window_stats']['mean_variance']:.4f}")Learning Progression:
- Maintains sliding windows of success/failure for each task type
- Calculates mean success rate and variance for recent performance
- Advances to next difficulty level when success rate and consistency thresholds are both met
- Provides detailed statistics to monitor curriculum progression
The CurriculumLearning class provides a single-call reward API that streamlines integration with GRPO and fine-tuning frameworks. The design consolidates reward computation, batch processing, and curriculum tracking into one unified method.
Core API Method:
def compute_reward(task_id: str, model_output: str) -> floatBehavior:
- Returns: A combined reward score (float between 0.0 and 1.0)
- Incomplete Batches (<
num_generationscompletions): Returns the primary score immediately - Complete Batches (>=
num_generationscompletions):- Triggers batch LLM Judge evaluation (if enabled and not already computed)
- Recomputes combined score with auxiliary metrics blended in
- Updates curriculum tracking for non-truthy tasks
- Returns final combined score
Design Benefits:
- Single Integration Point: No need to track task state externally or call separate methods for rewards vs. metrics
- Deferred LLM Judge: Batch evaluations happen automatically when all generations are ready, improving efficiency
- Automatic Curriculum: Task difficulty progresses transparently without separate state management
- Integrated Generation Tracking: Each task accumulates generations internally with full history preserved
Task Types & Scoring:
- Math Tasks (Level 0): Binary correctness (0.0 or 1.0) + optional LLM Judge auxiliary score
- Puzzle Tasks (Levels 1-5): Binary correctness (0.0 or 1.0) + optional LLM Judge auxiliary score
- Truthy Tasks (All Levels): Primary score = LLM Judge score (continuous 0.0-1.0), deferred to batch completion
- Format Gate: If format is invalid (missing
<answer>tag), primary score is gated to 0.0 regardless of judge quality
- Format Gate: If format is invalid (missing
Score Composition: When all auxiliary functions are enabled, the combined score is:
combined_score = primary_weight * primary_score + aux_weight * avg(auxiliary_scores) + judge_contribution
Where:
primary_score: Task-specific correctness (binary for math/puzzle, continuous for truthy)auxiliary_scores: Optional metrics like format validity, repetition penalty, reasoning steps bonus, language consistency (truthy tasks only)judge_contribution: Normalized LLM Judge score (0.0 whenllm_judge_weight=0.0)
Configuration:
from infinite_rl import CurriculumLearning
# Basic: Single-call API with all defaults
cl = CurriculumLearning()
# Customize scoring:
cl = CurriculumLearning(
num_generations=4, # GRPO with 4 completions per task
aux_weight=0.1, # Auxiliary metrics contribute 10% of final score
llm_judge_weight=0.2, # LLM Judge score contributes 20% (when available)
use_format=True, # Enable format validation (default)
use_length=True, # Enable response length regularization (default: False)
use_llm_judge=True, # Enable LLM Judge for auxiliary evaluation
)Usage in GRPO Training:
from infinite_rl import CurriculumLearning
cl = CurriculumLearning(num_generations=4)
# Training loop
for batch_idx in range(num_batches):
task = cl.get_prompt()
for generation in range(4): # 4 completions per GRPO batch
model_output = generate_response(task.prompt)
# Single-call API: computes and returns reward immediately
reward = cl.compute_reward(task.task_id, model_output)
# For batch 1: returns primary_score (incomplete batch)
# For batch 4+: returns combined_score (batch completed, LLM Judge applied)
rewards.append(reward)
# Process rewards for policy optimization
process_grpo_batch(task.task_id, rewards)Common Patterns:
-
Simple Correctness Only (no auxiliary metrics):
cl = CurriculumLearning(aux_weight=0.0, use_llm_judge=False) reward = cl.compute_reward(task_id, model_output) # Returns: 0.0 (wrong) or 1.0 (correct)
-
With Auxiliary Quality Metrics:
cl = CurriculumLearning(aux_weight=0.15) # 15% weight for format, length, repetition penalties reward = cl.compute_reward(task_id, model_output) # Returns: 0.0-1.0 blending correctness + auxiliary scores
-
With LLM Judge for Truthy Tasks:
cl = CurriculumLearning( use_llm_judge=True, llm_judge_kwargs={"api_host": "localhost", "api_port": 8000, ...} ) reward = cl.compute_reward(task_id, model_output) # For truthy: returns 0.0-1.0 LLM Judge score (deferred to batch completion) # For math/puzzle: returns 0.0 or 1.0 correctness
Removed Methods:
get_reward()- Consolidated intocompute_reward()return value_normalize_score()- No longer needed with unified API
The system implements a clean Task → Generation hierarchy for GRPO (Group Relative Policy Optimization) batch management with zero redundancy:
Core Components:
Task.generations: List of all generations for a taskTask.add_generation(): Adds a new generation with output, rewards, and primary scoreTask.latest_generation: Gets the most recent generationSession.get_batch_data(task_id): Retrieves all generation data for analysisSession.get_batch_stats(task_id): Provides comprehensive batch statistics
Simplified Task Management:
- Fresh Tasks per Batch: Each GRPO batch gets a fresh task instance from the dataset
- Within-Batch Reuse:
DynamicCurriculumDatasetcaches and reuses the same task for allnum_generationscompletions within a batch - Dataset Diversity: Built-in weighting reduces probability of recently used dataset rows across batches
- Clean Separation: No complex state tracking - each batch is independent
Key Benefits:
- Zero Redundancy: No scattered state across multiple dicts
- Single Source of Truth: Task owns all its generations
- Automatic Cleanup: No manual dict management needed
- Full Queryability: Complete generation history tracking
- Clean Integration: Seamless GRPO/non-GRPO task handling
- Simplified Logic: Removed error-prone active task tracking
Usage Examples:
from infinite_rl import CurriculumLearning
cl = CurriculumLearning()
# GRPO batch automatically handled by DynamicCurriculumDataset
# Each batch gets a fresh task, within-batch reuse is automatic
for generation in range(4): # GRPO with 4 generations
task = cl.get_prompt() # Fresh task per batch, cached within batch
model_output = generate_response(task.prompt)
reward = cl.compute_reward(task.task_id, model_output)
# Task automatically accumulates generations
# Query generation statistics
from infinite_rl.session import Session
session = cl.session # Access internal session
# Get all generation data
batch_data = session.get_batch_data(task.task_id)
print(f"Generated {len(batch_data)} responses")
# Get batch statistics
stats = session.get_batch_stats(task.task_id)
print(f"Average score: {stats['scores']['avg']:.3f}")
print(f"Best generation index: {stats['best_generation']['index']}")Test the executor with different programming languages:
from infinite_rl import RewardExecutor
executor = RewardExecutor(timeout=5)
# Test JavaScript puzzle execution
stdout, stderr = executor.run_single('{"puzzle": "SumOfDigits", "inputs": {"s": 10}, "code": "function sol(inputs) { return \'19\'; }"}', "javascript")
print(f"JS Result: {stdout}")
# Test Python puzzle execution (via local runner)
# Note: Python puzzles are executed via subprocess in the reward function, not directly through executorInstall and test in Colab with this notebook:
# Install the package
!pip install git+https://github.com/hon9kon9ize/infinite-rl.git
# Import and test
from infinite_rl import RewardExecutor, get_reward_functions
# Test executor
executor = RewardExecutor(timeout=5)
stdout, stderr = executor.run_single("print(2 + 2)", "python")
print(f"Executor test - Output: {stdout}, Error: {stderr}")
# Test coding reward function
reward_fns = get_reward_functions(timeout=5)
python_fn = reward_fns["python"]
result = python_fn.compute_reward(
model_output="<answer>\n```python\nprint(2 + 2)\n```\n</answer>",
expected_output="4"
)
print(f"Reward Result: {result}")
print(f"Correctness Score: {result.score}")To run all unit tests, install development dependencies and use pytest:
pip install -r requirements_dev.txt
pytestinfinite_rl/
├── curriculum.py # Curriculum learning with adaptive difficulty
├── executor.py # Multi-language code executor (WASM for JS)
├── generator.py # LLM orchestration and resume logic
├── parser.py # Robust tag extraction and markdown parsing
├── prompts.py # Task-specific system instructions
├── runner.py # Python puzzle execution via subprocess
├── puzzles.py # Puzzle data loading and utilities
├── reward_functions/
│ ├── reward_function.py # Base reward function class
│ ├── math.py # Math task evaluator (symbolic computation)
│ ├── puzzle.py # Puzzle task evaluator (validates code execution)
│ ├── reasoning_steps.py # Chain-of-thought bonus reward
│ ├── length.py # Response length regularizer (cosine decay)
│ ├── repetition.py # N-gram repetition penalty
│ ├── lang_consistency.py # Language consistency detection
│ └── format.py # Format validation
└── runtimes/
├── math.json # Math problem dataset compiled from OpenAI's GSM8K with solutions
├── puzzles.json # Programming puzzle specifications compiled from Microsoft's Python Programming Puzzles
└── puzzle_js.wasm # WASM runtime for JavaScript execution
1. Math Tasks
- Source:
infinite_rl/runtimes/math.json(compiled from OpenAI's GSM8K) - Evaluation: Symbolic computation with SymPy
- Reward Function:
MathRewardFunction
2. Puzzle Tasks
- Source:
infinite_rl/runtimes/puzzles.json(compiled from Microsoft's Python Programming Puzzles) - Languages: Python (subprocess execution) and JavaScript (WASM execution)
- Evaluation: Code validation against SAT (satisfaction) functions
- Reward Function:
PuzzleRewardFunction - Difficulty: Rated 1-5 per puzzle
All task types are designed for RLHF (Reinforcement Learning from Human Feedback) readiness. Every sample follows a strict three-headed structure:
- Prompt: The instruction.
- Answer: The ground-truth reference.
- Response: A detailed step-by-step reasoning (Chain-of-Thought) where the final solution is always wrapped in
<answer>tags.
Handles execution of code in multiple languages with timeout protection and error handling. Located in infinite_rl/executor.py.
Each task type has a specialized reward function that:
- Initializes necessary components (e.g., loading embedding or ML models)
- Executes/evaluates generated content extracted from
<answer>tags. - Computes a reward score (0-1) combining format and correctness.
- Returns detailed evaluation metrics.
All reward functions inherit from RewardFunction base class and are accessible via get_reward_functions().
A utility to discourage verbosity when the answer is correct and to discourage laziness (encourage effort) when the answer is incorrect. Instead of a linear penalty, it uses a cosine curve to create a "sweet spot" for response length.
- Purpose: Prevent overly long correct answers and encourage longer attempts for incorrect answers.
- Math (short): For a normalized x in [0,1], the functions used are:
- Correct answers (decay after target): R = (cos(pi * x) + 1) / 2 (maps 1 -> 0 over range)
- Incorrect answers (encourage effort): R = (1 - cos(pi * x)) / 2 (maps 0 -> 1 over range)
- Implementation: See
infinite_rl/reward_functions/length.py— functioncosine_length_reward(length, min_len=1, max_len=1000, target_len=None, correct=True).
Usage example (quick):
from infinite_rl.reward_functions.length import cosine_length_reward
length = 350
len_reward = cosine_length_reward(
length=length,
min_len=1,
max_len=1000,
target_len=200, # for correct answers, lengths <= 200 get full credit
correct=True,
)
# Combine with a base correctness score (example):
final_score = base_correctness_score * len_rewardInteractive examples (print to inspect behavior):
from infinite_rl.reward_functions.length import cosine_length_reward
print("Short correct:", cosine_length_reward(10, min_len=1, max_len=1000, target_len=20, correct=True))
print("Long correct:", cosine_length_reward(500, min_len=1, max_len=1000, target_len=200, correct=True))
print("Short incorrect (encourage longer):", cosine_length_reward(5, min_len=1, max_len=1000, correct=False))
print("Moderate incorrect (some effort):", cosine_length_reward(150, min_len=1, max_len=1000, correct=False))Notes:
- For
correct=True, lengths <=target_lenreceive full reward (1.0); beyond that the reward decays smoothly to 0 atmax_len. - For
correct=False, the reward increases smoothly with length to encourage longer reasoning attempts. - The function clamps
lengthto[min_len, max_len]and validates bounds.
We penalize repeated n-grams to discourage degenerate or looping responses. The penalty is a normalized negative value computed as:
from infinite_rl.reward_functions.repetition import ngram_repetition_reward
penalty = ngram_repetition_reward(text, n=3, weight=-0.1)Behavior:
- Uses simple tokenization (lowercasing and punctuation removal) and counts duplicated n-grams.
- Returns a negative penalty (<= 0) proportional to the fraction of duplicated n-grams in the response; 0 if no duplicates.
weightcontrols the maximum magnitude (default -0.1).
Quick example (inspect behavior):
from infinite_rl.reward_functions.repetition import ngram_repetition_reward
text = "Hello Hello Hello world world world"
penalty = ngram_repetition_reward(text, n=2, weight=-0.1)
print("Repetition penalty (n=2):", penalty)
# Combine with base score
# final_score = max(0.0, base_correctness_score + penalty)Notes:
- Combine this penalty with the base correctness score (e.g., final_score = max(0.0, base_correctness + penalty)).
GSM8K Dataset (Math tasks source):
- Repository: https://huggingface.co/datasets/openai/gsm8k
@article{cobbe2021gsm8k,
title={Training Verifiers to Solve Math Word Problems},
author={Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John},
journal={arXiv preprint arXiv:2110.14168},
year={2021}
}Programming Puzzles (Puzzle tasks source):
@inproceedings{
schuster2021programming,
title={Programming Puzzles},
author={Tal Schuster and Ashwin Kalyan and Alex Polozov and Adam Tauman Kalai},
booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2021},
url={https://arxiv.org/abs/2106.05784}
}Python Programming Puzzles Repository (Implementation source): We borrowed puzzle implementation code from Microsoft's Python Programming Puzzles repository and implemented a JavaScript version for WASM-based execution.