Infinite-RL

Infinite-RL is a reward functions toolbox for LLM Reinforcement Learning. It provides modular reward functions for evaluating programming puzzles, mathematical problems, language detection, and auxiliary metrics like length and repetition penalties. The toolbox is designed to integrate with fine-tuning frameworks like Tunix for model training and optimization.

The package includes pre-built datasets for math tasks (math.json compiled from OpenAI's GSM8K) and programming puzzles (puzzles.json compiled from Microsoft's Python Programming Puzzles), along with WASM runtimes for secure JavaScript execution.

Installation

Option 1: Clone and Install Locally

git clone https://github.com/hon9kon9ize/infinite-rl.git
cd infinite-rl
pip install .

Option 2: Install Directly from GitHub

pip install git+https://github.com/hon9kon9ize/infinite-rl.git

Setup

Install dependencies and language runtimes:

The installation process will automatically attempt to install required language runtimes:
- macOS: Uses Homebrew to install Node.js and ts-node
- Linux: Uses apt-get to install Node.js and ts-node
- Windows: Provides links for manual installation, ts-node installation via npm if available
(Optional) Activate the Python virtual environment before using the CLI:
```
source .venv/bin/activate
```
Runtimes (WASM)
- The JS runtime is built by build_src/build_wasm.sh.
- A GitHub Actions workflow (.github/workflows/build_and_release_runtimes.yml) runs the build and uploads puzzle_js.wasm to a GitHub Release.
- During installation, setup.py will try to download these runtimes automatically from the latest release (or use the RUNTIME_RELEASE_TAG environment variable to pin a release). If you prefer to build locally, run ./build_src/build_wasm.sh and the generated files will be placed in infinite_rl/runtimes/.

Usage

Infinite-RL provides modular reward functions for evaluating model responses across different task types. Use the package to integrate reward evaluation into fine-tuning frameworks like Tunix.

Testing & Verification

Infinite RL includes a comprehensive testing suite for reward functions and evaluation.

Run Unit Tests

Use pytest to run the unit tests for reward functions and the parser:

# Run all tests
python -m pytest tests -v

# Run specific reward function tests
python -m pytest tests/test_reward_functions.py -v

# Run puzzle reward function tests
python -m pytest tests/test_puzzle_reward_function.py -v

# Run tests with coverage report
python -m pytest tests --cov=infinite_rl --cov-report=html --cov-report=term-missing

# Or use the convenience script
python tests/run_coverage.py run      # Run tests with coverage
python tests/run_coverage.py view     # Opens HTML report in browser
python tests/run_coverage.py badge    # Update coverage badge in README
python tests/run_coverage.py all      # Run tests and update badge

# View coverage report in browser
open htmlcov/index.html

Code Coverage

The project maintains >80% code coverage. Coverage reports are generated automatically in CI and can be viewed locally:

HTML Report: htmlcov/index.html - Interactive coverage report
Terminal Report: Shows missing lines in terminal output
XML Report: coverage.xml - For CI integration
GitHub Status Checks: Coverage percentage shown in PR/commit status
Auto-updating Badges: Coverage badges update automatically in README

CI/CD Features

Coverage Enforcement: Builds fail if coverage drops below 80%
Status Checks: GitHub shows coverage percentage for each commit
Badge Updates: README coverage badges update automatically on main branch
Multiple Report Formats: HTML, XML, and terminal reports generated

Supported Tasks

1. Puzzle Task

Evaluates LLM-generated solutions to programming puzzles across multiple languages with automated verification.

Supported Languages:

Python (executed locally via subprocess)
JavaScript (executed via WASM runtime)

Features:

Puzzle solution validation using predefined sat functions
Support for various puzzle types (algebra, basic math, etc.)
Secure execution environments (WASM for JS, local subprocess for Python)
Detailed error reporting
Difficulty ratings: Each programming puzzle has been rated for difficulty (1-5 scale) using Gemini 3 Flash model.
Math tasks are drawn from GSM8K, filtered for easy mathematical problems (level 0)

Example:

from infinite_rl import get_reward_functions

# Initialize with custom timeout
reward_fns = get_reward_functions(timeout=10)
puzzle_fn = reward_fns["puzzle"]

# Evaluate Python puzzle solution
result = puzzle_fn.compute_reward(
    model_output="<answer>\n```python\ndef sol(s):\n    return \"19\"\n```\n</answer>",
    expected_output={"puzzle": "SumOfDigits", "inputs": {"s": 10}, "language": "python"}
)
print(f"Score: {result.score}")

Getting Puzzle Prompts: You can access the puzzle data programmatically to understand what problems are available or to inspect puzzle specifications:

from infinite_rl.puzzles import get_puzzle_data, get_available_puzzles

# Get all available JavaScript puzzles
js_puzzles = get_available_puzzles("javascript")
print(f"Available JS puzzles: {len(js_puzzles)}")
print(f"First few: {js_puzzles[:5]}")

# Get all available Python puzzles
py_puzzles = get_available_puzzles("python")
print(f"Available Python puzzles: {len(py_puzzles)}")

# Get a specific puzzle's data (includes sat, sol, docstring, example, etc.)
puzzle_data = get_puzzle_data("QuadraticRoot", "javascript")
if puzzle_data:
    print("QuadraticRoot puzzle data:")
    print(f"Docstring: {puzzle_data['docstring']}")
    print(f"Example inputs: {puzzle_data.get('example', {})}")
else:
    print("Puzzle not found")

2. Math Task

Evaluates mathematical problem-solving using symbolic computation.

Example:

from infinite_rl import get_reward_functions

reward_fns = get_reward_functions()
math_fn = reward_fns["math"]

result = math_fn.compute_reward(
    model_output="<answer>x^2 + 2x + 1</answer>",
    expected_output="(x+1)^2"
)
print(f"Correctness: {result.score}")

3. Truthy Task

Conversation-based quality evaluation using LLM Judge as the primary evaluator. Supports multilingual quality assessment (Cantonese, Chinese, English) from truthy-dpo and yue-truthy datasets.

Features:

LLM Judge (Skywork Reward Model) provides continuous quality score (0.0-1.0)
Conversation format: system prompt + user prompt + model response
Distributed across all difficulty levels (not rating-limited)
Configurable weight in task selection during training (default: 10%)
Multilingual support (yue, zh, en)
Format gate on judge: If response format is invalid (missing <answer> tag), judge reward is gated to zero

Requirements:

Running sglang server with Skywork Reward Model
use_llm_judge=True with api_host, api_port, model_name configured

Example:

from infinite_rl import CurriculumLearning

# Initialize with truthy task support
cl = CurriculumLearning(
    use_llm_judge=True,
    llm_judge_kwargs={
        "api_host": "localhost",
        "api_port": 8000,
        "model_name": "Skywork/Skywork-Reward-Llama-3.1-8B"
    }
)

task = cl.get_prompt()
if task.task_type == "truthy":
    print(f"System: {task.expected_answer['conversation'][0]['content']}")
    print(f"User: {task.prompt}")

A small encouragement reward that detects explicit chain-of-thought style reasoning placed inside a <think>...</think> block. The ReasoningStepsRewardFunction looks for common reasoning indicators (e.g., "first", "second", "finally", "therefore") and awards a modest bonus when multiple indicators are present.

Behavior:

If no <think> block is found, no bonus is awarded.
If 1–2 unique indicators appear in the <think> block, a small bonus (0.1) is returned.
If 3+ unique indicators appear, a larger encouragement bonus (0.2) is returned.

Example:

from infinite_rl import get_reward_functions

reward_fns = get_reward_functions()
reason_fn = reward_fns["reasoning_steps"]

model_out = "<think>First, we compute the sum. Second, we verify the result. Finally, we present it.</think>"
score = reason_fn.compute_reward(model_out, expected_output=None)
print(f"Reasoning bonus: {score.score}")

4. LLM Judge (Remote Quality Evaluation)

Uses a remote LLM-based reward model to evaluate response quality continuously. The LLMJudgeRewardFunction integrates with sglang server running the Skywork Reward Model (V2-Qwen3-4B) to score responses on a continuous scale.

Requirements:

Running sglang server with Skywork Reward Model
Network access to the sglang API endpoint

Features:

Continuous quality scoring (not binary correct/incorrect)
Flexible score normalization (raw or tanh-based [0, 1] mapping)
Configurable API endpoint, timeout, and score thresholds
Graceful error handling when API unavailable

Example:

from infinite_rl.reward_functions import LLMJudgeRewardFunction

judge = LLMJudgeRewardFunction(
    api_host="localhost",
    api_port=8000,
    normalize=True  # Normalize to [0, 1]
)
judge.initialize()

result = judge.compute_reward(task)
print(f"Quality score: {result.score:.4f}")

Setup Instructions: See docs/LLM_JUDGE_REWARD_FUNCTION.md for detailed setup, configuration, and integration guide.

Curriculum Learning

The CurriculumLearning class provides adaptive task difficulty progression based on model performance using a sliding window success rate mechanism. It starts with math tasks (level 0) and progressively includes programming puzzles with increasing difficulty (levels 1-5) as the model demonstrates competence.

Features:

Sliding Window Tracking: Tracks the last N episodes (default: 50) of success/failure per task type
Dual-Criterion Advancement: Advances difficulty only when both conditions are met:
- Average success rate > 80% (configurable via success_rate_threshold)
- Variance < 0.05 (configurable via variance_threshold)
Dual-Criterion Demotion: Demotes difficulty when both conditions are met:
- Average success rate < 40% (configurable via demote_threshold)
- Variance < 0.05 (configurable via variance_threshold)
Level Change Cooldown: Prevents rapid level fluctuations by enforcing a minimum of 5 steps (configurable via level_change_cooldown) between changes
Per-Task-Type Windows: Maintains independent sliding windows for math and puzzle tasks
Simplified Task Management: Clean separation between GRPO batches with automatic diversity weighting
Inverse Task Weighting: Uses inverse weighting based on task distribution (weight = 1.0 / num_tasks_at_level) to ensure balanced sampling across difficulty levels, giving higher priority to underrepresented levels. The current level receives an additional 2x weight multiplier to focus training on the active difficulty level.
Multi-Task Support: Works with math problems and programming puzzles
Automatic Judge Score Computation: get_judge_scores() computes missing LLM Judge scores on-demand for accurate statistics during training

This ensures the model has truly mastered a difficulty level rather than just "catching up" with lucky guesses.

Configuration:

from infinite_rl import CurriculumLearning

# Initialize with default settings
cl = CurriculumLearning()

# Or customize thresholds
cl = CurriculumLearning(
    timeout=10,
    answer_tag="answer",
    think_tag="think",
    window_size=50,              # Track last 50 episodes
    success_rate_threshold=0.8,  # Require 80% success rate for advancement
    demote_threshold=0.4,        # Demote if success rate falls below 40%
    variance_threshold=0.05,     # Require low variance for consistency
    level_change_cooldown=5,     # Minimum steps between level changes
    truthy_learning_rate=0.1,    # 10% chance of truthy tasks
)

Example Usage:

from infinite_rl import CurriculumLearning

cl = CurriculumLearning()

# Get a task appropriate for current skill level
task = cl.get_prompt()
print(f"Task type: {task.task_type}, Difficulty level: {task.level}")
print(f"Prompt: {task.prompt}")

# Evaluate model response and update learning state
model_response = "<answer>4</answer>"
reward = cl.compute_reward(
    task_id=task.task_id,
    model_output=model_response
)

print(f"Reward: {reward}")

# Check learning progress
stats = cl.get_learning_stats()
print(f"Current level: {stats['current_level']}")
print(f"Success rate: {stats['sliding_window_stats']['mean_success_rate']:.1%}")
print(f"Variance: {stats['sliding_window_stats']['mean_variance']:.4f}")

Learning Progression:

Maintains sliding windows of success/failure for each task type
Calculates mean success rate and variance for recent performance
Advances to next difficulty level when success rate and consistency thresholds are both met
Provides detailed statistics to monitor curriculum progression

Simplified Reward API

The CurriculumLearning class provides a single-call reward API that streamlines integration with GRPO and fine-tuning frameworks. The design consolidates reward computation, batch processing, and curriculum tracking into one unified method.

Core API Method:

def compute_reward(task_id: str, model_output: str) -> float

Behavior:

Returns: A combined reward score (float between 0.0 and 1.0)
Incomplete Batches (< num_generations completions): Returns the primary score immediately
Complete Batches (>= num_generations completions):
1. Triggers batch LLM Judge evaluation (if enabled and not already computed)
2. Recomputes combined score with auxiliary metrics blended in
3. Updates curriculum tracking for non-truthy tasks
4. Returns final combined score

Design Benefits:

Single Integration Point: No need to track task state externally or call separate methods for rewards vs. metrics
Deferred LLM Judge: Batch evaluations happen automatically when all generations are ready, improving efficiency
Automatic Curriculum: Task difficulty progresses transparently without separate state management
Integrated Generation Tracking: Each task accumulates generations internally with full history preserved

Task Types & Scoring:

Math Tasks (Level 0): Binary correctness (0.0 or 1.0) + optional LLM Judge auxiliary score
Puzzle Tasks (Levels 1-5): Binary correctness (0.0 or 1.0) + optional LLM Judge auxiliary score
Truthy Tasks (All Levels): Primary score = LLM Judge score (continuous 0.0-1.0), deferred to batch completion
- Format Gate: If format is invalid (missing <answer> tag), primary score is gated to 0.0 regardless of judge quality

Score Composition: When all auxiliary functions are enabled, the combined score is:

combined_score = primary_weight * primary_score + aux_weight * avg(auxiliary_scores) + judge_contribution

Where:

primary_score: Task-specific correctness (binary for math/puzzle, continuous for truthy)
auxiliary_scores: Optional metrics like format validity, repetition penalty, reasoning steps bonus, language consistency (truthy tasks only)
judge_contribution: Normalized LLM Judge score (0.0 when llm_judge_weight=0.0)

Configuration:

from infinite_rl import CurriculumLearning

# Basic: Single-call API with all defaults
cl = CurriculumLearning()

# Customize scoring:
cl = CurriculumLearning(
    num_generations=4,           # GRPO with 4 completions per task
    aux_weight=0.1,              # Auxiliary metrics contribute 10% of final score
    llm_judge_weight=0.2,        # LLM Judge score contributes 20% (when available)
    use_format=True,             # Enable format validation (default)
    use_length=True,             # Enable response length regularization (default: False)
    use_llm_judge=True,          # Enable LLM Judge for auxiliary evaluation
)

Usage in GRPO Training:

from infinite_rl import CurriculumLearning

cl = CurriculumLearning(num_generations=4)

# Training loop
for batch_idx in range(num_batches):
    task = cl.get_prompt()
    
    for generation in range(4):  # 4 completions per GRPO batch
        model_output = generate_response(task.prompt)
        
        # Single-call API: computes and returns reward immediately
        reward = cl.compute_reward(task.task_id, model_output)
        
        # For batch 1: returns primary_score (incomplete batch)
        # For batch 4+: returns combined_score (batch completed, LLM Judge applied)
        rewards.append(reward)
    
    # Process rewards for policy optimization
    process_grpo_batch(task.task_id, rewards)

Common Patterns:

Simple Correctness Only (no auxiliary metrics):

cl = CurriculumLearning(aux_weight=0.0, use_llm_judge=False)
reward = cl.compute_reward(task_id, model_output)
# Returns: 0.0 (wrong) or 1.0 (correct)

With Auxiliary Quality Metrics:

cl = CurriculumLearning(aux_weight=0.15)  # 15% weight for format, length, repetition penalties
reward = cl.compute_reward(task_id, model_output)
# Returns: 0.0-1.0 blending correctness + auxiliary scores

With LLM Judge for Truthy Tasks:

cl = CurriculumLearning(
    use_llm_judge=True,
    llm_judge_kwargs={"api_host": "localhost", "api_port": 8000, ...}
)
reward = cl.compute_reward(task_id, model_output)
# For truthy: returns 0.0-1.0 LLM Judge score (deferred to batch completion)
# For math/puzzle: returns 0.0 or 1.0 correctness

Removed Methods:

get_reward() - Consolidated into compute_reward() return value
_normalize_score() - No longer needed with unified API

GRPO Batch Architecture

The system implements a clean Task → Generation hierarchy for GRPO (Group Relative Policy Optimization) batch management with zero redundancy:

Core Components:

Task.generations: List of all generations for a task
Task.add_generation(): Adds a new generation with output, rewards, and primary score
Task.latest_generation: Gets the most recent generation
Session.get_batch_data(task_id): Retrieves all generation data for analysis
Session.get_batch_stats(task_id): Provides comprehensive batch statistics

Simplified Task Management:

Fresh Tasks per Batch: Each GRPO batch gets a fresh task instance from the dataset
Within-Batch Reuse: DynamicCurriculumDataset caches and reuses the same task for all num_generations completions within a batch
Dataset Diversity: Built-in weighting reduces probability of recently used dataset rows across batches
Clean Separation: No complex state tracking - each batch is independent

Key Benefits:

Zero Redundancy: No scattered state across multiple dicts
Single Source of Truth: Task owns all its generations
Automatic Cleanup: No manual dict management needed
Full Queryability: Complete generation history tracking
Clean Integration: Seamless GRPO/non-GRPO task handling
Simplified Logic: Removed error-prone active task tracking

Usage Examples:

from infinite_rl import CurriculumLearning

cl = CurriculumLearning()

# GRPO batch automatically handled by DynamicCurriculumDataset
# Each batch gets a fresh task, within-batch reuse is automatic
for generation in range(4):  # GRPO with 4 generations
    task = cl.get_prompt()  # Fresh task per batch, cached within batch
    model_output = generate_response(task.prompt)
    reward = cl.compute_reward(task.task_id, model_output)
    # Task automatically accumulates generations

# Query generation statistics
from infinite_rl.session import Session
session = cl.session  # Access internal session

# Get all generation data
batch_data = session.get_batch_data(task.task_id)
print(f"Generated {len(batch_data)} responses")

# Get batch statistics
stats = session.get_batch_stats(task.task_id)
print(f"Average score: {stats['scores']['avg']:.3f}")
print(f"Best generation index: {stats['best_generation']['index']}")

Testing

Testing the RewardExecutor Locally

Test the executor with different programming languages:

from infinite_rl import RewardExecutor

executor = RewardExecutor(timeout=5)

# Test JavaScript puzzle execution
stdout, stderr = executor.run_single('{"puzzle": "SumOfDigits", "inputs": {"s": 10}, "code": "function sol(inputs) { return \'19\'; }"}', "javascript")
print(f"JS Result: {stdout}")

# Test Python puzzle execution (via local runner)
# Note: Python puzzles are executed via subprocess in the reward function, not directly through executor

Testing in Google Colab

Install and test in Colab with this notebook:

# Install the package
!pip install git+https://github.com/hon9kon9ize/infinite-rl.git

# Import and test
from infinite_rl import RewardExecutor, get_reward_functions

# Test executor
executor = RewardExecutor(timeout=5)
stdout, stderr = executor.run_single("print(2 + 2)", "python")
print(f"Executor test - Output: {stdout}, Error: {stderr}")

# Test coding reward function
reward_fns = get_reward_functions(timeout=5)
python_fn = reward_fns["python"]

result = python_fn.compute_reward(
    model_output="<answer>\n```python\nprint(2 + 2)\n```\n</answer>",
    expected_output="4"
)
print(f"Reward Result: {result}")
print(f"Correctness Score: {result.score}")

Development

Running Unit Tests

To run all unit tests, install development dependencies and use pytest:

pip install -r requirements_dev.txt
pytest

Project Structure

infinite_rl/
├── curriculum.py            # Curriculum learning with adaptive difficulty
├── executor.py              # Multi-language code executor (WASM for JS)
├── generator.py             # LLM orchestration and resume logic
├── parser.py                # Robust tag extraction and markdown parsing
├── prompts.py               # Task-specific system instructions
├── runner.py                # Python puzzle execution via subprocess
├── puzzles.py               # Puzzle data loading and utilities
├── reward_functions/
│   ├── reward_function.py   # Base reward function class
│   ├── math.py              # Math task evaluator (symbolic computation)
│   ├── puzzle.py            # Puzzle task evaluator (validates code execution)
│   ├── reasoning_steps.py   # Chain-of-thought bonus reward
│   ├── length.py            # Response length regularizer (cosine decay)
│   ├── repetition.py        # N-gram repetition penalty
│   ├── lang_consistency.py  # Language consistency detection
│   └── format.py            # Format validation
└── runtimes/
    ├── math.json            # Math problem dataset compiled from OpenAI's GSM8K with solutions
    ├── puzzles.json         # Programming puzzle specifications compiled from Microsoft's Python Programming Puzzles
    └── puzzle_js.wasm       # WASM runtime for JavaScript execution

Task Types

1. Math Tasks

Source: infinite_rl/runtimes/math.json (compiled from OpenAI's GSM8K)
Evaluation: Symbolic computation with SymPy
Reward Function: MathRewardFunction

2. Puzzle Tasks

Source: infinite_rl/runtimes/puzzles.json (compiled from Microsoft's Python Programming Puzzles)
Languages: Python (subprocess execution) and JavaScript (WASM execution)
Evaluation: Code validation against SAT (satisfaction) functions
Reward Function: PuzzleRewardFunction
Difficulty: Rated 1-5 per puzzle

Architecture

Standardized Format

All task types are designed for RLHF (Reinforcement Learning from Human Feedback) readiness. Every sample follows a strict three-headed structure:

Prompt: The instruction.
Answer: The ground-truth reference.
Response: A detailed step-by-step reasoning (Chain-of-Thought) where the final solution is always wrapped in <answer> tags.

RewardExecutor

Handles execution of code in multiple languages with timeout protection and error handling. Located in infinite_rl/executor.py.

Reward Functions

Each task type has a specialized reward function that:

Initializes necessary components (e.g., loading embedding or ML models)
Executes/evaluates generated content extracted from <answer> tags.
Computes a reward score (0-1) combining format and correctness.
Returns detailed evaluation metrics.

All reward functions inherit from RewardFunction base class and are accessible via get_reward_functions().

Cosine Length Reward (length-based regularizer) ✅

A utility to discourage verbosity when the answer is correct and to discourage laziness (encourage effort) when the answer is incorrect. Instead of a linear penalty, it uses a cosine curve to create a "sweet spot" for response length.

Purpose: Prevent overly long correct answers and encourage longer attempts for incorrect answers.
Math (short): For a normalized x in [0,1], the functions used are:
- Correct answers (decay after target): R = (cos(pi * x) + 1) / 2 (maps 1 -> 0 over range)
- Incorrect answers (encourage effort): R = (1 - cos(pi * x)) / 2 (maps 0 -> 1 over range)
Implementation: See infinite_rl/reward_functions/length.py — function cosine_length_reward(length, min_len=1, max_len=1000, target_len=None, correct=True).

Usage example (quick):

from infinite_rl.reward_functions.length import cosine_length_reward

length = 350
len_reward = cosine_length_reward(
    length=length,
    min_len=1,
    max_len=1000,
    target_len=200,  # for correct answers, lengths <= 200 get full credit
    correct=True,
)
# Combine with a base correctness score (example):
final_score = base_correctness_score * len_reward

Interactive examples (print to inspect behavior):

from infinite_rl.reward_functions.length import cosine_length_reward

print("Short correct:", cosine_length_reward(10, min_len=1, max_len=1000, target_len=20, correct=True))
print("Long correct:", cosine_length_reward(500, min_len=1, max_len=1000, target_len=200, correct=True))
print("Short incorrect (encourage longer):", cosine_length_reward(5, min_len=1, max_len=1000, correct=False))
print("Moderate incorrect (some effort):", cosine_length_reward(150, min_len=1, max_len=1000, correct=False))

Notes:

For correct=True, lengths <= target_len receive full reward (1.0); beyond that the reward decays smoothly to 0 at max_len.
For correct=False, the reward increases smoothly with length to encourage longer reasoning attempts.
The function clamps length to [min_len, max_len] and validates bounds.

N-gram Repetition Penalty (anti-repetition) ⚠️

We penalize repeated n-grams to discourage degenerate or looping responses. The penalty is a normalized negative value computed as:

from infinite_rl.reward_functions.repetition import ngram_repetition_reward
penalty = ngram_repetition_reward(text, n=3, weight=-0.1)

Behavior:

Uses simple tokenization (lowercasing and punctuation removal) and counts duplicated n-grams.
Returns a negative penalty (<= 0) proportional to the fraction of duplicated n-grams in the response; 0 if no duplicates.
weight controls the maximum magnitude (default -0.1).

Quick example (inspect behavior):

from infinite_rl.reward_functions.repetition import ngram_repetition_reward

text = "Hello Hello Hello world world world"
penalty = ngram_repetition_reward(text, n=2, weight=-0.1)
print("Repetition penalty (n=2):", penalty)

# Combine with base score
# final_score = max(0.0, base_correctness_score + penalty)

Notes:

Combine this penalty with the base correctness score (e.g., final_score = max(0.0, base_correctness + penalty)).

References

GSM8K Dataset (Math tasks source):

Repository: https://huggingface.co/datasets/openai/gsm8k

@article{cobbe2021gsm8k,
  title={Training Verifiers to Solve Math Word Problems},
  author={Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John},
  journal={arXiv preprint arXiv:2110.14168},
  year={2021}
}

Programming Puzzles (Puzzle tasks source):

@inproceedings{
schuster2021programming,
title={Programming Puzzles},
author={Tal Schuster and Ashwin Kalyan and Alex Polozov and Adam Tauman Kalai},
booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2021},
url={https://arxiv.org/abs/2106.05784}
}

Python Programming Puzzles Repository (Implementation source): We borrowed puzzle implementation code from Microsoft's Python Programming Puzzles repository and implemented a JavaScript version for WASM-based execution.

Name		Name	Last commit message	Last commit date
Latest commit History 265 Commits
.github		.github
.vscode		.vscode
build_src		build_src
data		data
docs		docs
emulator		emulator
infinite_rl		infinite_rl
scripts		scripts
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
MANIFEST.in		MANIFEST.in
README.md		README.md
VERSION.txt		VERSION.txt
package.json		package.json
print_puzzle_prompt.py		print_puzzle_prompt.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

Infinite-RL

Installation

Option 1: Clone and Install Locally

Option 2: Install Directly from GitHub

Setup

Usage

Testing & Verification

Run Unit Tests

Code Coverage

CI/CD Features

Supported Tasks

1. Puzzle Task

2. Math Task

3. Truthy Task

4. LLM Judge (Remote Quality Evaluation)

Curriculum Learning

Simplified Reward API

GRPO Batch Architecture

Testing

Testing the RewardExecutor Locally

Testing in Google Colab

Development

Running Unit Tests

Project Structure

Task Types

Architecture

Standardized Format

RewardExecutor

Reward Functions

Cosine Length Reward (length-based regularizer) ✅

N-gram Repetition Penalty (anti-repetition) ⚠️

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages