Project Story: Implementation & Extensions

What Inspired This Project

After reading "Reasoning with Sampling" by Aayush Karan and Yilun Du, we wanted to build practical tooling to experiment with MCMC sampling and compare it against other strategies on Grok. The paper demonstrated powerful results, but we needed an accessible way to explore these techniques on modern LLM APIs.

What We Built

This repo provides an evaluation framework for comparing sampling strategies across multiple benchmarks and multiple models. Built over a weekend by three contributors (Lev, Shiva, Nisarg), it focuses on developer experience and rapid experimentation.

We initially developed using grok-4-1-fast-reasoning and grok-4-1-fast-non-reasoning. However, we hit a bottleneck with our sampling methods as the grok api did not return the log probabilities. To get around this, we hosted an instance of a similarly sized open-source model (qwen3-8b, qwen2.5-7b) using RunPod. This allowed us to accurately benchmark the following:

  • grok-4-1-fast-non-reasoning using default greedy sampling
  • qwen38bn using default greedy sampling
  • qwen38bn using Parallel MCMC (Markov Chain Monte Carlo) sampling
  • qwen38bn using Beam Search

We used the following Benchmarks:

  • GSM8k
  • HumanEval
  • SWEBenchLite

1. Interactive Streamlit UI (chat-ui/)

An interactive web app for side-by-side comparison of sampling methods:

  • Parallel execution of Greedy, MCMC, and Beam Search
  • Real-time token usage and cost tracking
  • Configurable hyperparameters per method
  • MCMC acceptance rate visualization

2. Beam Search Implementation

Added beam search as an alternative to MCMC:

  • Uses OpenAI's n parameter to generate multiple completions
  • Selects best candidate based on average log probability
  • Configurable beam width and temperature
  • Chunked generation to handle long sequences efficiently

3. Hydra Configuration System

Clean YAML-based configuration for rapid experimentation:

model:
  name: grok-4-1-fast-reasoning
  max_tokens: 3072

mcmc:
  alpha: 1.67
  steps: 10
  block_size: 192

Override via CLI: python src/run_benchmark.py mcmc.alpha=2.0

4. Modal Evaluation for SWE-bench

Cloud-scale evaluation pipeline:

  • Parallel SWE-bench evaluation across 16+ workers
  • Automatic repo cloning, patch application, test execution
  • Results downloadable to local results/ directory
  • Cost-optimized with configurable worker counts

Usage:

modal run scripts/modal_evaluate.py --predictions-dir predictions/

5. Benchmark Integrations

Modular benchmark system supporting:

  • HumanEval: Code generation
  • MATH: Mathematical reasoning
  • GPQA: Graduate-level science questions
  • SWE-bench: Real-world software engineering
  • MMLU: Multitask understanding
  • GSM8K: Grade school math
  • ARC-AGI: Abstract reasoning

Each benchmark extends Benchmark base class with custom prompting and evaluation.

6. Async API Client

Efficient AsyncOpenAIClient with:

  • Rate limiting via semaphores
  • Exponential backoff for retries
  • Token usage tracking and cost calculation
  • Support for Grok, GPT-4, and Claude models

Technical Challenges

MCMC Acceptance Ratio Bug

Initial implementation had incorrect proposal density calculation. Fixed by tracking both normalized (for proposal) and unnormalized (for target) log-probabilities separately.

SWE-bench Patch Format

Models generated invalid diffs with placeholder line numbers. Required two-stage prompting: fetch file with line numbers, then generate precise diff.

Beam Search Scalability

Naive beam search with width 10 and length 2000+ tokens was prohibitively expensive. Implemented chunked generation: run beam search over 192-token blocks, reducing API calls by 10×.

Cost Optimization

Discovered grok-4-1-fast-reasoning is 10× cheaper than grok-2-1212 with comparable quality. Built cost tracking into all sampling methods to monitor spend.

Key Learnings

Thinking without Fine-Tuning Improvements

We noticed a few important things for both Monte-Carlo and beam search strategies for discovering chains with high probability weight in the power distribution (which Karan and Du equate to success in reasoning tasks). First, we had to keep sample temperature relatively low (around 0.1-0.25) to see any success as otherwise our methods failed to produce meaningful results. Further, we noticed that performing these methods on already super-vised fine tuned and RL trained models was not helpful: specifically, the models got caught in a "reasoning loop" (constantly changing its internal thoughts) which overflowed the context and thus failing the evaluations. We also noticed that the original paper which employs traditional Monte Carlo, was quite slow to generate an output. In large part, this is due to the sequential nature of Metropolis Hastings sampling. Still, our initial results point out that parallelization methods (such as beam search and parallel versions of MCMC) can improve reasoning capacity while simultaneously improving latency.

Iteration Cycle Learnings

  1. Parallel execution matters: Running all sampling methods simultaneously (via asyncio) provides instant comparison feedback
  2. Hydra is powerful: Override any config from CLI without touching code
  3. Modal for heavy evaluation: SWE-bench evaluation (repos, patches, tests) is too expensive to run locally
  4. Interactive tools help: The Streamlit UI made it easy to develop intuition for MCMC hyperparameters

What's Next

  • Adaptive MCMC step sizes based on acceptance rates
  • Multi-model composition (e.g., reasoning model × code model)
  • Integration with verifiers for constrained generation
  • More benchmarks (Codeforces, LiveCodeBench, etc.)

Credit: Original MCMC sampling research by Aayush Karan and Yilun Du (Harvard) - Paper
This repo: Implementation framework by Lev Stambler, Shiva Peri, Nisarg Desai

Built With

  • beam-search
  • huggingface
  • mcmc
  • modal
  • python
  • runpod
  • streamlit
  • transformers
Share this project:

Updates