Exploration of Sampling methods we explored (MCMC, Beam Search)
Chat comparision UI dashboard
SWEBench Lite results (Beam Search vs Greedy)

Project Story: Implementation & Extensions

What Inspired This Project

After reading "Reasoning with Sampling" by Aayush Karan and Yilun Du, we wanted to build practical tooling to experiment with MCMC sampling and compare it against other strategies on Grok. The paper demonstrated powerful results, but we needed an accessible way to explore these techniques on modern LLM APIs.

What We Built

This repo provides an evaluation framework for comparing sampling strategies across multiple benchmarks and multiple models. Built over a weekend by three contributors (Lev, Shiva, Nisarg), it focuses on developer experience and rapid experimentation.

We initially developed using grok-4-1-fast-reasoning and grok-4-1-fast-non-reasoning. However, we hit a bottleneck with our sampling methods as the grok api did not return the log probabilities. To get around this, we hosted an instance of a similarly sized open-source model (qwen3-8b, qwen2.5-7b) using RunPod. This allowed us to accurately benchmark the following:

grok-4-1-fast-non-reasoning using default greedy sampling
qwen38bn using default greedy sampling
qwen38bn using Parallel MCMC (Markov Chain Monte Carlo) sampling
qwen38bn using Beam Search

We used the following Benchmarks:

GSM8k
HumanEval
SWEBenchLite

1. Interactive Streamlit UI (`chat-ui/`)

An interactive web app for side-by-side comparison of sampling methods:

Parallel execution of Greedy, MCMC, and Beam Search
Real-time token usage and cost tracking
Configurable hyperparameters per method
MCMC acceptance rate visualization

2. Beam Search Implementation

Added beam search as an alternative to MCMC:

Uses OpenAI's n parameter to generate multiple completions
Selects best candidate based on average log probability
Configurable beam width and temperature
Chunked generation to handle long sequences efficiently

3. Hydra Configuration System

Clean YAML-based configuration for rapid experimentation:

model:
  name: grok-4-1-fast-reasoning
  max_tokens: 3072

mcmc:
  alpha: 1.67
  steps: 10
  block_size: 192

Override via CLI: python src/run_benchmark.py mcmc.alpha=2.0

4. Modal Evaluation for SWE-bench

Cloud-scale evaluation pipeline:

Parallel SWE-bench evaluation across 16+ workers
Automatic repo cloning, patch application, test execution
Results downloadable to local results/ directory
Cost-optimized with configurable worker counts

Usage:

modal run scripts/modal_evaluate.py --predictions-dir predictions/

5. Benchmark Integrations

Modular benchmark system supporting:

HumanEval: Code generation
MATH: Mathematical reasoning
GPQA: Graduate-level science questions
SWE-bench: Real-world software engineering
MMLU: Multitask understanding
GSM8K: Grade school math
ARC-AGI: Abstract reasoning

Each benchmark extends Benchmark base class with custom prompting and evaluation.

6. Async API Client

Efficient AsyncOpenAIClient with:

Rate limiting via semaphores
Exponential backoff for retries
Token usage tracking and cost calculation
Support for Grok, GPT-4, and Claude models

Technical Challenges

MCMC Acceptance Ratio Bug

Initial implementation had incorrect proposal density calculation. Fixed by tracking both normalized (for proposal) and unnormalized (for target) log-probabilities separately.

SWE-bench Patch Format

Models generated invalid diffs with placeholder line numbers. Required two-stage prompting: fetch file with line numbers, then generate precise diff.

Beam Search Scalability

Naive beam search with width 10 and length 2000+ tokens was prohibitively expensive. Implemented chunked generation: run beam search over 192-token blocks, reducing API calls by 10×.

Cost Optimization

Discovered grok-4-1-fast-reasoning is 10× cheaper than grok-2-1212 with comparable quality. Built cost tracking into all sampling methods to monitor spend.

Key Learnings

Thinking without Fine-Tuning Improvements

We noticed a few important things for both Monte-Carlo and beam search strategies for discovering chains with high probability weight in the power distribution (which Karan and Du equate to success in reasoning tasks). First, we had to keep sample temperature relatively low (around 0.1-0.25) to see any success as otherwise our methods failed to produce meaningful results. Further, we noticed that performing these methods on already super-vised fine tuned and RL trained models was not helpful: specifically, the models got caught in a "reasoning loop" (constantly changing its internal thoughts) which overflowed the context and thus failing the evaluations. We also noticed that the original paper which employs traditional Monte Carlo, was quite slow to generate an output. In large part, this is due to the sequential nature of Metropolis Hastings sampling. Still, our initial results point out that parallelization methods (such as beam search and parallel versions of MCMC) can improve reasoning capacity while simultaneously improving latency.

Iteration Cycle Learnings

Parallel execution matters: Running all sampling methods simultaneously (via asyncio) provides instant comparison feedback
Hydra is powerful: Override any config from CLI without touching code
Modal for heavy evaluation: SWE-bench evaluation (repos, patches, tests) is too expensive to run locally
Interactive tools help: The Streamlit UI made it easy to develop intuition for MCMC hyperparameters

What's Next

Adaptive MCMC step sizes based on acceptance rates
Multi-model composition (e.g., reasoning model × code model)
Integration with verifiers for constrained generation
More benchmarks (Codeforces, LiveCodeBench, etc.)

Credit: Original MCMC sampling research by Aayush Karan and Yilun Du (Harvard) - Paper
This repo: Implementation framework by Lev Stambler, Shiva Peri, Nisarg Desai

Built With

beam-search
huggingface
mcmc
modal
python
runpod
streamlit
transformers

Updates

Nisarg Desai started this project — Dec 07, 2025 03:17 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.