Project Story: Implementation & Extensions
What Inspired This Project
After reading "Reasoning with Sampling" by Aayush Karan and Yilun Du, we wanted to build practical tooling to experiment with MCMC sampling and compare it against other strategies on Grok. The paper demonstrated powerful results, but we needed an accessible way to explore these techniques on modern LLM APIs.
What We Built
This repo provides an evaluation framework for comparing sampling strategies across multiple benchmarks and multiple models. Built over a weekend by three contributors (Lev, Shiva, Nisarg), it focuses on developer experience and rapid experimentation.
We initially developed using grok-4-1-fast-reasoning and grok-4-1-fast-non-reasoning. However, we hit a bottleneck with our sampling methods as the grok api did not return the log probabilities. To get around this, we hosted an instance of a similarly sized open-source model (qwen3-8b, qwen2.5-7b) using RunPod. This allowed us to accurately benchmark the following:
grok-4-1-fast-non-reasoningusing default greedy samplingqwen38bnusing default greedy samplingqwen38bnusing Parallel MCMC (Markov Chain Monte Carlo) samplingqwen38bnusing Beam Search
We used the following Benchmarks:
- GSM8k
- HumanEval
- SWEBenchLite
1. Interactive Streamlit UI (chat-ui/)
An interactive web app for side-by-side comparison of sampling methods:
- Parallel execution of Greedy, MCMC, and Beam Search
- Real-time token usage and cost tracking
- Configurable hyperparameters per method
- MCMC acceptance rate visualization
2. Beam Search Implementation
Added beam search as an alternative to MCMC:
- Uses OpenAI's
nparameter to generate multiple completions - Selects best candidate based on average log probability
- Configurable beam width and temperature
- Chunked generation to handle long sequences efficiently
3. Hydra Configuration System
Clean YAML-based configuration for rapid experimentation:
model:
name: grok-4-1-fast-reasoning
max_tokens: 3072
mcmc:
alpha: 1.67
steps: 10
block_size: 192
Override via CLI: python src/run_benchmark.py mcmc.alpha=2.0
4. Modal Evaluation for SWE-bench
Cloud-scale evaluation pipeline:
- Parallel SWE-bench evaluation across 16+ workers
- Automatic repo cloning, patch application, test execution
- Results downloadable to local
results/directory - Cost-optimized with configurable worker counts
Usage:
modal run scripts/modal_evaluate.py --predictions-dir predictions/
5. Benchmark Integrations
Modular benchmark system supporting:
- HumanEval: Code generation
- MATH: Mathematical reasoning
- GPQA: Graduate-level science questions
- SWE-bench: Real-world software engineering
- MMLU: Multitask understanding
- GSM8K: Grade school math
- ARC-AGI: Abstract reasoning
Each benchmark extends Benchmark base class with custom prompting and evaluation.
6. Async API Client
Efficient AsyncOpenAIClient with:
- Rate limiting via semaphores
- Exponential backoff for retries
- Token usage tracking and cost calculation
- Support for Grok, GPT-4, and Claude models
Technical Challenges
MCMC Acceptance Ratio Bug
Initial implementation had incorrect proposal density calculation. Fixed by tracking both normalized (for proposal) and unnormalized (for target) log-probabilities separately.
SWE-bench Patch Format
Models generated invalid diffs with placeholder line numbers. Required two-stage prompting: fetch file with line numbers, then generate precise diff.
Beam Search Scalability
Naive beam search with width 10 and length 2000+ tokens was prohibitively expensive. Implemented chunked generation: run beam search over 192-token blocks, reducing API calls by 10×.
Cost Optimization
Discovered grok-4-1-fast-reasoning is 10× cheaper than grok-2-1212 with comparable quality. Built cost tracking into all sampling methods to monitor spend.
Key Learnings
Thinking without Fine-Tuning Improvements
We noticed a few important things for both Monte-Carlo and beam search strategies for discovering chains with high probability weight in the power distribution (which Karan and Du equate to success in reasoning tasks). First, we had to keep sample temperature relatively low (around 0.1-0.25) to see any success as otherwise our methods failed to produce meaningful results. Further, we noticed that performing these methods on already super-vised fine tuned and RL trained models was not helpful: specifically, the models got caught in a "reasoning loop" (constantly changing its internal thoughts) which overflowed the context and thus failing the evaluations. We also noticed that the original paper which employs traditional Monte Carlo, was quite slow to generate an output. In large part, this is due to the sequential nature of Metropolis Hastings sampling. Still, our initial results point out that parallelization methods (such as beam search and parallel versions of MCMC) can improve reasoning capacity while simultaneously improving latency.
Iteration Cycle Learnings
- Parallel execution matters: Running all sampling methods simultaneously (via
asyncio) provides instant comparison feedback - Hydra is powerful: Override any config from CLI without touching code
- Modal for heavy evaluation: SWE-bench evaluation (repos, patches, tests) is too expensive to run locally
- Interactive tools help: The Streamlit UI made it easy to develop intuition for MCMC hyperparameters
What's Next
- Adaptive MCMC step sizes based on acceptance rates
- Multi-model composition (e.g., reasoning model × code model)
- Integration with verifiers for constrained generation
- More benchmarks (Codeforces, LiveCodeBench, etc.)
Credit: Original MCMC sampling research by Aayush Karan and Yilun Du (Harvard) - Paper
This repo: Implementation framework by Lev Stambler, Shiva Peri, Nisarg Desai
Built With
- beam-search
- huggingface
- mcmc
- modal
- python
- runpod
- streamlit
- transformers
Log in or sign up for Devpost to join the conversation.