Inspiration
Every LLM-powered application stuffs its context window with system prompts, RAG documents, conversation history, tool definitions, and few-shot examples, but nobody knows which parts actually matter. Teams guess which sections to trim, have no way to measure quality impact, and end up paying for tokens that contribute nothing. We were inspired by ablation studies from machine learning research, the practice of systematically removing components to measure their contribution. We asked: what if we could ablate LLM context the same way? ContextForge was born from the frustration of manual prompt trimming and the realization that context optimization deserves the same rigor as model optimization.
What it does
ContextForge is an agentic context ablation testing tool that systematically optimizes LLM context payloads. You upload a JSON context payload (system prompts, RAG documents, conversation history, tool definitions, etc.), and ContextForge autonomously:
- Parses & segments the context into typed sections with token counts and TF-IDF redundancy detection
- Establishes a baseline by scoring responses on the full context across multiple reasoning tiers using LLM-as-judge evaluation
- Ablates each section individually by removing it, re-running evaluation queries, and measuring the quality delta
- Classifies every section as Essential, Moderate, Removable, or Harmful based on its impact
- Runs greedy backward elimination to find the leanest configuration that stays within a quality tolerance
- Tests section ordering (start/middle/end positions) to find optimal placement
- Generates an AI-powered Context Diet Plan with section-specific keep/remove/condense recommendations via Nova extended thinking (HIGH tier)
- Produces a publication-ready HTML report with interactive Plotly charts, statistical analysis, and cost savings projections
On our demo payload, a 212K-token "bloated customer support agent," ContextForge identified that 58% of tokens were removable with less than 3% quality loss, projecting $37 in savings per 1,000 API calls.
How we built it
ContextForge is built entirely on Amazon Nova 2 Lite via the AWS Bedrock Converse API, using Nova's unique capabilities as four distinct cognitive components:
- QualityScorer uses LLM-as-judge scoring with Extended Thinking (MEDIUM tier), evaluating responses on relevance, completeness, accuracy, and helpfulness
- DietPlanner generates optimization recommendations via Extended Thinking (HIGH tier) for deep analytical reasoning
- ReportGenerator produces narrative HTML reports with optional Code Interpreter for demo flair
- QueryGenerator auto-generates evaluation queries from context content
What makes this truly agentic is that Extended Thinking isn't just used for better outputs. It's used as an experimental variable. Each ablation is tested across all four reasoning tiers (disabled, low, medium, high) to measure how reasoning depth affects context sensitivity.
The tech stack includes Python 3.12+ with Pydantic v2 for type-safe data models, Streamlit for a five-page interactive UI with a pastel light theme, Plotly for interactive charts, numpy/scipy/scikit-learn for local statistical analysis, and Jinja2 for HTML report templating. The infrastructure layer handles adaptive rate limiting (RPM + TPM), exponential backoff retries, a 4-strategy JSON parser for robust LLM output parsing, and centralized usage tracking. The project has 164 tests (154 unit + 10 integration).
Challenges we ran into
- Extended thinking token budget management: Reasoning tokens consume the output budget before the text response. Setting
max_tokenstoo low (e.g., 500) produced empty text output because all tokens went to reasoning. We had to carefully tune budgets, usingmax_tokens=16000for the quality scorer with MEDIUM reasoning. We also learned that HIGH tier doesn't acceptmax_tokensat all and causes aValidationException. - Robust LLM JSON parsing: Nova's outputs frequently included markdown fences, preamble text, trailing commas, and single quotes around JSON. We built a 4-strategy fallback parser (raw JSON, then fence extraction, then json-repair, then bracket extraction) to handle every variant reliably.
- Streamlit threading constraints: Streamlit doesn't allow calling
st.*from background threads. We implemented aqueue.Queue-based message passing system stored inst.session_state, with@st.fragment(run_every="2s")for progress polling. This pattern took several iterations to get right. - Cross-thread experiment cancellation: Adding a "Cancel Experiment" button required
threading.Event-based signaling with per-API-call cancellation checks in deeply nested loops, plus a customExperimentCancelledexception that propagates cleanly through the entire ablation pipeline without being swallowed by broadexcept Exceptionblocks. - CSS inside Streamlit fragments: DOM isolation in
@st.fragmentbroke CSS selectors for the cancel button styling. We had to move the button outside the fragment and usest.container(key=...)for reliable CSS targeting. - Demo payload engineering: Creating a payload that produces dramatic findings required deliberate design. This meant 100K tokens of product catalog that contributes almost nothing to quality, 35 irrelevant conversation turns, 18 unused tool definitions, and 40% internal FAQ redundancy.
Accomplishments that we're proud of
- End-to-end autonomous pipeline: ContextForge runs 35 to 800+ API calls without human intervention across a multi-phase pipeline, self-correcting with fallback reasoning tiers on parse failures.
- Extended Thinking as a research variable: Using all four reasoning tiers as an experimental dimension, not just a feature, reveals insights like sections that matter at
disabledtier but become irrelevant when the model can reason through their absence. - Pareto frontier optimization: Automatically computing non-dominated quality-vs-cost configurations so users can pick the optimal tradeoff for their budget.
- 164 tests with zero Bedrock dependency for unit tests: All 154 unit tests run without AWS credentials, with comprehensive mocking of every Bedrock interaction.
- Solo project scope: Built the entire system as a solo hackathon project, including infrastructure, core engine, statistical analysis, five-page Streamlit UI with five custom chart components, HTML report generator, diet planner, demo payload generator, and comprehensive test suite.
What we learned
- Context bloat is real and measurable: Our demo showed that over half the tokens in a typical agent context contribute almost nothing to response quality. The intuition many developers have is quantifiably correct.
- Reasoning depth changes context sensitivity: Sections that are essential at lower reasoning tiers can become redundant when the model has more reasoning budget. This is a non-obvious finding that challenges the assumption that "more context is always better."
- LLM-as-judge is viable but fragile: Nova produces excellent quality scores, but the output format varies enough that robust parsing infrastructure is non-negotiable. Our 4-strategy JSON parser was essential.
- Rate limiting is an art: With RPM=200 and TPM=8M limits, adaptive rate limiting with both request and token budgets was critical for running hundreds of sequential API calls without hitting throttling errors.
- Streamlit is powerful but opinionated: The threading model, fragment-based polling, and CSS isolation behaviors required creative workarounds, but the result is a genuinely polished interactive application.
What's next for ContextForge
- Nova Embeddings for redundancy detection: Replace TF-IDF cosine similarity with
amazon.nova-2-multimodal-embeddings-v1:0for semantic redundancy detection that catches paraphrased duplicates, not just lexical overlap. - Web Grounding: Enrich reports with external research via Nova's web grounding system tool, providing industry benchmarks and best practices alongside ablation findings.
- Multi-model comparison: Test ablation sensitivity across different Nova model sizes to identify model-specific context requirements.
- Additional demo payloads: Enterprise RAG and agentic workflow scenarios to demonstrate ContextForge's versatility across different LLM application patterns.
- Parallel execution: Run independent ablation experiments concurrently to dramatically reduce experiment runtime.
- Export lean configurations: Output optimized context configurations as reusable JSON templates that teams can plug directly into their applications.
Built With
- amazon-nova-2-lite
- amazon-web-services
- jinja
- json-repair
- numpy
- pandas
- plotly
- pydantic
- python
- scikit-learn
- scipy
- streamllit
- tiktoken

Log in or sign up for Devpost to join the conversation.