ContextForge

ContextForge (project banner)

Inspiration

Every LLM-powered application stuffs its context window with system prompts, RAG documents, conversation history, tool definitions, and few-shot examples, but nobody knows which parts actually matter. Teams guess which sections to trim, have no way to measure quality impact, and end up paying for tokens that contribute nothing. We were inspired by ablation studies from machine learning research, the practice of systematically removing components to measure their contribution. We asked: what if we could ablate LLM context the same way? ContextForge was born from the frustration of manual prompt trimming and the realization that context optimization deserves the same rigor as model optimization.

What it does

ContextForge is an agentic context ablation testing tool that systematically optimizes LLM context payloads. You upload a JSON context payload (system prompts, RAG documents, conversation history, tool definitions, etc.), and ContextForge autonomously:

Parses & segments the context into typed sections with token counts and TF-IDF redundancy detection
Establishes a baseline by scoring responses on the full context across multiple reasoning tiers using LLM-as-judge evaluation
Ablates each section individually by removing it, re-running evaluation queries, and measuring the quality delta
Classifies every section as Essential, Moderate, Removable, or Harmful based on its impact
Runs greedy backward elimination to find the leanest configuration that stays within a quality tolerance
Tests section ordering (start/middle/end positions) to find optimal placement
Generates an AI-powered Context Diet Plan with section-specific keep/remove/condense recommendations via Nova extended thinking (HIGH tier)
Produces a publication-ready HTML report with interactive Plotly charts, statistical analysis, and cost savings projections

On our demo payload, a 212K-token "bloated customer support agent," ContextForge identified that 58% of tokens were removable with less than 3% quality loss, projecting $37 in savings per 1,000 API calls.

How we built it

ContextForge is built entirely on Amazon Nova 2 Lite via the AWS Bedrock Converse API, using Nova's unique capabilities as four distinct cognitive components:

QualityScorer uses LLM-as-judge scoring with Extended Thinking (MEDIUM tier), evaluating responses on relevance, completeness, accuracy, and helpfulness
DietPlanner generates optimization recommendations via Extended Thinking (HIGH tier) for deep analytical reasoning
ReportGenerator produces narrative HTML reports with optional Code Interpreter for demo flair
QueryGenerator auto-generates evaluation queries from context content

What makes this truly agentic is that Extended Thinking isn't just used for better outputs. It's used as an experimental variable. Each ablation is tested across all four reasoning tiers (disabled, low, medium, high) to measure how reasoning depth affects context sensitivity.

The tech stack includes Python 3.12+ with Pydantic v2 for type-safe data models, Streamlit for a five-page interactive UI with a pastel light theme, Plotly for interactive charts, numpy/scipy/scikit-learn for local statistical analysis, and Jinja2 for HTML report templating. The infrastructure layer handles adaptive rate limiting (RPM + TPM), exponential backoff retries, a 4-strategy JSON parser for robust LLM output parsing, and centralized usage tracking. The project has 164 tests (154 unit + 10 integration).

Challenges we ran into

Extended thinking token budget management: Reasoning tokens consume the output budget before the text response. Setting max_tokens too low (e.g., 500) produced empty text output because all tokens went to reasoning. We had to carefully tune budgets, using max_tokens=16000 for the quality scorer with MEDIUM reasoning. We also learned that HIGH tier doesn't accept max_tokens at all and causes a ValidationException.
Robust LLM JSON parsing: Nova's outputs frequently included markdown fences, preamble text, trailing commas, and single quotes around JSON. We built a 4-strategy fallback parser (raw JSON, then fence extraction, then json-repair, then bracket extraction) to handle every variant reliably.
Streamlit threading constraints: Streamlit doesn't allow calling st.* from background threads. We implemented a queue.Queue-based message passing system stored in st.session_state, with @st.fragment(run_every="2s") for progress polling. This pattern took several iterations to get right.
Cross-thread experiment cancellation: Adding a "Cancel Experiment" button required threading.Event-based signaling with per-API-call cancellation checks in deeply nested loops, plus a custom ExperimentCancelled exception that propagates cleanly through the entire ablation pipeline without being swallowed by broad except Exception blocks.
CSS inside Streamlit fragments: DOM isolation in @st.fragment broke CSS selectors for the cancel button styling. We had to move the button outside the fragment and use st.container(key=...) for reliable CSS targeting.
Demo payload engineering: Creating a payload that produces dramatic findings required deliberate design. This meant 100K tokens of product catalog that contributes almost nothing to quality, 35 irrelevant conversation turns, 18 unused tool definitions, and 40% internal FAQ redundancy.

Accomplishments that we're proud of

End-to-end autonomous pipeline: ContextForge runs 35 to 800+ API calls without human intervention across a multi-phase pipeline, self-correcting with fallback reasoning tiers on parse failures.
Extended Thinking as a research variable: Using all four reasoning tiers as an experimental dimension, not just a feature, reveals insights like sections that matter at disabled tier but become irrelevant when the model can reason through their absence.
Pareto frontier optimization: Automatically computing non-dominated quality-vs-cost configurations so users can pick the optimal tradeoff for their budget.
164 tests with zero Bedrock dependency for unit tests: All 154 unit tests run without AWS credentials, with comprehensive mocking of every Bedrock interaction.
Solo project scope: Built the entire system as a solo hackathon project, including infrastructure, core engine, statistical analysis, five-page Streamlit UI with five custom chart components, HTML report generator, diet planner, demo payload generator, and comprehensive test suite.

What we learned

Context bloat is real and measurable: Our demo showed that over half the tokens in a typical agent context contribute almost nothing to response quality. The intuition many developers have is quantifiably correct.
Reasoning depth changes context sensitivity: Sections that are essential at lower reasoning tiers can become redundant when the model has more reasoning budget. This is a non-obvious finding that challenges the assumption that "more context is always better."
LLM-as-judge is viable but fragile: Nova produces excellent quality scores, but the output format varies enough that robust parsing infrastructure is non-negotiable. Our 4-strategy JSON parser was essential.
Rate limiting is an art: With RPM=200 and TPM=8M limits, adaptive rate limiting with both request and token budgets was critical for running hundreds of sequential API calls without hitting throttling errors.
Streamlit is powerful but opinionated: The threading model, fragment-based polling, and CSS isolation behaviors required creative workarounds, but the result is a genuinely polished interactive application.

What's next for ContextForge

Nova Embeddings for redundancy detection: Replace TF-IDF cosine similarity with amazon.nova-2-multimodal-embeddings-v1:0 for semantic redundancy detection that catches paraphrased duplicates, not just lexical overlap.
Web Grounding: Enrich reports with external research via Nova's web grounding system tool, providing industry benchmarks and best practices alongside ablation findings.
Multi-model comparison: Test ablation sensitivity across different Nova model sizes to identify model-specific context requirements.
Additional demo payloads: Enterprise RAG and agentic workflow scenarios to demonstrate ContextForge's versatility across different LLM application patterns.
Parallel execution: Run independent ablation experiments concurrently to dramatically reduce experiment runtime.
Export lean configurations: Output optimized context configurations as reusable JSON templates that teams can plug directly into their applications.

Built With

amazon-nova-2-lite
amazon-web-services
jinja
json-repair
numpy
pandas
plotly
pydantic
python
scikit-learn
scipy
streamllit
tiktoken

Updates

Srikar Pottabathula started this project — Mar 16, 2026 07:51 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.