A curated list of pioneering research papers, tools, and resources on the Agent Harness — the systematic execution layer that transforms raw model capability into sustained, long-horizon autonomy.
A Survey on AI Agent Harness
Agent = Model (Stochastic Intelligence) + Harness (Deterministic Infrastructure)
The survey proposes a Unified Architectural Taxonomy that organizes the Agent Harness as a four-layered stack:
- Layer 1: Execution & Orchestration — The temporal engine driving the autonomous execution loop, model routing, and multi-agent composition.
- Layer 2: Context & Trajectory Management — The epistemic layer governing state compaction, trajectory persistence, memory hierarchies, and observability.
- Layer 3: Interaction Surface & Execution Environment — The sensory and actuation organs connecting the agent to the world via tool calling, standardized protocols, and sandboxed execution.
- Layer 4: Constraints & Guardrails — The independent observer enforcing deterministic laws through access control, permission management, and defense against agent injection.
The figure below illustrates the asymmetric co-evolution between model capability and harness responsibility:
We aim to provide a comprehensive overview for researchers, developers, and infrastructure engineers interested in this rapidly advancing field.
- Agent Harness Foundations
- Model & Agent Routing
- Multi-Agent Composition & Orchestration
- Autonomous Loop, Resilience & Human-in-the-Loop
- Memory Systems
- Context Compression
- Trajectory Persistence & Observability
- Self-Evolving Architectures
- Agentic Skills
- Skills Security
- Standardized Protocols & Interaction Surface
- Tool Use & Code Execution
- Sandboxing & Execution Environments
- Governance Boundaries
- Agent Injection & Defense
Cross-layer conceptual works that define and motivate the Agent Harness as a first-class research object.
| Title | Author | Year | Description |
|---|---|---|---|
| Effective harnesses for long-running agents | Young et al. | 2025 | long-running agent harness management |
| Natural-Language Agent Harnesses | Pan et al. | 2026 | natural-language harness design |
| Harness Engineering for Language Agents: The Harness Layer as Control, Agency, and Runtime | He et al. | 2026 | harness as control, agency, and runtime layer |
| Building AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned | Bui et al. | 2026 | terminal coding agent scaffolding, context engineering, lessons learned |
| Harness engineering: leveraging Codex in an agent-first world | Lopopolo et al. | 2026 | Codex-based harness engineering |
| The importance of Agent Harness in 2026 | Schmid et al. | 2026 | agent harness importance analysis |
| What is an agent harness in the context of large-language models? | Parallel Web Systems et al. | 2025 | agent harness concept overview |
| Meta-Harness: End-to-End Optimization of Model Harnesses | Lee et al. | 2026 | end-to-end automated optimization of harness code |
Acting as the temporal engine of the harness, Layer 1 drives the autonomous execution loop, manages model routing, orchestrates multi-agent compositions, and enforces resilience mechanisms to maintain forward momentum under failures.
Dynamically determining which LLM or specialized agent should handle a given subtask, optimizing for cost, capability, and resource constraints.
Treating agents as composable, modular entities and orchestrating concurrent subagent spawning, delegation, and synchronized state handoffs.
| Title | Author | Year | Description |
|---|---|---|---|
| Claude Code Subagents | Anthropic | 2025 | custom AI subagent spawning |
| Compass: Enhancing agent long-horizon reasoning with evolving context | Wan et al. | 2025 | evolving context for long-horizon reasoning |
| Kimi K2. 5: Visual Agentic Intelligence | Team et al. | 2026 | visual agentic intelligence |
| Swarm: An educational framework exploring ergonomic, lightweight multi-agent orchestration | OpenAI et al. | 2024 | lightweight multi-agent orchestration |
| CrewAI: Framework for orchestrating role-playing autonomous AI agents | Moura et al. | 2025 | role-playing agent orchestration |
| A Declarative Language for Building And Orchestrating LLM-Powered Agent Workflows | Daunis et al. | 2025 | declarative agent workflow language |
| Orchestral AI: A Framework for Agent Orchestration | Roman et al. | 2026 | general-purpose agent orchestration |
Ensuring the execution loop is resilient to non-termination and drift, and managing the spectrum from full human oversight to closed-loop autonomy.
While the orchestration layer manages execution time, Layer 2 governs the agent's epistemic space — mitigating context window saturation, catastrophic forgetting, and maintaining strict observability.
Structured, queryable knowledge layers ranging from production-ready platforms to research prototypes.
Strategies to prevent Context Rot — the progressive degradation of reasoning quality due to accumulated irrelevant tokens.
Persisting the agent's execution history to external storage for recovery, replay, and continuous learning, while decoupling observability from the model's working memory.
| Title | Author | Year | Description |
|---|---|---|---|
| Reducing Cost of LLM Agents with Trajectory Reduction | Xiao et al. | 2025 | trajectory reduction for efficiency |
| Semantic Checkpointing for Stateless LLM Agents in Multi-Tenant Enterprise Systems | Roshan et al. | 2025 | semantic checkpointing for stateless agents |
| Large-scale Evaluation of Notebook Checkpointing with AI Agents | Fang et al. | 2025 | notebook checkpointing evaluation |
| AgentTrace: A Structured Logging Framework for Agent System Observability | AlSayyad et al. | 2026 | structured logging for observability |
| AgentSight: System-Level Observability for AI Agents Using eBPF | Zheng et al. | 2025 | eBPF-based system-level observability |
| Durable Execution in LangGraph | LangChain et al. | 2026 | fault-tolerant durable execution |
Agent systems that improve their own capabilities, prompts, or memory structures at test time or through continuous interaction.
Modular, reusable capabilities that agents acquire, compose, and execute to extend their action space.
Security vulnerabilities and defenses related to agentic skill systems and skill-based prompt injection.
| Title | Author | Year | Description |
|---|---|---|---|
| Agent Skills Enable a New Class of Realistic and Trivially Simple Prompt Injections | Schmotz et al. | 2025 | skill-based prompt injection analysis |
| Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale | Liu et al. | 2026 | skill security vulnerabilities at scale |
| Malicious Agent Skills in the Wild: A Large-Scale Security Empirical Study | Liu et al. | 2026 | malicious skill detection study |
| When Skills Lie: Hidden-Comment Injection in LLM Agents | Wang et al. | 2026 | hidden-comment skill injection |
| Zombie Agents: Persistent Control of Self-Evolving LLM Agents via Self-Reinforcing Injections | Yang et al. | 2026 | persistent control via self-reinforcing injection |
Because language models are inherently disembodied, Layer 3 constitutes the sensory and actuation organs of the agentic system — standardizing interfaces for tool calling and code execution, and enforcing hardware-level isolation.
Defining and standardizing how agents interact with tools, APIs, and external environments.
Benchmarks and methods for evaluating and improving agent tool use capabilities.
| Title | Author | Year | Description |
|---|---|---|---|
| WebOperator: Action-Aware Tree Search for Autonomous Agents in Web Environment | Dihan et al. | 2025 | action-aware web tree search |
| Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces | Merrill et al. | 2026 | CLI task benchmarking |
| Budget-Constrained Agentic Large Language Models: Intention-Based Planning for Costly Tool Use | Liu et al. | 2026 | budget-constrained tool planning |
| Toolsandbox: A stateful, conversational, interactive evaluation benchmark for llm tool use capabilities | Lu et al. | 2025 | stateful tool-use evaluation |
Because LLM outputs are inherently probabilistic, Layer 4 acts as an independent observer and judge — imposing deterministic laws of physics and security boundaries on the system, operating entirely out-of-band.
Isolating agent execution to contain erratic behaviors and protect host infrastructure.
Enforcing access control, permission management, and policy compliance for agent actions.
| Title | Author | Year | Description |
|---|---|---|---|
| POLARIS: Typed Planning and Governed Execution for Agentic AI in Back-Office Automation | Moslemi et al. | 2026 | typed planning and governed execution |
| ContextCov: Deriving and Enforcing Executable Constraints from Agent Instruction Files | Sharma et al. | 2026 | executable constraint enforcement |
| Sandbox-runtime: A lightweight sandboxing tool for enforcing filesystem and network restrictions on arbitrary processes at the OS level, without requiring a container | Anthropic et al. | 2026 | OS-level filesystem/network sandboxing |
| Securing AI Agent Execution | Buhler et al. | 2025 | agent execution security analysis |
| BashArena: A Control Setting for Highly Privileged AI Agents | Kaufman et al. | 2025 | highly-privileged agent control setting |
| Breaking the Protocol: Security Analysis of the Model Context Protocol Specification and Prompt Injection Vulnerabilities in Tool-Integrated LLM Agents | Maloyan et al. | 2026 | MCP specification security analysis |
Defending against adversarial prompt injection attacks targeting agentic systems.
Contributions are welcome! To add a paper, open a pull request with the new entry added to the relevant section, following the format below:
Title-Year-Brief description
Please ensure the paper is directly relevant to the Agent Harness infrastructure.
This repository is maintained in conjunction with the survey paper "A Survey on AI Agent Harness".
