Skip to content

joy7758/verifiable-agent-demo

Repository files navigation

English | 中文

Verifiable Agent Demo

This repository is the walkthrough demo for the execution-evidence path.

It is the guided walkthrough surface across the stack, not the canonical architecture hub and not the canonical evidence-profile spec.

Navigation

Fastest runnable path

This repo proves the path, agent-evidence is the evidence substrate, and aro-audit is the audit control plane.

Fastest local run:

python3 -m demo.agent

Fastest enterprise sandbox artifact chain:

python3 examples/enterprise_sandbox_demo/run.py

The sandbox run writes artifacts/enterprise_sandbox_demo/ with:

  • intent.json
  • policy.json
  • trace.jsonl
  • sep.bundle.json
  • replay_verdict.json
  • audit_receipt.json

Start here

Depends on

Status

  • active walkthrough demo
  • research annexes remain secondary to the demo path
  • not a canonical implementation repo

Shared doctrine:

Sandbox controls execution; portable evidence verifies execution.

  1. Governance decides what should be allowed.
  2. Execution integrity proves what actually happened.
  3. Audit evidence exports artifacts for independent review.
flowchart LR
    Persona["Persona (POP)"] --> Intent["Intent Object (AIP)"]
    Intent --> Governance["Governance Check"]
    Governance --> Trace["Execution Trace"]
    Trace --> Audit["Audit Evidence (ARO)"]
Loading

What this demo proves

  • a portable persona-oriented entry point can be projected into runtime
  • explicit intent and action objects can be emitted before execution
  • result objects can be emitted after execution
  • execution steps can be recorded as inspectable evidence
  • audit-facing artifacts can be exported as bounded outputs

Architecture Path in this Demo

  • Persona Layer -> POP-aligned persona context carried into the run
  • Interaction Layer -> intent, action, and result objects emitted under interaction/
  • Governance Layer -> referenced as the control checkpoint for runtime policy and budget constraints
  • Execution Integrity Layer -> runtime execution trace and verifiable execution context
  • Audit Evidence Layer -> ARO-style exported evidence artifacts

This repository does not claim a full Token Governor integration. It demonstrates a minimal aligned path across the broader stack, with explicit governance checkpoint references in the emitted interaction and result objects.

It now also includes one fixed enterprise sandbox artifact chain for the scenario organize client visit notes -> generate weekly report -> request approval, while still not claiming a general full-stack Token Governor integration.

How to read this demo

This demo is a guided path across layers. It is not the normative specification for each layer, and it points outward to the canonical repositories for those layers: digital-biosphere-architecture, persona-object-protocol, agent-intent-protocol, token-governor, and aro-audit.

Execution Evidence Demo Note

See docs/execution-evidence-demo-note.md.

Expected Artifacts

Repo-tracked sample bundle:

  • interaction/intent.json
  • interaction/action.json
  • interaction/result.json
  • evidence/example_audit.json
  • evidence/result.json
  • evidence/sample-manifest.json

Additional tracked example:

  • evidence/crew_demo_audit.json

Current concrete examples in this repository include:

  • docs/quick-walkthrough.md
  • docs/interaction-flow.md
  • docs/shortest-validation-loop.md

Run the Demo

Scripted wrapper

bash scripts/run_demo.sh

This local wrapper writes fresh output under artifacts/demo_output/.

Fastest external demo path

bash scripts/run_demo.sh
make killer-demo
python3 -m http.server --directory docs 8000

The receipt for the enterprise sandbox chain is checked through the canonical ARO surface aro_audit.receipt_validation with the minimal profile.

Existing CrewAI demo path

bash scripts/setup_framework_venv.sh
.venv/bin/python crew/crew_demo.py

Environment notes:

  • Python 3 is sufficient for the minimal local path.
  • Refresh the tracked deterministic sample bundle with python3 scripts/refresh_demo_samples.py.
  • The optional CrewAI and LangChain paths should run from a git-ignored local .venv/ created by scripts/setup_framework_venv.sh.
  • The pinned framework helper environment currently uses crewai 1.10.1, langchain 1.2.12, and langchain-core 1.2.18.
  • CrewAI currently requires Python <3.14.
  • Both demo paths use deterministic local mock data and do not require external API calls.

Repository Automation

  • The Mermaid render workflow opens PRs to main only through a dedicated GitHub App.
  • Configure repository variable PROTOCOL_BOT_APP_ID and repository secret PROTOCOL_BOT_PRIVATE_KEY under Settings -> Secrets and variables -> Actions.
  • The default repository GITHUB_TOKEN remains read-only and is not used for auto-PR promotion.

Research Evaluation Annex

This repository now includes a paper-ready evaluation harness for Execution Evidence Architecture for Agentic Software Systems: From Intent Objects to Verifiable Audit Receipts.

Primary entry points:

  • make eval-baseline
  • make eval-evidence
  • make eval-external-baseline
  • make eval-framework-pair
  • make eval-langchain-pair
  • make eval-ablation
  • make falsification-checks
  • make human-review-kit
  • make review-sample
  • make compare
  • make paper-eval
  • make top-journal-pack

Supporting material:

Generated outputs:

  • artifacts/runs/<task_id>/<mode>/
  • docs/paper_support/comparison-summary.md
  • docs/paper_support/comparison-summary.csv
  • artifacts/metrics/comparison-summary.json
  • docs/paper_support/external-baseline-summary.md
  • docs/paper_support/framework-pair-summary.md
  • docs/paper_support/langchain-pair-summary.md
  • docs/paper_support/ablation-summary.md
  • docs/paper_support/falsification-summary.md
  • artifacts/human_review/synthetic-review-summary.json

Research Manuscript Draft

The repository also includes a manuscript draft grounded in the current implemented harness and checked-in metrics:

Related Repositories

Minimal Reference Surface

  • interaction/ for explicit interaction objects
  • evidence/ for audit and result artifacts
  • demo/ and crew/ for runnable entry points
  • integration/ for persona and intent adapters
  • docs/spec/ for schema notes and example payloads

Further Reading