Sovereign Mohawk Proto: A formally verified federated learning runtime that scales to 10 million nodes with Byzantine resilience and quantum-resistant security - while keeping data private on the edge.
Federated learning breaks down at massive scale when trust has to be assumed, communication becomes the bottleneck, and a single coordinator becomes a failure domain. Sovereign Mohawk Proto addresses that by using hierarchical aggregation, inline verification, TPM-backed sovereignty, and a post-quantum migration path so edge data stays private while the system remains auditable. The result is a control plane that can be inspected, verified, and scaled instead of just hoped for.
Traditional federated learning can look simple in a lab and fail badly in the field: trust assumptions accumulate at the coordinator, bandwidth becomes the limiting factor, and one weak service can stall the whole training loop. This runtime shifts that model to hierarchical aggregation with inline zk-proof verification, TPM-backed node sovereignty, and a post-quantum migration plan so each round is harder to fake and easier to audit. It is meant for fleets where privacy, scale, and verifiability are all required at the same time.
- One command boots the full stack and its observability surfaces.
- A small local node set exercises aggregation, policy checks, and proof verification.
- The runtime exposes the same health path operators use in production-style demos.
- The primary HUD and C2 view share a structured AI interaction summary so common actions stay explainable and easy to trigger.
- The HUD now includes a review drawer and backend decision history so AI-suggested actions can be approved, edited, rejected, or undone with an audit trail.
Fastest path:
make sandbox-upThen open:
- HUD:
http://localhost:3000 - Grafana:
http://localhost:3001 - Prometheus:
http://localhost:9090
Other low-friction entry points:
- Verify the running stack:
make stack-verify - Browser-friendly notebook demo: examples/pysyft-integration/pysyft_mohawk_poc.ipynb
- Notebook-backed integration PoC: examples/pysyft-integration/README.md
- Public landing page source: docs/index.md
flowchart TB
subgraph Edge[Trusted Edge]
C1["Node 1<br/>TPM + PQC"]
C2["Node 2<br/>TPM + PQC"]
C3["Node 3<br/>TPM + PQC"]
end
C1 --> H[Hierarchical Aggregation]
C2 --> H
C3 --> H
H --> ZK[zk-Proof Verification]
ZK --> G[Governance + Policy Gate]
G --> A[Auditable Coordinator]
A --> O[Grafana + HUD]
| Capability | Sovereign Mohawk Proto | NVIDIA FLARE | PySyft |
|---|---|---|---|
| Scale posture | Hierarchical aggregation designed for very large edge fleets | Enterprise FL orchestration with coordinator-centric flows | Privacy-preserving workflows with policy-based data access |
| Trust model | TPM-backed sovereignty, inline verification, and auditable control paths | Policy and orchestration focused; trust is usually externalized to deployment | Strong privacy controls; verification depends on the integration pattern |
| Verification | zk-proof oriented round verification and auditable checkpoints | Typically relies on platform controls and deployment discipline | Privacy-first, but not a native zk-verification stack |
| Edge privacy | Data stays on the edge; only verified updates move upstream | Supports distributed training, but privacy depends on your deployment | Built to support private data workflows and datasite-style control |
| PQC path | Explicit post-quantum migration track | Not a default project-level emphasis | Not a default project-level emphasis |
Advanced Theorems and Proof Notes
- The deeper formalization and runtime assumptions are tracked in the technical brief and roadmap sections below.
- The published verification numbers in the badges are project metrics, not universal guarantees.
- For the broader research context, see Documentation/README.md and Documentation/Project/ROADMAP.md.
The project is open to contributors who want to work on practical scale and verification problems.
- Good first issues: CONTRIBUTING.md
- Code of conduct: CODE_OF_CONDUCT.md
- Security policy: SECURITY.md
- License: LICENSE
- Roadmap: Documentation/Project/ROADMAP.md
- Research notes: documentation/RESEARCH_FINDINGS.md
- Repo settings admin runbook: Documentation/Project/REPO_SETTINGS_ADMIN_RUNBOOK.md
Looking for help with:
- NPU and accelerator ports.
- Auditors for theorems and verification claims.
- Python SDK improvements and notebook ergonomics.
- Better GIFs/screenshots for the full-stack walkthrough.
- Full 10M-node hardware-backed validation run.
- More accelerator ports across CPU, GPU, and NPU backends.
- Cleaner browser-first demo flow and more polished release media.
- Continued PQC migration and proof-verification hardening.
Repository landing page source: docs/index.md
Documentation entrypoint: docs/README.md
Canonical docs navigation: docs/README.md for active operator guides and Documentation/MASTER_DOCUMENTATION_INDEX.md for full repository documentation indexing.
- Cross-platform CI coverage is active on
main. - Linux lanes:
Build and Test,Lint Code Base,Reproducibility Check,Observability CI - Windows lane:
Windows Client EXE Build - macOS lane:
macOS Client Smoke - Observability CI now runs on push, pull request, scheduled drift checks, and release events (
published,prereleased). - Current contributor baseline validation commands:
npm run test:cimake observability-smokemake alerts-test~/.local/bin/black --check .~/.local/bin/flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
For current roadmap status and remaining open items, see Documentation/Performance/OPS_OPTIMIZATION_ROADMAP_2026-03-24.md and Documentation/Project/ROADMAP.md.
The repository includes a public-facing Genesis Testnet drill to demonstrate post-quantum migration readiness with auditable artifacts.
Run the drill:
make quantum-kex-rotation-drillRun with strict non-mock backend enforcement:
make quantum-kex-rotation-drill-strictVerify production-mode guardrail policy in CI/local:
make quantum-kex-production-guardrail-checkOr run directly with explicit endpoint/token inputs:
NODE_AGENT_BASE_URL=http://localhost:8082 \
MOHAWK_API_TOKEN_FILE=/run/secrets/mohawk_api_token \
bash scripts/quantum-kex-rotation-drill.shOperator runbook and disclosure guidance:
Artifact output:
artifacts/quantum-kex-rotation/<drill-id>/
Production posture note:
- Testnet drills may use mock/simulated fallback backends for rehearsal.
- Production PQC readiness claims require strict non-mock enforcement and
production_pqc_ready=trueindrill-summary.json. production_pqc_ready=truerequires Go and Python liboqs adapters to both reportavailable=true.
If you just cloned the repo and want to run tests quickly, use this sequence.
git clone https://github.com/rwilliamspbg-ops/Sovereign_Map_Federated_Learning.git
cd Sovereign_Map_Federated_Learningmake help
make fmt
make lint
make testmake smoke
make observability-smoke
make quickstart-verifyWhere to get contribution guidance:
- Full contribution process and PR checklist: CONTRIBUTING.md
- Quick contribution opportunities: README.md#help-wanted-quick-wins
- Runtime validation expectations: README.md#contributor-first-steps
- Operations dashboard metric contract: docs/OPERATIONS_DASHBOARD_METRIC_CONTRACT.md
The consolidated validation path now supports profile-based execution, trend SLO enforcement, artifact diff summaries, browser runtime cadence checks, and scheduled deep validation runs.
What was added:
- Required-style PR gate workflow: .github/workflows/full-validation-pr-gate.yml
- Scheduled deep workflow: .github/workflows/full-validation-scheduled-deep.yml
- Fast and deep suite profiles: tests/scripts/python/run_full_validation_suite.py
- Trend SLO checker: tests/scripts/ci/check_validation_trends.py
- CI diff summary writer: tests/scripts/ci/write_validation_diff_summary.py
- Browser runtime E2E cadence check: tests/scripts/python/test_browser_runtime_e2e.py
- Playwright runtime artifacts: tests/e2e/runtime-cadence.spec.js, tests/e2e/playwright.config.js
Canonical commands:
npm run test:setup
npm run test:full:fast
npm run test:full:deep
npm run test:trends
npm run test:summary:diffValidation artifacts:
test-results/full-validation/full_validation_<timestamp>.jsontest-results/full-validation/full_validation_<timestamp>.mdtest-results/full-validation/history.jsonl
Documentation governance:
- Documentation maintenance runbook: docs/DOCUMENTATION_MAINTENANCE.md
- Test setup details and profile usage: tests/docs/TEST_ENV_SETUP.md
The mobile hardening and store packaging track is now implemented in-repo.
What is now live:
- Hardware-backed mobile signer wrappers for iOS Secure Enclave and Android StrongBox/Keystore integration in app flow.
- Canonical signed gradient envelope adapter on both mobile platforms, aligned to backend verifier contract.
- Backend endpoint for signed mobile gradient verification:
/mobile/verify_gradient. - Contract test coverage for valid and invalid mobile signed payloads.
- Production store wrapper packages for both Android Play Store and Apple App Store submission flows.
Primary references:
- Mobile implementation overview: mobile-apps/MOBILE_APP_README.md
- Android store wrapper: mobile-apps/android-node-app/store-wrapper/README.md
- iOS store wrapper: mobile-apps/ios-node-app/store-wrapper/README.md
- iOS submission checklist: mobile-apps/ios-node-app/store-wrapper/APP_STORE_CONNECT_SUBMISSION_CHECKLIST.md
- Backend verifier endpoint implementation: sovereignmap_production_backend_v2.py
- Mobile verifier contract test: tests/scripts/python/test_mobile_verify_gradient_contract.py
This upgrade ties blockchain and bridge execution telemetry directly into the production Grafana surfaces.
What was added:
- New exporter metrics for blockchain and bridge runtime state:
tokenomics_chain_heighttokenomics_bridge_transfers_totaltokenomics_bridge_routes_activetokenomics_fl_verification_ratiotokenomics_fl_average_confidence_bps
- Tokenomics telemetry payload now emits these fields from backend runtime calculations.
- Grafana Operations and Tokenomics dashboards now include a dedicated Blockchain and Bridge Runtime section with verification, confidence, transfer throughput, route count, and chain height.
Primary files updated for this upgrade:
- Metrics exporter: tokenomics_metrics_exporter.py
- Backend telemetry payload: sovereignmap_production_backend_v2.py
- Operations dashboard: grafana/provisioning/dashboards/operations_overview.json
- Tokenomics dashboard: grafana/provisioning/dashboards/tokenomics_overview.json
- Prometheus scrape config: prometheus.yml
Dashboard provisioning note:
- Canonical dashboards are served from grafana/provisioning/dashboards.
- Grafana home dashboard now defaults to grafana/provisioning/dashboards/operations_overview.json.
- STARRED live dashboard set:
- grafana/provisioning/dashboards/operations_overview.json
- grafana/provisioning/dashboards/tokenomics_overview.json
- grafana/provisioning/dashboards/llm_overview.json
Operator validation commands:
make observability-smokepython3 scripts/check_dashboard_queries.py
This upgrade package adds a local-first marketplace and governance workflow with production-facing observability guardrails.
What is included:
- Marketplace flows: offers, intents, matching, escrow release, dispute workflows, and governance proposals/voting.
- Network expansion flows: attestation sharing, self-service invite requests, admin approval/rejection/revocation.
- Dashboard and metrics integration: marketplace/governance snapshots in
/metrics_summaryand expanded HUD browser demo controls. - Prometheus additions:
marketplace_alerts.ymlwith stall/high-watermark detection plus promtool tests inmarketplace_alerts.test.yml. - API contract tests: local positive-path and negative-path coverage under
tests/scripts/python/test_marketplace_local_contracts.pyandtests/scripts/python/test_marketplace_negative_paths.py.
Primary references:
- First 10 minutes guide: docs/OPEN_ECOSYSTEM_FIRST_10_MINUTES.md
- Sprint 1 roadmap: docs/OPEN_ECOSYSTEM_SPRINT1_ROADMAP.md
- Sprint 2 roadmap: docs/OPEN_ECOSYSTEM_SPRINT2_ROADMAP.md
- API examples: docs/api/http-examples.md
- Backend implementation: sovereignmap_production_backend_v2.py
- Grafana operations dashboard: grafana/provisioning/dashboards/operations_overview.json
Validation commands:
make observability-smokemake observability-live-smokemake alerts-testpython3 tests/scripts/python/test_marketplace_local_contracts.pypython3 tests/scripts/python/test_marketplace_negative_paths.py
Canonical auditable artifact capture command:
RESULTS_ROOT=artifacts/final-verification/$(date +%F) TARGET_NODES=10 STRICT_NPU=0 DURATION_SECONDS=600 INTERVAL_SECONDS=60 bash scripts/demo-10min-auditable.shThe following environment variables are available for safe runtime tuning:
- Runtime profile selection and memory-pressure control loop:
RUNTIME_PROFILE=ultra_latency|balanced|throughput(defaultbalanced)MEMORY_PRESSURE_SAMPLE_SECONDS(default5.0)- Runtime API endpoint:
GET/POST /runtime/profile GET /metrics_summaryincludesruntime_profile,provider_execution_policy, andmemory_pressure
Runtime profile quick commands:
curl -fsS http://localhost:8000/runtime/profile | jq .
curl -fsS -X POST http://localhost:8000/runtime/profile \
-H 'Content-Type: application/json' \
-d '{"profile":"throughput"}' | jq .- FL aggregation path selection:
FL_AGGREGATION_MODE=auto|loop|vectorizedFL_AGGREGATION_VECTORIZE_MIN_CLIENTS(default1000)FL_AGGREGATION_VECTORIZE_MAX_PEAK_BYTES(default536870912)- DP/Opacus parameters:
DP_NOISE_MULTIPLIER(default1.1)DP_MAX_GRAD_NORM(default1.0)- TPM cache and spike controls:
TPM_ATTESTATION_MAX_REPORTS(default256)TPM_ATTESTATION_CACHE_TTL(default30s)TPM_ATTESTATION_SPIKE_THRESHOLD(default200us)
Operational notes:
- Aggregation path usage is exported as
fl_aggregation_path_total{impl="loop|vectorized"}. - The Operations dashboard includes a "FL Aggregation Path Usage" panel to verify auto-mode behavior.
- The Operations dashboard includes a "What Changed (Current vs Prior Window)" panel for rapid delta-based triage.
- The Operations dashboard includes annotations for control/config changes via
sovereign_ops_control_actions_total. - The HUD includes runbook-match cards for active
opsHealth.alertsto accelerate first-response triage.
Roadmap and execution tracking:
Incident bundle export (first-response evidence):
python3 scripts/export_incident_bundle.pyIncident tooling CI guard:
Run the fastest end-to-end path from startup to live dashboards:
docker compose -f docker-compose.full.yml up -d --scale node-agent=5
make stack-verifyBefore running the startup command, confirm Docker has enough space for first-time image builds:
df -h
docker system dfIf Docker cache is full, reclaim space and retry:
docker system prune -afThen open:
- HUD:
http://localhost:3000 - Grafana:
http://localhost:3001 - Prometheus:
http://localhost:9090
Use this flow as the canonical local run sequence.
- Step 1: Preflight checks
docker --version
docker compose version
df -h
docker system df- Step 2: Build all full-stack images explicitly (no start yet)
docker compose -f docker-compose.full.yml build- Step 3: Start the full stack with five node agents
docker compose -f docker-compose.full.yml up -d --scale node-agent=5- Step 4: Verify service state and health
docker compose -f docker-compose.full.yml ps
curl -fsS http://localhost:8000/status | jq
curl -fsS http://localhost:8000/health | jq
curl -fsS http://localhost:8000/ops/health | jq- Step 5: Follow logs during first run (optional but useful)
docker compose -f docker-compose.full.yml logs -f backend frontend prometheus grafana alertmanagerCommon build/start options:
- Rebuild from scratch if dependencies changed:
docker compose -f docker-compose.full.yml build --no-cache - Recreate containers after image rebuild:
docker compose -f docker-compose.full.yml up -d --force-recreate --scale node-agent=5 - Remove stale orphans after compose changes:
docker compose -f docker-compose.full.yml up -d --remove-orphans --scale node-agent=5
If you hit disk pressure during build (No space left on device):
docker system prune -af
docker builder prune -afStop and clean up:
docker compose -f docker-compose.full.yml down --remove-orphansA ready-to-run PySyft x Mohawk integration proof-of-concept lives in examples/pysyft-integration.
Quick start:
pip install -r examples/pysyft-integration/requirements-pysyft-demo.txt
python examples/pysyft-integration/pysyft_mohawk_poc.py --mode mock --rounds 5 --participants 5Sovereign Map uses a streaming aggregation model instead of loading full model updates into memory at once.
- Memory efficiency: Mohawk-style chunked processing reduces memory pressure by up to 224x for large update sets.
- Byzantine resilience: selective verification and trust scoring reduce adversarial impact with sublinear validation behavior for high node counts.
- Hardware root of trust: every node contributes attestation and certificate telemetry into the same operational control plane.
flowchart LR
A[Client Updates] --> B{Traditional FL Aggregator}
B --> C[Load full model deltas in memory]
C --> D[High RAM footprint per round]
A --> E{SovereignMap Mohawk Stream Aggregator}
E --> F[Chunk updates into streaming windows]
F --> G[Validate trust and policy per chunk]
G --> H[Aggregate incrementally]
H --> I[Low steady-state memory use]
Mohawk-style streaming aggregation treats model updates as a continuous stream of chunks rather than a monolithic tensor payload. This allows the coordinator to perform verification, filtering, and merge steps incrementally while retaining bounded working memory. In practice, this is what makes high fan-out node participation feasible on commodity infrastructure: memory usage scales with chunk window size instead of full global update size, while trust and policy checks run inline with aggregation.
Sovereign Map currently uses the Mohawk Proto streaming aggregation approach as its default high-scale aggregation path.
Advisory (informational): Mohawk Proto and related aggregation design elements are marked as provisional-patent advisory material by project maintainers.
- The implementation in this repository is provided for evaluation, research, and integration testing.
- Commercial licensing expectations may evolve as provisional filings progress.
- For enterprise/commercial questions, open a discussion in GitHub Discussions.
Legend: scan left-to-right by trust order: quality gates -> security/governance -> SDK/release -> deploy surfaces -> device/runtime footprint -> community signals.
Core Quality Gates:
Security and Governance Gates:
SDK and Release Engineering:
Deployment and Packaging:
Device and Runtime Footprint:
Community and Repository Signals:
- Browse all workflows: GitHub Actions workflow matrix
- Contributor and governance process: CONTRIBUTING.md
- Running secure, federated ML training across distributed nodes where raw data must stay local.
- Operating Byzantine-resilient model aggregation in adversarial or partially trusted environments.
- Building trust-aware AI infrastructure with policy controls, attestation signals, and auditable telemetry.
- Monitoring real-time FL, tokenomics, and system health through Prometheus and Grafana surfaces.
- Prototyping and scaling from local Docker deployments to large Compose/Kubernetes profiles.
| Node Class | Minimum (Functional) | Recommended (Sustained) |
|---|---|---|
| Edge CPU Node | Raspberry Pi 4 (4 GB RAM), 4-core ARM CPU, 32 GB storage, Linux, TPM 2.0 device access | Raspberry Pi 5 / x86 mini PC (8-16 GB RAM), NVMe storage, TPM 2.0, stable wired network |
| Edge GPU/NPU Node | Jetson Nano / Intel NPU-capable edge device, 8 GB RAM, CUDA/NPU drivers | NVIDIA Jetson Orin / equivalent, 16+ GB RAM, tuned CUDA/NPU stack |
| Operator / Aggregator | 8 vCPU, 16 GB RAM, SSD, Docker Compose | 16+ vCPU, 32-64 GB RAM, NVMe, GPU optional, isolated monitoring host |
| Monitoring Stack | 2 vCPU, 4 GB RAM for Prometheus + Grafana | 4-8 vCPU, 8-16 GB RAM with longer retention and dashboard concurrency |
Use hardware_auto_tuner.py to auto-profile host capability and choose an acceleration profile before large-scale runs.
Sovereign Map Federated Learning is a dual-plane runtime:
- Aggregation plane: Flower-based federated coordination with Byzantine-robust strategy logic and convergence tracking.
- Control and telemetry plane: Flask services for health, HUD, trust/policy operations, join lifecycle, and metrics publication.
Core characteristics:
- Byzantine-resilient training strategy with runtime convergence history.
- Trust and verification APIs for attestation-style governance workflows.
- Policy and join-management endpoints for operator-controlled enrollment.
- Prometheus-compatible metrics exporters for operational and tokenomics surfaces.
- Multi-profile deployment via Docker Compose and Kubernetes manifests.
- Hardware-aware tests spanning NPU, XPU, CUDA/ROCm, MPS, and CPU fallbacks.
- Backend aggregation and APIs: sovereignmap_production_backend_v2.py
- Tokenomics metrics exporter: tokenomics_metrics_exporter.py
- TPM metrics exporter: tpm_metrics_exporter.py
- Frontend HUD: frontend/src/HUD.jsx
- Primary compose profile: docker-compose.full.yml
- Kubernetes scale profile: kubernetes-5000-node-manifests.yaml
Visual proof for this project should be treated as release evidence, not optional decoration.
Expected screenshot artifacts per release:
- Operations HUD: trust score, node participation, latency wall, and resilience indicators.
- Grafana Operations Overview: gauge deck + trend wall under live load.
- Grafana Tokenomics Overview: mint/bridge/validator/wallet health sections.
Tracked asset locations:
docs/screenshots/hud-operations-overview.pngdocs/screenshots/grafana-operations-overview.pngdocs/screenshots/grafana-llm-overview.pngdocs/screenshots/grafana-tokenomics-overview.png
Capture workflow and acceptance checklist:
Feature artifact manifest:
Current status: screenshot paths are defined and release capture workflow is documented; attach rendered PNG/GIF evidence in each tagged release.
sequenceDiagram
participant Node as Edge Node Client
participant Flower as Flower Aggregation Plane (:8080)
participant Mohawk as Mohawk Stream Aggregator
participant Policy as Trust/Policy Gate
participant API as Control Plane API (:8000)
participant Prom as Prometheus
participant Grafana as Grafana/HUD
Node->>Flower: Submit model update (FitRes)
Flower->>Mohawk: Forward update chunks
Mohawk->>Policy: Validate adapter policy + attestation metadata
Policy-->>Mohawk: Accept/Reject + reason labels
Mohawk->>Flower: Incremental aggregate result
Flower->>API: Publish round metrics + convergence snapshot
API->>Prom: Expose /metrics and event-derived gauges/counters
Prom->>Grafana: Scrape telemetry
API->>Grafana: Serve /health, /ops/health, /hud_data, /metrics_summary
| Domain | Runtime Surfaces | Purpose |
|---|---|---|
| Federated learning | sovereignmap_production_backend_v2.py, src/client.py | Round orchestration, aggregation, convergence |
| Trust and attestation | tpm_cert_manager.py, tpm_metrics_exporter.py, secure_communication.py | Identity, verification, trust signals |
| Governance and policy | bridge-policies.json, capabilities.json | Runtime controls and policy surfaces |
| Tokenomics and economics | tokenomics_metrics_exporter.py, tokenomics_metrics_exporter.py | Economic telemetry and dashboard inputs |
| Observability | prometheus.yml, alertmanager.yml, fl_slo_alerts.yml | Metrics collection, alerting, SLO validation |
| Operations | deploy.sh, deploy_demo.sh, Makefile | Deployment and repeatable operator workflows |
Use this quick index to jump directly to API command groups and docs.
API command index:
- Control Plane API commands
- Training Service API commands
- TPM Exporter API commands
- Tokenomics Exporter API commands
API docs index:
- HTTP examples: docs/api/http-examples.md
- Swagger UI (multi-spec): docs/api/swagger-ui.html
- OpenAPI specs: docs/api/openapi.yaml, docs/api/openapi.training.yaml, docs/api/openapi.tpm.yaml, docs/api/openapi.tokenomics.yaml
- Postman collection: docs/api/postman_collection.json
- API validator command:
npm run api:validate - API docs CI: .github/workflows/api-spec-validation.yml, .github/workflows/api-docs-pages.yml
curl -s http://localhost:8000/status | jq
curl -s http://localhost:8000/health | jq
curl -s http://localhost:8000/ops/health | jq
curl -s -X POST http://localhost:8000/trigger_fl \
-H "X-Join-Admin-Token: ${JOIN_API_ADMIN_TOKEN}" \
-H "Content-Type: application/json" \
-d '{}' | jq
curl -s http://localhost:8000/metrics_summary | jq
curl -s http://localhost:8000/convergence | jq
curl -s http://localhost:8000/autonomy/twin/summary | jq
curl -s http://localhost:8000/autonomy/planner/insights | jq
curl -s http://localhost:8000/autonomy/sensors/quality | jq
curl -s http://localhost:8000/autonomy/slo/status | jqcurl -s http://localhost:5001/health | jq
curl -s http://localhost:8000/training/status | jqcurl -s http://localhost:9091/health | jq
curl -s http://localhost:9091/metrics/summary | jqcurl -s http://localhost:9105/health | jq| Endpoint | Method | Function | Responsibility |
|---|---|---|---|
| /health | GET | health | Service health, enclave status, HUD telemetry snapshot |
| /status | GET | status | Aggregator runtime identity and core port map |
| /chat | POST | chat_query | HUD assistant query handling for operator prompts |
| /hud_data | GET | hud_data | HUD metrics including audit accuracy and simulation counters |
| /founders | GET | get_founders | Founding-signature identity list for governance views |
| /trigger_fl | POST | trigger_fl_round | Manual FL round simulation and convergence updates |
| /create_enclave | POST | create_enclave | Enclave state transition workflow |
| /convergence | GET | get_convergence | Convergence history arrays for charting |
| /metrics_summary | GET | metrics_summary | Aggregated metrics summary across runtime domains |
| /model_registry | GET | model_registry_recent | Recent persisted model metadata and round snapshots |
| /simulate/<simulation_type> | POST | trigger_hud_simulation | Records HUD simulation events by scenario type |
| /ops/health | GET | ops_health | Operational dependency/system snapshot (ports, memory/disk pressure, Prometheus reachability) |
| /ops/events/recent | GET | ops_events_recent | Returns recent operations events for timeline replay |
| /ops/events | GET (SSE) | ops_events_stream | Server-sent event stream for live operations telemetry |
| /swarm/status | GET | swarm_status_view | Swarm C2 status snapshot for command deck summaries |
| /swarm/map | GET | swarm_map_view | Bounded swarm map payload with configurable layers |
| /swarm/commands | GET | swarm_commands_view | Recent accepted command history for UI replay |
| /swarm/audit/recent | GET | swarm_audit_recent_view | Signed audit-chain entries for accepted commands (admin gated) |
| /swarm/command | POST | swarm_command_submit | Authenticated + role-aware swarm command submission with validation/rate limits |
| /autonomy/twin/summary | GET | autonomy_twin_summary_view | Digital twin summary including lag, confidence, and risk score |
| /autonomy/planner/insights | GET | autonomy_planner_insights_view | Planner candidates, selected action, and rejected options |
| /autonomy/sensors/quality | GET | autonomy_sensors_quality_view | Sensor source confidence/freshness/anomaly quality matrix |
| /autonomy/slo/status | GET | autonomy_slo_status_view | Autonomy SLO snapshot for latency, confidence, and coverage targets |
| Endpoint | Method | Function | Responsibility |
|---|---|---|---|
| /trust_snapshot | GET | trust_snapshot | Current trust mode, policy state, and policy history |
| /verification_policy | POST | update_verification_policy | Runtime policy update surface |
| /llm_policy | GET | llm_policy_view | Exposes active LLM adapter validation policy |
| /join/policy | GET | join_policy_view | Join bootstrap policy and onboarding constraints |
| /join/invite | POST | create_join_invite | Issue join invites with bounded TTL and permissions |
| /join/register | POST | register_join_participant | Register participant certificates and join metadata |
| /join/registrations | GET | list_join_registrations | Admin listing of registered participants |
| /join/revoke/int:node_id | POST | revoke_join_participant | Administrative revocation of participant certificate |
| Endpoint | Method | Function | Responsibility |
|---|---|---|---|
| /training/start | POST | start_training | Trigger training start signal for HUD/ops flows |
| /training/stop | POST | stop_training | Trigger training halt signal |
| /training/status | GET | training_status | Current mocked training progress and metrics view |
| Endpoint | Method | Function | Responsibility |
|---|---|---|---|
| /metrics | GET | metrics | Prometheus exposition endpoint for tokenomics gauges |
| /health | GET | health | Tokenomics exporter liveness and source-file metadata |
| /event/tokenomics | POST | event_tokenomics | Ingest tokenomics events and persist canonical payload |
| Endpoint | Method | Function | Responsibility |
|---|---|---|---|
| /metrics | GET | metrics | Prometheus exposition endpoint for TPM/trust metrics |
| /metrics/summary | GET | metrics_summary | Aggregated TPM/trust summary snapshot |
| /health | GET | health | TPM exporter liveness and metadata |
| /event/attestation | POST | event_attestation | Ingest attestation event payloads |
| /event/message | POST | event_message | Ingest trust-related operational messages |
- Auth boundaries:
/join/registrationsand/join/revoke/<int:node_id>are admin-gated and require valid admin authorization headers. - Auth boundaries:
/verification_policysupports role-aware updates viaX-API-Roleand optional bearer token wiring. - Auth boundaries:
/swarm/commandrequires admin authorization and enforces role policy viaX-API-Role. - Auth boundaries:
/swarm/audit/recentis admin-gated and intended for operator audit workflows. - Auth boundaries:
/ai/interaction/decisiondefaults toAI_INTERACTION_DECISION_AUTH_MODE=admin_required; setAI_INTERACTION_DECISION_AUTH_MODE=public_localonly for local operator testing flows. - Auth boundaries:
/trigger_flis admin-gated whenJOIN_API_ADMIN_TOKENis configured. - Status code behavior:
/create_enclavemay return202while provisioning is in progress, then200once a stable state transition is reached. - Status code behavior:
/trigger_flmay return202for accepted async execution and non-2xx when round execution cannot proceed. - Status code behavior:
/swarm/commandreturns403when role policy denies command and429when command rate limits trigger. - Streaming semantics:
/ops/eventsis an SSE endpoint and includes heartbeat events to keep long-lived clients connected. - Streaming semantics:
/ops/events/recentshould be used to backfill timeline state before attaching to SSE.
- Open C2 HUD mode with
http://localhost:3000/?view=c2. - For operator usage and benchmark examples, see docs/C2_SWARM_HUD_QUICK_START.md.
- For full autonomy runbooks, see docs/AUTONOMY_DIGITAL_TWIN_OPERATIONS.md.
| Function | File | Responsibility |
|---|---|---|
| validate_llm_adapter_update | sovereignmap_production_backend_v2.py | Policy validation gate for incoming client updates |
| build_tokenomics_payload | sovereignmap_production_backend_v2.py | Constructs tokenomics publication payload from FL state |
| publish_tokenomics_event | sovereignmap_production_backend_v2.py | Sends tokenomics telemetry to exporter endpoint |
| publish_tpm_attestation_events | sovereignmap_production_backend_v2.py | Emits attestation events for trust metrics pipeline |
| run_flower_server | sovereignmap_production_backend_v2.py | Starts and configures Flower aggregation server |
| run_flask_metrics | sovereignmap_production_backend_v2.py | Starts Flask API plane for control and telemetry |
| create_app | tokenomics_metrics_exporter.py | Constructs exporter app and endpoint bindings |
| SelectBestCorrection | internal/autonomy/planner.go | Scores and selects policy-safe autonomous correction actions |
| PredictLinearMotion | internal/autonomy/predictor_orchestrator.go | Produces low-compute short-horizon digital twin predictions |
| ValidateReadiness | internal/autonomy/runtime_readiness.go | Enforces autonomy production readiness gates |
If you want the fastest production-feel walkthrough, run the two commands below and open HUD + Grafana immediately:
docker compose -f docker-compose.full.yml up -d --scale node-agent=5
make stack-verifyWhat this startup command does:
- Starts the full runtime profile from docker-compose.full.yml.
- Scales the
node-agentservice to 5 replicas. - Brings up API, HUD, and observability services in one command.
Recommended verification sequence:
docker compose -f docker-compose.full.yml up -d --scale node-agent=5
docker compose -f docker-compose.full.yml ps
make stack-verifyTeardown after the walkthrough:
docker compose -f docker-compose.full.yml downExpected outcome in under 2 minutes:
- API health endpoints respond on
:8000. - HUD is reachable at
http://localhost:3000. - Grafana is reachable at
http://localhost:3001. - Prometheus is reachable at
http://localhost:9090.
Node-agent dependency profile:
Dockerfile.node-agentuses CPU-only PyTorch wheels (torch==2.1.0+cpu,torchvision==0.16.0+cpu) to reduce build size and keep local/CI chaos tests reliable.
- Go 1.25+
- Node.js 20+
- npm 10+
- Python 3.11+
- Docker with Compose plugin
git clone https://github.com/rwilliamspbg-ops/Sovereign_Map_Federated_Learning.git
cd Sovereign_Map_Federated_Learning
./genesis-launch.shdocker compose -f docker-compose.full.yml up -d --scale node-agent=5
docker compose -f docker-compose.full.yml pscurl -s http://localhost:8000/status | jq
curl -s http://localhost:8000/health | jq
curl -s http://localhost:8000/ops/health | jq
curl -s http://localhost:8000/training/status | jqControl-plane mTLS verification (expected behavior when mTLS control plane is enabled):
# This should fail without a client certificate.
curl -kfsS https://localhost:8080/p2p/info || echo "expected: mTLS client certificate required"If this request succeeds without a client certificate, verify your control-plane TLS policy before production use.
Expected checkpoints:
/statusreturns service identity and ports./ops/healthreports API, Flower, and Prometheus reachability.- frontend HUD is reachable at
http://localhost:3000. - Grafana is reachable at
http://localhost:3001.
CLI flow:
# Trigger one global round
curl -s -X POST http://localhost:8000/trigger_fl \
-H "X-Join-Admin-Token: ${JOIN_API_ADMIN_TOKEN}" \
-H "Content-Type: application/json" \
-d '{}' | jq
# Verify round advanced and metrics updated
curl -s http://localhost:8000/metrics_summary | jq '.federated_learning.current_round, .federated_learning.current_accuracy, .federated_learning.current_loss'
curl -s http://localhost:8000/convergence | jq '.current_round, .current_accuracy, .current_loss'Strict chaos churn validation:
JOIN_API_ADMIN_TOKEN="${JOIN_API_ADMIN_TOKEN}" \
SOAK_CHAOS_ENABLED=1 \
SOAK_CHAOS_STRICT=1 \
CHAOS_MIN_CLIENT_QUORUM=1 \
python3 tests/scripts/python/test_soak_chaos_guard.pyUI flow:
- Open
http://localhost:3000. - Confirm Network Operations HUD is the default landing view.
- Click Run Global FL Epoch.
- Confirm the live timeline shows a
TRAINING_ROUNDevent and round metrics increment.
Workflow verification (GitHub Actions):
gh run list --branch main --limit 20docker compose -f docker-compose.full.yml up -d --scale node-agent=10# Go and backend tests
go test ./... -count=1
# Monorepo package build and tests
npm ci
npm run build:libs
npm run test:ci
# Frontend build
npm --prefix frontend ci
npm --prefix frontend run buildmake benchmark-fedavg-compareTo compare against a different baseline and output path:
BASE_REF=origin/main BENCH_RUNS=3 REPORT_PATH=results/metrics/fedavg_benchmark_compare.md make benchmark-fedavg-compareGenerated report:
CI workflow:
Before opening a PR, run the same fast checks maintainers use:
# Discover all available developer targets
make help
# Required baseline
make fmt
make lint
make test
# Recommended reproducibility smoke checks
make smoke
# Required visual evidence assets
make screenshots-checkFor runtime-focused changes (HUD, observability, policy endpoints), include at least one local verification artifact in your PR description:
/healthand/ops/healthoutput snippet.- one screenshot from HUD or Grafana.
- command log showing a successful
trigger_flround.
- Standard runtime sequence: docker-compose.full.yml
- Contribution guidelines: CONTRIBUTING.md
- Security policy: SECURITY.md
- Changelog: CHANGELOG.md
- License: LICENSE
If you want to contribute quickly, these areas have high impact and low setup friction:
- Test matrix expansion for TPM 2.0 hardware variants and Docker runtimes.
- Apple Silicon (MPS) acceleration optimization and benchmark baselines.
- Additional Grafana panel tuning for high-cardinality node fleets.
- Better synthetic fault workloads for Byzantine and partition simulation paths.
Jump directly to open issues by label:
- Good first issue: good first issue label
- Help wanted: help wanted label
- Documentation: documentation label
- Observability: observability label
Contribution process and coding standards are in CONTRIBUTING.md.
- Symptom: trust/attestation metrics stay flat or backend cannot initialize TPM flows.
- Check: container runtime must expose TPM devices and required permissions.
- Typical fix: run with explicit device mapping and appropriate group permissions for TPM access.
- Symptom: HUD shows backend unreachable or Grafana/Prometheus endpoints fail to bind.
- Check: local services already using frontend/backend ports.
- Typical fix: align Compose port mappings and frontend API base configuration so HUD and backend targets match.
Timestamp: 2026-03-19
- Tokenomics exporter handles directory-valued source path safely and persists payload without IsADirectoryError.
- HUD simulation controls are wired end-to-end (frontend action -> backend endpoint -> HUD counter surface).
- Backend and exporter modules compile successfully with Python syntax checks.
Verified green on main after latest changes:
- Build and Test
- Lint Code Base
- HIL Tests
- Reproducibility Check
- Governance Check
- Workflow Action Pin Check
- CodeQL Security Analysis
- Security Supply Chain
- Secret Scan
- Observability CI
- Build and Deploy
For release candidates, run one additional live smoke test using docker-compose.full.yml and verify /health, /hud_data, and /event/tokenomics before tagging.
Standalone report: SANITY_REPORT.md



