- Project: Dembrane ECHO GitOps • Repo type: mono (Terraform + Helm + Argo CD GitOps) (README.md:40, infra/main.tf:80)
- Entrypoints: Terraform stacks (
infra/,ai-infra/), Argo CD apps (echo-*,echo-monitoring-*), Helm charts (helm/echo,helm/monitoring) (infra/main.tf:80-195, ai-infra/README.md:16-24, argo/echo-dev.yaml:1-23, helm/monitoring/Chart.yaml:1-6) - License & governance: Business Source License 1.1 with GPLv3 transition; no CODEOWNERS/CONTRIBUTING tracked (LICENSE:1-25)
- Quickstart (≤3 commands)
cd infra && terraform init -backend-config="bucket=<tf-state>" -backend-config="prefix=infra"(README.md:107-120)terraform apply -var-file=terraform.tfvars(infra/main.tf:9-15)kubectl apply -f argo/echo-dev.yaml && kubectl apply -f argo/echo-monitoring-dev.yaml(README.md:142-149)
- Runtimes & package managers: Terraform ≥1.0, kubectl, Helm 3, kubeseal, doctl per prerequisites; Python 3 +
requestsfor Loki tooling (README.md:84-103, scripts/LOKI_LOG_QUERY.md:5-17). - Core libraries by role: DigitalOcean Terraform resources provision VPC/K8s/DB/cache/object storage (infra/main.tf:80-145); Argo CD Applications drive GitOps sync (argo/echo-dev.yaml:1-23);
helm/echomanages API, Directus, worker tiers, and Neo4j (helm/echo/values.yaml:42-166);helm/monitoringdelivers Prometheus/Grafana/Loki stack (helm/monitoring/values.yaml:1-61);ai-infraprovisions Vertex AI endpoints and IAM (ai-infra/vertex/main.tf:1-20). - Scripts you’ll actually use:
secret-manager.shfor base64 edits, batch updates, and compares (secret-manager.sh:4-193);scripts/query_logs.pywraps Loki queries with chunking/pagination (scripts/LOKI_LOG_QUERY.md:29-132);scripts/k6/sendChunks.jsreplays participant uploads via k6 (scripts/k6/README.md:11-35). - Code style (lint/format/type): Terraform providers are version-locked by
.terraform.lock.hcl; run CLI formatters (terraform fmt,helm lint) locally as needed (infra/.terraform.lock.hcl:1-33).
- Modules/services & responsibilities: Terraform builds DigitalOcean infra then seeds namespaces/secrets; Helm deploys application workloads (API, workers, Directus, Neo4j) and monitoring stack (infra/main.tf:80-195, helm/echo/values.yaml:42-166, helm/monitoring/values.yaml:1-112).
- Data & external surfaces: Postgres, Redis, and Spaces are managed services; ingress exposes
directus/apihostnames with TLS and monitoring endpoints with optional auth (infra/main.tf:101-145, helm/echo/values.yaml:143-159, helm/monitoring/values.yaml:1-60). - Notable patterns: Argo CD auto-prune/self-heal enforces drift control; HPAs and priority classes tune scaling for core workloads (argo/echo-dev.yaml:18-23, helm/echo/templates/hpa-api-server.yaml:1-32, helm/echo/templates/priorityclass-echo-critical.yaml:1-11).
- Diagram → see
.agents/architecture.md.
- Setup: Select the Terraform workspace, provide tfvars, export Spaces credentials, and follow kubeseal instructions before apply (infra/main.tf:3-52, README.md:107-133).
- Dev workflow: Update image tags and config in
helm/echo/values.yaml, commit tomain, and let Argo auto-sync dev/prod apps (helm/echo/values.yaml:1-40, argo/echo-dev.yaml:9-23). - Test workflow (coverage): Use
scripts/k6/sendChunks.jsfor load validation andscripts/query_logs.pyfor log triage; both require manual invocation (scripts/k6/README.md:11-35, scripts/LOKI_LOG_QUERY.md:29-132). - CI gates: GitOps enforcement happens via Argo CD automated sync/prune; no separate CI workflows tracked in repo (argo/echo-monitoring-prod.yaml:18-23).
- Layout & naming: Repo split across
argo/,helm/,infra/,scripts/,secrets/, andai-infra/per README map (README.md:72-78). - Commits & branches: All Argo apps follow
targetRevision: main; protect that branch before enabling auto-sync in production (argo/echo-prod.yaml:9-10, NEED_HELP.md:17). - Hooks & codegen: Secrets are edited via
secret-manager.sh, which injects plaintext comments before base64 encoding—respect the workflow to avoid malformed manifests (secret-manager.sh:104-156). - Gotchas: Managed Postgres/Spaces resources set
prevent_destroy, so plan carefully before destructive changes; resealing secrets requires kubeseal access to each cluster (infra/main.tf:112-145, infra/main.tf:34-52).
- Secrets & env strategy: Maintain plaintext manifests locally, update with
secret-manager.sh, seal withkubeseal, and apply per env (infra/main.tf:24-52, secret-manager.sh:4-189). - Access control: Terraform issues namespaces and registry secrets tied to cluster credentials; Vertex service accounts are scoped via IAM bindings (infra/main.tf:170-195, ai-infra/vertex/main.tf:9-20).
- Observability/logging: Monitoring chart exposes Prometheus/Grafana/Loki with storage and ingress defaults; use the Loki helper script for precise queries (helm/monitoring/values.yaml:1-112, scripts/LOKI_LOG_QUERY.md:29-132).
- Perf & a11y: HPAs cap CPU/memory utilization and priority classes reserve capacity for critical pods—review before tuning resources (helm/echo/templates/hpa-api-server.yaml:1-32, helm/echo/templates/priorityclass-echo-critical.yaml:1-11).
- 2025-03-06: Initial repo scaffold and base configs (3a426a9)
- 2025-03-07: Terraform refactor aligning Argo + Kubernetes wiring (012f020)
- 2025-05-09: Added worker scheduler deployment for background jobs (434b671)
- 2025-05-15: Introduced LiteLLM sizing across deployments (c7a6eab)
- 2025-05-23: Merged Runpod/Whisper config updates for non-prod (3e6a86b)
- 2025-09-04: Landed monitoring production stack via Helm (214933d)
- 2025-09-04: Fixed Argo prod app to track
main(2afbe72) - 2025-09-09: Resolved Neo4j crash loops with probes/priority tweaks (e0b2b8a)
- 2025-10-08: Consolidated dev scaling adjustments for HPAs (6205791)
- 2025-10-13: Tuned prod scaling through dedicated merge (dccd907)
- 2025-10-13: Promoted backend image
v1.11.1to prod (9164b35)
- Terraform automation: Workspace selection and apply steps remain manual → infra/main.tf:3
- Infra backlog: Azure LLMs + Runpod service provisioning TBD → NEED_HELP.md:4-6
- Secrets workflow: Move more secret management into Terraform/stateful tooling → NEED_HELP.md:6
- Frontend ops: Replace Vercel-managed front-end secrets with unified flow → NEED_HELP.md:7
- Helm polish: Add worker probes and optimize resource profiles → NEED_HELP.md:10-11
- Monitoring: Build richer dashboards and alerting coverage → NEED_HELP.md:13-15
- Governance: Enforce protection on
mainbranch → NEED_HELP.md:17 - Logging: Strip
%2Cerror%2Cnoise from Directus logs → NEED_HELP.md:19
- Task: Apply dev infrastructure
- Context: Terraform in
infra/manages DigitalOcean resources; dev uses thedefaultworkspace (infra/main.tf:3-97). - Steps:
- UNRUN (safety)
cd infra && terraform workspace select default - UNRUN (safety)
terraform init -backend-config="bucket=<tf-state>" -backend-config="prefix=infra" - UNRUN (safety)
terraform apply -var-file=terraform.tfvars
- UNRUN (safety)
- Acceptance: Terraform apply finishes without pending changes and
kubernetes_namespace.echo-devreports up-to-date (infra/main.tf:170-175).
- Context: Terraform in
- Task: Promote backend release to prod
- Context: Argo watches the Helm chart and image tags in
values-prod.yaml(helm/echo/values-prod.yaml:1-82, argo/echo-prod.yaml:9-23). - Steps:
- Update
global.imageTagand any config overrides inhelm/echo/values-prod.yaml. - Commit to
mainwith meaningful message and push. - Confirm Argo sync + rollout status for
echo-prod.
- Update
- Acceptance: Argo shows
Sync: Syncedand pods run the new image tag without degraded HPAs.
- Context: Argo watches the Helm chart and image tags in
- Task: Rotate a backend secret
- Context: Secrets are stored as SealedSecrets; plaintext editing happens via the helper script (secret-manager.sh:4-189, secrets/sealed-backend-secrets-dev.yaml:1-24).
- Steps:
- Edit
secrets/backend-secrets-dev.yamlvia./secret-manager.sh dev update. - UNRUN (safety)
kubeseal ... < secrets/backend-secrets-dev.yaml > secrets/sealed-backend-secrets-dev.yaml(infra/main.tf:34-39). - UNRUN (safety)
kubectl apply -f secrets/sealed-backend-secrets-dev.yaml.
- Edit
- Acceptance: Argo reports healthy secrets sync and application pods consume the rotated value.
- Task: Tune worker CPU scaling
- Context: Worker CPU deployment and HPA values live in the Helm chart (helm/echo/values.yaml:96-111, helm/echo/templates/hpa-worker-cpu.yaml:1-24).
- Steps:
- Adjust
workerCpuresources or replica limits inhelm/echo/values.yaml. - Update HPA thresholds if required in
helm/echo/templates/hpa-worker-cpu.yaml. - Commit and push; monitor Argo sync plus pod behavior.
- Adjust
- Acceptance: Workers scale to the new limits without hitting throttling or OOM events.
- Task: Enable monitoring basic auth
- Context: Monitoring ingress has toggles for basic auth (helm/monitoring/values.yaml:1-59).
- Steps:
- Set
ingress.basicAuth.enabled: trueand provide credentials inhelm/monitoring/values-prod.yaml. - Commit changes and ensure Argo sync completes.
- Verify ingress prompts for auth before exposing dashboards.
- Set
- Acceptance: Monitoring hosts require credentials and Grafana login works as expected.
- Task: Investigate recent errors via Loki
- Context: Python helper script queries Loki with optional filters (scripts/LOKI_LOG_QUERY.md:29-132).
- Steps:
- UNRUN (safety)
kubectl port-forward svc/loki 3100:3100 -n monitoring. - UNRUN (safety)
./scripts/query_logs.py --component api --hours 6 --text-contains "ERROR". - Export to CSV if escalation is needed.
- UNRUN (safety)
- Acceptance: Relevant log slices retrieved; findings shared or ticketed.
- Task: Reproduce participant upload flow
- Context: k6 script sends WebM chunks against API endpoints (scripts/k6/README.md:11-35, scripts/k6/sendChunks.js:1-93).
- Steps:
- Place chunk files in
scripts/k6/audioChunks/. - UNRUN (safety)
(cd scripts/k6 && k6 run sendChunks.js -e PROJECT_ID=<id> -e START=0 -e END=5). - Review API metrics and logs post-run.
- Place chunk files in
- Acceptance: Conversation lifecycle completes without errors; metrics show expected load.
- Task: Bootstrap Vertex AI endpoint
- Context:
ai-infra/includes state bucket + endpoint modules; GCP creds pulled from Argo (ai-infra/README.md:5-30, ai-infra/vertex/main.tf:1-20). - Steps:
- Fetch and export the GCP SA json from Argo via kube secret (ai-infra/README.md:5-12).
- UNRUN (safety)
cd ai-infra/state && terraform init && terraform apply -auto-approve. - UNRUN (safety)
cd ../vertex && terraform init ... && terraform apply -auto-approve.
- Acceptance: Vertex endpoint exists and service account bound to
roles/aiplatform.user.
- Context:
- Task: Add alert on worker timeouts
- Context: Monitoring chart houses alertmanager config; NEED_HELP notes highlight worker timeouts (helm/monitoring/templates/alertmanager.yaml:1-120, NEED_HELP.md:21-25).
- Steps:
- Add a new alert rule in
helm/monitoring/templates/alertmanager.yamltargeting worker timeout metrics/logs. - Provide Slack route details in
values-prod.yamlif needed. - Commit and push; confirm Argo sync and alert delivery.
- Add a new alert rule in
- Acceptance: Alert rule visible in Alertmanager and fires when timeout condition reproduces.
- [E1] README.md:40 — Describes the GitOps purpose of this repository.
- [E2] infra/main.tf:80 — Terraform provisions the DigitalOcean VPC and Kubernetes cluster.
- [E3] helm/echo/values.yaml:42 — Echo workloads include Directus, API server, and worker tiers.
- [E4] helm/monitoring/values.yaml:1 — Monitoring stack covers Prometheus, Grafana, Loki, and storage.
- [E5] secret-manager.sh:4 — Secret management script handles list/update/batch/compare actions.
- [E6] scripts/LOKI_LOG_QUERY.md:5 — Loki helper depends on Python 3 and
requests. - [E7] scripts/k6/README.md:11 — k6 script documents the chunk upload workflow.
- [E8] ai-infra/README.md:5 — Vertex AI quickstart outlines credential fetch and Terraform apply.
- [E9] argo/echo-dev.yaml:18 — Argo CD auto-syncs with prune/self-heal enabled.
- [E10] NEED_HELP.md:4 — Outstanding infra/helm/monitoring tasks are tracked here.
- [E11] infra/.terraform.lock.hcl:1 — Provider lock file enforces Terraform version pinning.
Refresh when:
- Fingerprint drift occurs (tracked path added/removed/changed) or adapters list changes.
- Topology changes: new/removed apps, services, workspaces, or major runtime shifts.
- Workflow/infra documentation changes (CI, Terraform, Helm, Argo, secrets) or policy docs update.
- TTL: 30 days since
generated_at_utc.
Behavior:
- On refresh, recompute fingerprint, churn, architecture, deployment map, todos, and recipes in place.
- Preserve everything below the human-notes line verbatim.
- Keep edits scoped to
AGENTS.mdand.agents/**; never touch product code.
--- AUTO-GENERATED CONTENT ENDS --- (Human Notes below persist across refreshes. Use this zone for institutional knowledge and decisions.)
When you run the agent, capture novel architecture/workflow changes not reflected here. Batch up to five items, then ask: “Found new context; add a brief evidence-backed note?” Choices: [Append] [Revise] [Skip] [Ignore this item]. On approval, save .agents/inbox/<slug>.md and optional .agents/patches/ diff; never apply patches without explicit consent.