TalkOps Blog

Talk to Your Cloud: Why We're Open-Sourcing the AWS Orchestrator Agent

Tue, 07 Apr 2026 00:00:00 GMT

We've all been there. It's 2:00 AM, you're staring at a terminal, and you just ran terraform plan. Your heart sinks as the screen flashes:

Plan: 0 to add, 0 to change, 8 to destroy.

You double-check every line of HCL, praying a small subnet change didn't just decide to recreate your entire production database. This is the "Toil Gap" of modern cloud operations—we spend roughly 30% of our IT budgets on cloud, yet we waste a massive chunk of that on idle resources, manual configuration errors, and tribal knowledge trapped in the heads of two senior engineers who are always on vacation when you need them.

Today, we're releasing the AWS Orchestrator Agent as open source. It's not a chatbot that hallucinates half-working scripts. It's a multi-agent framework, built on LangGraph, that researches live provider schemas, enforces security compliance, generates production-grade Terraform, validates it in a sandbox, and ships it to GitHub—all from a single natural language prompt.

The Problem: Cloud Complexity Has Outgrown Human Bandwidth

For years, we've been the "mechanics" of the cloud—writing every line of HCL by hand, managing state file locks, and hunting down cryptic IAM permission errors that somehow only surface at 3 AM on a Friday.

The uncomfortable truth? Our cloud environments have become more complex than any one engineer can reliably manage. A single "simple" S3 bucket today requires versioning configuration, KMS encryption setup, public access blocks, lifecycle policies, logging, and tagging—easily 200+ lines of Terraform before you even think about variables and outputs. Multiply that across VPCs, RDS clusters, EKS node groups, and Lambda functions, and you start to see why infrastructure teams are perpetually underwater.

We built the AWS Orchestrator Agent because we believe the industry is ready for a fundamental shift: from writing infrastructure code to orchestrating infrastructure intent.

What Makes This Different: The "Deep Agent" Architecture

Most AI coding tools follow a simple pattern: take a prompt, generate code, hope for the best. The AWS Orchestrator doesn't work that way. It uses a Deep Agent architecture—a multi-stage pipeline where specialized sub-agents each own a narrow slice of the problem.

Here's what actually happens when you type "Create an S3 bucket with versioning and KMS encryption":

Phase 1: Research Before Writing

A dedicated Requirements Analyser agent doesn't guess what attributes an S3 bucket needs based on stale training data. Instead, it queries the live Terraform Registry through the Model Context Protocol (MCP) to fetch the latest AWS provider schemas, version constraints, and required inputs. Think of MCP as the "hands" that let the LLM "brain" reach out and touch real-world data sources in real time.

A separate Security & Best Practices agent then cross-references your request against SOC 2 and HIPAA compliance patterns—enforcing rules like "always use server-side encryption with KMS" and "never allow public access by default" before a single line of HCL is generated.

Phase 2: Deterministic Code Generation

The planning phase produces a structured SKILL.md blueprint—essentially a contract that the code generator must follow. This eliminates the randomness problem. The tf-generator agent doesn't freestyle; it writes main.tf, variables.tf, outputs.tf, and versions.tf according to strict, hardcoded rules:

Never hardcode values. ARNs, regions, and IPs are always abstracted into variables.
Enforce merge tagging. Every resource uses tags = merge({"Name" = var.name}, var.tags).
Pin provider versions. Breaking API updates don't silently destroy your infrastructure.
Guard with conditionals. Resources use count or for_each with boolean gates for flexible composition.

Phase 3: The Agent Tests Its Own Code

This is the part that changes everything. The Orchestrator doesn't just generate code and hand it to you with a "good luck." It runs an internal evaluation loop:

Generated code is flushed to a physical sandbox via sync_workspace_to_disk().
The tf-validator agent executes terraform init, terraform fmt, and terraform validate.
If errors appear, the raw stderr is injected back into the graph state, and the generator rewrites the code automatically.
This loop continues until validation passes 100%.

The validator acts as an un-bribeable code reviewer that catches every missing provider block, every invalid argument, and every syntax error—before you ever see a Pull Request.

Phase 4: Human-Approved Delivery

Validated code hits a Human-in-the-Loop (HITL) gate. Nothing gets committed without your explicit approval. Upon approval, a JIT (Just-in-Time) GitHub agent uses the GitHub MCP Server to push code directly via API endpoints—no brittle git clone or git add shell commands that break in containerized environments.

The entire workflow follows a strict Propose → Approve → Ship philosophy. The agent is proactive, but never cowboy.

Try It in 60 Seconds

You don't need to clone anything. Just create two files and run one command:

.env

GOOGLE_API_KEY=your_google_api_key
GITHUB_PERSONAL_ACCESS_TOKEN=your_github_pat
TERRAFORM_WORKSPACE=./workspace/terraform_modules

docker-compose.yml

services:
  aws-orchestrator:
    image: sandeep2014/aws-orchestrator-agent:latest
    ports:
      - "10102:10102"
    env_file: .env
    volumes:
      - ./workspace:/app/workspace
    restart: unless-stopped

  talkops-ui:
    image: talkopsai/talkops:latest
    environment:
      - TALKOPS_AWS_ORCHESTRATOR_URL=http://aws-orchestrator:10102
    ports:
      - "8080:80"
    depends_on:
      - aws-orchestrator
    restart: unless-stopped

docker compose up -d
# Open http://localhost:8080 and start talking to your cloud.

The system uses a Three-Tier LLM Architecture—fast models for validation routing, mid-tier models for planning, and high-context models for deep code generation. It ships configured for Google Gemini, but swapping to OpenAI or Anthropic is a one-line .env change. Check the full configuration guide for details.

Addressing the Elephant in the Room: Trust

"I wouldn't give a new hire root access on day one, so why would I trust a bot?"

Fair. Here's how we think about it:

The agent generates code. It never runs terraform apply. Your existing CI/CD pipeline and approval process remain the final gate.
Every decision is observable. Through our A2UI streaming protocol, you watch the agent's reasoning in real time—tool calls, security checks, validation results—not a black box.
HITL gates are mandatory, not optional. The agent literally pauses execution and asks for human approval before any code leaves the sandbox.
It's open source. You can audit every prompt, every tool binding, and every routing decision in the source code.

We're not asking you to blindly trust AI with your infrastructure. We're asking you to let it do the research, write the first draft, and validate it—while you retain full veto power.

The Landscape: How Teams Write Terraform Today

The Agentic DevOps space is heating up. If you're evaluating how your team produces infrastructure code, here's how the current approaches stack up:

Approach	Example	Trade-off
Write it by hand	Engineers + HashiCorp docs	Full control, but slow and error-prone at scale
General-purpose AI copilots	GitHub Copilot, Amazon Q Developer	Fast autocomplete, but no validation—it generates code and hopes for the best
Cloud-native generators	AWS CDK, Pulumi	Strong typing and loops, but you're learning a new framework, not writing HCL
Purpose-built deep agents	TalkOps AWS Orchestrator	Researches live schemas, generates, self-validates, and ships—with human approval at every gate

The key differentiator for us is the evaluation loop. Most tools in this space generate code and stop. The AWS Orchestrator generates, validates, self-corrects, and only then asks for approval. That closed feedback loop is what transforms "AI-assisted" into "AI-reliable."

What's Next

This is just the beginning. Here's what's on the roadmap:

Azure and GCP Orchestrators — The Deep Agent pattern is provider-agnostic. AWS is first, but the same architecture will extend to multi-cloud.
Terraform State Import — Teaching the agent to understand and work with existing infrastructure, not just greenfield deployments.
Cross-Agent Collaboration via A2A — The AWS Orchestrator can already talk to other TalkOps agents (like the CI-Copilot) over the Agent-to-Agent protocol. We're building more composable workflows where agents delegate to each other autonomously.

Get Involved

The cloud is getting bigger, but your team doesn't have to stay underwater. We built this in public because we believe the best infrastructure tools are the ones you can audit, extend, and trust.

⭐ Star the repo to follow releases
📖 Read the docs for the full architecture deep-dive
💬 Join our Discord to talk shop about multi-agent systems
🐛 Open an issue if something breaks

It might be time to stop writing Terraform and start talking to your infrastructure.

From Ingress NGINX to Traefik: A Zero-Drama Migration Playbook (With AI Agents)

Mon, 30 Mar 2026 00:00:00 GMT

If your cluster still runs ingress-nginx, you’re not alone. But the clock is ticking.

By March 2026, the community ingress-nginx controller will be officially retired—moving into an unmaintained state with no new bug fixes or security patches. Running an internet-facing component without security updates is a ticking time bomb for production reliability and compliance.

Don’t panic. The Kubernetes Ingress API itself isn’t going anywhere, but you must swap out the controller underneath it. The ecosystem is slowly moving toward the Gateway API, making Traefik the perfect landing spot. Traefik supports your legacy Ingress objects natively while future-proofing you for the Gateway API—giving you a lift-and-shift path today, and modernization tomorrow.

Even better? You don't have to migrate hundreds of routes by hand. Let’s look at how utilizing the Traefik MCP Server by TalkOps.ai makes this a zero-drama migration.

Why We Chose Traefik (And Why You Should Too)

When an infrastructure deadline looms, nobody wants to initiate a painful rewrite of 300+ YAML manifests.

Traefik provides features engineers actually care about out of the box—built-in Let's Encrypt, an intuitive dashboard, and robust observability. But the killer feature for this migration? Traefik can natively read your existing Ingress objects and gracefully interpret many of your legacy nginx.ingress.kubernetes.io annotations.

Concern	With ingress-nginx after 2026	With Traefik
Security updates	No new patches after retirement!	Actively maintained & frequent releases
Migration effort	High risk, controller must be replaced eventually	Reuse most existing Ingress resources and logic
Future standard	Stuck with legacy Ingress API	Full Gateway API support for future-proofing

The MCP-Powered Discovery & Analysis Phase

A massive hurdle to any migration is the "unknown unknowns." Do you have bespoke configuration-snippet hacks buried in some ecommerce namespace? Do your developers use regex path rewrites that will get completely mangled?

By deploying the Traefik MCP Server, we hand off the entire discovery phase to our conversational AI agents.

1. Scan the Cluster

Instead of blindly grepping manifests, ask your AI Agent to pull the inventory:

"Scan all NGINX Ingress resources in the cluster and tell me their complexity."

Behind the scenes, the agent queries traefik://migration/nginx-ingress-scan, pulling exact annotation values, hosts, and paths across all namespaces instantly.

2. Compatibility Analysis

Next, we validate the ecosystem safely:

"Analyze all NGINX Ingresses for Traefik compatibility. Which annotations have breaking changes?"

Using traefik://migration/nginx-ingress-analyze, the agent categorizes every single annotation. Things like CORS (enable-cors) and IP whitelisting map smoothly to Traefik's ecosystem. But what about undocumented breaking configurations like custom NGINX Lua snippets? The AI tags these as breakingAnnotations immediately, allowing you to prioritize the risk.

Supervised Autonomy: Solving the "Gotchas"

Not all NGINX hacks translate 1:1. This is where an MCP-powered agent truly shines through Agentic Override.

Let's say the scanner flags an unsupported auth-signin annotation. Instead of blocking the migration or forcing a manual YAML refactor, the operator and agent collaborate effortlessly:

Operator:

"I see auth-signin is unsupported for the admin ingress. Please create a custom Traefik ForwardAuth middleware named agent-custom-auth pointing to http://auth.internal to replace it."

The Agent automatically provisions the Traefik Middleware CRD. Then, we execute the migration payload by dynamically overriding the breakage:

Operator:

"Run the full migration. Ignore the auth-signin annotation, and inject the agent-custom-auth middleware we just built into the routing."

The tool (traefik_nginx_migration) executes the strategy, stripping the legacy annotations, merging the custom Middlewares, and seamlessly converting the routing spec. You just mitigated a complex refactor completely over chat!

The Zero-Downtime Execution Playbook

So how do we cut over safely without breaking production at 2 AM?

Install Traefik in Parallel: Deploy Traefik in its own namespace using Helm. Enable the Ingress provider fallback so it actively reads the same generic Ingress resources. Both controllers are now happily co-existing and routing traffic from separate LoadBalancer IPs.
Generate and Review (Dry-Run): If you prefer strict GitOps, tell the agent: "Read the migration runbook for the production namespace." The MCP server outputs the complete proposed YAML (Middlewares, patched Ingress objects) for offline review.
Execute the Migration: Ask the agent to apply the migration over the cluster. It generates and binds Traefik Middlewares (IPAllowList, RateLimit, Headers) directly into your Ingress routes via the robust router.middlewares annotation.
Validate Traffic: Run basic curl tests against Traefik’s LoadBalancer IP. Check to ensure TLS termination, headers, and redirects function correctly before DNS cutover.
Progressive Shift: Safely transition your DNS A-records or load-balancer weights towards Traefik's IP. Leave ingress-nginx alive during the TTL crossover window as an ultimate fallback.

Clean Up & Next Steps

Once the Traefik metrics confirm smooth sailing and the old ingress-nginx pods show zero traffic, you can safely run kubectl delete on the legacy controller. Treat the March 2026 deadline as an opportunity to modernize your infrastructure's traffic posture, not just a frantic footnote.

By supercharging your workflow with AI agents and the Traefik MCP Server by TalkOps.ai, hours of manual YAML auditing and syntax porting drop to mere minutes of supervised execution.

Want to dive deeper into the technical execution? Check out our full NGINX Migration Workflow Guide to learn how to connect your Agent to the Traefik MCP Server today.

Welcome to the TalkOps Blog

Sun, 30 Mar 2025 00:00:00 GMT

We're incredibly excited to launch the TalkOps blog!

If you've been following the journey of DevOps, you know that managing modern infrastructure has become increasingly complex. Between juggling Terraform states, navigating dense Kubernetes manifests, and debugging CI/CD pipelines, engineers spend way too much time wrestling with tools instead of building great features.

That's exactly why we built TalkOps. We wanted to democratize platform engineering expertise through intelligent, conversational AI agents that act as an extension of your SRE team.

What to Expect Here

We created this space to share our learnings, product updates, and raw technical deep-dives as we build out the TalkOps ecosystem. Here's a look at what we'll be covering in the coming months:

Agent Spotlights: We'll crack open the hood on how our specialized agents (like the Kubernetes, CI-Copilot, and AWS Orchestrators) actually reason and execute commands dynamically.
MCP Server Releases: As we build new Model Context Protocol servers to securely interact with Terraform, ArgoCD, and Helm, we'll post walkthroughs showing exactly how to use them in your own setups.
Architecture & Design: Expect transparent posts on how we scale LangGraph swarms, design A2A (Agent-to-Agent) communication protocols, and enforce strict governance guardrails.
Tutorials: Real-world, step-by-step guides for automating your worst operational headaches.

Join the Conversation

Open source is at the heart of everything we do. We're building this in public, and we want you to be a part of it.

If you're interested in tracking our releases, please consider starring or watching the TalkOps GitHub organization.

Got a specific topic you want us to cover? Or maybe you just want to talk shop about multi-agent systems and the future of DevOps? Get in touch—we’d love to hear from you.

Welcome aboard!

TalkOps Blog

Talk to Your Cloud: Why We're Open-Sourcing the AWS Orchestrator Agent

The Problem: Cloud Complexity Has Outgrown Human Bandwidth​

What Makes This Different: The "Deep Agent" Architecture​

Phase 1: Research Before Writing​

Phase 2: Deterministic Code Generation​

Phase 3: The Agent Tests Its Own Code​

Phase 4: Human-Approved Delivery​

Try It in 60 Seconds​

Addressing the Elephant in the Room: Trust​

The Landscape: How Teams Write Terraform Today​

What's Next​

Get Involved​