"Autonomous cloud infrastructure that remediates failures before humans can notice."
SHCIA (pronounced shia) is an Agentic AI Platform designed for high-availability cloud environments. It operates as a multi-agent system that continuously monitors, diagnoses, and auto-remediates infrastructure failures across microservices with zero human intervention for known failure classes.
Built with a FAANG-standard Monorepo architecture, SHCIA combines the power of LLM-based causal reasoning with the durability of infrastructure-as-code (Terraform/Helm) and the stability of state-machine orchestration (LangGraph).
graph TD
A[Observer Agent] -- Anomaly Signal --> B[Diagnosis Agent]
B -- Root Cause Report --> C[Planner Agent]
C -- Remediation Plan --> D{Safety Gate}
D -- Confidence > 95% --> E[Execution Agent]
D -- Confidence < 95% --> F[Human Intervention]
E -- Terraform/Helm/K8s --> G[Production EKS]
H[Chaos Agent] -- Injection --> G
I[Cost Guardian] -- Optimization --> G
| Module | Role | Technology |
|---|---|---|
| Observer | Real-time correlation of metrics and signals. | Prometheus, CloudWatch |
| Diagnosis | AI-powered RCA using service topology awareness. | LangChain, GPT-4, OpenAI |
| Planner | Generates safe, validated remediation plans. | Pydantic, Custom Logic |
| Execution | Durable infrastructure mutation with safety gates. | Terraform, Helm, Kubectl |
| Chaos | Proactive failure injection for resilience testing. | Kubernetes API |
| Cost Guardian | Cloud waste detection and right-sizing engine. | AWS Cost Explorer |
Ensure you have Docker and Make installed.
# Clone the repository
git clone https://github.com/Ismail-2001/Self-Healing-Cloud-Infrastructure-Agent-SHCIA-.git
cd Self-Healing-Cloud-Infrastructure-Agent-SHCIA-
# Install local development dependencies
make install-depsCreate a .env file at the root:
OPENAI_API_KEY=sk-....
SHCIA_AUTH_TOKEN=super-secret-token
PROMETHEUS_URL=http://localhost:9090make build-all
docker-compose up -dSHCIA is built on the principle of "Safety Above Speed". Every automated action must pass through our 4-Gate Validation Pipeline:
- Schema Validation: Ensures the plan is structurally sound and follows defined contracts.
- Policy Engine (OPA): Validates the action against organization-wide security and compliance.
- Blast Radius Assessment: Checks how many services/traffic will be impacted during remediation.
- Confidence Gate: Only plans with >95% AI Confidence are auto-executed. Everything else is escalated for human review.
├── libs/ # Internal shared libraries (shcia-core, shcia-contracts)
├── services/ # Microservices (agents)
│ ├── observer/ # Monitoring & Anomaly Detection
│ ├── diagnosis/ # AI Causal Reasoning
│ └── execution/ # Infrastructure Mutation
├── iac/ # Infrastructure-as-Code (Terraform, CloudFormation)
├── ops/ # Operational standards (Runbooks, Dashboards, Workflows)
├── deploy/ # Deployment mechanisms (Helm Charts)
└── tests/ # Unit, Integration, and E2E Test Suites
We follow the standard FAANG Pull Request Flow. Please ensure all PRs include:
- Updated Unit/Integration tests.
- Updated ADRs (Architecture Decision Records) for new patterns.
- Successful
make lint-allandmake build-allresults.
This project is licensed under the MIT License - see the LICENSE file for details.
Built with ❤️ by the Ismail Sajid and SHCIA Engineering Team.