☁️ Nephos: The Project Story

In the current cloud landscape, downtime isn't just a technical failure—it’s an expensive race against the clock. As DevOps teams scale across multi-cloud environments, they are increasingly trapped in a reactive cycle. When AWS services interact with specialized providers like Vultr, the resulting data silos create a "visibility gap."

Engineers are often left "dashboard hopping," desperately trying to manually correlate disparate telemetry data while an outage is already in progress. This fragmented approach turns debugging into a guessing game, where identifying the link between a CPU spike on one provider and a database latency on another can take hours of critical uptime. This is where Nephos comes in.

What he Does:

Nephos is a serverless observability platform that monitors system health across AWS and Vultr in real-time. It uses an event-driven "Sentinel" to pull telemetry, stores it for both instant viewing and long-term analysis, and leverages Snowflake Cortex AI to predict failures before they happen.

  • Real-time Dashboards: Sub-100ms updates for live metrics
  • AI Insights: Automated anomaly detection and natural language summaries of system health
  • Unified View: A single pane of glass for multi-cloud orchestration

How I Built It

I orchestrated a serverless pipeline using AWS Lambda and Amazon EventBridge to poll telemetry from the Vultr API v2. I implemented a dual-layer storage architecture:

Speed Layer: Amazon DynamoDB stores real-time metrics for instant dashboard updates, acting as a buffer to the server information being fed into the data pipeline.

Analytics Layer: Snowflake stores historical logs where Cortex AI executes ML functions to detect latency anomalies. The frontend was built with Next.js and Tailwind CSS, providing a high-performance UI for real-time monitoring.

See my tech stack diagram for more info!

Challenges I Faced

The biggest hurdle was cross-cloud data consistency. Managing different timestamp formats and API rate limits between Vultr and AWS required a robust error-handling logic within our Lambda "Sentinel" to ensure the data reaching Snowflake was clean and normalized. Additionally, during the initial build, my AWS Lambda "Sentinel" function was polling the Vultr API so frequently that I hit strict rate limits. I had to implement an intelligent back-off strategy and optimize our polling intervals to ensure 1-minute granularity without getting blocked.

What I Learned

I gained deep experience in Multi-tenant Cloud Orchestration. I learned how to move away from reactive monitoring toward Predictive Observability, proving that AI-driven insights can significantly reduce the recovery time needed for DevOps teams. I also developed a deep understanding of Speed Layers vs. Analytics Layers, learning to use Amazon DynamoDB for sub-100ms real-time dashboard updates while utilizing Snowflake for heavy historical data processing. Finally, I learned how to solo-hack: ideate, debug, and build all by myself.

What's next for Nephos

The most significant evolution for Nephos is moving from "Observing" to "Acting." Instead of just alerting a developer that a Vultr server is down, Nephos will attempt to fix it automatically. For example, if Snowflake Cortex AI detects a critical failure pattern, it triggers a "Remediation Lambda." This function can automatically restart the Vultr instance or spin up a new "Snapshot" of the server to restore service without human intervention.

Built With

Share this project:

Updates