WaveChat: From Cloud Operations to Hardware Intelligence

The Journey

The Foundation: Cloud Operations & System Validation

During my time in cloud operations and data center system validation, I realized the reality of troubleshooting at scale.

In my previous roles, I spent countless hours analyzing system telemetry streams: CPU utilization traces, memory bandwidth graphs, thermal sensor logs, and network packet captures.

Raw data is useless without context. A 90% CPU spike means nothing without knowing which process, which cores, and when relative to other events.
Visual patterns matter deeply. A waveform that looks "fine" might hide a 2-nanosecond glitch that cascades into data corruption three clock cycles later.
Domain expertise is the bottleneck. Junior engineers would stare at logs for hours. Senior engineers would spot the issue in seconds by recognizing patterns.
Automation scales understanding. I built dashboards and alerts that trapped problems before they became incidents. But even good tooling couldn't bridge the gap between what the tools detected and what engineers needed to understand.

This experience planted a seed: What if we could teach machines to understand hardware the way experienced engineers do?

But here's the catch:

So I built the intelligent intermediary.

What We Built: The Architecture

WaveChat is a domain-specific analysis pipeline that feeds AI, not the other way around:

Raw VCD File
    ↓
[VCD Parser] → Extract all signal transitions + timescale
    ↓
[Signal Classifier] → Detect clocks, resets, FSMs, buses automatically
    ↓
[Timing Analyzer] → Calculate frequency, jitter, setup/hold violations, glitches
    ↓
[Protocol Decoders] → Decode I2C, SPI, UART transactions
    ↓
[SVG Renderer] → Generate interactive waveform visualization
    ↓
[Context Builder] → Synthesize findings into structured LLM prompt
    ↓
[LLM (Gemini)] → Natural language explanations
    ↓
User-Friendly Answers

Why ChatGPT cannot do this:

You can't just throw a VCD at ChatGPT. The tool can't:

Parse binary VCD efficiently
Detect a 1.5-nanosecond glitch in a 1-second simulation
Correlate I2C transactions across clock domains
Calculate frequency stability (jitter) from transition patterns
Identify causality ("why did the ready signal drop?")

The LLM Integration: Context is King

Classic mistake: Feed raw VCD text to an LLM and expect magic.

Instead, WaveChat builds structured context:

context = f"""
=== SIGNAL CLASSIFICATION ===
clk [1b]: CLOCK, 100.02 MHz, jitter: 0.21ns, 50.1% duty
rst_n [1b]: RESET (async), active-low, deasserts at 50ns
data[7:0] [8b]: BUS, 47 transitions, first change at 60ns
...

=== TIMING VIOLATIONS ===
Setup delay between data and clk: 2.3ns (OK, min required: 1.5ns)
Glitch detected: enable_n @ 892ns, duration 1.8ns
...

=== PROTOCOL ANALYSIS ===
I2C Controller Activity:
- START @ 100ns
- Address: 0x50, Direction: WRITE
- Data: 0xA5
- ACK received @ 928ns
...
"""

# Feed THIS to the LLM, not the raw waveform
response = gemini(system_prompt, user_question, context)

This gives WaveChat the ammunition to give intelligent answers grounded in actual hardware analysis.

The Challenges

1. Binary VCD Parsing

The vcdvcd library is solid, but:

Some simulators output non-standard variant syntax
Timescale strings are parsed differently (1ns vs 1 ns vs 1e-9 s)
Very large WaveChat files (10GB+) can't fit in memory

Partial solutions:

Normalize timescale strings aggressively
Sample large files instead of loading entirely
Cache parsed results

Full solution would require a custom VCD parser (too complex for 24h).

2. Clock Domain Detection

Automatically finding which clocks cross which domains is genuinely hard. I implemented a 70% heuristic:

# Clocks have regular frequency; if a signal uses two clocks, likely a CDC (Clock Domain Crossing)
def find_clock_domain_crossings(signals, clocks):
    for sig in signals:
        edges_near_clock1 = count_transitions_near_clock(sig, clocks[0])
        edges_near_clock2 = count_transitions_near_clock(sig, clocks[1])

        if edges_near_clock1 > threshold AND edges_near_clock2 > threshold:
            # High likelihood of CDC
            flag_as_cdc(sig)

This catches ~70% of real cases but has false positives. The LLM then filters based on context ("does this actually look like a synchronizer?").

Lesson: Algorithmic heuristics don't need to be perfect; they need to be correlated with ground truth. The LLM handles disambiguation.

Data & Results

During Hackathon

24 hours of work → ~2000 lines of code (backend) + ~800 lines (frontend)
5 analysis modules from scratch
Tested on 10+ real VCD files from benchmark suites (counter, UART, FIFO, FSM, ALU)
Streaming LLM integration working end-to-end

After Deployment

Uptime: 99.2% over 2 weeks (managed to not crash!)
Load test: 50 concurrent uploads handled fine
Response latency:
- Upload + analysis: 2-5 seconds
- Question answer (streaming): 3-8 seconds
- Total: 5-13 seconds per query

The Bigger Picture: Why This Matters

Hardware debugging hasn't fundamentally changed in 20 years. Verification engineers still:

Generate massive dumps
Load them in specialized (expensive) tools
Stare at waveforms manually
Make guesses about causality

WaveChat expands at the intersection of algorithmic intelligence (domain-specific analysis) and AI (LLM explanations).

Key Takeaways

What Worked	What Didn't
Building domain analysis first, LLM second	Expecting LLM to do novel signal processing
Ruthless scoping within time constraints	Trying to be "production-ready" in 24 hours
Streaming responses for UX	Expecting data persistence in stateless containers
Visual feedback (SVG diagrams)	Text-only chat interface
Simple heuristics + LLM filtering	Perfect algorithmic solutions
Session caching + S3 backup	Relying solely on in-memory storage

Built With

Updates

Ujval Sai Munagala started this project — Feb 08, 2026 11:37 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.