Feed
Chat app
AI Agent

Experience Engine — Our Story

Team Brick & Morty | UMD x Ironsite Startup Shell Hackathon 2026

What Inspired Us

It started with a Grey's Anatomy episode.

An older surgeon — unfamiliar with the newest robotic tools, slower on paper — beat a younger, technically superior resident in a live competition. Not because of knowledge. Because of muscle memory. Decades of micro-decisions baked into his hands. He didn't think about where to cut. He just knew.

That image stayed with us. And when we sat down with the Ironsite team and started looking at job site footage, we saw the same thing playing out on a construction site. The best mason on the crew doesn't think about how far to stand from the wall before setting a block. The best plumber doesn't consciously decide to check the joint before firing the torch. They just do it — automatically, efficiently, invisibly.

Then we noticed something that nobody in the room had talked about yet.

Body cam and helmet-mounted footage is fundamentally different data from security surveillance.

Surveillance footage watches what happens. Body cam footage watches how someone does it — from their own perspective, in their own space, at the exact distance and angle they chose to work from. That is not incidental. That is a gold mine of implicit behavioral data that existing construction AI products are not fully exploiting.

Ironsite had built impressive tooling for activity logging and site awareness. But the body cam angle — the spatial story of how an expert moves through their work — was sitting there underutilized. That was our opening.

What We Built

Experience Engine is an end-to-end pipeline that takes raw construction site footage and produces a queryable behavioral index of implicit expert behaviors — the things skilled workers do without knowing they're doing them.

The pipeline works like this:

1. Gemini segments and spatially analyzes the video. Each clip is broken into activity windows. For every window, Gemini extracts not just what the worker is doing but how their body is doing it — torso orientation, working distance, trajectory efficiency.

2. Claude classifies each clip into task types (check, risky, carry, install, communicate, idle) using LLM reasoning over the full context — tools, trade, phase, sequence. Not keywords. Intent. And as the model sees more videos, it dynamically generates new task types to fit what it learns.

3. Sequences of task types are grouped into implicit intent patterns. A check immediately followed by a risky action maps to Attention — the worker verified before acting. A check followed by idle maps to Hesitation — they paused because they hadn't pre-planned the next step. These mappings are how we encode instinct mathematically.

4. All of that data — scores, spatial observations, task sequences — is written to a structured JSON index that Claude can query with ./ee query. From that index, the model produces reasoning sentences like:

"The worker squares their torso and leans in at approximately 20 degrees to align the torch tip with the copper joint — a deliberate expert technique that ensures uniform heat distribution. This check-to-risky sequence, combined with a 0.3m working distance and a trajectory efficiency ratio of 0.95, signals a high-attention expert pattern that is correct and worth repeating."

No label in the source video said any of that. It was inferred from the data.

5. The AI generates targeted coaching tutorials from real footage. It writes code that runs a script to render a video tutorial directly from the indexed JSON. The agent is general and can generate different types of tutorials, including one that shows what the worker did wrong, what the expert version looks like, and why, as a teaching artifact for the crew. The system then uses Gemini to analyze the generated tutorial and verify that its reasoning, spatial claims, and task mappings are accurate before delivering it to workers.

What We Learned

The biggest lesson was about when to use AI and when to stay deterministic.

We went in thinking the AI could classify everything — intent, behavior, scores. We quickly learned it couldn't. LLMs are powerful at reading context and making qualitative judgments (what kind of task is this?), but they are unreliable for producing consistent numeric scores. The same segment described two different ways could yield wildly different numbers from the model.

The breakthrough was separating the two jobs cleanly:

The AI classifies. It reads the activity, the tools, the trade, the sequence, and decides: this is check, this maps to attention, this spatial posture is lean_in_for_precision.
The math scores. Given those classifications, deterministic formulas produce stable, auditable numbers.

This split — qualitative reasoning by the model, quantitative measurement by code — is the architecture that made the whole system work.

We also learned how to think about what data is worth storing for AI to reason over later. Not everything. The index entries need to be structured precisely enough that a language model can query them reliably, but rich enough that the reasoning it produces is meaningful. Finding that balance — what to compute up front, what to leave for the model to infer at query time — was a design challenge we iterated on throughout the hackathon.

The Challenges We Faced

Classifying behavioral intent is genuinely hard.

We believe AI struggles to detect implicit behavior because implicit expertise is rarely documented, which means it almost never appears in labeled training data.

Activity labels like "applying mortar" or "positioning pipe" do not tell you what the worker intends. A worker positioning a pipe could be installing it, checking its alignment, or just moving it out of the way. Getting the model to reliably distinguish these from context — without overfitting to keywords — required careful prompt design and a lot of iteration on the task vocabulary.

Collecting enough data for the AI to reason well.

The more behavioral data in the index, the richer and more specific the model's reasoning becomes. With only a handful of videos, some patterns are underrepresented and the model generalizes too broadly. The feature functions work, the scores are correct — but the reasoning sentences are most impressive when the model has seen enough variation to know what is typical vs. exceptional. We hit the edge of our dataset size.

Spatial signals are noisy from body cam footage.

Body cam video moves. The worker's head turns, the camera shakes, the angle changes mid-segment. Extracting consistent spatial features — especially trajectory efficiency and working distance — from footage that is itself attached to the subject required Gemini to do real inference rather than simple measurement. Sometimes it was right. Sometimes it needed correction. Building confidence in those spatial signals is an ongoing challenge.

Why This Matters

The construction industry loses billions of dollars a year to inefficiency, rework, and safety incidents that stem from skill gaps nobody can see. The knowledge to close those gaps already exists — it's in the hands and bodies of the most experienced workers on every crew. It just isn't accessible.

Experience Engine makes it accessible. Not by asking the expert to explain what they know — they often can't. But by watching them work, finding the patterns in how they move, and turning those patterns into something a new worker can learn from.

The frontier we pushed is this: AI that can explain implicit behavior, not just detect it. Every prior system could tell you what someone did. We built a system that can tell you why it matters — and teach it to someone else.