Video today is passive.
You scroll. You scrub. You guess timestamps.
SceneIQ changes that.
It transforms video into queryable structured intelligence — not frame detection, not static labeling, but temporal reasoning over evolving events.
| Modern Systems Fall Short | SceneIQ Delivers |
|---|---|
| Detect objects per frame | Scene-level abstraction |
| Ignore temporal continuity | Persistent identity modeling |
| Miss entity relationships | Motion reasoning |
| Cannot retrieve narrative moments | Interaction inference |
| Depend on heavy black-box transformers | Importance ranking — locally |
User types:
"Football World Cup trophy celebration"
Within seconds, SceneIQ:
- Detects trophy object
- Identifies crowd motion spike
- Tracks multi-player clustering
- Detects raised-arm celebration pattern
- Computes interaction density
- Ranks scene importance
Returns:
Scene: Trophy Lift Celebration
Timestamp: 01:42:18 – 01:44:03
Importance Score: 0.93
Motion Intensity: High
Interaction Density: High
Detected Objects: trophy, players
Video jumps instantly to the exact moment. No timeline scrubbing. No manual editing. Just the moment.
This is not keyword search. This is structured temporal reasoning.
SceneIQ models video as:
| Dimension | What It Models |
|---|---|
| Entities | Persistent identity tracking |
| Objects | Multi-object detection |
| Motion | Velocity-based reasoning |
| Interaction | Spatial relationship inference |
| Narrative | Scene boundary modeling |
| Importance | Scene scoring & ranking |
Each scene is represented as a semantic unit:
| Symbol | Meaning |
|---|---|
| Time boundary | |
| Persistent entities | |
| Objects | |
| Motion intensity | |
| Interaction graph | |
| Importance score |
Video Structured Scene Graph.
Video Input
Structural Scene Segmentation
Motion-Aware Frame Sampling
YOLOv8 Object Detection
Persistent Multi-Object Tracking
Velocity & Motion Modeling
Interaction Graph Construction
Scene Importance Scoring
Semantic Indexing
Timestamp-Accurate Retrieval
- HSV histogram comparison
- Structural similarity metrics
- Temporal smoothing
- Narrative-consistent boundary grouping
**Result ** Scene-level units, not raw frames.
Each tracked entity maintains a continuous trajectory:
Track ID continuity enables:
- Long-term identity preservation
- Behavior evolution tracking
- Cross-frame reasoning
Velocity is classified into:
| Class | Description |
|---|---|
stationary |
No movement |
walking |
Low velocity |
running |
High velocity |
vehicle_motion |
Wheeled motion |
fast_object |
Projectile / fast-moving item |
Detects: Goals Celebrations Action spikes High-energy events
Entities become nodes. Spatial proximity and temporal overlap form edges.
Scene graph examples:
person driving car
player holding trophy
man speaking_to woman
Enables semantic reasoning beyond detection.
Scenes are ranked by:
- Motion intensity
- Entity count
- Interaction density
The system surfaces moments that matter.
User query:
"man driving car scene"
Converted into constraints:
persondetectedvehicledetectedvehicle_motion > threshold- spatial overlap: person inside car region
Matched scenes ranked by importance returned with exact timestamp.
Deterministic. Explainable. Local.
| Property | SceneIQ |
|---|---|
| Hardware | CPU-friendly |
| Inference | Deterministic |
| Indexing | Real-time |
| Transformer dependency | None |
| Execution | Fully local |
Unlike heavy multimodal models, SceneIQ is efficient, transparent, and deployable anywhere.
| Domain | Use Case |
|---|---|
| Sports | Highlight extraction |
| Film & Media | Scene indexing |
| Surveillance | Behavior analysis |
| Smart Mobility | Vehicle understanding |
| Education | Video navigation |
SceneIQ introduces:
- Structured temporal abstraction
- Persistent identity modeling
- Motion-aware scene importance ranking
- Interaction graph reasoning
- Deterministic semantic retrieval
It bridges classical computer vision and semantic video intelligence.
In the near future, users will not scrub videos. They will query them.
"Last minute winning goal."
"Professor explaining gradient descent."
"Crowd panic moment."
SceneIQ is the engine that makes video searchable by meaning.
SceneIQ doesn't detect frames.
It understands moments.
MIT