Scene iQ

Query-driven retrieval showing motion-aware scene matching with importance scoring and exact timestamp output.
Demonstrates object detection and highlights limitations of traditional models in capturing missing relational context.
Examples of generated scene graphs showing entity relationships such as riding, wearing, holding, and spatial positioning.
SceneIQ analyzes a YouTube video, segments scenes, and enables semantic timestamp-based retrieval.
SceneIQ retrieves the trophy lift moment using persistent character IDs and motion-aware ranking for instant timestamp navigation.

Inspiration

Video today is treated as a sequence of independent frames. But narrative video — sports, interviews, films, lectures — is not frame-based. It is event-based.

While watching long-form content, we realized something simple: People don’t search for frames. They search for moments : “Goal celebration.” “Trophy lift.” “Two people arguing.” “Man driving car scene.”

Existing systems detect objects per frame but fail to model: Temporal continuity Persistent identity Interaction dynamics Scene importance SceneIQ was built to shift video understanding from static detection to structured temporal intelligence.

What it does

SceneIQ transforms raw video into structured, searchable scene-level intelligence. It : Segments video into narrative-consistent scenes Detects and tracks entities persistently across time Models motion intensity and velocity Infers interactions between people and objects Builds scene-level semantic representations Ranks scenes by importance Enables timestamp-accurate natural language retrieval

Instead of returning labels, SceneIQ returns: Structured moments. Example: Query: “Football World Cup trophy celebration” SceneIQ returns: Trophy detected Multi-person interaction cluster High motion spike Timestamp: 01:42:18 – 01:44:03 Importance score: 0.93

How we built it

Video Input: → Structural Scene Segmentation (groups frames into narrative-consistent scenes) → Multi-Object Detection (identifies key entities in each frame) → Persistent Multi-Object Tracking (maintains identity across time) → Motion Intelligence Layer (analyzes movement patterns and intensity) → Interaction Graph Modeling (infers relationships between entities) → Scene Importance Scoring (ranks scenes by narrative significance) → Semantic Retrieval Engine (returns exact timestamps based on user query)

Challenges we ran into

Accomplishments that we're proud of

-Fully local execution -Deterministic explainable pipeline -Structured scene-level abstraction -Temporal entity persistence modeling -Motion-aware importance ranking -Interaction graph reasoning Timestamp-accurate semantic retrieval -Modular architecture suitable for research extension Most importantly : SceneIQ understands evolving events - not isolated frames.

What we learned

Temporal continuity is more powerful than frame-level detection Motion dynamics reveal narrative significance Interaction density signals meaningful moments Structured representations outperform raw label outputs Efficiency and interpretability can coexist We also learned that narrative reasoning does not require massive transformer architectures. Carefully engineered temporal modeling can achieve strong semantic understanding.

What's next for Scene iQ

Graph Neural Network refinement for interaction graphs Transcript-aware multimodal fusion Cross-video narrative linking Zero-shot semantic query generalization Large-scale evaluation benchmarks

Long term vision: Make video searchable by meaning. From: Video as media → Video as structured knowledge.

Built With

custom-temporal-scene-segmentation
engine
ffmpeg
flask
numpy
opencv
persistent-multi-object-tracking
python
pytorch
retrieval
scene-graph-modeling
scipy
semantic
ultralytics-yolov8
velocity-based-motion-reasoning

Updates

Robert samuel started this project — Feb 20, 2026 11:56 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.