💡 Inspiration
The motivation for this project was born out of a tragedy that should never have happened. The Seoul Halloween Crowd Crush served as a devastating wake up call to the dangers of unmanaged crowd density.
- The Context: The first Halloween without COVID restrictions saw 100,000 people celebrating on Itaewon Street.
- The Tragedy: A bottleneck in a mere 3.2m wide alleyway resulted in 159 deaths and 197 injuries.
- The Reality: Research shows that 70% of safety related injuries at events stem directly from poor crowd management.
Current surveillance methods are passive. They record tragedies but do not prevent them. We realized that we needed intelligent surveillance that quantifies danger in real time. We built Crowd Watch to transform passive drone footage into active, life saving data.
⚙️ How We Built It
Our system is a closed loop pipeline designed for low latency and high accuracy. We architected the solution in three distinct stages:
1. Data Acquisition (The Eye)
As seen in our system architecture, we utilized a DJI Drone to capture a bird's eye view of the target area. We bypassed standard on device processing to save battery and maximize compute power. Instead, we engineered the drone to broadcast a video feed via RTMP (Real-Time Messaging Protocol). This stream is intercepted by our central processing computer, which acts as the server handling the heavy lifting.
2. The AI Engine (The Brain)
This is the core of our "Accuracy" pillar. We implemented CSRNet (Congested Scene Recognition Network), a deep learning architecture specifically designed for counting objects in dense environments. We chose this over standard object detection models (like YOLO) because detection boxes fail when people overlap significantly.
Based on our implementation, the model follows a specific two part structure:
- The Frontend (VGG16 Feature Extractor): We utilized the first 10 layers of a VGG16 network. Its strong transfer learning capabilities allow it to understand basic shapes and patterns immediately. We strictly avoided downsampling after this stage to preserve the spatial resolution needed to detect small heads in a large crowd.
- The Backend (Dilated Convolutions): This is the technical innovation. Standard pooling layers reduce image size and lose detail. Instead, we used dilated convolutions with a dilation rate of 2. This technique expands the receptive field (the area of the image the neuron "looks" at) without reducing the resolution.
Mathematically, this allows us to generate a high quality Density Map where the integral of the map equals the crowd count:
$$ C = \sum_{i=1}^{W} \sum_{j=1}^{H} D(x_{i,j}) $$
(Where $C$ is the total count, $W$ and $H$ are width and height, and $D$ is the density map value at pixel $x$.)
3. Visualization & Dashboard (The Interface)
The central computer pushes the processed data to two endpoints:
- AI SCRN: A dedicated screen for monitoring the raw AI output and debugging model performance.
- Operator Dashboard: A user-facing interface for security personnel. This provides:
- Live heatmaps overlaid on the video feed
- Real-time crowd count and density metrics
- Automatic “High Risk” alerts when density exceeds safe thresholds (e.g., >5 people/m²)
- A source toggle to switch between:
- Live drone feed
- Fixed CCTV cameras
- Uploaded video (for incident review and offline analysis)
Frontend Analytics & Operator Tools
To move beyond “just a heatmap,” we designed the frontend as a full analytics cockpit:
Pin Drops for Critical Zones
Operators can click on a hotspot to drop a pin on the map. Each pin stores:- Timestamp
- Estimated crowd count
- Risk level (Low / Medium / High / Critical)
These pins act as a visual log of where and when risk emerged.
- Timestamp
Real-Time Risk Panel
A side panel shows a live list of Top Risk Areas, sorted by danger level:- Zone name or camera ID
- Current density (people/m²)
- Trend (increasing, stable, decreasing)
- Time since first detected as “at risk”
- Zone name or camera ID
Start Live Session & Density Timeline
Security teams can hit “Start Live Session” at the beginning of an event.
During the session we:- Log density and risk scores frame-by-frame
- Aggregate them into minute-by-minute trends
- Render a timeline chart that shows how risk evolved over time in each zone
- Log density and risk scores frame-by-frame
AI Safety Report (One-Click Summary)
At the end of a session, the system generates an AI-written safety report that includes:- Peak density and when it occurred
- Number of high-risk alerts and where they happened
- Recommended interventions (e.g., “Open an alternate exit near Zone B”, “Add barriers to prevent backflow in Alley 3”)
This report can be shared with event organizers as a post-mortem or compliance document.
- Peak density and when it occurred
Multi-Source Viewing (Drone / Camera / Video)
From the same dashboard, users can:- Toggle between drone, CCTV, and uploaded video sources
- See consistent overlays (heatmap + risk labels) across all sources
- Use uploaded video to replay incidents and validate safety procedures
- Toggle between drone, CCTV, and uploaded video sources
Together, these tools turn Crowd Watch from a single AI model into a full decision-support system for crowd safety: operators can spot risk, log it, act on it, and review it all in one place.
🚧 Challenges We Faced
- The "Scale" Problem: In computer vision, perspective distortion is a major issue. A person standing directly under the drone looks massive, while a person at the edge of the crowd looks like a dot. We had to fine tune the CSRNet backend to handle these multi scale variations so the count remained accurate across the entire frame.
- Latency Bottlenecks: Streaming high definition video via RTMP is resource intensive. We initially faced significant lag between the real world event and the dashboard update. We solved this by optimizing the buffer size on the Python backend and downscaling the input resolution slightly before it hit the neural network, finding the sweet spot between speed and accuracy.
- Networking Complexity: As shown in our architecture diagram, the central computer has to handle three simultaneous bidirectional connections. Managing the asynchronous data flow between the drone input, the AI processing loop, and the frontend updates required careful thread management to prevent the application from freezing.
🧠 What We Learned
- Deep Learning Architectures: We gained a deep understanding of how Convolutional Neural Networks (CNNs) operate beyond basic tutorials. We learned specifically how $3 \times 3$ kernels work and how dilation acts as a superior alternative to pooling for density estimation tasks.
- System Design: We learned how to handle video streams programmatically. Moving beyond static image processing into continuous data pipeline management taught us about the complexities of real time systems.
- Tech for Good: Most importantly, we learned that engineering is not just about optimization but application. Building a tool that addresses a real world humanitarian issue like crowd safety gave the code a sense of purpose that a standard class assignment never could.
Log in or sign up for Devpost to join the conversation.