ThirdEye

Cinematic shot
A front view of ThirdEye
The full setup (note the middle speaker was for demonstration purposes only)

Inspiration

Over 25% of the human population has some sort of visual impairment, and for those unfortunate enough to suffer total vision loss, it can be incredibly difficult and dangerous just navigating daily life. While guide dogs are being increasingly accepted in many public spaces, these canine companions may be barred from accessing zoos, airports, hospitals, etc. ThirdEye is like a guide dog in your pocket, placing you just a voice command away from a vivid, narrated description of your surroundings in the current moment, or recalling a scene from weeks, months, or even years ago.

What it does

ThirdEye hooks onto your shirt pocket and requires only your voice to operate. With one spoken word, you can instantly capture the scene in front of you. Then, ThirdEye will play a detailed description, with an emphasis on obstacles or potential tripping hazards, and save that description to a cloud database. With another command, you can ask ThirdEye a question about what it has seen before, and it will tell you exactly where, when, and what it was.

How we built it

We built ThirdEye using Python on a Raspberry Pi 5. The user's speech, captured by a Bluetooth microphone, is processed by Python's SpeechRecognition library to tune out background noise and discern commands. If the user asks to take a snapshot, the Pi camera takes a photo which is then encoded and passed along with a customised prompt to Cohere, which returns a description that is read aloud with gTTS. This description is also saved to DynamoDB which triggers an AWS Lambda function which stores it in an OpenSearch vector database alongside its timestamp and location data. When the user asks ThirdEye to recall a certain scene by providing a few details, e.g. "Tell me where I saw a clock tower next to a river" it queries the OpenSearch database for possible candidates, which are passed to Cohere to select the most accurate entry. Finally, the description, time, and location are narrated back to the user.

Challenges we ran into

We had a lot of difficulty deciding on and acquiring the correct hardware. We initially tried to use an ESP-32 camera with a microphone and speaker module, but the high latency, lack of IO pins, and the provided microphone not being a microphone but instead a sound detector posed significant challenges. We experimented with an Arduino Uno, a regular ESP32, a QNX Raspberry Pi 4B before settling on our Pi 5 design. We found it challenging to navigate python dependencies between half of our team developing on MacOS, the others on Windows, and the deployment on a Linux-based PiOS. We also struggled with audio IO; the Bluetooth connections were extraordinarily finnicky so we ended up using a USB speaker for the demo. And last but certainly not least, as Minghao will testify, the little pushbutton was the bane of our existence for the most part of 5 hours. After it seemed to be finally working, it just decided to suddenly stop working after an innocuous git pull which didn't even change any relevant code. We eventually concluded that it was faulty after hours of fighting sunk cost fallacy. We initially intended to use this button to trigger snapshots or recalls, but we had to pivot deep into the project. Thankfully, we were able to adapt through trial and error, and accomplish our goal to build something cool and most importantly impactful at Hack the North 2025.

Accomplishments that we're proud of

Getting the hardware working!
Pivoting from using the button to voice commands
Seamlessly integrating DynamoDB and OpenSearch
Making full use of Cohere
Managing multimodal IO
Decreasing latency with multithreading

What we learned

Hardware can be very unreliable (THE BUTTON)
How your hardware behaves can almost seem non-deterministic
The prompt design is crucial to getting consistent responses without hallucinations

What's next for ThirdEye

Downsizing our hardware to make it more portable and all-in-one with a smaller microphone and speaker.
More sensors for redundancy and extra safety such as lidar