MemGrep

Inspiration

We were inspired by the concept of "total recall" seen in the movie Limitless and the Black Mirror episode "The Entire History of You." We wanted to move beyond the dystopian warnings and find the utopian utility: what if you never forgot a face, a promise, or where you put your keys? We wanted to build a "Ctrl+F" for real life.

What it does

MemGrep is an always-on, multimodal memory assistant embedded in Ray-Ban Meta glasses. It acts as a passive observer that:

1) Ingests the video stream to understand your visual context.

2) Transcribes and tracks conversations in real-time.

3) Indexes this multimodal data into a long-context vector database utilizing Mem0 to have a small language model to serve as the "librarian" to help index, retrieve, and console contradictions in memory.

How we built it

We built this by leveraging Grok Live to communicate with the user in real-time. For the heavy lifting of memory, we used SigLIP to embed video frames into a semantic vector space and Phi-1.5 to act as the lightweight language backbone that processes both queries and video tokens.

All of this data is fed into our Mem0 database, which manages the long-term user context. We open a websocket between the client (a mobile relay app connected to the glasses) and our backend server to stream audio and video data continuously. This stream is chunked, embedded, and stored in the vector store. When a user asks a question, we perform a smart semantic search with the help of a helper SLM to search Mem0 database, retrieve the relevant time-stamped context, and use the LLM to synthesize an answer.

Challenges we ran into

Latency vs. Accuracy: Balancing the speed of real-time ingestion with the computational cost of generating high-quality embeddings. We had to optimize how often we sampled frames to keep the system responsive without missing critical details.

Hardware Constraints: Getting a consistent stream from the glasses while managing battery life and bandwidth was a constant balancing act. We also struggled with inference latency at first but quickly solved this by utilising Mem0's SLM to act as the "librarian" as well as swapping out a bulky VLM for our SigLIP x Phi-1.5 multimodal VLM, which ran at blazing speeds.

Accomplishments that we're proud of

Semantic Video Search: We successfully built a pipeline where you can remember things that were never explicitly spoken about (such as leaving your keys on the counter) as well as map things that were spoken to things that were seen (such as linking a name to a face)