Bagel

Bagel Tech Stack
2D Similarity Graph

Inspiration

As active Twitter users who are strong advocates for environmental justice and racial equality, we would love to meet other like-minded progressives on Twitter to not only understand their unique perspectives but also work with them on community-based projects to spread awareness of the problems facing our world today.

What it does

Bagel ignites discussions, fosters connections, and levels opinions.

Bagel is a data-driven dynamic grouping algorithm that stimulates opinionated conversation based entirely on the user. It helps foster meaningful relationships by grouping users entirely based on their content. It doesn’t just recommend influencers and mutual connections—it recommends people who are holistically similar to you.

In doing so, Bagel allows for meaningful 1:1 connections that support relationships beyond the virtual world. The current state of the app is essentially an algorithm that identifies users most similar to each other based on their content history. Ultimately, we hope to extend this concept into a full fleshed, opt-in pairing platform in which users are paired frequently to build meaningful connections.

How we built it

We built Bagel with a variety of software tools. We first used the Twitter API for tweet scraping for each user on our platform, which allowed us to create a large dataset of opinionated debates and discussions both stemming from and continued by the user at hand. After merging each user's large dataset of tweets into one large opinionated text file, we used the Google Bidirectional Encoder Representational Model (BERT) to create word embeddings of tweets that take into account tf-idf (term frequency, inverse document frequency) and words that reverse the meaning of sentences using Masked Language Modeling (Masked LM) and Next Sentence Prediction (NSP).

Using these word embeddings, we performed a cosine similarity comparison that calculates the angles between the word embedding vectors to determine how similar each user's word embeddings were. This gave us an n x n grid of similarity values between pairs of word embeddings, which served as the input to our Manifold Learning algorithm. We then used Manifold Learning to generate a set of concrete values for each data point in a 2D Graph based on word embedding similarity (the more similar the word embeddings, the closer the points would be). Each data point represented the location of a single Twitter user's opinions and discussions on a 2D graph, which stayed consistent with the similarity scores between any pair of users.

Using this graph, we were able to visualize a graphical network of users who were a certain Euclidean distance apart from each other. We were then able to store the data into Microsoft Azure and Google Cloud to maximize the cloud storage we had access to. We hope to scale our cloud storage with these two platforms as we look to launch in the near future.

Challenges we ran into

Cal Hacks has been an exciting frenzy of caffeine, sleepless nights, and debugging. In building this application, so much has gone wrong. Yet, it is through these failures that we have learned so much. One of the biggest challenges we faced was accessing the Twitter API itself—we were repeatedly faced with HTTP errors, leading us to try alternative techniques to grab the data for nearly all of Saturday. Ultimately, we figured out what was wrong with our Twitter API call (it was—unfortunately—just a typo), and were able to get the request to work.

Another challenge we faced was regarding finding the most efficient algorithm to map out users based on their tweets. This was the crux of our project, and we wanted to make sure it was high quality. As such, we implemented more than five different approaches, scratching them all each time. One of the most time-consuming approaches we tried was a manual comparison of tweets, in which the tweets would have to be identical in order to actually strike similarity. This was not at all close to what we wanted, as we were aiming for a similarity score based on contextual clues. Ultimately, we tried a plethora of models and decided to go with BERT.

After doing so, it was difficult to also figure out how to actually visually represent our differences. At first, we thought this was a Linear Programming Question—given a set of distances between N choose 2 points, and the corresponding constraints, what are the points in an X Y plane? We spent hours trying to solve this on a whiteboard, before realizing that this would better be solved using an entirely different model, better explained in our text stack (Manifold Learning Algorithm). Again, we rewrote a plethora of our code, taking one step backwards to move two steps forward.

Accomplishments that we're proud of

Overall, we are so proud of our product, and areas excited to move forward with this idea. We truly believe that our application has the potential to revolutionize social connection and discourse, and we couldn’t be more happy to have come up with this in such a constraint time frame. Regardless of how we do, we are so glad to have had this experience.

What's next for Bagel

We hope to transform our code into an opt-in mobile and web platform and expand our network to include data from other social media platforms in the coming weeks.

Built With

Updates

Surya Rajan started this project — Oct 24, 2021 09:43 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.