DataSphere

Inspiration

We've all spent hours trying to find datasets from various sources to help us with data science projects, sometimes making the data collection step take a disproportionately longer time than the rest of the project.

What it does

We built a platform that allows users to input a prompt describing their data needs, using a vector-similarity search algorithm to display the most relevant dataset within our database and interact with the data. The user can then "chat" with the data to better understand what it contains.

How we built it

We use a multi-part pipeline to go from a natural language prompt to an agent that allows the user to interact with the dataset. This pipeline begins with using an LLM to convert the natural language query into word features that are representative of the dataset they're interested in. We then utilize Word2Vec to create a vector representation of this dataset and compare this to the representations we have stored. We then present the most similar datasets, allowing the user to preview them and interact with them utilizing an LLM agent. We built the UI in streamlit.

Challenges we ran into

We ran into numerous challenges on various aspects of the pipeline. Our similarity scores initially were quite inaccurate which we eventually realized was due to the specific pretrained Word2Vec model we used. Switching the model to one trained on a broader corpus significantly improved performance.

Additionaly, our initial goal was to incorporate a code writing LLM (such as the CodeLlama family) at the end of the pipeline for more dynamic interactions with the data, allowing the user to actively manipulate the data through natural language. We started with a 70B parameter model but ran into challenges in implementing distributed inference. We were also unable to debug the smaller 7B model in time. Throughout the weekend, we also have to revise our original idea, because we weren't able to do as much market research as we would've liked

Accomplishments that we're proud of

We were able to build an MVP this weekend and come up with a prototype of the entire initial pipeline. We also developed a novel idea with significant room for growth.

What we learned

We learned that the choice of corpus for a pretrained model can make a very significant difference for certain use cases.

What's next for DataSphere

The platform has a lot of potential for growth. In the future, we plan on allowing users to interact with multiple datasets simultaneously. They'd be able to compare, synthesize and train models across datasets that we provide, adding more novel functionality. Fine-tuning the models that allow the user to interact with the data would result in a deeper, more domain-specific conversation. DataSphere can become a platform for finding and interacting with any data. We bring the model and the data.