DataDM

Searching for CSVs related to "hackathons"

Inspiration

We are a startup that works at the intersection of tabular data and AI. We have a mission to help people get value from data, and are especially focused on leveraging the value of new natural language systems give for making data systems more accessible. We have released tooling for data scientists in the past, and one of the biggest asks has been to respect data privacy and have the ability to do data work with modern chatbots in a private way.

When we saw recent code-based models (eg. StarCoder) get released with open source licenses, we knew we wanted to find a way to make this available immediately. Additionally, OpenAI has in beta / via invite only the toolkit called "Code Interpreter" and we wanted to replicate that work and make it available to everyone. Combining these two inspirations together with our expertise with data, led to the creation of dataDM.

We then thought, what if we connect this to a bunch of open public datasets? How cool would it be to have a tool where people could search for any data on the internet, and then ask it questions? That would be valuable for researchers, journalists, etc. To demonstrate this experience, we connected dataDM to GitHub search. Now, in one app, you can search for millions of CSVs on GitHub and then analyze it.

What it does

DataDM is a chatbot interface that lets a user talk to an AI assistant that writes code which is executed to answer data questions. Users can ask questions about doing data processing, feature engineering, data cleaning, question answering, visualizations, and even some data science modeling. They can bring their own csvs (which are local and private), or can easily find csvs via the github API, both of which are easy to do with a single click add button.

How we built it

We combined a bunch of open source tools: Jupyter kernels (for background execution), Gradio (for UI), Huggingface Transformers (for open source models), and Microsoft Guidance (for wrapping LLM execution). For execution and analytics we rely on the great open source python data science stack (numpy, pandas, scikit-learn, matplotlib etc.)

First we got conversations working, then we added parsing for the code, and sending the code to the jupyter kernel in the background, then we parsed the info back and found ways to both send the results to the UI for rendering (Such as plots) but also representing it for the chatbot to "see" that the conversation has continued. We also added self-retries, where if the code raises an error, the AI will try to fix the error and code some more. Once we had that, we chased a lot of UI features (undo, retry, cancel, multiple-models, the concept of agents). Then lastly we added search as a separate tab, since in using it, the first thing we found ourselves doing every time was finding CSVs online.

For 2 added features that are not relying on open source code, but offer extra capabilities when given:

if you include an openai api key you can use Open ai models (so you can compare and contrast quality of the open models)
if you include a github api key you can use github code search to find public CSVs to bring into your analysis

We also host the compiled images, cuda and non-cuda images on github, using github actions to organize our tests, publishing to pypi and hosting. We have also hosted this on a machine and set up access so anyone can try it out without pip installing, using docker or anything.

Challenges we ran into

Gradio UI elements styling and behaving well (eg. canceling tasks that are part of a .then event-chain, getting a chat box that is full sized viewport and applying custom CSS)
Various ways of getting guidance to stream output for a chatbot (wasn't officially supported, so we had lots of hacks that eventually we were able to clear out)
Trying to get GGML CPU models working through guidance, surprising number of issues with this. We also ran into lots of hard crashes when mixing this C code into the package, so we are waiting for open source features on this front to become a bit more stable.

Accomplishments that we're proud of

Parsing starchat responses with guidance with advanced prompting techniques allows for generating code nicely in blocks, and helps with forcing the model to code and extracting that code.
The jupyter repl is nicely abstracted in the system, and gives a persistent session state for working with the data objects
The concept of agents allows for creating new strategies for different models, as well as new "styles" of agents (eg. in the future could create an agent that is pyspark focused rather than pandas focused) and easily via some prompt engineering introduce it as a new py file that uses the baseclass.
Search + CSV download features really enhance the experience, giving it a much more "workspace" feel rather than just "demo"
Recording demos via an included in the repo playwright script. Using this we can automate tests and demo videos showcasing the behavior of the tool
We have a hosted version available now, at https://datadm.approx.dev/new
The end-to-end experience of this feels remarkable. Imagine if we could connect this to all the data in data.gov, kaggle, etc. Imagine being able to search all data on the internet and analyze it with language models. This is exciting.

What we learned

Starchat models are not as good as even GPT3.5-turbo, especially when it comes to language understanding
Starchat models have behavioral response issues, and seem under-fine-tuned on their instruction set, and sometimes output invalid "chat" syntax (won't end a role in the conversation, and choose to repeat previous text)
Gradio is very powerful and great for specific use-cases, but as you start trying to add a specific feature can have a lot of sharp edges and it's quickly the case that making your own UI feels faster.
Github search API is not the same as the web-search ui as of now (June 2023), and it's not clear when that will be updated

What's next for DataDM

We have a bunch of features in the backlog, and we're hoping to get it into the hands of as many users as possible. Find out what works, what doesn't work, how it's solving issues and not. The three features we're the most excited about adding:

GGML model support, so that the system can run entirely on the CPU of a single machine (ideally we will have a demo of it running on a M2 macbook air soon)
HTML export of a conversation -- since we are using a jupyter kernel in the backend, it should be possible for us to save the jupyter notebook and convert it to an html page, which would allow for people to share their conversations and analysis with others, including the code that was used!
Building our own search function (beyond github's api) to allow people to search across many online data stores, making this a tool for journalists, researchers, etc. to find and analyze public data.