Vesalius

Inspiration

Public health is a field that is immeasurably intertwined with socioeconomics. Advances in technologies and new findings based on demographic analysis have been crucial in pushing the boundaries of the status quo. A key component of this push has been the CDC- which has funded and conducted numerous studies in these areas. However, a recent movement to eliminate federal programs that champion the importance of diversity have put many resources around these studies at risk. Datasets, such as BRFSS, or the Behavioral Risk Factor Surveillance System, were taken down for over a month. Countless sets about HIV/AIDS are still unavailable on federal sites. Thankfully, the Internet Archive has had many contributors upload datasets they had installed locally so that they now hold most CDC datasets on their pages. These datasets, however, are wildly inaccessible, as the metadata files don't possess much useful specific information, meaning that users can't know the sizes, contents, or context of what they're downloading. This especially pushes novice researchers away from using and learning from these datasets, a problem whose effects will snowball over time as fewer and fewer students explore this critical fields. This problem is what led us to create Vesalius.

What it does

The application takes in a user query for datasets they want. It then quickly retrieves an ordered list of CDC datasets, accessible independent of their availability on the official site. Without the user having to download it locally, Vesalius displays the first 500 rows of the data, summarizes it for the user, and lets the user view specific columns by their choosing. It then allows the user to download the first 100 rows or the whole thing, and exclude chosen columns through an extremely intuitive UI. By increasing the accessibility and readability of data, we hope that people will be able to find and use information they're looking for regardless of the current political climate.

How we built it

We fine-tuned Ministral-3B, a lightweight language model, on a dataset of about 14,000 samples we built by augmenting and randomizing queries based on metadata and dataset titles of 1257 archived datasets on Internet Archive that were uploaded about two weeks ago. We attempted to eliminate bias in the data using negative example training, normalizing samples per dataset so that sets with more metadata tags weren't over represented, and ignoring useless tags. To handle user queries, we used MongoDB Atlas' vector-search function to retrieve the 100 most relevant datasets to the user's query, and then we use the fine-tuned Ministral model to rank them based on more advanced semantic analysis. We use the IA API to stream and display the first 500 rows of the data, and we use standard csv editing operations to allow the user to both visually see the data without certain columns and download different versions of the data as desired. Finally, we again used the fine-tuned Ministral model to interpret the datasets and provide a summary to the user.

Challenges we ran into

We faced many challenges. The first was building the dataset from which we could train Ministral-3B. IA is the only repository at the scale we need to get data, and even then it's only 1257 datasets. Researching ways to augment the data to increase diversity and training robustness was absolutely a challenge. In fact, fine-tuning the model was probably the hardest part of this process- it was something we didn't have much experience with, and we had a lot of issues with using the AI Makerspace (probably due to our inexperience). We also did not have much experience with full-stack, and we made the decision to switch from Streamlit to React + Flask fairly late, which caused many issues in the eleventh hour. Specifically, dealing with API calls that properly handled a plethora of csv files and dataframes was extremely tough, especially as the challenges related to this interaction bled into the early hours of Sunday, and we only got more tired as we tried to solve them.

Accomplishments that we're proud of

We're absolutely proud of just completing every part of the project we set out to do. The problem that brought about this project is something that we are all exceedingly passionate about, and we believe we created something that could truly have some utility. Additionally, this was the first hackathon for two of our group members, and everyone was up for the task for what was a fairly difficult project. We're proud that we were able to fine-tune an LLM and have it make a significant impact on its performance on complex medical inputs. We're proud that we were able to properly use a cluster computer to train data. We're proud of using frameworks and technologies that none of us had used to this extent before.

What we learned

We learned basically everything that we ended up using in the project. In addition to what we accomplished, we learned how to efficiently use version control to work on different branches simultaneously to avoid conflicting code. We learned how to use MongoDB, which is something that we'd heard a bunch about but never knew what it actually did. We learned how LLMs actually worked, and why fine-tuning can be better than prompting, and we also learned that LLM usage is expensive, and should be minimized unless it's necessary. We also learned how to implement React.js to work seamlessly with Flask, an unfamiliar tech stack with respect to our full-stack experience but we were able to figure out solutions to problems that we encountered.

What's next for Vesalius

Our hope for Vesalius is that this web application will be able to fetch invaluable datasets not just across Internet Archive but other less accessible yet accurate sources in which we can process the data for easier use. Data is history, and because Internet Archive provides that valuable history of data, we hope to make Vesalius a tool for future researchers and students who need access to more digestable healthcare information. But our biggest hope for the application is that it can expand into domains beyond public health. Academic resources across the board are under attack, and keeping them accessible independent of the political climate is something that's going to pay dividends in the future.