Try it out!

Inspiration

Intrigued by the ability to access a large amount of data on the distributed network, we set out to classify files by similarity for interesting exploration. While there were powerful search tools like Google to access information on the web, we were unsure how to explore all of the data available on the distributed network.

What it does

This project visualizes the vast quantities of data stored on the InterPlanetary File System using an intuitive 3D graph. Connected nodes are nearby and have similar content and all nodes are colour-coded based on their file type, with larger nodes representing larger files. Hovering over a node gives more information about the file and clicking on the node downloads the file.

Nodes can be dragged with the cursor and the view of the graph can be zoomed in or out with the scroll wheel.

How we built it

We built this application using 3 important technologies: Golang, Python, and Three.js (JavaScript). Behind the scenes we used the powerful technologies of Estuary in order to interface and get files from the IPFS and Co:here's Embed platform in order to quantify the similarity of two files.

Our pipeline consists of fetching the headers of around 2000 files on the IPFS, embedding the texts into vectors, performing a reduction in vector space dimension with principal component analysis, classifying texts based on k nearest neighbors, and visualizing the resulting neighbors as a 3D graph.

Challenges we ran into

The data in the IPFS was too large to download and process so we embedded the files based only on their metadata.
Co:here's embed model was unable to process more than 500 lines in one request.
Data retrieval from IPFS was slower than centralized systems.
Determining the best way to summarize the multi-dimensional data into 3-dimensional data.
We were unable to fine-tune the Co:here command model.

Accomplishments that we're proud of

Reverse engineering the Estuary API to be able to access all files hosted on the IPFS through the miners with multiple scripts in Go using parallel processing.
Performance with concurrence while fetching and formatting the file headers from the network.
The handling of large data in an efficient pipeline.
The use of Co:here embeddings in order to generate 3D vectors with minimal information loss with principal component analysis.
The efficient and intuitive representation of the collected data which was categorized with k nearest neighbors.

What we learned

This hackathon has served as an opportunity to learn uncountable things, but I would like to highlight a couple. To begin, we were able to learn about useful and important technologies that facilitated us to make the project possible, including the Estuary and Co:here APIs, and we improved in our abilities to code in Python, Golang, and Javascript. Furthermore, the presentations hosted by various sponsors were a nice opportunity to be able to talk with and meet successful individuals in the field of technology and get their advice on the future of technology, and how to improve ourselves as members of a team and technically.

What's next for IPFE: InterPlanetary File Explorer

Since we were unable to process all of the file content during the vector embedding process due to file space and time feasibility limitations, IPFE can be improved by using the file content to influence the vector embedding of the files for a more accurate graph. Additionally, we were only able to scratch the surface of the number of files on the IPFS. This project can be scaled up to many more files, where individual "InterPlanetary Cluster" could consist of similar files and make up a whole "galaxy" of files that can be visually inspected and explored.

Built With

co:here
estuary
go
python
three.js

Submitted to

UofTHacks X
- Winner Protocol Labs - Best Use of Estuary

Created by

Worked on the visualization of interplanetary files with Three.js. Animated and produced the video.

If you're interested in what we built, we'd love to hear your ideas about where to go next.

At the moment, I'm exploring internship opportunities for summer 2023, so feel free to connect with me at [email protected].

Michael Xu
Software Engineering at UWaterloo
I concurrently scraped 69k files on the IPFS with Estuary's API, by implementing a token based semaphore to manage goroutines. For data processing, I implemented buckets to evenly distribute files by content type.

Hussein Elguindi
Software Engineering student at the University of Waterloo.
In this project, I was responsible for the data analysis with Python. I formatted and fed the file headers of each file into Co:here's Embed model, and reduced the resulting embedding vectors down to 3 dimensions with principal component analysis.

Arnav Kumar
Computer Science student at the University of Waterloo
Youssef Soliman
hacker