DiscoDB

What is DiscoDB?

With the theme, "Making Memories," we wondered, "what memories do computer scientists have?" RAM. However, downloading free RAM turns out to be an online scam that may give you viruses. Fortunately, we've discovered the next best thing: free, unlimited storage in the form of a NoSQL database (all hosted on a Discord server :D).

How It Works

The Stack

DiscoDB is powered through the Flask and Requests python frameworks. These lightweight frameworks gave us the power and flexibility to successfully integrate the Discord Web API with our own web server. Then, using Postman, we were able to see our API and Discord interact in real time. Finally, our demo was deployed using an AWS EC2 instance running NGINX that forwards our application to port 80.

Database Design

Given Discord’s channels and messages format, we naturally leaned towards creating a document-based database. We modeled the Discord server off MongoDB’s databases where channels are collections and messages are documents. We then built the corresponding endpoints in Flask. This gave us the basis for how we organized and queried data down the road.

Challenges We Faced (and Learned From)

Concurrency

Originally, we planned to use Discord.py alongside Flask. However, Discord.py was a blocking process and cross-application communication proved difficult. We looked into solutions such as message brokers and task queues. However, with the help of a mentor, we eventually settled on dropping Discord.py altogether. Instead, we integrated the Discord Web API with our Flask server using the Request library. While we didn’t end up using a message broker, our studies would prove useful later on…

Querying

Querying was so hard! After scouring the niche-est websites known to man, we decided to stick with the naive approach. But! Discord only lets us query using the parameters: before, after, and how many messages we wanted (up to 100). After a bit of tinkering, we discovered the return was also always in order.

From there, it was almost Leetcode-like. We used the last message in the current group of 100 messages as a lower bound to the next group (similar to Linked List problems). We then compared whether the attributes matched up to our queries using a DFS algorithm.

String Encoding

Apparently being a database was NOT Discord’s intention. While storing data in Discord, we had to stringify JSON. This was a bit annoying, but not particularly bad while we were working on CRUD operations. However, authentication was another beast. Because of the importance of the raw bytes in encryption, we ended up having to encrypt base_64, bytes, string, etc. just to store simple secrets.

Rate-Limits

After making large requests we started getting rate-limit responses. However, using the aforementioned research on concurrency, we knew what we needed: a task scheduler with a throttle. While we didn’t have time to fully set up one, we found the ratelimit python library, offering a task deferral decorator, which ended up being critical in ensuring the proper processing of user requests.

Cool Optimizations

Authenticating Using Message_ID

This was a fun little optimization. Before, when verifying JWT Tokens, we would search the ENTIRE user database. However, on login, users are handed a JWT. We realized we could also hand them the message id that represented their account. They then submitted it within their request headers to authorize API access.

What's Next?

Bot Swarms, Sharding, and Token Rotation

Using multiple tokens and bots, we could load balance the application and “increase” the maximum requests per second. This could look like a simple deployment of AWS Application Load Balancer and a group of EC2 instances or a complicated sharding method. However, we know that we’ll be using multiple bots at a time down the line. (Edit in particular is difficult to handle).

Optimize Query Endpoint Using Indexing Optimizations

Currently, the Query endpoint uses a naive approach of “collection scanning” to query matches. However, using indexing optimizations (which we don’t fully understand yet), we know we can get our query times down. This will require reading up on document database theory.

Switching to a Dedicated Task Scheduler

While the ratelimit library is incredibly convenient, it’s also part of the main program… and thus in memory. If an instance went down, clients would lose multiple requests resulting in a disconnect between the database and client understanding. Using RabbitMQ or Redis, we could return a processed notification and send a server-side event when the background task was complete. This would greatly improve the stability of DiscoDB.

Building a Frontend

From AWS to CockroachDB, most modern databases have some way to allow admins to view what exactly is going on. Creating a GUI would be a good way to gain some traction in the database world.

Migrating Languages

While Python provides many libraries that made developing DiscoDB much easier, it is simply too slow to be appealing to some developers. Switching to a compiled language such as Rust or C++ may provide the performance that engineers are looking for before committing to DiscoDB. Fortunately, with our mostly Requests-driven code, this may be easier than… py.