Inspiration

I always fear that something on the internet may one day be deleted or removed, never to be seen again. To combat that, I always try to archive whenever I can. This project came about from a need to easily and automatically record information about the things I'm archiving, especially videos.

A great example of this is the fact that I have long lost the original files for my old videos on YouTube. My program is a great way to quickly and easily have them somewhere on my computer and organized so that I do not have to worry if I may lose my account or YouTube goes down.

What it does

It takes any YouTube link, along with a connection string to one's personal CockroachDB database, and scans the webpage for links to videos, storing the videos' link, name, creator, and more. It then stores that data in a CockroachDB relational database for easy access later. It also provides a basic means of searching the database for matching video names and creator names.

How I built it

I used python 3 with urllib, argparse, re, and psycopg2 libraries to create the main codebase. In lieu of string methods, I used regular expressions to parse the webpage html files to grab the desired data as it was faster and more concise in my code (no other libraries or web applications were used). I used CockroachDB to store the data, as well as their client to check the database. I coded everything using emacs on Ubuntu.

Challenges I ran into

I did not know python, regular expressions, or CockroachDB SQL prior to beginning this project, so the main challenge would definitely have been learning these on the fly to accomplish my goals! In not being familiar with any of these, debugging any issues I had were definitely a struggle. For example, I didn't realize my regex for one method was wrong for half an hour because it wasn't immediately apparent that it was wrong in the first place, let alone why it was wrong!

I also came to learn that web crawling is not very easy, and without an external package, scraping a JavaScript reactive website is quite difficult. Unfortunately, I was not able to make my program capable of scraping more links than what is by default displayed on a page—it cannot click on load more or scroll down to load more.

Accomplishments that I'm proud of

Just as with above, I'm definitely proud to be able to say I'm able to use python, regex, and CockroachDB now. It also performs the tasks better than I would have expected I could make it!

I am definitely proud of the fact that I successfully made a program that can get information from a website and turn it into database entries! Especially considering the only external package is psycopg2 for connecting to CockroachDB.

What I learned

I learned Python syntax and how to use Python. Like its sets and tuples and "WITH". The lack of strong typing really messes with me! I also learned how to interface with CockroachDB through their CLI, as well as through psycopg2, which made using Python to modify databases in CockroachDB very easy. Finally, I learned quite a bit about regular expressions and how to use them. They're littered throughout my project in order to find the strings of data I was looking for.

What's next for Venerable_Videos

I want to be able to provide support for all sorts of websites and services that may include videos on their webpages, such as vimeo or dailymotion. I made as much of my code flexible for this purpose as possible. I would also like to provide more command options to make the program more versatile, as well as using Selenium to allow the loading of more videos, creating a server and front-end to archive videos remotely, and creating an easier method of querying the database. Most importantly, I want to be able to link my program to youtube-dl or a similar program that will be able to download, store, and archive the videos whose information is saved by my program. That way, the mirror column of the database will be able to point towards a copy (local or on a server) of every video (for personal use, like saving one's own videos, of course!). It is for this purpose that I wrote some code to be able to expand this functionality in the future and hence why there is a mirror column that is null.

Best First Time Hack

I probably used Python, SQL, and regex about once each in the past. Bringing them together in a project such as this took a lot of learning, trial, and error. I spent about 18 hours total on this project, and I can't say I have much to show for it other than all that I've learned. But for my first hackathon and first true foray into all three of these, I can't say I did horribly either!

Best Open Source Hack

youtube-dl is an open-source program used to download and archive all sorts of videos on the internet. My program happens to fit very well with the idea of downloading, archiving, storing, and organizing things like a wide variety of videos from anywhere on the internet (for personal use, like saving one's own videos, of course!). All it requires is a quick explicit implementation.

Best Use of CockroachDB Serverless

This project uses CockroachDB serverless to host and store all database information relating to the videos examined by my program. My program's flexibility and freedom to expand is perfect for CockroachDB's architecture which will gladly accommodate its expansion!

Built With

Share this project:

Updates