Nico Calderon [email protected]
Ethan Choi [email protected]
Kimberly Martin [email protected]
Miguel Paulo Riano [email protected]
Anthony Safo [email protected] (team coordinator)
Overview: A subreddit recommendation system where the user types a query pertaining to topics they are interested in, and the system uses the query and text data from the "hot" posts of the "top" active subreddits to recommend a list of subreddits to the user.
- redditScraper.py - code for scraping reddit and subreddits. This uses PRAW for retrieving data from reddit, and pandas for storing and arranging the relationship between reddit, subrreddit and submissions.
- flask_main.py - code for the web application, it uses flask library as the micro web framework
- output-submissions.csv - contains the submissions scraped from reddit in csv format, comma delimited
- output-submissions.xml - contains the submissions scraped from reddit in xml format
- output-subreddits.csv - contains the subreddits scraped from reddit in csv format, comma delimited
- ranker.py - code for doing the ranking calculations of the pulled reddit/subrredit
- results.csv - contains the results for the ranked filed
-
Description: The scaper module is responsible for scraping reddit, subreddits and its submissions. In this module we are using PRAW and pandas. PRAW is the Python Reddit API wrapper used to mine the text data from reddit. Pandas is the library used for manipulating the retrieved data. PRAW is installed via pip and requires a reddit account and creation of a reddit instance, whose authentication is done via client id, and client secret and user id and password.
-
The code file for this module is *redditScraper.py
-
Usage: to use this tool, simply type "python redditScraper.py -l1 10 -l2 10"
- The first parameter "-l1" is the number of subreddits you want to scrape. The default is 5, and you can put any integer value
- The second parameter "-l2" is the number of submissions you want to scrape. The default is 5, and you can put any integer value
- Once it runs, it will generate the following files, *output-submissions.csv, *output-submissions.xml and *output-subreddits.csv
- The first parameter "-l1" is the number of subreddits you want to scrape. The default is 5, and you can put any integer value
- Description: The ranker module is responsible for ranking the list of reddits and submissions retrieved. This is using BM25 as the main ranking function.
- The code file for this module is *ranker.py
- Usage: To run ranker.py type 'python ranker.py -n 10 -q "subreddit ranker"'
- Edit 10 and "subreddit ranker" to be the number of results and the query, respectively.
- Description: The UI is done in HTML/CSS and the framework for the web application is Flask.
- The code file for this module is *flask_main.py
- Usage: to start the website, type python flask_main.py from a terminal. This will provide an address to enter into a web browser.
- On the website, type your query into the textbox and press submit.
- This will send you to another page that lists (up to) 10 subreddits we recommend based on your query.
-Kimberly and Miguel worked on using PRAW and pandas to scrape and index text data from various subreddits
-Ethan and Anthony worked on implementing the ranking algorithm using the mined subreddit post text data
-Nico worked on implementing the interface for the recommendation system based on the results from the ranking algorithm.
For a tutorial video detailing the usage of the software, visit https://youtu.be/FlrMZn2ehzU