Skip to content

Latest commit

 

History

History
62 lines (41 loc) · 3.9 KB

File metadata and controls

62 lines (41 loc) · 3.9 KB

CS410 - Final Project Report

Team Reddit Recommenders

Nico Calderon [email protected]
Ethan Choi [email protected]
Kimberly Martin [email protected]
Miguel Paulo Riano [email protected]
Anthony Safo [email protected] (team coordinator)

Overview: A subreddit recommendation system where the user types a query pertaining to topics they are interested in, and the system uses the query and text data from the "hot" posts of the "top" active subreddits to recommend a list of subreddits to the user.

The following are the project file structure for the Reddit Recommenders:

  • redditScraper.py - code for scraping reddit and subreddits. This uses PRAW for retrieving data from reddit, and pandas for storing and arranging the relationship between reddit, subrreddit and submissions.

  • flask_main.py - code for the web application, it uses flask library as the micro web framework

  • output-submissions.csv - contains the submissions scraped from reddit in csv format, comma delimited

  • output-submissions.xml - contains the submissions scraped from reddit in xml format

  • output-subreddits.csv - contains the subreddits scraped from reddit in csv format, comma delimited

  • ranker.py - code for doing the ranking calculations of the pulled reddit/subrredit

  • results.csv - contains the results for the ranked filed

Module Details

1. Scraper Module

  • Description: The scaper module is responsible for scraping reddit, subreddits and its submissions. In this module we are using PRAW and pandas. PRAW is the Python Reddit API wrapper used to mine the text data from reddit. Pandas is the library used for manipulating the retrieved data. PRAW is installed via pip and requires a reddit account and creation of a reddit instance, whose authentication is done via client id, and client secret and user id and password.

  • The code file for this module is *redditScraper.py

  • Usage: to use this tool, simply type "python redditScraper.py -l1 10 -l2 10"

    • The first parameter "-l1" is the number of subreddits you want to scrape. The default is 5, and you can put any integer value
    • The second parameter "-l2" is the number of submissions you want to scrape. The default is 5, and you can put any integer value
    • Once it runs, it will generate the following files, *output-submissions.csv, *output-submissions.xml and *output-subreddits.csv

2. Ranker Module

  • Description: The ranker module is responsible for ranking the list of reddits and submissions retrieved. This is using BM25 as the main ranking function.
  • The code file for this module is *ranker.py
  • Usage: To run ranker.py type 'python ranker.py -n 10 -q "subreddit ranker"'
    • Edit 10 and "subreddit ranker" to be the number of results and the query, respectively.

3. UI/Flask Module

  • Description: The UI is done in HTML/CSS and the framework for the web application is Flask.
  • The code file for this module is *flask_main.py
  • Usage: to start the website, type python flask_main.py from a terminal. This will provide an address to enter into a web browser.
    • On the website, type your query into the textbox and press submit.
    • This will send you to another page that lists (up to) 10 subreddits we recommend based on your query.

Team Member Contributions

-Kimberly and Miguel worked on using PRAW and pandas to scrape and index text data from various subreddits

-Ethan and Anthony worked on implementing the ranking algorithm using the mined subreddit post text data

-Nico worked on implementing the interface for the recommendation system based on the results from the ranking algorithm.

Software Tutorial Presentation

For a tutorial video detailing the usage of the software, visit https://youtu.be/FlrMZn2ehzU