Skip to content

masamip2/CourseProject

 
 

Repository files navigation

CS 410 - Project: Causal Topic Modeling

Overview

This project is to perform causal topic modeling on MLB (Major League Baseball) articles to analyze the identified hidden trend topics. The performance of the model can be tested by referring to the section: The Software Usage and Testing the Software without installing anything on your machine, which I suspect most people prefer. For the case where you would like to set up the same development environment of this project, I additionally introduce The Software Development and Installation.

Author

Team: MP

Name NetId
Masami Peak [email protected]

Table of Contents

Introduction

Each annual MVP (Most Valuable Player) in the 2 leagues is determined by the voters in the Baseball Writers’ Association of America. Because of this, topic modeling on the MLB-related articles published before the MVP announcement can be used to discover the key topics correlated to the MVP for the year.

LDA Algorithm

LDA: Latent Dirichlet Allocation is the most popular topic model and extracts the topics discussed in the documents. In the LDA model, each document has a vector of topics as a topic distribution, and each topic has a vector of words as a word distribution. The topic distribution is forced to be drawn from its Dirichlet distribution with the vector of α parameters, and the word distribution is forced to be drawn from its Dirichlet distribution with the vector of β parameters.

k Topics Probability Distribution (from Dirichlet Distribution) on d-th Document πd

p(πd)

= (p(πd,1), p(πd,2), ..., p(πd,k))

= (Dirichlet(α1), Dirichlet(α2), ..., Dirichlet(αk))

= p(θ)

= (p(θ1), p(θ2), ..., p(θk))

m Words Distribution on i-th Topic θi

p(W|θi)

= (p(w1|θi), p(w2|θi), ..., p(wm|θi))

= (Dirichlet(β1), Dirichlet(β2), ..., Dirichlet(βm))

The Software Usage and Testing the Software

Please go to the Online Jupyter Notebook of this project and see the section 'How to Use/Test the Software' in the ProjectReport.pdf for the details of the software usage and testing the software.

The Software Implementation

Please see the section 'How the Software Implemented' in the ProjectReport.pdf for the details of the software implementation.

The Software Usage Tutorial Presentation

Please watch the Project Presentaion for the software usage tutorial.

The Software Development and Installation

The rest of the entire section below is extra detailed documentation for how to develop and install the software on Windows 10 Pro (21H1), Mac (Catalina 10.15.7), or Linux (Ubuntu 20.04). The information will be useful if you would like to set up the development environment of the software.

Prerequisites

Python 3.9, Jupyter Notebook 6.4.5, ChromeDriver (Optional)

Python Modules and Tools Used on Windows

Module / Tool Version Usage Reference
Python 3.9.4 Programming Language https://www.python.org/
Python Built-in Modules 3.9.4 Commonly Used Functions calendar, csv, datetime, os, string, re, importlib.util, subprocess, sys, warnings
Jupyter Notebook 6.4.5 Simulating Code https://www.python.org/
bs4 for BeautifulSoup 0.0.1 Web Scraping https://www.crummy.com/software/BeautifulSoup/bs4/doc/
selenium for webdriver 4.0.0 Controlloring Browser https://selenium-python.readthedocs.io/getting-started.html
pandas 1.3.4 DataFrame https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
nltk 3.6.5 Stemming & Lemmatization https://www.nltk.org/howto/stem.html https://www.nltk.org/api/nltk.stem.wordnet.html
gensim 4.1.2 Topic Modeling https://radimrehurek.com/gensim/models/phrases.html
matplotlib for pyplot 3.4.3 Graphical Plotting https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.html
pyLDAvis 3.3.1 LDA model visualization https://pypi.org/project/pyLDAvis/
Chrome 94.0.4606.71 Google Browser https://www.google.com/chrome/
ChromeDriver 94.0.4606.61 Executing Webapps https://chromedriver.chromium.org/downloads

Project File Structure

masamip2/CourseProject (CourseProject-main.zip)
|   mlb.ipynb - for topic modeling and analyzing the outcome
|   ProjectProgress.pdf
|   ProjectProposal.pdf
|   ProjectReport.pdf
|   README.md
|   scraper.py - for web crawling
└───Data/
|   |   articles.csv (the dataset for reproducing the report)
|   |   (the datasets for producing the articles.csv)
|   |   espn.csv - from espn.com
|   |   mlb.csv - from mlb.com
|   |   mvps.csv - a list of the annual MVP in 2 leagues from 2011 to 2021
|   |   nyt.csv - from nytimes.com
|   |   reuters.csv - from reuters.com
|   |   wsj.csv - from wsj.com
└───Model/ (this directory will be created only if to_save=True is configured in the mlb notebook)
|   |   topic_model_{year}.pkl (the LDA model can be saved per {year})

Project Steps

This project has 4 different steps:

  1. Downloading the Project Source Code
  2. Setting Up the Local Environment
  3. Web Crawling to Obtain MLB-Related Articles
  4. Data Preprocessing and Topic Model Analysis

1 Downloading the Project Source Code

Please follow the steps below to download the project zip file.

  1. Go to https://github.com/masamip2/CourseProject .
  2. Click the 'Code' button.
  3. Choose 'Download ZIP'.

NOTE: Alternatively, you can clone the project repository, if you prefer to use GIT.

git clone https://github.com/masamip2/CourseProject.git

2 Setting Up the Local Environment

2_1 Python and Pip Installation

This project has a working condition on:

  • Windows 10 Pro (21H1): Python 3.9.4
  • Mac (Catalina 10.15.7): Python 3.9.7
  • Linux (Ubuntu 20.04): Python 3.9.7

If your Python version is different from any of the versions above, please see the section Python and Pip Installation Windows, Mac, and Linux to consider installing the supported version of Python for this project. Pip is already installed with Python 3.4+.

2_2 Finding Out Available Python and Pip Versions

If Python and Pip versions are not up to date, please see the section Finding Out Available Python and Pip Versions Windows, Mac, Linux to upgrade the versions.

2_3 Jupyter Notebook Installation

If Jupyter Notebook is not installed or the version is not up to date, please see the section Jupyter Notebook Installation Windows, Mac, Linux.

2_4 Setting Up Virtual Environment

Please see the section Setting Up Virtual Environment Windows, Mac, Linux to set up the virtual environment 'mlb' for this project. Venv is already installed with Python 3.3+.

3 Web Crawling to Obtain MLB Related Articles

This step is OPTIONAL, because it will take about 1 hour 40 minutes to complete web crawling. Also, due to the available articles being updated on the corresponding sites, the fetched articles will be slightly different from the datasets already provided in this project. Moreover, there is a chance that the scraper.py has to run multiple times on some of those article sites where the advertisements and the pop-ups maybe cause some kind of issue.

3_1 ChromeDriver Installation

  1. Check your Chrome version: Click the right-top corner on the browser for settings.

  2. 'Help' -> 'About Google Chrome' -> Version (the examples below):

  • Windows 10 Pro (21H1): Version 94.0.4606.81 (Official Build) (64-bit)
  • Mac (Catalina 10.15.7): Version 95.0.4638.69 (Official Build) (x86_64)
  • Linux (Ubuntu 20.04): Version 95.0.4638.54 (Official Build) (64-bit)
  1. Go to https://chromedriver.chromium.org/downloads , click a link for the downloading page, download the appropriate chromedriver zip file, unzip the file and place the executable driver file at 1 level above the scraper.py:
  • Windows 10 Pro (21H1): ChromeDriver 94.0.4606.61 -> chromedriver_win32.zip -> chromedriver.exe
  • Mac (Catalina 10.15.7): ChromeDriver 95.0.4638.54 -> chromedriver_mac64.zip -> chromedriver
  • Linux (Ubuntu 20.04): ChromeDriver 95.0.4638.17 -> chromedriver_linux64.zip -> chromedriver
  1. Alternatively, place the driver file at any applicable place and modify a constant 'CHROME_DRIVER_PATH' in the scraper.py for the file path.

  2. NOT ON the virtual environment 'mlb', run the chromedriver:

  • Windows 10 Pro (21H1): Double-click the chromedriver.exe
  • Mac (Catalina 10.15.7): Right-click the chromedriver -> Open -> Open
  • Linux (Ubuntu 20.04): Type ~/Documents/CS410/chromedriver in terminal and hit enter key

3_2 Start Web Crawling

ON the virtual environment 'mlb', run the command like below where an argument of site_name has to be replaced by Reuters, MLB, WSJ, NYTimes, or ESPN.

Windows 10 Pro (21H1) Environment

py -3.9 scraper.py ESPN # example: web crawling on ESPN.com

Mac (Catalina 10.15.7) Environment

python3 scraper.py ESPN # example: web crawling on ESPN.com

NOTE1: For the first time, a popup will show up asking "'Google Chrome' would like to access files in your Documents folder."

Linux (Ubuntu 20.04) Environment

python3.9 scraper.py ESPN # example: web crawling on ESPN.com

NOTE2: If the script stops very quickly or gets stuck, please run scraper.py multiple times for the particular article site where the advertisements and the pop-ups maybe cause some kind of issue.

NOTE3: Once web crawling is done, you can stop the chromedriver by 'Ctrl' + 'c'.

4 Data Preprocessing and Topic Model Analysis

4_1 Start Jupyter Notebook

NOT ON the virtual environment 'mlb', run the command like below to start Jupyter Notebook.

cd ~/Documents/CS410/CourseProject-main # go to a suitable directory for the root of the Jupyter Notebook tree
jupyter notebook

NOTE1: In case a browser for Jupyter Notebook does not open up on Mac, copy and paste the url for the host like http://127.0.0.1:8888/?token={48_Alphanumeric_Characters} on a browser URL bar.

4_2 Run mlb notebook

  1. On the automatically opened Jupyter Notebook in a browser, navigate to the file 'mlb.ipynb' and click it.

  2. Make sure that the virtual environment 'mlb' has already been created in the step Setting Up Virtual Environment, click 'Kernel' -> 'Change kernel' -> select 'mlb'.

  3. Click 'Kernel' -> 'Restart & Clear Output' -> click 'Restart and Clear All Outputs' to clear cache.

  4. Click 'Cell' -> 'Run All' (alternatively, click 'Run' for each cell one by one).

NOTE1: Once all the tasks in Jupyter Notebook are done, you can stop the Jupyter Notebook by 'Ctrl' + 'c'.

NOTE2: In 'mlb.ipynb', please ignore one type of DeprecationWarning which is unable to be suppressed when the function gensimvis.prepare() is called for the first time in Python 3.9.7.

Appendix

Windows 10 Pro (21H1) Environment

Python and Pip Installation on Windows

Please follow the steps below as an example, only if you do not have Python version 3.9.4. Pip is already installed with Python 3.4+.

  1. Go to https://www.python.org/downloads/windows .
  2. Look for the section 'Python 3.9.4 - April 4, 2021'.
  3. Click 'Download Windows installer (64-bit)'.
  4. Double-click the downloaded 'python-3.9.4-amd64.exe' and click 'Install Now'.

NOTE1: The Python (python.exe) is, for example, installed at C:\Users\Masami\AppData\Local\Programs\Python\Python39\python.exe .

NOTE2: Please follow the steps below in case Pip does not exist in your machine.

  1. Downloaded get-pip.py from https://bootstrap.pypa.io/get-pip.py .
  2. Alternatively, run the following command.
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
  1. Install Pip for the Python on your PATH as a User variable.
python get-pip.py

Finding Out Available Python and Pip Versions on Windows

Please follow the steps below to check your environment.

  1. Find out the Python versions in your machine and see the output like below.
py -0
Installed Pythons found by py Launcher for Windows
 -3.10-64 *
 -3.9-64
 -3.7-64
  1. Find out the Pip version for the specific Python version.
py -3.9 -m pip -V
  1. Upgrade the Pip version for the specific Python version and check to see the latest version like below.
py -3.9 -m pip install pip --upgrade

py -3.9 -m pip -V
pip 21.3.1 from C:\Users\Masami\AppData\Local\Programs\Python\Python39\lib\site-packages\pip (python 3.9)

Jupyter Notebook Installation on Windows

  1. Install Jupyter Notebook for the specific Python version.
py -3.9 -m pip install notebook --upgrade
  1. Check to see the latest version like below.
jupyter notebook -V
6.4.5

Setting Up Virtual Environment on Windows

Venv is already installed with Python 3.3+.

  1. Create a virtual environment 'mlb' at the same location as scraper.py and mlb.ipynb.
cd C:\Users\Masami\Documents\CS410\CourseProject-main
py -3.9 -m venv mlb
mlb\Scripts\activate
  1. Install ipykernel for running Jupyter Notebook on a kernel inside the virtual environment.
py -3.9 -m pip install pip --upgrade
py -3.9 -m pip install --user ipykernel
  1. Add the kernel to Jupyter Notebook and see the output like below.
py -3.9 -m ipykernel install --user --name=mlb
Installed kernelspec mlb in C:\Users\Masami\AppData\Roaming\jupyter\kernels\mlb
  1. To delete the kernel and the virtual environment, run the following commands.
jupyter kernelspec uninstall mlb
deactivate
rmdir /s /q mlb

Mac (Catalina 10.15.7) Environment

Python and Pip Installation on Mac

Please follow the steps below only if you do not have Python version 3.9.7. Pip is already installed with Python 3.4+.

  1. Install Python 3.9.7 and its Pip 21.3.1, and see the output like below.
brew update && brew upgrade
brew install python3 && brew upgrade python3

python3 -V
Python 3.9.7

pip3 -V
pip 21.2.4 from /usr/local/lib/python3.9/site-packages/pip (python 3.9)

NOTE1: Refer to the following steps, in case you want to clean up your python3 environment.

  1. Remove all the symbolic links to python and pip.
sudo rm /usr/local/bin/python*
sudo rm /usr/local/bin/pip*
  1. Remove versions of Python in the Python.framework.
sudo rm -rf /Library/Frameworks/Python.framework/Versions/*
  1. Check your PATH environment variables in ~/.bash_profile.
export PATH=/usr/local/bin:/usr/local/sbin:${PATH}
  1. Uninstalling python3 first could make reinstalling python3 easier depending on your situation.
brew uninstall --ignore-dependencies python3

NOTE2: The Python is installed at /usr/local/Cellar/[email protected]/3.9.7_1/bin/python3 .

Finding Out Available Python and Pip Versions on Mac

Please follow the steps below to check your environment.

  1. Find out the detail of Python3 version on your PATH and see the output like below.
python3 -V
Python 3.9.7
  1. Find out the Pip version for the specific Python version.
python3 -m pip -V
  1. Upgrade the Pip version for the specific Python version and check to see the output like below.
python3.9 -m pip install pip --upgrade

python3 -m pip -V
pip 21.3.1 from /usr/local/lib/python3.9/site-packages/pip (python 3.9)

Jupyter Notebook Installation on Mac

  1. Install Jupyter Notebook for the specific Python version.
python3 -m pip install notebook --upgrade
  1. Check to see the latest version like below.
jupyter notebook -V
6.4.5

Setting Up Virtual Environment on Mac

Venv is already installed with Python 3.3+.

  1. Create a virtual environment 'mlb' at the same location as scraper.py and mlb.ipynb.
cd ~/Documents/CS410/CourseProject-main
python3 -m venv mlb
source mlb/bin/activate
  1. Install ipykernel for running Jupyter Notebook on a kernel inside the virtual environment.
python3 -m pip install pip --upgrade
python3 -m pip install ipykernel
  1. Add the kernel to Jupyter Notebook and see the output like below.
python3 -m ipykernel install --user --name=mlb
Installed kernelspec mlb in /home/masami/.local/share/jupyter/kernels/mlb
  1. To delete the kernel and the virtual environment, run the following commands.
jupyter kernelspec uninstall mlb
deactivate
rm -r

Linux (Ubuntu 20.04) Environment

Python and Pip Installation on Linux

Please follow the steps below only if you do not have Python version 3.9.7.

  1. Install Python 3.9.7 and its Pip 21.3.1, and see the output like below. Pip is already installed with Python 3.4+.
sudo apt update && sudo apt upgrade
sudo apt install software-properties-common
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt-get install python3.9
python3.9 -V
python3.9.7

python3.9 -m pip install pip --upgrade
python3.9 -m pip -V
pip 21.3.1 from /home/masami/.local/lib/python3.9/site-packages/pip (python 3.9)

NOTE1: The Python is installed at /usr/bin/python3.9 .

NOTE2: In case you see the warning below, the command like below makes the Pip to be on your PATH.

WARNING: The scripts pip, pip3, and pip3.9 are installed in '/home/masami/.local/bin' which is not on PATH.

echo "export PATH=\"/home/masami/.local/bin:\$PATH\"" >> ~/.bashrc && source ~/.bashrc

Finding Out Available Python and Pip Versions on Linux

Please follow the steps below to check your environment.

  1. Find out the detail of Python3.9 version at /usr/bin on your machine and see the output like below.
python3.9 -V
Python 3.9.7
  1. Find out the Pip version for the specific Python version.
python3.9 -m pip -V
  1. Upgrade the Pip version for the specific Python version and check to see the output like below.
python3.9 -m pip install pip --upgrade

python3.9 -m pip -V
pip 21.3.1 from /home/masami/.local/lib/python3.9/site-packages/pip (python 3.9)

Jupyter Notebook Installation on Linux

  1. Install Jupyter Notebook for the specific Python version.
python3.9 -m pip install notebook --upgrade
  1. Check to see the latest version like below.
jupyter notebook -V
6.4.5

Setting Up Virtual Environment on Linux

  1. Install Venv.
sudo apt-get install python3.9-venv
  1. Create a virtual environment 'mlb' at the same location as scraper.py and mlb.ipynb.
cd ~/Documents/CS410/CourseProject-main
python3.9 -m venv mlb
source mlb/bin/activate
  1. Install ipykernel for running Jupyter Notebook on a kernel inside the virtual environment.
python3.9 -m pip install pip --upgrade
python3.9 -m pip install ipykernel
  1. Add the kernel to Jupyter Notebook and see the output like below.
python3.9 -m ipykernel install --user --name=mlb
Installed kernelspec mlb in /home/masami/.local/share/jupyter/kernels/mlb
  1. To delete the kernel and the virtual environment, run the following commands.
jupyter kernelspec uninstall mlb
deactivate
rm -rf mlb

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 69.4%
  • Python 30.6%