This project is to perform causal topic modeling on MLB (Major League Baseball) articles to analyze the identified hidden trend topics. The performance of the model can be tested by referring to the section: The Software Usage and Testing the Software without installing anything on your machine, which I suspect most people prefer. For the case where you would like to set up the same development environment of this project, I additionally introduce The Software Development and Installation.
| Name | NetId |
|---|---|
| Masami Peak | [email protected] |
- Introduction
- LDA Algorithm
- The Software Usage and Testing the Software
- The Software Implementation
- The Software Usage Tutorial Presentation
- The Software Development and Installation
Each annual MVP (Most Valuable Player) in the 2 leagues is determined by the voters in the Baseball Writers’ Association of America. Because of this, topic modeling on the MLB-related articles published before the MVP announcement can be used to discover the key topics correlated to the MVP for the year.
LDA: Latent Dirichlet Allocation is the most popular topic model and extracts the topics discussed in the documents. In the LDA model, each document has a vector of topics as a topic distribution, and each topic has a vector of words as a word distribution. The topic distribution is forced to be drawn from its Dirichlet distribution with the vector of α parameters, and the word distribution is forced to be drawn from its Dirichlet distribution with the vector of β parameters.
k Topics Probability Distribution (from Dirichlet Distribution) on d-th Document πd
p(πd)
= (p(πd,1), p(πd,2), ..., p(πd,k))
= (Dirichlet(α1), Dirichlet(α2), ..., Dirichlet(αk))
= p(θ)
= (p(θ1), p(θ2), ..., p(θk))
m Words Distribution on i-th Topic θi
p(W|θi)
= (p(w1|θi), p(w2|θi), ..., p(wm|θi))
= (Dirichlet(β1), Dirichlet(β2), ..., Dirichlet(βm))
Please go to the Online Jupyter Notebook of this project and see the section 'How to Use/Test the Software' in the ProjectReport.pdf for the details of the software usage and testing the software.
Please see the section 'How the Software Implemented' in the ProjectReport.pdf for the details of the software implementation.
Please watch the Project Presentaion for the software usage tutorial.
The rest of the entire section below is extra detailed documentation for how to develop and install the software on Windows 10 Pro (21H1), Mac (Catalina 10.15.7), or Linux (Ubuntu 20.04). The information will be useful if you would like to set up the development environment of the software.
Python 3.9, Jupyter Notebook 6.4.5, ChromeDriver (Optional)
| Module / Tool | Version | Usage | Reference |
|---|---|---|---|
| Python | 3.9.4 | Programming Language | https://www.python.org/ |
| Python Built-in Modules | 3.9.4 | Commonly Used Functions | calendar, csv, datetime, os, string, re, importlib.util, subprocess, sys, warnings |
| Jupyter Notebook | 6.4.5 | Simulating Code | https://www.python.org/ |
| bs4 for BeautifulSoup | 0.0.1 | Web Scraping | https://www.crummy.com/software/BeautifulSoup/bs4/doc/ |
| selenium for webdriver | 4.0.0 | Controlloring Browser | https://selenium-python.readthedocs.io/getting-started.html |
| pandas | 1.3.4 | DataFrame | https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html |
| nltk | 3.6.5 | Stemming & Lemmatization | https://www.nltk.org/howto/stem.html https://www.nltk.org/api/nltk.stem.wordnet.html |
| gensim | 4.1.2 | Topic Modeling | https://radimrehurek.com/gensim/models/phrases.html |
| matplotlib for pyplot | 3.4.3 | Graphical Plotting | https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.html |
| pyLDAvis | 3.3.1 | LDA model visualization | https://pypi.org/project/pyLDAvis/ |
| Chrome | 94.0.4606.71 | Google Browser | https://www.google.com/chrome/ |
| ChromeDriver | 94.0.4606.61 | Executing Webapps | https://chromedriver.chromium.org/downloads |
masamip2/CourseProject (CourseProject-main.zip)
| mlb.ipynb - for topic modeling and analyzing the outcome
| ProjectProgress.pdf
| ProjectProposal.pdf
| ProjectReport.pdf
| README.md
| scraper.py - for web crawling
└───Data/
| | articles.csv (the dataset for reproducing the report)
| | (the datasets for producing the articles.csv)
| | espn.csv - from espn.com
| | mlb.csv - from mlb.com
| | mvps.csv - a list of the annual MVP in 2 leagues from 2011 to 2021
| | nyt.csv - from nytimes.com
| | reuters.csv - from reuters.com
| | wsj.csv - from wsj.com
└───Model/ (this directory will be created only if to_save=True is configured in the mlb notebook)
| | topic_model_{year}.pkl (the LDA model can be saved per {year})
This project has 4 different steps:
- Downloading the Project Source Code
- Setting Up the Local Environment
- Web Crawling to Obtain MLB-Related Articles
- Data Preprocessing and Topic Model Analysis
Please follow the steps below to download the project zip file.
- Go to https://github.com/masamip2/CourseProject .
- Click the 'Code' button.
- Choose 'Download ZIP'.
NOTE: Alternatively, you can clone the project repository, if you prefer to use GIT.
git clone https://github.com/masamip2/CourseProject.gitThis project has a working condition on:
- Windows 10 Pro (21H1): Python 3.9.4
- Mac (Catalina 10.15.7): Python 3.9.7
- Linux (Ubuntu 20.04): Python 3.9.7
If your Python version is different from any of the versions above, please see the section Python and Pip Installation Windows, Mac, and Linux to consider installing the supported version of Python for this project. Pip is already installed with Python 3.4+.
If Python and Pip versions are not up to date, please see the section Finding Out Available Python and Pip Versions Windows, Mac, Linux to upgrade the versions.
If Jupyter Notebook is not installed or the version is not up to date, please see the section Jupyter Notebook Installation Windows, Mac, Linux.
Please see the section Setting Up Virtual Environment Windows, Mac, Linux to set up the virtual environment 'mlb' for this project. Venv is already installed with Python 3.3+.
This step is OPTIONAL, because it will take about 1 hour 40 minutes to complete web crawling. Also, due to the available articles being updated on the corresponding sites, the fetched articles will be slightly different from the datasets already provided in this project. Moreover, there is a chance that the scraper.py has to run multiple times on some of those article sites where the advertisements and the pop-ups maybe cause some kind of issue.
-
Check your Chrome version: Click the right-top corner on the browser for settings.
-
'Help' -> 'About Google Chrome' -> Version (the examples below):
- Windows 10 Pro (21H1): Version 94.0.4606.81 (Official Build) (64-bit)
- Mac (Catalina 10.15.7): Version 95.0.4638.69 (Official Build) (x86_64)
- Linux (Ubuntu 20.04): Version 95.0.4638.54 (Official Build) (64-bit)
- Go to https://chromedriver.chromium.org/downloads , click a link for the downloading page, download the appropriate chromedriver zip file, unzip the file and place the executable driver file at 1 level above the scraper.py:
- Windows 10 Pro (21H1): ChromeDriver 94.0.4606.61 -> chromedriver_win32.zip -> chromedriver.exe
- Mac (Catalina 10.15.7): ChromeDriver 95.0.4638.54 -> chromedriver_mac64.zip -> chromedriver
- Linux (Ubuntu 20.04): ChromeDriver 95.0.4638.17 -> chromedriver_linux64.zip -> chromedriver
-
Alternatively, place the driver file at any applicable place and modify a constant 'CHROME_DRIVER_PATH' in the scraper.py for the file path.
-
NOT ON the virtual environment 'mlb', run the chromedriver:
- Windows 10 Pro (21H1): Double-click the chromedriver.exe
- Mac (Catalina 10.15.7): Right-click the chromedriver -> Open -> Open
- Linux (Ubuntu 20.04): Type ~/Documents/CS410/chromedriver in terminal and hit enter key
ON the virtual environment 'mlb', run the command like below where an argument of site_name has to be replaced by Reuters, MLB, WSJ, NYTimes, or ESPN.
py -3.9 scraper.py ESPN # example: web crawling on ESPN.compython3 scraper.py ESPN # example: web crawling on ESPN.comNOTE1: For the first time, a popup will show up asking "'Google Chrome' would like to access files in your Documents folder."
python3.9 scraper.py ESPN # example: web crawling on ESPN.comNOTE2: If the script stops very quickly or gets stuck, please run scraper.py multiple times for the particular article site where the advertisements and the pop-ups maybe cause some kind of issue.
NOTE3: Once web crawling is done, you can stop the chromedriver by 'Ctrl' + 'c'.
NOT ON the virtual environment 'mlb', run the command like below to start Jupyter Notebook.
cd ~/Documents/CS410/CourseProject-main # go to a suitable directory for the root of the Jupyter Notebook tree
jupyter notebookNOTE1: In case a browser for Jupyter Notebook does not open up on Mac, copy and paste the url for the host like http://127.0.0.1:8888/?token={48_Alphanumeric_Characters} on a browser URL bar.
-
On the automatically opened Jupyter Notebook in a browser, navigate to the file 'mlb.ipynb' and click it.
-
Make sure that the virtual environment 'mlb' has already been created in the step Setting Up Virtual Environment, click 'Kernel' -> 'Change kernel' -> select 'mlb'.
-
Click 'Kernel' -> 'Restart & Clear Output' -> click 'Restart and Clear All Outputs' to clear cache.
-
Click 'Cell' -> 'Run All' (alternatively, click 'Run' for each cell one by one).
NOTE1: Once all the tasks in Jupyter Notebook are done, you can stop the Jupyter Notebook by 'Ctrl' + 'c'.
NOTE2: In 'mlb.ipynb', please ignore one type of DeprecationWarning which is unable to be suppressed when the function gensimvis.prepare() is called for the first time in Python 3.9.7.
Please follow the steps below as an example, only if you do not have Python version 3.9.4. Pip is already installed with Python 3.4+.
- Go to https://www.python.org/downloads/windows .
- Look for the section 'Python 3.9.4 - April 4, 2021'.
- Click 'Download Windows installer (64-bit)'.
- Double-click the downloaded 'python-3.9.4-amd64.exe' and click 'Install Now'.
NOTE1: The Python (python.exe) is, for example, installed at C:\Users\Masami\AppData\Local\Programs\Python\Python39\python.exe .
NOTE2: Please follow the steps below in case Pip does not exist in your machine.
- Downloaded get-pip.py from https://bootstrap.pypa.io/get-pip.py .
- Alternatively, run the following command.
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py- Install Pip for the Python on your PATH as a User variable.
python get-pip.pyPlease follow the steps below to check your environment.
- Find out the Python versions in your machine and see the output like below.
py -0
Installed Pythons found by py Launcher for Windows
-3.10-64 *
-3.9-64
-3.7-64- Find out the Pip version for the specific Python version.
py -3.9 -m pip -V- Upgrade the Pip version for the specific Python version and check to see the latest version like below.
py -3.9 -m pip install pip --upgrade
py -3.9 -m pip -V
pip 21.3.1 from C:\Users\Masami\AppData\Local\Programs\Python\Python39\lib\site-packages\pip (python 3.9)- Install Jupyter Notebook for the specific Python version.
py -3.9 -m pip install notebook --upgrade- Check to see the latest version like below.
jupyter notebook -V
6.4.5Venv is already installed with Python 3.3+.
- Create a virtual environment 'mlb' at the same location as scraper.py and mlb.ipynb.
cd C:\Users\Masami\Documents\CS410\CourseProject-main
py -3.9 -m venv mlb
mlb\Scripts\activate- Install ipykernel for running Jupyter Notebook on a kernel inside the virtual environment.
py -3.9 -m pip install pip --upgrade
py -3.9 -m pip install --user ipykernel- Add the kernel to Jupyter Notebook and see the output like below.
py -3.9 -m ipykernel install --user --name=mlb
Installed kernelspec mlb in C:\Users\Masami\AppData\Roaming\jupyter\kernels\mlb- To delete the kernel and the virtual environment, run the following commands.
jupyter kernelspec uninstall mlb
deactivate
rmdir /s /q mlbPlease follow the steps below only if you do not have Python version 3.9.7. Pip is already installed with Python 3.4+.
- Install Python 3.9.7 and its Pip 21.3.1, and see the output like below.
brew update && brew upgrade
brew install python3 && brew upgrade python3
python3 -V
Python 3.9.7
pip3 -V
pip 21.2.4 from /usr/local/lib/python3.9/site-packages/pip (python 3.9)NOTE1: Refer to the following steps, in case you want to clean up your python3 environment.
- Remove all the symbolic links to python and pip.
sudo rm /usr/local/bin/python*
sudo rm /usr/local/bin/pip*- Remove versions of Python in the Python.framework.
sudo rm -rf /Library/Frameworks/Python.framework/Versions/*- Check your PATH environment variables in ~/.bash_profile.
export PATH=/usr/local/bin:/usr/local/sbin:${PATH}- Uninstalling python3 first could make reinstalling python3 easier depending on your situation.
brew uninstall --ignore-dependencies python3NOTE2: The Python is installed at /usr/local/Cellar/[email protected]/3.9.7_1/bin/python3 .
Please follow the steps below to check your environment.
- Find out the detail of Python3 version on your PATH and see the output like below.
python3 -V
Python 3.9.7- Find out the Pip version for the specific Python version.
python3 -m pip -V- Upgrade the Pip version for the specific Python version and check to see the output like below.
python3.9 -m pip install pip --upgrade
python3 -m pip -V
pip 21.3.1 from /usr/local/lib/python3.9/site-packages/pip (python 3.9)- Install Jupyter Notebook for the specific Python version.
python3 -m pip install notebook --upgrade- Check to see the latest version like below.
jupyter notebook -V
6.4.5Venv is already installed with Python 3.3+.
- Create a virtual environment 'mlb' at the same location as scraper.py and mlb.ipynb.
cd ~/Documents/CS410/CourseProject-main
python3 -m venv mlb
source mlb/bin/activate- Install ipykernel for running Jupyter Notebook on a kernel inside the virtual environment.
python3 -m pip install pip --upgrade
python3 -m pip install ipykernel- Add the kernel to Jupyter Notebook and see the output like below.
python3 -m ipykernel install --user --name=mlb
Installed kernelspec mlb in /home/masami/.local/share/jupyter/kernels/mlb- To delete the kernel and the virtual environment, run the following commands.
jupyter kernelspec uninstall mlb
deactivate
rm -rPlease follow the steps below only if you do not have Python version 3.9.7.
- Install Python 3.9.7 and its Pip 21.3.1, and see the output like below. Pip is already installed with Python 3.4+.
sudo apt update && sudo apt upgrade
sudo apt install software-properties-common
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt-get install python3.9
python3.9 -V
python3.9.7
python3.9 -m pip install pip --upgrade
python3.9 -m pip -V
pip 21.3.1 from /home/masami/.local/lib/python3.9/site-packages/pip (python 3.9)NOTE1: The Python is installed at /usr/bin/python3.9 .
NOTE2: In case you see the warning below, the command like below makes the Pip to be on your PATH.
WARNING: The scripts pip, pip3, and pip3.9 are installed in '/home/masami/.local/bin' which is not on PATH.
echo "export PATH=\"/home/masami/.local/bin:\$PATH\"" >> ~/.bashrc && source ~/.bashrcPlease follow the steps below to check your environment.
- Find out the detail of Python3.9 version at /usr/bin on your machine and see the output like below.
python3.9 -V
Python 3.9.7- Find out the Pip version for the specific Python version.
python3.9 -m pip -V- Upgrade the Pip version for the specific Python version and check to see the output like below.
python3.9 -m pip install pip --upgrade
python3.9 -m pip -V
pip 21.3.1 from /home/masami/.local/lib/python3.9/site-packages/pip (python 3.9)- Install Jupyter Notebook for the specific Python version.
python3.9 -m pip install notebook --upgrade- Check to see the latest version like below.
jupyter notebook -V
6.4.5- Install Venv.
sudo apt-get install python3.9-venv- Create a virtual environment 'mlb' at the same location as scraper.py and mlb.ipynb.
cd ~/Documents/CS410/CourseProject-main
python3.9 -m venv mlb
source mlb/bin/activate- Install ipykernel for running Jupyter Notebook on a kernel inside the virtual environment.
python3.9 -m pip install pip --upgrade
python3.9 -m pip install ipykernel- Add the kernel to Jupyter Notebook and see the output like below.
python3.9 -m ipykernel install --user --name=mlb
Installed kernelspec mlb in /home/masami/.local/share/jupyter/kernels/mlb- To delete the kernel and the virtual environment, run the following commands.
jupyter kernelspec uninstall mlb
deactivate
rm -rf mlb