This code is an NLP pipeline for topic modeling large collections of documents. It is generalizeable to any text data set, once formatted properly (see src-gen/raw_corpus2tsv.py for an example processing script).
This code is run on a compute cluster at Brown University using the slurm scheduler. Minor adjustments should allow the code to be run on other compute clusters or locally.
- Install software/packages in the Requirements section, change paths, comment out compute cluster specific code (e.g.
os.system('module load *')) - Toggle switches for desired pipeline steps in
main.py - Run
python main.sh
- Set up compute cluster with software/packages in the Requirements section and change any paths or compute cluster specific code (e.g.
os.system('module load *')) - Toggle switches for desired pipeline steps in
main.py - Adjust resources and run
sbatch sbatch.sh
data/: contains tabular data output from all steps in the pipeline and two additional folderssentiment/: contains positively and negatively charged adjective lists (not used in this pipeline)stoplists/: contains generic and custom stopword lists
src/: contains the codesbatch.sh: script to run the pipeline on a compute cluster in batch modemain.py: top level script with switches to control other scripts/parts in the pipelineraw_corpus2tsv.py: custom hansard script to transform raw xml data to tabular datapreprocess.py: both custom hansard cleaning and generic data cleaning functionsmallet_import_from_file.sh: mallet command to import data to mallet format on a compute clustermallet_train_lda.sh: mallet command to train LDA model on mallet format data on a compute cluster
test/: contains 10 sample Hansard xml data
-
Python 3.6.1
-
anaconda 3-5.2.0
-
MALLET
-
numpy
-
pandas
-
sys
-
os
-
time
-
string
-
csv
-
pickle
-
sklearn
-
nltk
-
enchant