Final Project Repo for CS 410 Text Info Systems

Credit to team members:

Chuanyue Shen (cs11),
Jianjia Zhang (jianjia2),
Runpeng Nie (runpeng3)

Introduction

This is TEAM PYTHON's repo for CS 410 final project classification competition: Twitter Sarcasm Detection Using Natural Language Processing and Machine Learning Techniques.

Project highlights

Constructed a vocabulary library for 5,000 Tweets after pre-processing the data through stop word removal, stemming, and lemmatization using Python NLTK package; and reduced the vector space dimension from 40,000 to 10,000
Implemented Naive Bayes, SVM, and LSTM to classify a tweet as Sarcastic or Not Sarcastic based on the context and response
In particular, the LSTM model hyperparameters were optimized to achieve a higher accuracy than other models

Prerequisite

Please use Python3 and install the following packages:

numpy
jsonlines
nltk
string
re
autocorrect
Keras
sklearn
Pandas

It is recommended to run our program on machines with GPU.

Run the code

Data cleaning and preprocessing

Type

python preProcessData.py

Train the model and make predictions

To run the LSTM model, type

python lstm.py

To run the SVM model, type

python svm.py

To run the Naive Bayes model, type

python navieBayes.py

Test dataset prediction

After running the model, the test dataset prediction will be saved in the local directory, named

answer.txt

Reference

https://www.kaggle.com/kredy10/simple-lstm-for-text-classification

Presentation

https://youtu.be/IC9ncGVvbcQ

More details about the project

Source code

Please refer to the Source code part to the "Run the code" part mentioned above. The test set prediction of our best results can be found in answer.txt. The F1 score of one of our best results using LSTM beat the baseline and can be found in the Livedatalab leaderboard under the name of cs11 and/or jianjia2.

Implementation details

Data preprocessing:

Training data and test data store in JSON line file. Each data item has three fields. The first field of the training data is a label, indicating whether the response of this data is sarcasm or not. The first field of the test data is its ID. Both training data and test data have a response field and a context field. The response field stores the tweet to be classified, and the context field stores the conversation context of the response relatively. Both response and context are string data. Because the data is the tweets, so there are lots of emoji objects and @USER marks. The first step is removing the emojis and @USER. First of all, the regular expression package is used since all emojis are encoded in Unicode. By using re.compile() function the emoji Unicode pattern is defined, and emojis are removed through re.sun() function. @USER is relatively easier to remove. Just like what we learned from lectures, there are lots of meaningless stop words like "the", "a" etc, they are almost useless for text classification. The NLTK package provides an English stopwords list, by adding "@USER" to the list and removing all words that appear in the list from response and context. Besides, for text classification, the punctuation character is useless too since the punctuation does not hold sentiment like normal words. Words are combined except punctuation through using string.punctuation. After these operations, the data is cleaned. Only having a clean dataset is not enough. In English, with the change of grammar and context, there are many words with similar roots that have a similar meaning. For instance computer, computing, and computational. Their appearance increases the complexity of matrix operations. To improve performance and raise classification accuracy, words need to be grouped/replaced by their stem word. NLTK's PortStemmer tool is very useful. By replacing some words with their stem word, the computation complexity dropped significantly.

With the steps mentioned above, we can generate a n * 3 numpy array. The 3 columns store label, response, and context, respectively. The array is saved in 'dataStep1.csv' file.

Using the n * 3 numpy array, we can build a term document matrix that counts the frequency of each vocabulary in each tweet. The term document matrix is saved in 'term_doc_matrix.csv' file.

'dataStep1.csv' is mainly used in LSTM model implementation with some further processing.

'term_doc_matrix.csv' is mainly used in Naive Bayes and SVM models.

Model implementation and test result

• Naive Bayes

For this model, we choose the Gaussian distribution to fit the distribution of the count of each word. Then use the Naive Bayes classifier to fit the data.

Result: ▪ Precision = 0.5371 ▪ Recall = 0.7967 ▪ F1 = 0.6416

• SVM

For this model, we preprocess the data using TF-IDF methods. Due to the reason that we have too many unique words and which will result in around 20,000 features, but the training data has only 5000 rows which will possibly result in underfitting. We remove the words with term count less than 3. After the data cleaning, the unique words are around 6,000, which is reasonable compared to original unique words.

In the training process, we shuffle the training dataset and split it as training data (0.8) and validation data (0.2).

Result: ▪ Precision = 0.5829 ▪ Recall = 0.8633 ▪ F1 = 0.6959

• LSTM (best model)

For this model, we create a neural network with LSTM, and word embeddings were learned while fitting the neutral network. We try to manipulate the number of layers and layer depths to find the best model architecture. Our final best RNN with LSTM model is composed of an embedding layer, a LSTM layer, 3 densely-connected layers, and activation and dropout layers in between. 'ReLu' is used for the activation function. Dropout = 0.5 is used to reduce overfitting. We adopt Sigmoid as the activation function in the output layer.

In the training process, we try to compare different loss functions (mse, binary cross entropy), optimizers (Adam, SGD, RMSprop), and batch sizes (16,32,64,128). Our final best model using binary cross entropy as the loss function, RMSprop as the optimizer, and batch size = 128. Also, we shuffle the training dataset and split it as training data (0.8) and validation data (0.2).

For the data input, we further process the 'dataStep1.csv'. At first, we take the 5000 strings as the input and for each string, we choose at most max_len = 100 words from the right side to the left side. And then, based on the vocabulary, we store the index of each word for each string in the table (matrix_sequences) with its size as 5000 by 100. Overall, the data input for the LSTM model would be a matrix with a size of 5000 by 100.

Result ▪ Precision = 0.6068 ▪ Recall = 0.8967 ▪ F1 = 0.7238

Contribution

All team members made equal contribution to the project and commited 20 hours+ per person to this project.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
CS410_FinalProjectProposal_TEAMPYTHON.pdf		CS410_FinalProjectProposal_TEAMPYTHON.pdf
CS410_project_progress.pdf		CS410_project_progress.pdf
README.md		README.md
answer.txt		answer.txt
classfication.py		classfication.py
lstm.py		lstm.py
naiveBayes.py		naiveBayes.py
preProcessData.py		preProcessData.py
svm.py		svm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Final Project Repo for CS 410 Text Info Systems

Credit to team members:

Introduction

Project highlights

Prerequisite

Run the code

Data cleaning and preprocessing

Train the model and make predictions

Test dataset prediction

Reference

Presentation

More details about the project

Source code

Implementation details

Data preprocessing:

Model implementation and test result

Contribution

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Final Project Repo for CS 410 Text Info Systems

Credit to team members:

Introduction

Project highlights

Prerequisite

Run the code

Data cleaning and preprocessing

Train the model and make predictions

Test dataset prediction

Reference

Presentation

More details about the project

Source code

Implementation details

Data preprocessing:

Model implementation and test result

Contribution

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages