Sentiment Analysis for Movie Reviews

This project implements and compares different approaches for sentiment analysis on movie reviews, using both supervised and unsupervised learning techniques. The goal is to automatically determine whether a movie review expresses a positive or negative opinion.

Project Overview

This project is structured in two main parts:

Supervised Learning Approach: Using machine learning algorithms trained on labeled data to classify reviews
Unsupervised Learning Approach: Using SentiWordNet lexical resource without requiring labeled training data

Dataset

The project uses the Movie Reviews Corpus from NLTK, which contains movie reviews labeled as either positive or negative. The dataset is preprocessed and split into training and testing sets with an 80/20 ratio.

Methodology

Text Preprocessing

Before training models, the text data undergoes several preprocessing steps:

Conversion to lowercase
Removal of newlines
Removal of punctuation
Tokenization
Stopword removal
Lemmatization

Supervised Learning Approach

Several supervised machine learning models were implemented and evaluated:

Random Forest: An ensemble of decision trees
K-Nearest Neighbors (KNN): Classification based on the k closest training examples
Support Vector Machine (SVM): Finding the optimal hyperplane to separate classes
Logistic Regression: A linear model for binary classification
LightGBM: A gradient boosting framework

Each model underwent hyperparameter optimization using Grid Search with 5-fold cross-validation to find the best configuration.

Unsupervised Learning Approach

The unsupervised approach uses SentiWordNet, a lexical resource derived from WordNet that assigns sentiment scores to words:

Each word is analyzed for its part of speech (POS) and assigned positive and negative scores
Different aggregation methods were tested:
- Standard: Sum all positive and negative scores directly
- Max: Only sum the highest score (positive or negative) for each word
- Binary: Count how many words are more positive than negative and vice versa
- Squared: Square the scores to give more weight to intense sentiments
- Harmonic: Use the harmonic mean between positive and negative scores

Results

Model Performance

The supervised models significantly outperformed the unsupervised approach
Random Forest achieved the best performance with 86.5% accuracy on the test set
The unsupervised approach using SentiWordNet reached 65.2% accuracy
Among different grammatical categories, adjectives proved most informative for sentiment detection

Comparative Analysis

Supervised models excel at capturing complex relationships between words and phrases
Unsupervised approaches struggle with context, sarcasm, and mixed sentiments
Even with small training sets, supervised models outperformed the lexicon-based approach
An ROC curve analysis for the Random Forest model showed an AUC of 0.93, indicating excellent classification performance

Installation & Requirements

pip install nltk scikit-learn pandas numpy matplotlib lightgbm

NLTK resources needed:

nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')
nltk.download('sentiwordnet')
nltk.download('movie_reviews')

Usage

The project is implemented as a Jupyter notebook. To run it:

Install the required dependencies
Download the necessary NLTK resources
Run the notebook cells sequentially

Error Analysis

Common classification errors include:

Reviews with complex language and subtle nuances
Reviews discussing negative aspects of a movie while being globally positive (or vice versa)
Text containing sarcasm or irony
Mixed sentiment reviews

Authors

Sergi Flores & Sam Brumwell

License

This project is for educational purposes and part of university coursework.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.gitattributes		.gitattributes
Practica2_final.ipynb		Practica2_final.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentiment Analysis for Movie Reviews

Project Overview

Dataset

Methodology

Text Preprocessing

Supervised Learning Approach

Unsupervised Learning Approach

Results

Model Performance

Comparative Analysis

Installation & Requirements

Usage

Error Analysis

Authors

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sentiment Analysis for Movie Reviews

Project Overview

Dataset

Methodology

Text Preprocessing

Supervised Learning Approach

Unsupervised Learning Approach

Results

Model Performance

Comparative Analysis

Installation & Requirements

Usage

Error Analysis

Authors

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages