Sentiment Analysis on IMDb Movie Reviews

Overview

This project applies Natural Language Processing (NLP) techniques to analyze the sentiment of movie reviews from the IMDb dataset. The goal is to classify each review as positive or negative based on its text content.

The workflow includes:

Data exploration and visualization
Text preprocessing and cleaning
Feature extraction using TF-IDF
Model training with Logistic Regression
Evaluation and interpretation of results

Dataset

Source: IMDb Movie Reviews dataset
Size: 50,000 reviews
Columns:
- review: text of the movie review
- sentiment: label (positive or negative)

Balance: 25,000 positive reviews and 25,000 negative reviews

Methodology

Data Exploration
- Checked distribution of labels and review lengths
- Visualized frequent words with word clouds
Preprocessing
- Lowercased text
- Removed HTML tags, punctuation, and stopwords
Feature Extraction
- Converted text into numerical vectors using TF-IDF
Model Training
- Trained a Logistic Regression classifier
- Chose it for its interpretability and efficiency on high-dimensional sparse data
Evaluation
- Accuracy: 88.85%
- ROC AUC: 0.9571
- Confusion matrix shows balanced performance across classes

Results

The model performs well on both classes with balanced precision, recall, and F1-scores (~0.89)
Top positive words: realistic, terrific, masterpiece, enjoy, powerful, touching
Top negative words: awful, bad, worst, waste, horrible, terrible, pathetic

Visualizations include:

Distribution of sentiment labels
Histogram of review lengths
Word clouds of overall reviews and influential positive/negative words
ROC curve

Conclusion

The project demonstrates that a simple Logistic Regression model with TF-IDF features can effectively classify movie reviews by sentiment. The analysis also provides interpretable insights into the words driving positive and negative predictions, aligning with human intuition.

Author

This project was developed by Alba Górriz.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
notebooks		notebooks
output		output
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentiment Analysis on IMDb Movie Reviews

Overview

Dataset

Methodology

Results

Conclusion

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sentiment Analysis on IMDb Movie Reviews

Overview

Dataset

Methodology

Results

Conclusion

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages