This project applies Natural Language Processing (NLP) techniques to analyze the sentiment of movie reviews from the IMDb dataset. The goal is to classify each review as positive or negative based on its text content.
The workflow includes:
- Data exploration and visualization
- Text preprocessing and cleaning
- Feature extraction using TF-IDF
- Model training with Logistic Regression
- Evaluation and interpretation of results
- Source: IMDb Movie Reviews dataset
- Size: 50,000 reviews
- Columns:
review: text of the movie reviewsentiment: label (positiveornegative)
Balance: 25,000 positive reviews and 25,000 negative reviews
-
Data Exploration
- Checked distribution of labels and review lengths
- Visualized frequent words with word clouds
-
Preprocessing
- Lowercased text
- Removed HTML tags, punctuation, and stopwords
-
Feature Extraction
- Converted text into numerical vectors using TF-IDF
-
Model Training
- Trained a Logistic Regression classifier
- Chose it for its interpretability and efficiency on high-dimensional sparse data
-
Evaluation
- Accuracy: 88.85%
- ROC AUC: 0.9571
- Confusion matrix shows balanced performance across classes
- The model performs well on both classes with balanced precision, recall, and F1-scores (~0.89)
- Top positive words: realistic, terrific, masterpiece, enjoy, powerful, touching
- Top negative words: awful, bad, worst, waste, horrible, terrible, pathetic
Visualizations include:
- Distribution of sentiment labels
- Histogram of review lengths
- Word clouds of overall reviews and influential positive/negative words
- ROC curve
The project demonstrates that a simple Logistic Regression model with TF-IDF features can effectively classify movie reviews by sentiment. The analysis also provides interpretable insights into the words driving positive and negative predictions, aligning with human intuition.
This project was developed by Alba Górriz.