Skip to content

albaaggbb/sentiment-analysis-imdb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sentiment Analysis on IMDb Movie Reviews

Overview

This project applies Natural Language Processing (NLP) techniques to analyze the sentiment of movie reviews from the IMDb dataset. The goal is to classify each review as positive or negative based on its text content.

The workflow includes:

  • Data exploration and visualization
  • Text preprocessing and cleaning
  • Feature extraction using TF-IDF
  • Model training with Logistic Regression
  • Evaluation and interpretation of results

Dataset

  • Source: IMDb Movie Reviews dataset
  • Size: 50,000 reviews
  • Columns:
    • review: text of the movie review
    • sentiment: label (positive or negative)

Balance: 25,000 positive reviews and 25,000 negative reviews

Methodology

  1. Data Exploration

    • Checked distribution of labels and review lengths
    • Visualized frequent words with word clouds
  2. Preprocessing

    • Lowercased text
    • Removed HTML tags, punctuation, and stopwords
  3. Feature Extraction

    • Converted text into numerical vectors using TF-IDF
  4. Model Training

    • Trained a Logistic Regression classifier
    • Chose it for its interpretability and efficiency on high-dimensional sparse data
  5. Evaluation

    • Accuracy: 88.85%
    • ROC AUC: 0.9571
    • Confusion matrix shows balanced performance across classes

Results

  • The model performs well on both classes with balanced precision, recall, and F1-scores (~0.89)
  • Top positive words: realistic, terrific, masterpiece, enjoy, powerful, touching
  • Top negative words: awful, bad, worst, waste, horrible, terrible, pathetic

Visualizations include:

  • Distribution of sentiment labels
  • Histogram of review lengths
  • Word clouds of overall reviews and influential positive/negative words
  • ROC curve

Conclusion

The project demonstrates that a simple Logistic Regression model with TF-IDF features can effectively classify movie reviews by sentiment. The analysis also provides interpretable insights into the words driving positive and negative predictions, aligning with human intuition.

Author

This project was developed by Alba Górriz.

About

Sentiment Analysis of IMDb movie reviews using NLP and Logistic Regression. Classifies reviews as positive or negative and highlights influential words driving predictions.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors