- Alejandro Figueroa
- Raynard Flores
This project aims to analyze IMDb movie data to provide insights into various aspects of the movie industry. Our primary focus is on understanding which movies are most valued by viewers within specific genres and how these preferences correlate with user ratings and vote counts. Additionally, we explore the relationship between a movie's gross earnings and its IMDb rating to determine if higher ratings are indicative of higher earnings, determine which genre (out Action, Animation, Horror, History) generates the most average votes by user, and which one has the longest average runtime.
As cinephiles, we were curious to discover significant films within specific genres based on IMDB ratings and vote counts. As data analysts in training, we developed three hypotheses about these top rated movies in order to use the techniques we have learned so far: data cleaning, data wrangling, EDA, and data visualization.
-
Determine the highest rated movies within specific genres based on user ratings and vote counts. The genres that are going to be analysed are:
- Action
- Animation
- Horror
- History
-
Analyze the correlation between a movie’s gross earnings and its IMDb rating to understand the factors contributing to a movie's financial and critical success.
-
Of the highest rated movies by genres specified, determine genre which one has the highest average amount of votes.
-
Of the highest rated movies by genres specified, determine genre which one has the highest average runtime.
Our main dataset comes from Kaggle, and essentially is a combination of CSV files that have information about movies with specific genres.
The dataset contains the following columns:
movie_id: IMDb Movie IDmovie_name: Name of the movieyear: Release yearcertificate: Movie certificate ratingrun_time: Total runtime of the moviegenre: Genre of the movierating: IMDb rating of the moviedescription: Description of the moviedirector: Director of the moviedirector_id: IMDb ID of the directorstar: Star of the moviestar_id: IMDb ID of the starvotes: Number of votes the movie received on IMDbgross: Gross box office revenue of the movie in dollars
Our second data source is the Movie Database API, which collects movie information from IMDb and its freely hosted on the RapidAPI page.
- Movie Database API: https://rapidapi.com/SAdrian/api/moviesdatabase
Here, we will outline our initial hypotheses based on our problem statement. These hypotheses will guide our analysis and help us focus on specific relationships within the data.
- Hypothesis 1: There is a correlation between a movie's IMDB rating and its worldwide gross.
- Hypothesis 2: Among the selected common genres—Action, Horror, Animation, and History—we hypothesize that the Action genre generates the highest average number of votes.
- Hypothesis 3: Among the selected genres, we hypothesize that the Action genre has the highest average runtime
We will employ various data analytics techniques including data visualization, EDA, and data wrangling to explore the dataset and validate our hypotheses.
To perform a similar analisis follow these simple steps:
- Download the dataset (you will find it in https://www.sharkattackfile.net/incidentlog.htm)
- Install dependencies into your coding notebook
- Run various codes to explore and analyze the data
- Come up with a conclusion with your findings
You will need to import the following:
- Pandas --> import pandas as pd
- Requests --> import requests
- Pyplot --> import matplotlib.pyplot as okt
- DotEnv --> from dotenv import load_dotenv
- Path --> from pathlib import Path
- OS --> import os
Contributions to this project are welcome. You can contribute by:
- Extending the analysis to include additional movie metrics.
- Refining the visualizations and interpretations.
Please refer to the contribution guidelines before making a contribution.