Adrian’s Data Science Blog

Project Six API’s and Random Forests

2017-03-17T12:00:00+00:00

The Problem:

There were three portions to this project:

Collection
Cleaning
Analysis and Modeling

The end goal was to be able to accurately predict what a high rated movie might be and the contributing factors towards a high movie rating.

Risks and Assumptions:

This project came down to the dataset and the reliability of the model in the end was highly dependent on how much data was extracted as well as the reliabiltiy of the feature extraction. In terms of reliability, some of the features extracted in this project were not done as meticulously as they could have. Many features were simplified or extracted due to time contraints. Given more time to go over the data cleaning, more analysis could have been done on the null values and missing data such as meta score. Crtic score seems like it could have had a large impact on overall movie rating.

Scraping and Dataset

The first step in the project was to actually get the dataset for movies in the US.

To start, there was a top 250 rating page that was going to be used but I felt that the 250 movies might not be enough for creating a good model.

In order to pull a good number of songs, I found this list of movies released in the U.S from 1972-2016

All U.S. Released Movies: 1972-2016

Since there were around 10,000 movies in this dataset I thought it would not be a good idea to scrape. Last time when I scraped the average scrape time was 2-3 seconds per page as well as running into overall memory issues.

After checking the omdb api, it seemed easier to search movies via their imdb ID as opposed to title since there can be variations in the title.

This lead me to scraping the movie title followed by the IMDB ID and store it in a dataframe:

def scrape_panel(soup):
    col = soup.find('div', class_='list compact')
    
    names = []
    imdbID = []

    #------------------------------------------------------------#
    for n in col.find_all("td",class_="title"):
        try:
            names.append(n.text.encode('ascii','ignore'))
        except:
            names.append(None)
    #------------------------------------------------------------#
    for i in col.find_all("td",class_="title"):
        try:
            imdbID.append(i.find_all('a')[0])
        except:
            imdbID.append(None)
    #------------------------------------------------------------#
    data = pd.DataFrame({'name': names, 'id': imdbID})
    #------------------------------------------------------------#
    return data

API Calls and Extraction

To get all the movie information, the dataframe created from teh web scrape was used along with this function below:

api_calls = []


for ids in range(len(data)):
    api_calls.append('http://www.omdbapi.com/?i='+data['id'][ids]+'&plot=full')

The API call itself was quite simple since I wanted to extract all the information provided.

Surprisingly this was the easiest part since parsing the JSON output ended up being a bit difficult.

The information was all converted to a dictionary format, converted into a series. The information was stacked and the transformed to genrate a dataframe with all the data.

Data Cleaning

	RangeIndex: 9951 entries, 0 to 9950
	Data columns (total 25 columns):
	Actors          9760 non-null object
	Awards          7021 non-null object
	Country         9786 non-null object
	Director        9652 non-null object
	Episode         1 non-null float64
	Error           154 non-null object
	Genre           9773 non-null object
	Language        9761 non-null object
	Metascore       4575 non-null float64
	Plot            9718 non-null object
	Poster          9612 non-null object
	Rated           8638 non-null object
	Released        9568 non-null object
	Response        9951 non-null bool
	Runtime         9625 non-null object
	Season          1 non-null float64
	Title           9797 non-null object
	Type            9797 non-null object
	Writer          9550 non-null object
	Year            9797 non-null object
	imdbID          9797 non-null object
	imdbRating      9662 non-null float64
	imdbVotes       9661 non-null object
	seriesID        1 non-null object
	totalSeasons    71 non-null float64
	dtypes: bool(1), float64(5), object(19)
	memory usage: 1.8+ MB

Data Removal Process:

TV Shows - I wanted to run this model on movies exclusively since tv shows may have different metrics that made them good or bad.

Errors Column - This column only had error messages for some API calls that did not go through properly.

Poster Image - Given that no features were going to be generated from the image, the image was dropped.

Only keep movies with ratings - Movies with no ratings would take a while to figure out proper values for imputing since such a different set of movies were used.

Remove Meta Score Column - Although this seemed like good metric, there were too many missing values to provide a good metric for rating movies overall.

Awards Column - This column in particular was very complex since it had oscars, awards, nominations all bundled. I decided to sum all the awards as an award value and get the sum. Any columns with null values would get 0.

Numerical Values - All columns with numerical values were converted from string format to number format.

Final Clean Data:

Data Cleaning

	RangeIndex: 8475 entries, 0 to 8474
	Data columns (total 16 columns):
	Actors           8475 non-null object
	Awards           8475 non-null int64
	Country          8475 non-null object
	Director         8475 non-null object
	Genre            8475 non-null object
	Language         8475 non-null object
	Plot             8475 non-null object
	Rated            8475 non-null object
	Runtime          8475 non-null int64
	Title            8475 non-null object
	Writer           8475 non-null object
	Year             8475 non-null int32
	imdbRating       8475 non-null float64
	imdbVotes        8475 non-null int32
	MonthReleased    8475 non-null int64
	DayReleased      8475 non-null int64
	dtypes: float64(1), int32(2), int64(4), object(9)

Visuazlizations

Median: 6.4 Rating

The movie ratings seem to all lie around this rating and left skewed towards the higher ratings.

This histogram showed movie released by year and was a shows the heavy number of movies released post 2000. This may make the models work better for movies of this time period.

Modeling and Analysis

All the columns with string values were turned into dummy variables aside from writers and plots. Writers - For one there were about 13-14 thousand writer values that would be created, the names were not very consistent so they were kept out. Given more time the data writerrs could have been cleaned up and used.

Given the distribution of the ratings falling around the median, it may be important to categorize into what is considered good and bad. Something similar to how Youtube categorizes videos (thumbs up and thumbs down). Given more time this might be three categories, good, bad, and neutral. I will utilize the median rating 6.4 as the binary indicator for high/low rating.

A binary model was picked and generated from the ratings for > or < the median.

def high_rating(rating):
    target = np.median(movies['imdbRating'])
    if rating>= target:
        return 1
    else:
        return 0

Using a random forest classifier with gradient boosting yielded the best results with an accuracy score of about 73%

	Predicted High Rating	Predicted Low Rating
High Rating	867	393
Low Rating	289	994

Conclusions

Overall the random forest tree did not improve significantly with gradient boost, however the score in itself was significantly good at a first shot.

The improvement was also slight but in these type of prediction models a increase in about a percent is quite significant.

I think more features could have been engineerined using the description and writers. Overall the descriptions could have yielded better results since they may describe parts of the movies that the gneres do not classify.

However adding those features might also cause overfitting, so further analysis would need to be done to get the right number of descriptors out of the description.

Further analysis would be pinning down the exact features to improve the model even more.

DSI Movie Rating Predictor and Analysis

2017-03-17T00:00:00+00:00

Github Repo

Part I - Scraping Movie ID’s

Part II - API Data Extraction

Part III - Data Cleaning & Feature Engineering

Part IV - Data Modeling & Clustering

The Problem:

There were three portions to this project:

Collection
Cleaning
Analysis and Modeling

The end goal was to be able to accurately predict what a high rated movie might be and the contributing factors towards a high movie rating.

Risks and Assumptions:

Scraping and Dataset

The first step in the project was to actually get the dataset for movies in the US.

To start, there was a top 250 rating page that was going to be used but I felt that the 250 movies might not be enough for creating a good model.

In order to pull a good number of songs, I found this list of movies released in the U.S from 1972-2016

All U.S. Released Movies: 1972-2016

After checking the omdb api, it seemed easier to search movies via their imdb ID as opposed to title since there can be variations in the title.

This lead me to scraping the movie title followed by the IMDB ID and store it in a dataframe:

def scrape_panel(soup):
    col = soup.find('div', class_='list compact')

    names = []
    imdbID = []

    #------------------------------------------------------------#
    for n in col.find_all("td",class_="title"):
        try:
            names.append(n.text.encode('ascii','ignore'))
        except:
            names.append(None)
    #------------------------------------------------------------#
    for i in col.find_all("td",class_="title"):
        try:
            imdbID.append(i.find_all('a')[0])
        except:
            imdbID.append(None)
    #------------------------------------------------------------#
    data = pd.DataFrame({'name': names, 'id': imdbID})
    #------------------------------------------------------------#
    return data

API Calls and Extraction

To get all the movie information, the dataframe created from teh web scrape was used along with this function below:

api_calls = []


for ids in range(len(data)):
    api_calls.append('http://www.omdbapi.com/?i='+data['id'][ids]+'&plot=full')

The API call itself was quite simple since I wanted to extract all the information provided.

Surprisingly this was the easiest part since parsing the JSON output ended up being a bit difficult.

The information was all converted to a dictionary format, converted into a series. The information was stacked and the transformed to genrate a dataframe with all the data.

Data Cleaning

	RangeIndex: 9951 entries, 0 to 9950
	Data columns (total 25 columns):
	Actors          9760 non-null object
	Awards          7021 non-null object
	Country         9786 non-null object
	Director        9652 non-null object
	Episode         1 non-null float64
	Error           154 non-null object
	Genre           9773 non-null object
	Language        9761 non-null object
	Metascore       4575 non-null float64
	Plot            9718 non-null object
	Poster          9612 non-null object
	Rated           8638 non-null object
	Released        9568 non-null object
	Response        9951 non-null bool
	Runtime         9625 non-null object
	Season          1 non-null float64
	Title           9797 non-null object
	Type            9797 non-null object
	Writer          9550 non-null object
	Year            9797 non-null object
	imdbID          9797 non-null object
	imdbRating      9662 non-null float64
	imdbVotes       9661 non-null object
	seriesID        1 non-null object
	totalSeasons    71 non-null float64
	dtypes: bool(1), float64(5), object(19)
	memory usage: 1.8+ MB

Data Removal Process:

TV Shows - I wanted to run this model on movies exclusively since tv shows may have different metrics that made them good or bad.

Errors Column - This column only had error messages for some API calls that did not go through properly.

Poster Image - Given that no features were going to be generated from the image, the image was dropped.

Only keep movies with ratings - Movies with no ratings would take a while to figure out proper values for imputing since such a different set of movies were used.

Remove Meta Score Column - Although this seemed like good metric, there were too many missing values to provide a good metric for rating movies overall.

Numerical Values - All columns with numerical values were converted from string format to number format.

Final Clean Data:

Data Cleaning

	RangeIndex: 8475 entries, 0 to 8474
	Data columns (total 16 columns):
	Actors           8475 non-null object
	Awards           8475 non-null int64
	Country          8475 non-null object
	Director         8475 non-null object
	Genre            8475 non-null object
	Language         8475 non-null object
	Plot             8475 non-null object
	Rated            8475 non-null object
	Runtime          8475 non-null int64
	Title            8475 non-null object
	Writer           8475 non-null object
	Year             8475 non-null int32
	imdbRating       8475 non-null float64
	imdbVotes        8475 non-null int32
	MonthReleased    8475 non-null int64
	DayReleased      8475 non-null int64
	dtypes: float64(1), int32(2), int64(4), object(9)

Visuazlizations

Median: 6.4 Rating

The movie ratings seem to all lie around this rating and left skewed towards the higher ratings.

This histogram showed movie released by year and was a shows the heavy number of movies released post 2000. This may make the models work better for movies of this time period.

Modeling and Analysis

A binary model was picked and generated from the ratings for > or < the median.

def high_rating(rating):
    target = np.median(movies['imdbRating'])
    if rating>= target:
        return 1
    else:
        return 0

Using a random forest classifier with gradient boosting yielded the best results with an accuracy score of about 73%

	Predicted High Rating	Predicted Low Rating
High Rating	867	393
Low Rating	289	994

Conclusions

Overall the random forest tree did not improve significantly with gradient boost, however the score in itself was significantly good at a first shot.

The improvement was also slight but in these type of prediction models a increase in about a percent is quite significant.

However adding those features might also cause overfitting, so further analysis would need to be done to get the right number of descriptors out of the description.

Further analysis would be pinning down the exact features to improve the model even more.

Project Five Classification Disaster Management

2017-03-01T12:00:00+00:00

The Problem:

This project consists of accessing a remote database for the titanic disaster dataset. Acquiring the data for the titanic disaster dataset and using that information to predict survival rates using the created regression model.

Risks and Assumptions:

There are some limitations to the titanic dataset in terms of missing data on both the age and cabin information. As far as the cabin allocations go, it is assumed that this is a clerical error and thus in the overall analysis these factors were not used. In regards to age, there also seemed to be some data entry issues and for the scope of this analysis, the median age for each gender was used respectively.

Cabin information is not being used due to how incomplete it is, so it will be assumed there just wasn’t enough data there or that it was not necessary for this analysis.

One other assumption is that the collected data is correct and accurate, since many of the individuals involved did pass away.

Making Sense of the Data and Problems with the Data

The titanic dataset was imported from a PSQL database into a pandas notebook.

Data Size on import was 891 passengers with 12 features for each one, including the target.

Upon initial looking at the imported data, it was clear there were some missing values:

There seem to be some missing values for age in: Age 714 non-null float64

There is also a lot of missing data for the Cabin number of most of the passengers: Cabin 204 non-null object

There are two missing values for the embarked location: Embarked 889 non-null object

Dealing with Age Missing Values Missing age values were replaced with the median of the respective age group for each passenger category. I went along with this solution since the alternative was losing the missing values and the rest of the information provided in the dataset. The median was picked as the age since the mean could possibly be skewing more towards due to some of the old age outliers in the data.

df['Age'] = df.groupby('Sex')['Age'].transform(lambda x: x.fillna(x.median()))

Dealing with Cabin Missing Values Initially before looking at the data, I began to think that cabin location might be a good indicator of class as well as location during the disaster which might affect the survival rate of passengers. After looking through the data, it was determined that 687 values were missing and thus this column just got dropped. However, given more time it could possibly be explored a bit more and use the little data that we did get.

df = df.drop('Cabin', 1)

Dealing with Embarked Missing Values Since there are only two missing values here, I looked up the names of these passengers and just filled in the missing value with the actual embarked location.

Once the data was ready to be worked with, I decided it was time to look at some patterns in the data.

Visualization and analysis

This was perhaps the most important graph and metric that was seen initially on the dataset. There just seemed to be a huge disparity between the deaths of men vs women during the titanic disaster. Naturally it seems like the women and children were opted to survive more so than men.

From the above graph, aside from being female or male it seems there were a lot more deaths in the lowest class of passengers. This is somewhat troubling just due to the bulk on the deaths happening within the their class as opposed to the about even survival rates otherwise.

This third bar at first seemed to indicate more death rates from port C but after a second look it may be the case that there were just more people embarking from port C. This is most likely the case since there are more deaths and survivals from port C.

Aside from gender, class, or port one other important metric seemed to be age. In the above graph the survivals are in blue and seem to represent a much higher survival rate proportionally in the lower younger ages around 0-13. The death rates spike up a bit more after this but improve towards some of the older ages. The median and range area of this graph should not be taken as serious, since the age frequency is going to be much higher there due to the data cleaning and using the median as the age for over 100 individuals. Overall it seemed that the younger you were, the better your chances were of being rescued.

In general the most important metric in determining the survival of most of the passenger had to do with gender and were heavily against the ships male population.

Creating Prediction Models

In order to create the regression models as well as some other test models, it was important to get the data setup correctly.

Since one of the models was KNN, it was important to do pre processing and scale the data. There were also categoricals in gender that needed to be changed to numbers, as well as port location.

# Turn all male/female notations into 1 or 0
df['Sex'] = df['Sex'].map({'male':1,'female':0})

# Turn all the ports into categoricals as 0,1,2
df['Embarked'] = df['Embarked'].map({'S':0,'Q':1, 'C':2})

** Setup our parameters and scale the data **

from sklearn.preprocessing import MinMaxScaler

y = df['Survived']
X = df.drop(['PassengerId','Name', 'Ticket','Survived'], 1)

X_scaled = MinMaxScaler().fit_transform(X)

Once the data setup, I create a train test split for the dataset and began predicting the best parameters using a gridsearch.

For Logistic Regression, the best score attained was: 80% accuracy

Below is the ROC curve and confusion matrix denoting, the true predicted values (false/positive). The ROC curve will give a good metric of the predictions at different positions with a base prediction of 50%.

	Positive	Negative
Positive	80	21
Negative	31	136

Other models were also tested with very similar results below is KNN:

	Positive	Negative
Positive	69	11
Negative	42	146

Decision Tree Classifier:

	Positive	Negative
Positive	68	23
Negative	43	134

Conclusion and Final thoughts

In regards to the modeling predictions, it seems like the best performing models were the logistic regressions with the decision tree classifier. This accuracy score isn’t all that great but I believe it could have been improved via better features as well as some more extensive feature extraction.

Some improvements could have been using some of the titles found in the passenger names, which might be indicative of some other hidden metric. Additionally the cabin data as mentioned earlier could have proven to be useful to improve the model.

Project Four Data Scraping Project

2017-03-01T12:00:00+00:00

The Problem

The objective of this project is to create a regression model using binary indicators to help predict if a salary is either high or low. There are only basic binary indicators given initially, and many of them need to be constructed from the acquired data. To acquire data, a scrape was done of Glassdoor and then cleaned and input into a Padas Dataframes for analysis.

Risks:

the dataset is mostly concentrated around a small number of cities which I found to have a large number of data science positons, there are many outside of these states that were not taken into account due to the time constraints of the project.
there is missing data for certain locations since the initial scrape would not let me go past ~30 pages per location. This means each “state” isn’t actually the entire state but instead its the first 30 or so pages for that particular state.
the salary estimate themselves are limited since I am basing my predictions using the already predicted salaries from Glassdoor and their own algorithm. Since they can’t provide exact salaries, I got the mean(median) between the two min and max salary range and used that as my salary indicator.
the feature selection is done arbitrarily by most common words that sound impactful, however this may not be the best approach.

Assumptions:

Given the enormous dataset and time constraints, a lot of job positions not related to data science may have slipped in. Although I tried filtering this out on the initial scrape, it was difficult to do so completely, it will be assumed that all scraped data is data science related
Some data may have overlaps if the same company is hiring in multiple states, I did try to mitigate this but there may be overlap overall

Webscraping

In order to web scrape Glassdoor, selenium and beautiful soup were both used. Selenium was needed since the website was an AJAX website and would return an error with a normal request. Beautiful soup was used to parse through the page and find the correct panels needed for scraping.

The first step in scraping was getting the needed search results, meaning data science salaries for a specific state. Once we had the needed search results then the initial extraction was done meaning these features were extracted:

Company name
Location name
Salary
Post URL
Position

In order to do this through all the states I wanted, it was easier to break each step into functions. 7 States were scraped separately due to Glassdoor not showing results beyond ~30 pages. These results were all extracted from their csv output and then combined to make one data frame with the all the page URL’s and relevant information. This is where the time consuming scrape started, since about 4K pages had to be opened and scraped for information. The process would take about 30 minutes per 500 pages.

Below, is the function used to do the initial scrape.

def scrape_panel(soup):
    leftcol = soup.find('ul', class_="jlGrid hover")
    comp_name = []
    loc_name  = []
    salary    = []
    job_urls  = []
    position  = []

    #------------------------------------------------------------#
    for n in leftcol.find_all("div",class_="flexbox empLoc"):
        try:
            comp_name.append(n.text.encode('ascii','ignore').split("  ")[1])
        except:
            comp_name.append(None)
        try:
            loc_name.append(n.text.encode('ascii','ignore').split("  ")[2])
        except:
            loc_name.append(None)

    #------------------------------------------------------------#
    for s in leftcol.find_all('li', class_='jl'):
        try:
            salary.append(s.find('span', class_='green small').text.encode('ascii','ignore'))
        except:
            salary.append(None)
    #------------------------------------------------------------#
    for l in leftcol.find_all("a"):
        try:
            job_urls.append('https://www.glassdoor.com'+l['href'].encode('ascii','ignore'))
        except:
            job_urls.append(None)
        try:
            position.append(l.text.encode('ascii','ignore'))
        except:
            position.append(None)

    job_urls = job_urls[0::2]
    position = position[1::2]
    #------------------------------------------------------------#
    data = pd.DataFrame({'company_name': comp_name,\
                         'location': loc_name,\
                         'salary':salary,\
                         'position':position,\
                         'urls': job_urls})
    #------------------------------------------------------------#
    return data

Data Cleaning

Once the scraping portion was completed, the data had to be cleaned. Here are some of the main things I was looking for when cleaning:

Repeated job postings
Empty cells/ Null values
Must have salary information provided

Upon finishing the initial cleaning, here are the salary distributions that were found:

From the histogram, the salary distribution is quite normal for a salary distribution with the main concentration of data around the 100K region.

Median = 100,000 Mean = 102,961.48

It would also be important to note the concentration of our data in terms of states. Meaning, which states are we getting most of our data from after the cleaning?

Although I started with around 1000 postings for each state, many of them were lost during the cleaning. It can be seen above that the majority of the listings will be from California and New York.

Modeling and Feature Extraction

The target salary was chosen as the median for my model creation since this was the cutoff portion for most of the top or lower salaries.

Features were extracted by getting a value count for all the words going through the title and the job description. Then I looked at the top 20 and hand picked the ones that were most relevant. These words were looked for in the description and were give a True of False marker if found or not. Once these values were all made into dummy variables then the data had to be split into the target and data as X and y. The target column was acquired by running the salaries through the target salary (median).

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

X = reg_classf
y = data['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

logreg = LogisticRegression(solver='liblinear')

C_vals = [ .1,.2,.3,.7,1,3,6,10,20,30,40]
penalties = ['l1','l2']

gs = GridSearchCV(logreg, {'penalty': penalties, 'C': C_vals},\
                  verbose=False, cv=5)
gs.fit(X_train, y_train)

This data was able to provide an accuracy of: 65%

Below is a confusion Matrix showing the actual (top) vs predicted (left).

	Positive	Negative
Positive	412	139
Negative	247	353

I have also included the ROC curve which is showing some poor results for this specific model.

Conclusion and Final thoughts

The model had a best score of about 65%, which is quite poor but should be fine for the first scrape of this project. I feel it could be improved by adding more data, cleaning it better, and improving feature selection.

Finding patterns between the job descriptions and the salary is probably the most crucial in this project and that portion was done poorly due to the pending data cleaning and slow scrape.

Project Three House Prices

2017-02-19T12:00:00+00:00

Predicting House Prices Using Linear regression

Main Problem and Objectives

Build a prediction model for house pricing in Ames, IA
Where are most sales taking place?
Where are the most expensive houses located?
Discuss Possible Improvements

Describing the Data and Limitations

Target Prediction Feature: Sale Price
Number of Instances: 1460
Number of Attributes Allowed: 18
Years of data collected: 2006 - 2010
Missing

Limitations

The attributes provided are not necessarily the best indicators of the house pricing
The data collected is mostly around a particular unstable time in the market

Understanding the Data

Looking at Correlations Quality & Price Correlation

Looking at Sale Prices Across Neighborhoods

Where are the most sales happening?

Most Sales Happening in :
- North Ames
How many happened?
- 225 (About 15.41 % of total sales)

Where are the most expensive homes?

Creating a Regression model

Type of Regression: Linear
Attributes Dropped: Utilities
Dummy Variable Selection: All except Lot Area and GrLivArea
- Accuracy Testing:
- R Squared = 0.899
- Mean Absolute Error: 16261.24
- Mean Squared Error: 637850318.41
- Root Mean Squared Error: 25255.70
- Cross Validation Score: 0.738
Limitations of the Model

There were a lot of outliers in the data that have caused the RMSE and MSE to be quite high The cross validation score was lower than the initial model made, although R Square was improved this could be a sign of some over fitting on this model.

Possible Improvements

More location based metrics such as surrounding business’, schools, police stations, etc
More insight into the overall condition and quality metric
More data points for expensive homes, to improve predictions on the expensive homes.

Project Two Billboard

2017-02-07T12:00:00+00:00

Top 100 Billboard Singles of the Year 2000

Introduction

The objective of this analysis was to utilize billboard data for the top 100 songs of the year 2000. The data set was not clean and so a fair amount of data cleaning would need to be done before starting to do anything with the numbers

Making sense of the data

My first step into this project was to look into the data and get as much information as possible from the given CSV. This would help me find errors, next steps, and overall give me a good plan of action to see what I would be able to extract from the data later on.

What am I looking at?

Billboard data for the top 100 charts for the year 2000, showing the song peak cycle from the time the song entered the top 100 to the time it left. The peak position and date is also given along with the song artist, name, genre, and length.

How big is the data?

Rows: 83, containing song attributes

Columns: 317, containing songs that reached the Top 100 in the year 2000

There is no missing information from this dataset.

What do these headers mean?

year - Year is 2000 for all songs, denoting they peaked in the top 100 during this year
artist.inverted - artist or band name, artist full name will be inverted
track - Track title
time - Track length, is later on converted to seconds
genre- Track genre from 12 given genres
date.entered - Date the track entered the top 100
date.peaked - Peak date of the track (highest position on the top 100)
x1st.week - x76th.week - position at given week number for the given track

columns added later on:

weeks to peak - how long it took for the song to peak in the charts
weeks on chart - how long the song remained in the top 100 charts
worst position - worst chart ranking during its top 100 position
best position - best chart ranking during its top 100 position
enter rank - the rank the song entered the top 100 charts with
exit rank - the rank the song exited the top 100 charts with

Are there any problems with the data?

Overall the issues with the data were formatting related with issues such as the header formatting. I needed to convert the header names properly to make them more understandable.

Time and date were going to be an issue as well because the song length was not setup properly in MM:SS format and the date entered and date peaked data were strings instead of date values. This would mean that should I need to make calculations on these dates, then it might be hard to do down the line.

Some big problems I found were also around the genre and the classification of data around it. The given classifications are very confusing since they don’t seem to match the songs. There also seem to be some input errors with the genre R&B that would need some cleaning.

Finally, there is just some data type conversions that are needed to be able to handle the numbers properly as well as deal with invalid entries (*).

What risks am I taking with this data?

Given the outlined issues above, I see some of the biggest risks coming from the genre section of the data. It seems like there is vast misclassification of the music, especially within the field of rock n roll. There are songs from every other sub-genre that are thrown into rock n roll. Given the vast errors in song classifications, it leads to wonder what other issues may arise in the overall genre classification or how it was derived.

For the purposes of this analysis, it will be assumed that the genre classifications have been done correctly. Aside from combining the two R&B genres, the rest will be left as is. This would also prevent personal bias against a genre that could possibly affect the end classification.

It is also assumed that the song attributes given have been input correctly.

What am I trying to accomplish here?

Problem: Does music popularity inherently depend on public attraction or are there other forces such as marketing, distribution, contracts, etc. that affect the popularity of our songs.
Hypothesis: Track duration is a big decider in track popularity and will have specific parameters that will increase the probability of a song making it to the top 100.

Data Cleaning

In the data cleaning process, it was my goal to convert all the chart information into proper data types that could be used as well as making the information as clear as possible. Here are some of the data cleaning processes I used in order to get my final working data frame.

Cleaning Up the Headers

#cleaned columns by removing the '.'
#also removed the 'x' at the beginning of the week columns
bb.columns = [x.replace('.',' ') for x in bb.columns]
bb.columns = [x.replace('x','') for x in bb.columns]

#renamed artist and duration columns to clarify their content
bb = bb.rename(columns={"artist inverted": "artist"})
bb = bb.rename(columns={"time": "duration"})

Converted the track duration column

This was the time format provided by the data:

mm,ss,ms AM

In order to be able to use this time information, I thought it would be best to convert it to seconds. This would make any visualization a lot easier as well as conversion to other data types simple should I find the need to.

# Create a function that takes in a string and outputs seconds
def get_seconds(string):
    sp = string.split(',')
    seconds = int(sp[0])*60 + int(sp[1])
    return seconds
bb['duration'] = bb['duration'].apply(get_seconds)

These are just some of the cleaning steps done on this data. All the changes made are documented in my jupyter notebook.

Generating New Data From the Clean Data

One important piece of information that is given but is not quickly visible or accessible is the song ranking upon entering the top 100 list and the exit ranking.

#generated a list of the enter and exit positions for all the songs
exit_loc = week_data.apply(pd.Series.last_valid_index , axis = 1)
enter_loc = week_data.apply(pd.Series.first_valid_index , axis = 1)

exit = []
enter= []
#converted the lists added them to billboard data
for i in range(len(bb)):
    exit.append(int(bb[exit_loc[i]][i]))

for j in range(len(bb)):
    enter.append(int(bb[enter_loc[j]][j]))

Another set of columns I added were the worst and best positions of a given song during their time in the top 100.

#get max and min values for the rankings for the columns with "weeks"
bb['worst position'] = bb.iloc[:, 6:-2].max(axis = 1)
bb['best position'] = bb.iloc[:, 6:-2].min(axis = 1)
#converted the positions to int, since rankings are not floats
bb['worst position'] = bb['worst position'].astype(int)
bb['best position'] = bb['best position'].astype(int)

Visualizing the data

Before seeing this histogram, I expected the frequency to be a lot more flat along with a small decline as the weeks went on. However, it seems there is a very specific number of weeks that the songs were staying on for and this was around 20 weeks. This was very strange since the number spiked incredibly high at that point. I did some research on here:

http://www.billboard.com/articles/columns/ask-billboard/5740625/ask-billboard-how-does-the-hot-100-work

Generally speaking, our Hot 100 formula targets a ratio of sales (35-45%), airplay (30-40%) and streaming (20-30%). (Year 2013)

This is how they were calculating metrics around 2013 but given the year 2000, I imagine that streaming was a non existent metric. Given the assumption that sales and airplay were about the same to calculate the billboard perhaps purchase patterns of the tracks or airplay contracts could be a factor, more so than actual popularity.

I also wanted to look into the track duration metric to see if there are any patterns on the billboard rankings. As seen in the initial histogram, there seems to be a strong concentration around the 200 to 300 second mark for song popularity. The songs around that range seem to do the best overall in terms of best position as well as duration on the top 100 charts.

Conclusions and Findings

Overall I would note that my initial hypothesis does not seem to be completely correct. I thought that there would be some strong metric (track duration) that would contribute to the popularity of songs in the top 100 chart. Initially, this does seem to be the case but I also found very steep drop offs in the popularity of songs around the 20 week mark. It is as if songs just stopped being popular at that specific location. The heavy positive skew in the track duration does seem to present a strong metric showing that shorter songs are more popular. However, the findings for random drop offs and popularity durations goes to show that there are greater external factors in play to the popularity. It would be interesting to know how the popularity was calculated on average for this year. This might provide better insight into the current data. To improve insight into what makes a song popular, it might help to get actual user listening patterns as opposed to company regulated charts. Calculating popularity on pure listening patterns could remove some of the external factors such as distribution and contracts out. However doing so might also isolate the music popularity findings to a specific online music listener.

My Jupyter notebook on this project can be found here