Jekyll2017-05-18T14:33:56+00:00https://adrianll.github.io//Adrian’s Data Science BlogI am a programmer and IT professional turned Data Scientist out of fascination for data driven work, problem-solving, and the high societal impact of big data.Project Six API’s and Random Forests2017-03-17T12:00:00+00:002017-03-17T12:00:00+00:00https://adrianll.github.io//Project-6<h3 id="the-problem">The Problem:</h3>
<p>There were three portions to this project:</p>
<ul>
<li>Collection</li>
<li>Cleaning</li>
<li>Analysis and Modeling</li>
</ul>
<p>The end goal was to be able to accurately predict what a high rated movie might be and the contributing factors towards a high movie rating.</p>
<h3 id="risks-and-assumptions">Risks and Assumptions:</h3>
<p>This project came down to the dataset and the reliability of the model in the end was highly dependent on how much data was extracted as well as the reliabiltiy of the feature extraction.
In terms of reliability, some of the features extracted in this project were not done as meticulously as they could have. Many features were simplified or extracted due to time contraints.
Given more time to go over the data cleaning, more analysis could have been done on the null values and missing data such as meta score. Crtic score seems like it could have had a large impact on overall movie rating.</p>
<h3 id="scraping-and-dataset">Scraping and Dataset</h3>
<p>The first step in the project was to actually get the dataset for movies in the US.</p>
<p>To start, there was a top 250 rating page that was going to be used but I felt that the 250 movies might not be enough for creating a good model.</p>
<p>In order to pull a good number of songs, I found this list of movies released in the U.S from 1972-2016</p>
<p><a href="http://www.imdb.com/list/ls057823854/">All U.S. Released Movies: 1972-2016</a></p>
<p>Since there were around 10,000 movies in this dataset I thought it would not be a good idea to scrape.
Last time when I scraped the average scrape time was 2-3 seconds per page as well as running into overall memory issues.</p>
<p>After checking the omdb api, it seemed easier to search movies via their imdb ID as opposed to title since there can be variations in the title.</p>
<p>This lead me to scraping the movie title followed by the IMDB ID and store it in a dataframe:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">scrape_panel</span><span class="p">(</span><span class="n">soup</span><span class="p">):</span>
<span class="n">col</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">'div'</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s">'list compact'</span><span class="p">)</span>
<span class="n">names</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">imdbID</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c">#------------------------------------------------------------#</span>
<span class="k">for</span> <span class="n">n</span> <span class="ow">in</span> <span class="n">col</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"td"</span><span class="p">,</span><span class="n">class_</span><span class="o">=</span><span class="s">"title"</span><span class="p">):</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">names</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">n</span><span class="o">.</span><span class="n">text</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">'ascii'</span><span class="p">,</span><span class="s">'ignore'</span><span class="p">))</span>
<span class="k">except</span><span class="p">:</span>
<span class="n">names</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="bp">None</span><span class="p">)</span>
<span class="c">#------------------------------------------------------------#</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">col</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"td"</span><span class="p">,</span><span class="n">class_</span><span class="o">=</span><span class="s">"title"</span><span class="p">):</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">imdbID</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">i</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">'a'</span><span class="p">)[</span><span class="mi">0</span><span class="p">])</span>
<span class="k">except</span><span class="p">:</span>
<span class="n">imdbID</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="bp">None</span><span class="p">)</span>
<span class="c">#------------------------------------------------------------#</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s">'name'</span><span class="p">:</span> <span class="n">names</span><span class="p">,</span> <span class="s">'id'</span><span class="p">:</span> <span class="n">imdbID</span><span class="p">})</span>
<span class="c">#------------------------------------------------------------#</span>
<span class="k">return</span> <span class="n">data</span></code></pre></figure>
<h3 id="api-calls-and-extraction">API Calls and Extraction</h3>
<p>To get all the movie information, the dataframe created from teh web scrape was used along with this function below:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">api_calls</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">ids</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">)):</span>
<span class="n">api_calls</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="s">'http://www.omdbapi.com/?i='</span><span class="o">+</span><span class="n">data</span><span class="p">[</span><span class="s">'id'</span><span class="p">][</span><span class="n">ids</span><span class="p">]</span><span class="o">+</span><span class="s">'&plot=full'</span><span class="p">)</span></code></pre></figure>
<p>The API call itself was quite simple since I wanted to extract all the information provided.</p>
<p>Surprisingly this was the easiest part since parsing the JSON output ended up being a bit difficult.</p>
<p>The information was all converted to a dictionary format, converted into a series. The information was stacked and the transformed to genrate a dataframe with all the data.</p>
<h3 id="data-cleaning">Data Cleaning</h3>
<figure class="highlight"><pre><code class="language-python" data-lang="python"> <span class="n">RangeIndex</span><span class="p">:</span> <span class="mi">9951</span> <span class="n">entries</span><span class="p">,</span> <span class="mi">0</span> <span class="n">to</span> <span class="mi">9950</span>
<span class="n">Data</span> <span class="n">columns</span> <span class="p">(</span><span class="n">total</span> <span class="mi">25</span> <span class="n">columns</span><span class="p">):</span>
<span class="n">Actors</span> <span class="mi">9760</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Awards</span> <span class="mi">7021</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Country</span> <span class="mi">9786</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Director</span> <span class="mi">9652</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Episode</span> <span class="mi">1</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="n">float64</span>
<span class="n">Error</span> <span class="mi">154</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Genre</span> <span class="mi">9773</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Language</span> <span class="mi">9761</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Metascore</span> <span class="mi">4575</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="n">float64</span>
<span class="n">Plot</span> <span class="mi">9718</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Poster</span> <span class="mi">9612</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Rated</span> <span class="mi">8638</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Released</span> <span class="mi">9568</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Response</span> <span class="mi">9951</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">bool</span>
<span class="n">Runtime</span> <span class="mi">9625</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Season</span> <span class="mi">1</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="n">float64</span>
<span class="n">Title</span> <span class="mi">9797</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Type</span> <span class="mi">9797</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Writer</span> <span class="mi">9550</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Year</span> <span class="mi">9797</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">imdbID</span> <span class="mi">9797</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">imdbRating</span> <span class="mi">9662</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="n">float64</span>
<span class="n">imdbVotes</span> <span class="mi">9661</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">seriesID</span> <span class="mi">1</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">totalSeasons</span> <span class="mi">71</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="n">float64</span>
<span class="n">dtypes</span><span class="p">:</span> <span class="nb">bool</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span> <span class="n">float64</span><span class="p">(</span><span class="mi">5</span><span class="p">),</span> <span class="nb">object</span><span class="p">(</span><span class="mi">19</span><span class="p">)</span>
<span class="n">memory</span> <span class="n">usage</span><span class="p">:</span> <span class="mf">1.8</span><span class="o">+</span> <span class="n">MB</span></code></pre></figure>
<p><strong>Data Removal Process:</strong></p>
<p>TV Shows - I wanted to run this model on movies exclusively since tv shows may have different metrics that made them good or bad.</p>
<p>Errors Column - This column only had error messages for some API calls that did not go through properly.</p>
<p>Poster Image - Given that no features were going to be generated from the image, the image was dropped.</p>
<p>Only keep movies with ratings - Movies with no ratings would take a while to figure out proper values for imputing since such a different set of movies were used.</p>
<p>Remove Meta Score Column - Although this seemed like good metric, there were too many missing values to provide a good metric for rating movies overall.</p>
<p>Awards Column - This column in particular was very complex since it had oscars, awards, nominations all bundled. I decided to sum all the awards as an award value and get the sum. Any columns with null values would get 0.</p>
<p>Numerical Values - All columns with numerical values were converted from string format to number format.</p>
<p>Final Clean Data:</p>
<h3 id="data-cleaning-1">Data Cleaning</h3>
<figure class="highlight"><pre><code class="language-python" data-lang="python"> <span class="n">RangeIndex</span><span class="p">:</span> <span class="mi">8475</span> <span class="n">entries</span><span class="p">,</span> <span class="mi">0</span> <span class="n">to</span> <span class="mi">8474</span>
<span class="n">Data</span> <span class="n">columns</span> <span class="p">(</span><span class="n">total</span> <span class="mi">16</span> <span class="n">columns</span><span class="p">):</span>
<span class="n">Actors</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Awards</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="n">int64</span>
<span class="n">Country</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Director</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Genre</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Language</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Plot</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Rated</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Runtime</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="n">int64</span>
<span class="n">Title</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Writer</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Year</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="n">int32</span>
<span class="n">imdbRating</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="n">float64</span>
<span class="n">imdbVotes</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="n">int32</span>
<span class="n">MonthReleased</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="n">int64</span>
<span class="n">DayReleased</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="n">int64</span>
<span class="n">dtypes</span><span class="p">:</span> <span class="n">float64</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span> <span class="n">int32</span><span class="p">(</span><span class="mi">2</span><span class="p">),</span> <span class="n">int64</span><span class="p">(</span><span class="mi">4</span><span class="p">),</span> <span class="nb">object</span><span class="p">(</span><span class="mi">9</span><span class="p">)</span></code></pre></figure>
<h3 id="visuazlizations">Visuazlizations</h3>
<p><img src="https://adrianll.github.io//assets/images/project6/AwardsNominations.png" alt="Rating Histogram" /></p>
<p>Median: 6.4 Rating</p>
<p>The movie ratings seem to all lie around this rating and left skewed towards the higher ratings.</p>
<p><img src="https://adrianll.github.io//assets/images/project6/yearhist.png" alt="Year Histogram" />
This histogram showed movie released by year and was a shows the heavy number of movies released post 2000. This may make the models work better for movies of this time period.</p>
<h3 id="modeling-and-analysis">Modeling and Analysis</h3>
<p>All the columns with string values were turned into dummy variables aside from writers and plots.
Writers - For one there were about 13-14 thousand writer values that would be created, the names were not very consistent so they were kept out. Given more time the data writerrs could have been cleaned up and used.</p>
<p>Given the distribution of the ratings falling around the median, it may be important to categorize into what is considered good and bad. Something similar to how Youtube categorizes videos (thumbs up and thumbs down). Given more time this might be three categories, good, bad, and neutral. I will utilize the median rating 6.4 as the binary indicator for high/low rating.</p>
<p>A binary model was picked and generated from the ratings for > or < the median.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">high_rating</span><span class="p">(</span><span class="n">rating</span><span class="p">):</span>
<span class="n">target</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">median</span><span class="p">(</span><span class="n">movies</span><span class="p">[</span><span class="s">'imdbRating'</span><span class="p">])</span>
<span class="k">if</span> <span class="n">rating</span><span class="o">>=</span> <span class="n">target</span><span class="p">:</span>
<span class="k">return</span> <span class="mi">1</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="mi">0</span></code></pre></figure>
<p>Using a random forest classifier with gradient boosting yielded the best results with an accuracy score of about 73%</p>
<table>
<tr>
<th> </th>
<th>Predicted High Rating</th>
<th>Predicted Low Rating</th>
</tr>
<tr>
<td>High Rating</td>
<td>867</td>
<td>393</td>
</tr>
<tr>
<td>Low Rating</td>
<td>289</td>
<td>994</td>
</tr>
</table>
<h3 id="conclusions">Conclusions</h3>
<p>Overall the random forest tree did not improve significantly with gradient boost, however the score in itself was significantly good at a first shot.</p>
<p>The improvement was also slight but in these type of prediction models a increase in about a percent is quite significant.</p>
<p>I think more features could have been engineerined using the description and writers. Overall the descriptions could have yielded better results since they may describe parts of the movies that the gneres do not classify.</p>
<p>However adding those features might also cause overfitting, so further analysis would need to be done to get the right number of descriptors out of the description.</p>
<p>Further analysis would be pinning down the exact features to improve the model even more.</p>The Problem:DSI Movie Rating Predictor and Analysis2017-03-17T00:00:00+00:002017-03-17T00:00:00+00:00https://adrianll.github.io//Movie-Rating-Analysis<h3 id="github-repo">Github Repo</h3>
<p><a href="https://github.com/AdrianLl/DSI-Movie-Rating-Predictor/blob/master/Part%20I%20-%20Scraping%20Movie%20ID.ipynb">Part I - Scraping Movie ID’s</a></p>
<p><a href="https://github.com/AdrianLl/DSI-Movie-Rating-Predictor/blob/master/Part%20II%20-%20API%20Data%20Extraction.ipynb">Part II - API Data Extraction</a></p>
<p><a href="https://github.com/AdrianLl/DSI-Movie-Rating-Predictor/blob/master/Part%20III%20-%20Data%20Cleaning%20%26%20Feature%20Engineering.ipynb">Part III - Data Cleaning & Feature Engineering</a></p>
<p><a href="https://github.com/AdrianLl/DSI-Movie-Rating-Predictor/blob/master/Part%20IV%20-%20Data%20Modeling%20%26%20Clustering.ipynb">Part IV - Data Modeling & Clustering</a></p>
<h3 id="the-problem">The Problem:</h3>
<p>There were three portions to this project:</p>
<ul>
<li>Collection</li>
<li>Cleaning</li>
<li>Analysis and Modeling</li>
</ul>
<p>The end goal was to be able to accurately predict what a high rated movie might be and the contributing factors towards a high movie rating.</p>
<h3 id="risks-and-assumptions">Risks and Assumptions:</h3>
<p>This project came down to the dataset and the reliability of the model in the end was highly dependent on how much data was extracted as well as the reliabiltiy of the feature extraction.
In terms of reliability, some of the features extracted in this project were not done as meticulously as they could have. Many features were simplified or extracted due to time contraints.
Given more time to go over the data cleaning, more analysis could have been done on the null values and missing data such as meta score. Crtic score seems like it could have had a large impact on overall movie rating.</p>
<h3 id="scraping-and-dataset">Scraping and Dataset</h3>
<p>The first step in the project was to actually get the dataset for movies in the US.</p>
<p>To start, there was a top 250 rating page that was going to be used but I felt that the 250 movies might not be enough for creating a good model.</p>
<p>In order to pull a good number of songs, I found this list of movies released in the U.S from 1972-2016</p>
<p><a href="http://www.imdb.com/list/ls057823854/">All U.S. Released Movies: 1972-2016</a></p>
<p>Since there were around 10,000 movies in this dataset I thought it would not be a good idea to scrape.
Last time when I scraped the average scrape time was 2-3 seconds per page as well as running into overall memory issues.</p>
<p>After checking the omdb api, it seemed easier to search movies via their imdb ID as opposed to title since there can be variations in the title.</p>
<p>This lead me to scraping the movie title followed by the IMDB ID and store it in a dataframe:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">scrape_panel</span><span class="p">(</span><span class="n">soup</span><span class="p">):</span>
<span class="n">col</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">'div'</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s">'list compact'</span><span class="p">)</span>
<span class="n">names</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">imdbID</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c">#------------------------------------------------------------#</span>
<span class="k">for</span> <span class="n">n</span> <span class="ow">in</span> <span class="n">col</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"td"</span><span class="p">,</span><span class="n">class_</span><span class="o">=</span><span class="s">"title"</span><span class="p">):</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">names</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">n</span><span class="o">.</span><span class="n">text</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">'ascii'</span><span class="p">,</span><span class="s">'ignore'</span><span class="p">))</span>
<span class="k">except</span><span class="p">:</span>
<span class="n">names</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="bp">None</span><span class="p">)</span>
<span class="c">#------------------------------------------------------------#</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">col</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"td"</span><span class="p">,</span><span class="n">class_</span><span class="o">=</span><span class="s">"title"</span><span class="p">):</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">imdbID</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">i</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">'a'</span><span class="p">)[</span><span class="mi">0</span><span class="p">])</span>
<span class="k">except</span><span class="p">:</span>
<span class="n">imdbID</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="bp">None</span><span class="p">)</span>
<span class="c">#------------------------------------------------------------#</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s">'name'</span><span class="p">:</span> <span class="n">names</span><span class="p">,</span> <span class="s">'id'</span><span class="p">:</span> <span class="n">imdbID</span><span class="p">})</span>
<span class="c">#------------------------------------------------------------#</span>
<span class="k">return</span> <span class="n">data</span></code></pre></figure>
<h3 id="api-calls-and-extraction">API Calls and Extraction</h3>
<p>To get all the movie information, the dataframe created from teh web scrape was used along with this function below:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">api_calls</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">ids</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">)):</span>
<span class="n">api_calls</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="s">'http://www.omdbapi.com/?i='</span><span class="o">+</span><span class="n">data</span><span class="p">[</span><span class="s">'id'</span><span class="p">][</span><span class="n">ids</span><span class="p">]</span><span class="o">+</span><span class="s">'&plot=full'</span><span class="p">)</span></code></pre></figure>
<p>The API call itself was quite simple since I wanted to extract all the information provided.</p>
<p>Surprisingly this was the easiest part since parsing the JSON output ended up being a bit difficult.</p>
<p>The information was all converted to a dictionary format, converted into a series. The information was stacked and the transformed to genrate a dataframe with all the data.</p>
<h3 id="data-cleaning">Data Cleaning</h3>
<figure class="highlight"><pre><code class="language-python" data-lang="python"> <span class="n">RangeIndex</span><span class="p">:</span> <span class="mi">9951</span> <span class="n">entries</span><span class="p">,</span> <span class="mi">0</span> <span class="n">to</span> <span class="mi">9950</span>
<span class="n">Data</span> <span class="n">columns</span> <span class="p">(</span><span class="n">total</span> <span class="mi">25</span> <span class="n">columns</span><span class="p">):</span>
<span class="n">Actors</span> <span class="mi">9760</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Awards</span> <span class="mi">7021</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Country</span> <span class="mi">9786</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Director</span> <span class="mi">9652</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Episode</span> <span class="mi">1</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="n">float64</span>
<span class="n">Error</span> <span class="mi">154</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Genre</span> <span class="mi">9773</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Language</span> <span class="mi">9761</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Metascore</span> <span class="mi">4575</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="n">float64</span>
<span class="n">Plot</span> <span class="mi">9718</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Poster</span> <span class="mi">9612</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Rated</span> <span class="mi">8638</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Released</span> <span class="mi">9568</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Response</span> <span class="mi">9951</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">bool</span>
<span class="n">Runtime</span> <span class="mi">9625</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Season</span> <span class="mi">1</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="n">float64</span>
<span class="n">Title</span> <span class="mi">9797</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Type</span> <span class="mi">9797</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Writer</span> <span class="mi">9550</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Year</span> <span class="mi">9797</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">imdbID</span> <span class="mi">9797</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">imdbRating</span> <span class="mi">9662</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="n">float64</span>
<span class="n">imdbVotes</span> <span class="mi">9661</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">seriesID</span> <span class="mi">1</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">totalSeasons</span> <span class="mi">71</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="n">float64</span>
<span class="n">dtypes</span><span class="p">:</span> <span class="nb">bool</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span> <span class="n">float64</span><span class="p">(</span><span class="mi">5</span><span class="p">),</span> <span class="nb">object</span><span class="p">(</span><span class="mi">19</span><span class="p">)</span>
<span class="n">memory</span> <span class="n">usage</span><span class="p">:</span> <span class="mf">1.8</span><span class="o">+</span> <span class="n">MB</span></code></pre></figure>
<p><strong>Data Removal Process:</strong></p>
<p>TV Shows - I wanted to run this model on movies exclusively since tv shows may have different metrics that made them good or bad.</p>
<p>Errors Column - This column only had error messages for some API calls that did not go through properly.</p>
<p>Poster Image - Given that no features were going to be generated from the image, the image was dropped.</p>
<p>Only keep movies with ratings - Movies with no ratings would take a while to figure out proper values for imputing since such a different set of movies were used.</p>
<p>Remove Meta Score Column - Although this seemed like good metric, there were too many missing values to provide a good metric for rating movies overall.</p>
<p>Awards Column - This column in particular was very complex since it had oscars, awards, nominations all bundled. I decided to sum all the awards as an award value and get the sum. Any columns with null values would get 0.</p>
<p>Numerical Values - All columns with numerical values were converted from string format to number format.</p>
<p>Final Clean Data:</p>
<h3 id="data-cleaning-1">Data Cleaning</h3>
<figure class="highlight"><pre><code class="language-python" data-lang="python"> <span class="n">RangeIndex</span><span class="p">:</span> <span class="mi">8475</span> <span class="n">entries</span><span class="p">,</span> <span class="mi">0</span> <span class="n">to</span> <span class="mi">8474</span>
<span class="n">Data</span> <span class="n">columns</span> <span class="p">(</span><span class="n">total</span> <span class="mi">16</span> <span class="n">columns</span><span class="p">):</span>
<span class="n">Actors</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Awards</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="n">int64</span>
<span class="n">Country</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Director</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Genre</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Language</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Plot</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Rated</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Runtime</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="n">int64</span>
<span class="n">Title</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Writer</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="nb">object</span>
<span class="n">Year</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="n">int32</span>
<span class="n">imdbRating</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="n">float64</span>
<span class="n">imdbVotes</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="n">int32</span>
<span class="n">MonthReleased</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="n">int64</span>
<span class="n">DayReleased</span> <span class="mi">8475</span> <span class="n">non</span><span class="o">-</span><span class="n">null</span> <span class="n">int64</span>
<span class="n">dtypes</span><span class="p">:</span> <span class="n">float64</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span> <span class="n">int32</span><span class="p">(</span><span class="mi">2</span><span class="p">),</span> <span class="n">int64</span><span class="p">(</span><span class="mi">4</span><span class="p">),</span> <span class="nb">object</span><span class="p">(</span><span class="mi">9</span><span class="p">)</span></code></pre></figure>
<h3 id="visuazlizations">Visuazlizations</h3>
<p><img src="https://adrianll.github.io//assets/images/project6/AwardsNominations.png" alt="Rating Histogram" /></p>
<p>Median: 6.4 Rating</p>
<p>The movie ratings seem to all lie around this rating and left skewed towards the higher ratings.</p>
<p><img src="https://adrianll.github.io//assets/images/project6/yearhist.png" alt="Year Histogram" />
This histogram showed movie released by year and was a shows the heavy number of movies released post 2000. This may make the models work better for movies of this time period.</p>
<h3 id="modeling-and-analysis">Modeling and Analysis</h3>
<p>All the columns with string values were turned into dummy variables aside from writers and plots.
Writers - For one there were about 13-14 thousand writer values that would be created, the names were not very consistent so they were kept out. Given more time the data writerrs could have been cleaned up and used.</p>
<p>Given the distribution of the ratings falling around the median, it may be important to categorize into what is considered good and bad. Something similar to how Youtube categorizes videos (thumbs up and thumbs down). Given more time this might be three categories, good, bad, and neutral. I will utilize the median rating 6.4 as the binary indicator for high/low rating.</p>
<p>A binary model was picked and generated from the ratings for > or < the median.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">high_rating</span><span class="p">(</span><span class="n">rating</span><span class="p">):</span>
<span class="n">target</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">median</span><span class="p">(</span><span class="n">movies</span><span class="p">[</span><span class="s">'imdbRating'</span><span class="p">])</span>
<span class="k">if</span> <span class="n">rating</span><span class="o">>=</span> <span class="n">target</span><span class="p">:</span>
<span class="k">return</span> <span class="mi">1</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="mi">0</span></code></pre></figure>
<p>Using a random forest classifier with gradient boosting yielded the best results with an accuracy score of about 73%</p>
<table>
<tr>
<th> </th>
<th>Predicted High Rating</th>
<th>Predicted Low Rating</th>
</tr>
<tr>
<td>High Rating</td>
<td>867</td>
<td>393</td>
</tr>
<tr>
<td>Low Rating</td>
<td>289</td>
<td>994</td>
</tr>
</table>
<h3 id="conclusions">Conclusions</h3>
<p>Overall the random forest tree did not improve significantly with gradient boost, however the score in itself was significantly good at a first shot.</p>
<p>The improvement was also slight but in these type of prediction models a increase in about a percent is quite significant.</p>
<p>I think more features could have been engineerined using the description and writers. Overall the descriptions could have yielded better results since they may describe parts of the movies that the gneres do not classify.</p>
<p>However adding those features might also cause overfitting, so further analysis would need to be done to get the right number of descriptors out of the description.</p>
<p>Further analysis would be pinning down the exact features to improve the model even more.</p>The goal of this project was to use ensemble methods to create a movie prediction model. The model would determine if a movie would be getting a high or low rating. Some of the contributing factors to the rating would also be explored.Project Five Classification Disaster Management2017-03-01T12:00:00+00:002017-03-01T12:00:00+00:00https://adrianll.github.io//Project-5<h3 id="the-problem">The Problem:</h3>
<p>This project consists of accessing a remote database for the titanic disaster dataset. Acquiring the data for the titanic disaster dataset and using that information to predict survival rates using the created regression model.</p>
<h3 id="risks-and-assumptions">Risks and Assumptions:</h3>
<p>There are some limitations to the titanic dataset in terms of missing data on both the age and cabin information. As far as the cabin allocations go, it is assumed that this is a clerical error and thus in the overall analysis these factors were not used. In regards to age, there also seemed to be some data entry issues and for the scope of this analysis, the median age for each gender was used respectively.</p>
<p>Cabin information is not being used due to how incomplete it is, so it will be assumed there just wasn’t enough data there or that it was not necessary for this analysis.</p>
<p>One other assumption is that the collected data is correct and accurate, since many of the individuals involved did pass away.</p>
<h3 id="making-sense-of-the-data-and-problems-with-the-data">Making Sense of the Data and Problems with the Data</h3>
<p>The titanic dataset was imported from a PSQL database into a pandas notebook.</p>
<p>Data Size on import was 891 passengers with 12 features for each one, including the target.</p>
<p>Upon initial looking at the imported data, it was clear there were some missing values:</p>
<p>There seem to be some missing values for age in:
Age 714 non-null float64</p>
<p>There is also a lot of missing data for the Cabin number of most of the passengers:
Cabin 204 non-null object</p>
<p>There are two missing values for the embarked location:
Embarked 889 non-null object</p>
<p><strong>Dealing with Age Missing Values</strong>
Missing age values were replaced with the median of the respective age group for each passenger category. I went along with this solution since the alternative was losing the missing values and the rest of the information provided in the dataset. The median was picked as the age since the mean could possibly be skewing more towards due to some of the old age outliers in the data.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s">'Age'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'Sex'</span><span class="p">)[</span><span class="s">'Age'</span><span class="p">]</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">median</span><span class="p">()))</span></code></pre></figure>
<p><strong>Dealing with Cabin Missing Values</strong>
Initially before looking at the data, I began to think that cabin location might be a good indicator of class as well as location during the disaster which might affect the survival rate of passengers. After looking through the data, it was determined that 687 values were missing and thus this column just got dropped. However, given more time it could possibly be explored a bit more and use the little data that we did get.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="s">'Cabin'</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span></code></pre></figure>
<p><strong>Dealing with Embarked Missing Values</strong>
Since there are only two missing values here, I looked up the names of these passengers and just filled in the missing value with the actual embarked location.</p>
<p>Once the data was ready to be worked with, I decided it was time to look at some patterns in the data.</p>
<h3 id="visualization-and-analysis">Visualization and analysis</h3>
<p><img src="https://adrianll.github.io//assets/images/project5/gender.png" alt="Gender Chart" /></p>
<p>This was perhaps the most important graph and metric that was seen initially on the dataset. There just seemed to be a huge disparity between the deaths of men vs women during the titanic disaster. Naturally it seems like the women and children were opted to survive more so than men.</p>
<p><img src="https://adrianll.github.io//assets/images/project5/class.png" alt="Class Chart" />
From the above graph, aside from being female or male it seems there were a lot more deaths in the lowest class of passengers. This is somewhat troubling just due to the bulk on the deaths happening within the their class as opposed to the about even survival rates otherwise.</p>
<p><img src="https://adrianll.github.io//assets/images/project5/port.png" alt="Port Chart" />
This third bar at first seemed to indicate more death rates from port C but after a second look it may be the case that there were just more people embarking from port C. This is most likely the case since there are more deaths and survivals from port C.</p>
<p><img src="https://adrianll.github.io//assets/images/project5/Age.png" alt="Age Chart" />
Aside from gender, class, or port one other important metric seemed to be age. In the above graph the survivals are in blue and seem to represent a much higher survival rate proportionally in the lower younger ages around 0-13. The death rates spike up a bit more after this but improve towards some of the older ages. The median and range area of this graph should not be taken as serious, since the age frequency is going to be much higher there due to the data cleaning and using the median as the age for over 100 individuals. Overall it seemed that the younger you were, the better your chances were of being rescued.</p>
<p>In general the most important metric in determining the survival of most of the passenger had to do with gender and were heavily against the ships male population.</p>
<h3 id="creating-prediction-models">Creating Prediction Models</h3>
<p>In order to create the regression models as well as some other test models, it was important to get the data setup correctly.</p>
<p>Since one of the models was KNN, it was important to do pre processing and scale the data. There were also categoricals in gender that needed to be changed to numbers, as well as port location.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># Turn all male/female notations into 1 or 0</span>
<span class="n">df</span><span class="p">[</span><span class="s">'Sex'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'Sex'</span><span class="p">]</span><span class="o">.</span><span class="nb">map</span><span class="p">({</span><span class="s">'male'</span><span class="p">:</span><span class="mi">1</span><span class="p">,</span><span class="s">'female'</span><span class="p">:</span><span class="mi">0</span><span class="p">})</span>
<span class="c"># Turn all the ports into categoricals as 0,1,2</span>
<span class="n">df</span><span class="p">[</span><span class="s">'Embarked'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'Embarked'</span><span class="p">]</span><span class="o">.</span><span class="nb">map</span><span class="p">({</span><span class="s">'S'</span><span class="p">:</span><span class="mi">0</span><span class="p">,</span><span class="s">'Q'</span><span class="p">:</span><span class="mi">1</span><span class="p">,</span> <span class="s">'C'</span><span class="p">:</span><span class="mi">2</span><span class="p">})</span></code></pre></figure>
<p>** Setup our parameters and scale the data **</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">MinMaxScaler</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'Survived'</span><span class="p">]</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">drop</span><span class="p">([</span><span class="s">'PassengerId'</span><span class="p">,</span><span class="s">'Name'</span><span class="p">,</span> <span class="s">'Ticket'</span><span class="p">,</span><span class="s">'Survived'</span><span class="p">],</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">X_scaled</span> <span class="o">=</span> <span class="n">MinMaxScaler</span><span class="p">()</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">X</span><span class="p">)</span></code></pre></figure>
<p>Once the data setup, I create a train test split for the dataset and began predicting the best parameters using a gridsearch.</p>
<p>For Logistic Regression, the best score attained was: 80% accuracy</p>
<p>Below is the ROC curve and confusion matrix denoting, the true predicted values (false/positive).
The ROC curve will give a good metric of the predictions at different positions with a base prediction of 50%.</p>
<table>
<tr>
<th> </th>
<th>Positive</th>
<th>Negative</th>
</tr>
<tr>
<td>Positive</td>
<td>80</td>
<td>21</td>
</tr>
<tr>
<td>Negative</td>
<td>31</td>
<td>136</td>
</tr>
</table>
<p><img src="https://adrianll.github.io//assets/images/project5/roc reg.png" alt="ROC Reg" /></p>
<p>Other models were also tested with very similar results below is KNN:</p>
<table>
<tr>
<th> </th>
<th>Positive</th>
<th>Negative</th>
</tr>
<tr>
<td>Positive</td>
<td>69</td>
<td>11</td>
</tr>
<tr>
<td>Negative</td>
<td>42</td>
<td>146</td>
</tr>
</table>
<p><img src="https://adrianll.github.io//assets/images/project5/roc_knn.png" alt="ROC KNN" /></p>
<p>Decision Tree Classifier:</p>
<table>
<tr>
<th> </th>
<th>Positive</th>
<th>Negative</th>
</tr>
<tr>
<td>Positive</td>
<td>68</td>
<td>23</td>
</tr>
<tr>
<td>Negative</td>
<td>43</td>
<td>134</td>
</tr>
</table>
<p><img src="https://adrianll.github.io//assets/images/project5/roc dtc.png" alt="ROC DTC" /></p>
<h3 id="conclusion-and-final-thoughts">Conclusion and Final thoughts</h3>
<p>In regards to the modeling predictions, it seems like the best performing models were the logistic regressions with the decision tree classifier. This accuracy score isn’t all that great but I believe it could have been improved via better features as well as some more extensive feature extraction.</p>
<p>Some improvements could have been using some of the titles found in the passenger names, which might be indicative of some other hidden metric. Additionally the cabin data as mentioned earlier could have proven to be useful to improve the model.</p>The Problem: This project consists of accessing a remote database for the titanic disaster dataset. Acquiring the data for the titanic disaster dataset and using that information to predict survival rates using the created regression model.Project Four Data Scraping Project2017-03-01T12:00:00+00:002017-03-01T12:00:00+00:00https://adrianll.github.io//Project-4<h3 id="the-problem">The Problem</h3>
<p>The objective of this project is to create a regression model using binary indicators to help predict if a salary is either high or low. There are only basic binary indicators given initially, and many of them need to be constructed from the acquired data. To acquire data, a scrape was done of Glassdoor and then cleaned and input into a Padas Dataframes for analysis.</p>
<h3 id="risks">Risks:</h3>
<ul>
<li>
<p>the dataset is mostly concentrated around a small number of cities which I found to have a large number of data science positons, there are many outside of these states that were not taken into account due to the time constraints of the project.</p>
</li>
<li>
<p>there is missing data for certain locations since the initial scrape would not let me go past ~30 pages per location. This means each “state” isn’t actually the entire state but instead its the first 30 or so pages for that particular state.</p>
</li>
<li>
<p>the salary estimate themselves are limited since I am basing my predictions using the already predicted salaries from Glassdoor and their own algorithm. Since they can’t provide exact salaries, I got the mean(median) between the two min and max salary range and used that as my salary indicator.</p>
</li>
<li>
<p>the feature selection is done arbitrarily by most common words that sound impactful, however this may not be the best approach.</p>
</li>
</ul>
<p><em>Assumptions:</em></p>
<ul>
<li>
<p>Given the enormous dataset and time constraints, a lot of job positions not related to data science may have slipped in. Although I tried filtering this out on the initial scrape, it was difficult to do so completely, it will be assumed that all scraped data is data science related</p>
</li>
<li>
<p>Some data may have overlaps if the same company is hiring in multiple states, I did try to mitigate this but there may be overlap overall</p>
</li>
</ul>
<h3 id="webscraping">Webscraping</h3>
<p>In order to web scrape Glassdoor, selenium and beautiful soup were both used. Selenium was needed since the website was an AJAX website and would return an error with a normal request. Beautiful soup was used to parse through the page and find the correct panels needed for scraping.</p>
<p>The first step in scraping was getting the needed search results, meaning data science salaries for a specific state. Once we had the needed search results then the initial extraction was done meaning these features were extracted:</p>
<ul>
<li>Company name</li>
<li>Location name</li>
<li>Salary</li>
<li>Post URL</li>
<li>Position</li>
</ul>
<p>In order to do this through all the states I wanted, it was easier to break each step into functions. 7 States were scraped separately due to Glassdoor not showing results beyond ~30 pages. These results were all extracted from their csv output and then combined to make one data frame with the all the page URL’s and relevant information. This is where the time consuming scrape started, since about 4K pages had to be opened and scraped for information. The process would take about 30 minutes per 500 pages.</p>
<p>Below, is the function used to do the initial scrape.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">scrape_panel</span><span class="p">(</span><span class="n">soup</span><span class="p">):</span>
<span class="n">leftcol</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">'ul'</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s">"jlGrid hover"</span><span class="p">)</span>
<span class="n">comp_name</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">loc_name</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">salary</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">job_urls</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">position</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c">#------------------------------------------------------------#</span>
<span class="k">for</span> <span class="n">n</span> <span class="ow">in</span> <span class="n">leftcol</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"div"</span><span class="p">,</span><span class="n">class_</span><span class="o">=</span><span class="s">"flexbox empLoc"</span><span class="p">):</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">comp_name</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">n</span><span class="o">.</span><span class="n">text</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">'ascii'</span><span class="p">,</span><span class="s">'ignore'</span><span class="p">)</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">" "</span><span class="p">)[</span><span class="mi">1</span><span class="p">])</span>
<span class="k">except</span><span class="p">:</span>
<span class="n">comp_name</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="bp">None</span><span class="p">)</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">loc_name</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">n</span><span class="o">.</span><span class="n">text</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">'ascii'</span><span class="p">,</span><span class="s">'ignore'</span><span class="p">)</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">" "</span><span class="p">)[</span><span class="mi">2</span><span class="p">])</span>
<span class="k">except</span><span class="p">:</span>
<span class="n">loc_name</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="bp">None</span><span class="p">)</span>
<span class="c">#------------------------------------------------------------#</span>
<span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="n">leftcol</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">'li'</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s">'jl'</span><span class="p">):</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">salary</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">s</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">'span'</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s">'green small'</span><span class="p">)</span><span class="o">.</span><span class="n">text</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">'ascii'</span><span class="p">,</span><span class="s">'ignore'</span><span class="p">))</span>
<span class="k">except</span><span class="p">:</span>
<span class="n">salary</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="bp">None</span><span class="p">)</span>
<span class="c">#------------------------------------------------------------#</span>
<span class="k">for</span> <span class="n">l</span> <span class="ow">in</span> <span class="n">leftcol</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"a"</span><span class="p">):</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">job_urls</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="s">'https://www.glassdoor.com'</span><span class="o">+</span><span class="n">l</span><span class="p">[</span><span class="s">'href'</span><span class="p">]</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">'ascii'</span><span class="p">,</span><span class="s">'ignore'</span><span class="p">))</span>
<span class="k">except</span><span class="p">:</span>
<span class="n">job_urls</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="bp">None</span><span class="p">)</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">position</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">l</span><span class="o">.</span><span class="n">text</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">'ascii'</span><span class="p">,</span><span class="s">'ignore'</span><span class="p">))</span>
<span class="k">except</span><span class="p">:</span>
<span class="n">position</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="bp">None</span><span class="p">)</span>
<span class="n">job_urls</span> <span class="o">=</span> <span class="n">job_urls</span><span class="p">[</span><span class="mi">0</span><span class="p">::</span><span class="mi">2</span><span class="p">]</span>
<span class="n">position</span> <span class="o">=</span> <span class="n">position</span><span class="p">[</span><span class="mi">1</span><span class="p">::</span><span class="mi">2</span><span class="p">]</span>
<span class="c">#------------------------------------------------------------#</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s">'company_name'</span><span class="p">:</span> <span class="n">comp_name</span><span class="p">,</span>\
<span class="s">'location'</span><span class="p">:</span> <span class="n">loc_name</span><span class="p">,</span>\
<span class="s">'salary'</span><span class="p">:</span><span class="n">salary</span><span class="p">,</span>\
<span class="s">'position'</span><span class="p">:</span><span class="n">position</span><span class="p">,</span>\
<span class="s">'urls'</span><span class="p">:</span> <span class="n">job_urls</span><span class="p">})</span>
<span class="c">#------------------------------------------------------------#</span>
<span class="k">return</span> <span class="n">data</span></code></pre></figure>
<h3 id="data-cleaning">Data Cleaning</h3>
<p>Once the scraping portion was completed, the data had to be cleaned. Here are some of the main things I was looking for when cleaning:</p>
<ul>
<li>Repeated job postings</li>
<li>Empty cells/ Null values</li>
<li>Must have salary information provided</li>
</ul>
<p>Upon finishing the initial cleaning, here are the salary distributions that were found:</p>
<p align="center">
<iframe width="800" height="550" seamless="" frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/1rZoq2xoI2YRi5Y-A-MI8pxZHRbaKvQTfx_Kvwh1rLws/pubchart?oid=943616562&format=interactive"></iframe>
</p>
<p>From the histogram, the salary distribution is quite normal for a salary distribution with the main concentration of data around the 100K region.</p>
<p>Median = 100,000
Mean = 102,961.48</p>
<p>It would also be important to note the concentration of our data in terms of states. Meaning, which states are we getting most of our data from after the cleaning?</p>
<p align="center">
<iframe width="600" height="171.5" seamless="" frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/1rZoq2xoI2YRi5Y-A-MI8pxZHRbaKvQTfx_Kvwh1rLws/pubchart?oid=20439445&format=interactive"></iframe>
</p>
<p>Although I started with around 1000 postings for each state, many of them were lost during the cleaning. It can be seen above that the majority of the listings will be from California and New York.</p>
<h3 id="modeling-and-feature-extraction">Modeling and Feature Extraction</h3>
<p>The target salary was chosen as the median for my model creation since this was the cutoff portion for most of the top or lower salaries.</p>
<p>Features were extracted by getting a value count for all the words going through the title and the job description. Then I looked at the top 20 and hand picked the ones that were most relevant. These words were looked for in the description and were give a True of False marker if found or not. Once these values were all made into dummy variables then the data had to be split into the target and data as X and y. The target column was acquired by running the salaries through the target salary (median).</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>
<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">GridSearchCV</span>
<span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LogisticRegression</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">reg_classf</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="s">'target'</span><span class="p">]</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">test_size</span><span class="o">=</span><span class="mf">0.30</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)</span>
<span class="n">logreg</span> <span class="o">=</span> <span class="n">LogisticRegression</span><span class="p">(</span><span class="n">solver</span><span class="o">=</span><span class="s">'liblinear'</span><span class="p">)</span>
<span class="n">C_vals</span> <span class="o">=</span> <span class="p">[</span> <span class="o">.</span><span class="mi">1</span><span class="p">,</span><span class="o">.</span><span class="mi">2</span><span class="p">,</span><span class="o">.</span><span class="mi">3</span><span class="p">,</span><span class="o">.</span><span class="mi">7</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">6</span><span class="p">,</span><span class="mi">10</span><span class="p">,</span><span class="mi">20</span><span class="p">,</span><span class="mi">30</span><span class="p">,</span><span class="mi">40</span><span class="p">]</span>
<span class="n">penalties</span> <span class="o">=</span> <span class="p">[</span><span class="s">'l1'</span><span class="p">,</span><span class="s">'l2'</span><span class="p">]</span>
<span class="n">gs</span> <span class="o">=</span> <span class="n">GridSearchCV</span><span class="p">(</span><span class="n">logreg</span><span class="p">,</span> <span class="p">{</span><span class="s">'penalty'</span><span class="p">:</span> <span class="n">penalties</span><span class="p">,</span> <span class="s">'C'</span><span class="p">:</span> <span class="n">C_vals</span><span class="p">},</span>\
<span class="n">verbose</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">cv</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
<span class="n">gs</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span></code></pre></figure>
<p>This data was able to provide an accuracy of: <strong>65%</strong></p>
<p>Below is a confusion Matrix showing the actual (top) vs predicted (left).</p>
<table>
<tr>
<th> </th>
<th>Positive</th>
<th>Negative</th>
</tr>
<tr>
<td>Positive</td>
<td>412</td>
<td>139</td>
</tr>
<tr>
<td>Negative</td>
<td>247</td>
<td>353</td>
</tr>
</table>
<p>I have also included the ROC curve which is showing some poor results for this specific model.
<img src="https://adrianll.github.io//assets/images/project4/roc.png" alt="ROC" /></p>
<h3 id="conclusion-and-final-thoughts">Conclusion and Final thoughts</h3>
<p>The model had a best score of about 65%, which is quite poor but should be fine for the first scrape of this project. I feel it could be improved by adding more data, cleaning it better, and improving feature selection.</p>
<p>Finding patterns between the job descriptions and the salary is probably the most crucial in this project and that portion was done poorly due to the pending data cleaning and slow scrape.</p>The ProblemProject Three House Prices2017-02-19T12:00:00+00:002017-02-19T12:00:00+00:00https://adrianll.github.io//Project-3-House-Prices<h1 id="predicting-house-prices-using-linear-regression">Predicting House Prices Using Linear regression</h1>
<h2 id="main-problem-and-objectives">Main Problem and Objectives</h2>
<ul>
<li>Build a prediction model for house pricing in Ames, IA</li>
<li>Where are most sales taking place?</li>
<li>Where are the most expensive houses located?</li>
<li>Discuss Possible Improvements</li>
</ul>
<h2 id="describing-the-data-and-limitations">Describing the Data and Limitations</h2>
<ul>
<li>Target Prediction Feature: Sale Price</li>
<li>Number of Instances: 1460</li>
<li>Number of Attributes Allowed: 18</li>
<li>Years of data collected: 2006 - 2010</li>
<li>Missing</li>
</ul>
<h2 id="limitations">Limitations</h2>
<ul>
<li>The attributes provided are not necessarily the best indicators of the house pricing</li>
<li>The data collected is mostly around a particular unstable time in the market</li>
</ul>
<h2 id="understanding-the-data">Understanding the Data</h2>
<p><strong>Looking at Correlations</strong>
<img src="https://adrianll.github.io//assets/images/project3/CorrelationMap.png" alt="CorrelationMap" />
<strong>Quality & Price Correlation</strong>
<img src="https://adrianll.github.io//assets/images/project3/SalevQual.png" alt="SaleVsQual" /></p>
<p><strong>Looking at Sale Prices Across Neighborhoods</strong>
<img src="https://adrianll.github.io//assets/images/project3/SalePriceBox1.png" alt="Sale Price Box Plot" /></p>
<h2 id="where-are-the-most-sales-happening">Where are the most sales happening?</h2>
<ul>
<li>Most Sales Happening in :
<ul>
<li>North Ames</li>
</ul>
</li>
<li>How many happened?
<ul>
<li>225 (About 15.41 % of total sales)</li>
</ul>
</li>
</ul>
<p align="center">
<iframe width="600" height="371" seamless="" frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/1_7Afgm5NYnjFvN1vYmyvqtW30sBuUa9kT7x0gFPYSMs/pubchart?oid=1718000841&format=interactive"></iframe>
</p>
<h2 id="where-are-the-most-expensive-homes">Where are the most expensive homes?</h2>
<p align="center">
<iframe width="600" height="371" seamless="" frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/1_7Afgm5NYnjFvN1vYmyvqtW30sBuUa9kT7x0gFPYSMs/pubchart?oid=1071517443&format=interactive"></iframe>
</p>
<h2 id="creating-a-regression-model">Creating a Regression model</h2>
<ul>
<li>Type of Regression: Linear</li>
<li>Attributes Dropped: Utilities</li>
<li>Dummy Variable Selection: All except Lot Area and GrLivArea
<ul>
<li>Accuracy Testing:</li>
<li>R Squared = 0.899</li>
<li>Mean Absolute Error: 16261.24</li>
<li>Mean Squared Error: 637850318.41</li>
<li>Root Mean Squared Error: 25255.70</li>
<li>Cross Validation Score: 0.738</li>
</ul>
</li>
<li>Limitations of the Model</li>
</ul>
<p>There were a lot of outliers in the data that have caused the RMSE and MSE to be quite high
The cross validation score was lower than the initial model made, although R Square was improved this could be a sign of some over fitting on this model.</p>
<p><img src="https://adrianll.github.io//assets/images/project3/regression.png" alt="Regression" /></p>
<h2 id="possible-improvements">Possible Improvements</h2>
<ul>
<li>
<p>More location based metrics such as surrounding business’, schools, police stations, etc</p>
</li>
<li>
<p>More insight into the overall condition and quality metric</p>
</li>
<li>
<p>More data points for expensive homes, to improve predictions on the expensive homes.</p>
</li>
</ul>Predicting House Prices Using Linear regressionProject Two Billboard2017-02-07T12:00:00+00:002017-02-07T12:00:00+00:00https://adrianll.github.io//Project-2-Billboard<h1 id="top-100-billboard-singles-of-the-year-2000">Top 100 Billboard Singles of the Year 2000</h1>
<h2 id="introduction">Introduction</h2>
<p>The objective of this analysis was to utilize billboard data for the top 100 songs of the year 2000. The data set was not clean and so a fair amount of data cleaning would need to be done before starting to do anything with the numbers</p>
<h2 id="making-sense-of-the-data">Making sense of the data</h2>
<p>My first step into this project was to look into the data and get as much information as possible from the given CSV. This would help me find errors, next steps, and overall give me a good plan of action to see what I would be able to extract from the data later on.</p>
<p><strong>What am I looking at?</strong></p>
<p>Billboard data for the top 100 charts for the year 2000, showing the song peak cycle from the time the song entered the top 100 to the time it left. The peak position and date is also given along with the song artist, name, genre, and length.</p>
<p><strong>How big is the data?</strong></p>
<p><em>Rows:</em>
83, containing song attributes</p>
<p><em>Columns:</em>
317, containing songs that reached the Top 100 in the year 2000</p>
<p>There is no missing information from this dataset.</p>
<p><strong>What do these headers mean?</strong></p>
<ul>
<li><em>year</em> - Year is 2000 for all songs, denoting they peaked in the top 100 during this year</li>
<li><em>artist.inverted</em> - artist or band name, artist full name will be inverted</li>
<li><em>track</em> - Track title</li>
<li><em>time</em> - Track length, is later on converted to seconds</li>
<li><em>genre</em>- Track genre from 12 given genres</li>
<li><em>date.entered</em> - Date the track entered the top 100</li>
<li><em>date.peaked</em> - Peak date of the track (highest position on the top 100)</li>
<li><em>x1st.week - x76th.week</em> - position at given week number for the given track</li>
</ul>
<p>columns added later on:</p>
<ul>
<li><em>weeks to peak</em> - how long it took for the song to peak in the charts</li>
<li><em>weeks on chart</em> - how long the song remained in the top 100 charts</li>
<li><em>worst position</em> - worst chart ranking during its top 100 position</li>
<li><em>best position</em> - best chart ranking during its top 100 position</li>
<li><em>enter rank</em> - the rank the song entered the top 100 charts with</li>
<li><em>exit rank</em> - the rank the song exited the top 100 charts with</li>
</ul>
<p><strong>Are there any problems with the data?</strong></p>
<p>Overall the issues with the data were formatting related with issues such as the header formatting. I needed to convert the header names properly to make them more understandable.</p>
<p>Time and date were going to be an issue as well because the song length was not setup properly in MM:SS format and the date entered and date peaked data were strings instead of date values. This would mean that should I need to make calculations on these dates, then it might be hard to do down the line.</p>
<p>Some big problems I found were also around the genre and the classification of data around it. The given classifications are very confusing since they don’t seem to match the songs. There also seem to be some input errors with the genre R&B that would need some cleaning.</p>
<p>Finally, there is just some data type conversions that are needed to be able to handle the numbers properly as well as deal with invalid entries (*).</p>
<p><strong>What risks am I taking with this data?</strong></p>
<p>Given the outlined issues above, I see some of the biggest risks coming from the genre section of the data. It seems like there is vast misclassification of the music, especially within the field of rock n roll. There are songs from every other sub-genre that are thrown into rock n roll. Given the vast errors in song classifications, it leads to wonder what other issues may arise in the overall genre classification or how it was derived.</p>
<p>For the purposes of this analysis, it will be assumed that the genre classifications have been done correctly. Aside from combining the two R&B genres, the rest will be left as is. This would also prevent personal bias against a genre that could possibly affect the end classification.</p>
<p>It is also assumed that the song attributes given have been input correctly.</p>
<p><strong>What am I trying to accomplish here?</strong></p>
<ul>
<li>
<p>Problem: Does music popularity inherently depend on public attraction or are there other forces such as marketing, distribution, contracts, etc. that affect the popularity of our songs.</p>
</li>
<li>
<p>Hypothesis: Track duration is a big decider in track popularity and will have specific parameters that will increase the probability of a song making it to the top 100.</p>
</li>
</ul>
<h2 id="data-cleaning">Data Cleaning</h2>
<p>In the data cleaning process, it was my goal to convert all the chart information into proper data types that could be used as well as making the information as clear as possible. Here are some of the data cleaning processes I used in order to get my final working data frame.</p>
<p><em>Cleaning Up the Headers</em></p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c">#cleaned columns by removing the '.'</span>
<span class="c">#also removed the 'x' at the beginning of the week columns</span>
<span class="n">bb</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s">'.'</span><span class="p">,</span><span class="s">' '</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">bb</span><span class="o">.</span><span class="n">columns</span><span class="p">]</span>
<span class="n">bb</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s">'x'</span><span class="p">,</span><span class="s">''</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">bb</span><span class="o">.</span><span class="n">columns</span><span class="p">]</span>
<span class="c">#renamed artist and duration columns to clarify their content</span>
<span class="n">bb</span> <span class="o">=</span> <span class="n">bb</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s">"artist inverted"</span><span class="p">:</span> <span class="s">"artist"</span><span class="p">})</span>
<span class="n">bb</span> <span class="o">=</span> <span class="n">bb</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s">"time"</span><span class="p">:</span> <span class="s">"duration"</span><span class="p">})</span></code></pre></figure>
<p><em>Converted the track duration column</em></p>
<p>This was the time format provided by the data:</p>
<p><strong>mm,ss,ms AM</strong></p>
<p>In order to be able to use this time information, I thought it would be best to convert it to seconds. This would make any visualization a lot easier as well as conversion to other data types simple should I find the need to.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># Create a function that takes in a string and outputs seconds</span>
<span class="k">def</span> <span class="nf">get_seconds</span><span class="p">(</span><span class="n">string</span><span class="p">):</span>
<span class="n">sp</span> <span class="o">=</span> <span class="n">string</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">','</span><span class="p">)</span>
<span class="n">seconds</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">sp</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span><span class="o">*</span><span class="mi">60</span> <span class="o">+</span> <span class="nb">int</span><span class="p">(</span><span class="n">sp</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="k">return</span> <span class="n">seconds</span>
<span class="n">bb</span><span class="p">[</span><span class="s">'duration'</span><span class="p">]</span> <span class="o">=</span> <span class="n">bb</span><span class="p">[</span><span class="s">'duration'</span><span class="p">]</span><span class="o">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">get_seconds</span><span class="p">)</span></code></pre></figure>
<p>These are just some of the cleaning steps done on this data. All the changes made are documented in <a href="https://github.com/AdrianLl/AdrianLl.github.io/blob/master/projects/billboard/Project%202%20Billboard%20Hits%20%2B%20Data%20Munging.ipynb">my jupyter notebook</a>.</p>
<h1 id="generating-new-data-from-the-clean-data">Generating New Data From the Clean Data</h1>
<p>One important piece of information that is given but is not quickly visible or accessible is the song ranking upon entering the top 100 list and the exit ranking.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c">#generated a list of the enter and exit positions for all the songs</span>
<span class="n">exit_loc</span> <span class="o">=</span> <span class="n">week_data</span><span class="o">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="o">.</span><span class="n">last_valid_index</span> <span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">enter_loc</span> <span class="o">=</span> <span class="n">week_data</span><span class="o">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="o">.</span><span class="n">first_valid_index</span> <span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span>
<span class="nb">exit</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">enter</span><span class="o">=</span> <span class="p">[]</span>
<span class="c">#converted the lists added them to billboard data</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">bb</span><span class="p">)):</span>
<span class="nb">exit</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">bb</span><span class="p">[</span><span class="n">exit_loc</span><span class="p">[</span><span class="n">i</span><span class="p">]][</span><span class="n">i</span><span class="p">]))</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">bb</span><span class="p">)):</span>
<span class="n">enter</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">bb</span><span class="p">[</span><span class="n">enter_loc</span><span class="p">[</span><span class="n">j</span><span class="p">]][</span><span class="n">j</span><span class="p">]))</span></code></pre></figure>
<p>Another set of columns I added were the worst and best positions of a given song during their time in the top 100.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c">#get max and min values for the rankings for the columns with "weeks"</span>
<span class="n">bb</span><span class="p">[</span><span class="s">'worst position'</span><span class="p">]</span> <span class="o">=</span> <span class="n">bb</span><span class="o">.</span><span class="n">iloc</span><span class="p">[:,</span> <span class="mi">6</span><span class="p">:</span><span class="o">-</span><span class="mi">2</span><span class="p">]</span><span class="o">.</span><span class="nb">max</span><span class="p">(</span><span class="n">axis</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">bb</span><span class="p">[</span><span class="s">'best position'</span><span class="p">]</span> <span class="o">=</span> <span class="n">bb</span><span class="o">.</span><span class="n">iloc</span><span class="p">[:,</span> <span class="mi">6</span><span class="p">:</span><span class="o">-</span><span class="mi">2</span><span class="p">]</span><span class="o">.</span><span class="nb">min</span><span class="p">(</span><span class="n">axis</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span>
<span class="c">#converted the positions to int, since rankings are not floats</span>
<span class="n">bb</span><span class="p">[</span><span class="s">'worst position'</span><span class="p">]</span> <span class="o">=</span> <span class="n">bb</span><span class="p">[</span><span class="s">'worst position'</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
<span class="n">bb</span><span class="p">[</span><span class="s">'best position'</span><span class="p">]</span> <span class="o">=</span> <span class="n">bb</span><span class="p">[</span><span class="s">'best position'</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span></code></pre></figure>
<h2 id="visualizing-the-data">Visualizing the data</h2>
<p><img src="https://adrianll.github.io//assets/images/project2/Weeks vs Exit Rank.png" alt="Weeks vs Exit Ranking" />
<img src="https://adrianll.github.io//assets/images/project2/Hot 100 Weeks on Billboard.png" alt="Chart Duration Histogram" /></p>
<p>Before seeing this histogram, I expected the frequency to be a lot more flat along with a small decline as the weeks went on. However, it seems there is a very specific number of weeks that the songs were staying on for and this was around 20 weeks. This was very strange since the number spiked incredibly high at that point.
I did some research on here:</p>
<p>http://www.billboard.com/articles/columns/ask-billboard/5740625/ask-billboard-how-does-the-hot-100-work</p>
<blockquote>
<p>Generally speaking, our Hot 100 formula targets a ratio of sales (35-45%), airplay (30-40%) and streaming (20-30%). (Year 2013)</p>
</blockquote>
<p>This is how they were calculating metrics around 2013 but given the year 2000, I imagine that streaming was a non existent metric. Given the assumption that sales and airplay were about the same to calculate the billboard perhaps purchase patterns of the tracks or airplay contracts could be a factor, more so than actual popularity.</p>
<p><img src="https://adrianll.github.io//assets/images/project2/Hot 100 Track Durations.png" alt="Histogram of Track Durations" />
<img src="https://adrianll.github.io//assets/images/project2/Track Duration v Weeks.png" alt="Track Duration vs Weeks" /></p>
<p>I also wanted to look into the track duration metric to see if there are any patterns on the billboard rankings. As seen in the initial histogram, there seems to be a strong concentration around the 200 to 300 second mark for song popularity. The songs around that range seem to do the best overall in terms of best position as well as duration on the top 100 charts.</p>
<h1 id="conclusions-and-findings">Conclusions and Findings</h1>
<p>Overall I would note that my initial hypothesis does not seem to be completely correct. I thought that there would be some strong metric (track duration) that would contribute to the popularity of songs in the top 100 chart. Initially, this does seem to be the case but I also found very steep drop offs in the popularity of songs around the 20 week mark. It is as if songs just stopped being popular at that specific location.
The heavy positive skew in the track duration does seem to present a strong metric showing that shorter songs are more popular. However, the findings for random drop offs and popularity durations goes to show that there are greater external factors in play to the popularity. It would be interesting to know how the popularity was calculated on average for this year. This might provide better insight into the current data.
To improve insight into what makes a song popular, it might help to get actual user listening patterns as opposed to company regulated charts. Calculating popularity on pure listening patterns could remove some of the external factors such as distribution and contracts out. However doing so might also isolate the music popularity findings to a specific online music listener.</p>
<p>My Jupyter notebook on this project can be found <a href="https://github.com/AdrianLl/AdrianLl.github.io/blob/master/projects/billboard/Project%202%20Billboard%20Hits%20%2B%20Data%20Munging.ipynb">here</a></p>Top 100 Billboard Singles of the Year 2000