Skip to content

tjjdoherty/python-statistical-modelling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Statistical-Modelling-with-Python

Jan 2025 portfolio update

  • I'm generally happy with the status of this project as it was designed to develop my statistical modelling and use of some APIs, which I wish to continue developing but would likely benefit from doing that with other APIs in different projects, having worked with Yelp/Foursquare already now. Per future goals section below, I did already take two API pulls comparing times of day. There are no significant changes to this repo upcoming.

Project/Goals

  • The goal of the project was to statistically model if there was a relationship between the bike availability and the number of Points of Interest (POIs herein) in the vicinity, their specific categories, and location of the bike station itself.
  • To do this, we used the City Bike, Foursquare and Yelp APIs, combining the data, extracting points of interest, their locations and bike availability at the stations. With this we did some data visualization and linear regression modelling.

Process

Step 1: CityBikes:

  • I used the City Bikes API to find the 826 Bike stations in BikeShare Toronto's dataframe, and each station's individual bike availability (the number of free bikes divided by the total number of bike slots) as it was live when I requested the data on Saturday Aug 3, 2024. I primarily used bike_availability not free bikes or open slots because there is a range of total number of bike slots - the % availability seemed more appropriate to make a comparison.
  • See city_bikes.ipynb for more.

Step 2: Foursquare/Yelp:

  • I used Foursquare's API to obtain various places of interest in the stations' radius. I chose an 800m radius because Toronto has a large number of bike stations and I determined it would be unlikely a cyclist would be docking any further than 800m from their intended destination. Additionally, Foursquare's maximum limit on POIs returned is 50, so any larger radius in a densely packed city like Toronto would see virtually all of the bike stations return the maximum 50 POIs and reduce any analytical power that data column has to predict bike_availability. I needed a radius small enough to make the total number of POIs diverse, but large enough to be realistic for a cyclists docking and pedestrian travel to their final destination.
  • I began playing with a sample request of one bike station latitude/longitude location and manipulating the data returned by Foursquare. I recorded the number of bars/restaurants, parks/outdoor spaces, live venues and cafes as well as the total number of POIs that fell into these four categories for each bike station. I used a number of small samples of the Foursquare API calls on bike different stations to build up a list of category names for the venues/POIs that Foursquare/Yelp would return.
  • See yelp_foursquare_EDA.ipynb for more.

Step 3: Joining the Data:

  • I joined the City Bikes and Foursquare data by importing the City Bikes dataframe and calling the Foursquare API on the entire bike stations dataset. This means that for each of the 826 stations, Foursquare obtained up to 50 POIs in an 800 metre radius in the bar/restaurant, park, cafe and live venue categories. This gave me data on each bike station's number of nearby bars/restaurants, cafes, parks and live venues against the bike availability and latitude/longitude to train the model and perform a linear regression later. I want to know if there's a relationship between bike availability and the POIs around the station. Step 3b:
  • I also saved the individual venue data such as name, address, venue category for saving into a SQL .db file, as I did with the city bikes stations. This created a database holding tens of thousands venues and hundreds of bike stations, where we can join the tables and view them both across the city of Toronto in a SQL table by their latitude/longitude location and their address/location.
  • See joining_data.ipynb for more.

Step 4: Building a model:

  • I used the data for some EDA and find correlations between the numeric data, using Seaborn correlation heatmaps and scatter graphs. I identified the strongest correlations, ran Pearson tests to find p-values to determine if the relationships were statistically significant, then fit a model to those features.
  • See model_building.ipynb for more.

Results

The strongest correlation found in my model that was not a clear case of collinearity was latitude correlating with the number of POIs (-0.51 correlation) and latitude correlating with the bike availability (-0.31). These mean that as you travel south in Toronto (latitude is decreasing), the number of POIs increase as does the number of available bikes. This first finding is to be expected because Toronto is south-facing onto Lake Ontario so southernmost Downtown would be more densely packed with POIs. The second finding is of considerable value though as it suggests it is harder to find an available bike the further north you are in the city. I validated these correlations by finding p-values well below 0.05 for all of the relationships tested, so we can now say that:

  • There is a statistically significant relationship between latitude in Toronto and the number of Bikes availabile at bike stations. The correlation is -0.52, meaning bike availability increases as you move South in Toronto.
  • There is also a similarly significant relationship between latitude and the number of Points of Interest as called by Foursquare. The correlation is -0.31, meaning the number of POIs increases as you move South in Toronto.
  • There is a statistically significant relationship between the number of Points of Interest and the number of bikes available at bike stations. The correlation is 0.17 which is weak.

I ran simple linear regression on bike availability against the latitude to fit a model. The R^2 (R-Squared) came out to 0.097 is disappointing - this means that latitude alone accounts for 9.7% of the variance in bike availability. I also ran simple linear regression on the number of Points of Interest against the latitude to fit a model. The R^2 (R-Squared) came out to 0.27 which indicates that latitude alone can account for only 27% of the variance in the number of POIs. Trying to tie both of these two together, I ran a multilinear regression to predict bike availability from the number of POIs and latitude but this also came out to 0.096 - very poorly fit model.

What I took from this is that latitude and number of points of interest do have a statistically significant relationship to the bike availability but it is a very weak relationship and other factors are contributing to the bike availability.

Challenges

  • I decided not to run with Yelp as the API leaned heavily towards bar/restaurants rather than the other categories for the same location calls, and this would have introduced significant complexity for matching establishments from Foursquare and Yelp together, for the only tangible benefit being perhaps having data on the rating of individual establishments/venues. Catching the sub-category of venue was also more difficult in Yelp.
  • Both Yelp and Foursquare limit the API call to 50 establishments maximum in the returning JSON data. The number of POIs data could be significantly improved if the limit on venues returned was higher. I made the deliberate call for a radius of 800m for locations around the bike stations because any higher would lead to virtually all of the downtown stations being at 50 POIs, with very little insight into which 50 made my call and which were left out, and very little analytical value could be drawn from it.
  • Avoiding multicollinearity in this is very difficult because the number of certain category POIs such as number of live venues or parks, which may explain bike availability, is directly contributing to the number of total POIs, as number of bikes free or empty slots free contributes to the bike availability.

Future Goals

  • I would like to run the model on each individual category of POI which would be a simple step forward to make. I would also like to explore other POI categories that may become more appropriate during certain times of the week, or year, or when the weather is a particular way.
  • The city_bike dataset should be gathered at different times of the week and ran to capture trends e.g. weekday rush hour (commuters), lunchtime in the week day (quiet time), late night on the weekend (nightlife), early morning weekend (few commutes happening).
  • The Yelp API may need further exploration because it provided different returns to the Foursquare API, which was more generalized and caught many more non bar-restaurant establishments. Yelp offered far more specific restaurant/bar information, which I find it hard to believe that this would statistically influence bike_availability if the number of bars/restaurants themselves did not overall, but it would be interesting nonetheless.
  • Could this become a classification problem? If we looked into the bike station availability and the number of POIs for a station, we could categorize them by high traffic or low traffic (or medium) for predictive purposes. For example, where is a high-traffic bike station or high-density bike station (where # POIs >= 50) likely to be found? If given some inputs, is it likely this is a high-traffic or high density bike station?

About

Statistical Modelling in Python

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors