Repository of ADA project Twinhoods of team ADA-Orion:
Abstract
Data description
Feasibility and Risks
Deliverables
Timeplan
Scraped places data from either or a combination of the following APIs:
-
We want to use the Factual API to gather information on the location of restaurants, cafees, theatres etc. These will be our main descriptives for an area.
-
If we descide to use user recomendations of the restaurants etc., we have considered Foursquare as a source.
-
As a backup plan, we have the Google Places API that does much the same as the factual API. Because of strong regulations on usage, we will try to base our solution on the Factual API.
-
It should be noted that common API restrictions and limits regarding the number of request within a certain timeframe possibly affect or at least slow down the data acquisistion progress.
-
For Google's APIs we have already found out that there are quite strict regulations on how you are allowed to use the data. This fact might oblige us to divert the data gatherhing to secondary sources like the other APIs mentioned in the 'Data description' chapter above.
-
Further there is a certain level of uncertainty when it comes to the ability to extract different tags or categories for places or venues within the provided data which might increase the difficulty to properly cluster the scraped data into aggregated descriptors. In order to solve this challenge we have already devised some ideas but a functional prototype is still pending.
- The most promissing dataset we have found, is the one by Factual. From our tests we seem to be able to extract the data we need - geolocated, categorized venue information. The query results are somewhat limited, having 50 reslults as a maximum. We believe that by doing the querying/scraping in a structural way, we will be able to get us around this problem.
-
One possible data processing risk is related to combining multiple data sources into one dataset. It might be challenging in terms of general logistics as well as processing power (datasets might be larger then what we can easily manipulate on one machine).
-
Creating meaningful metrics in order to accurately describe and compare neighborhoods might prove to be challenging and there is no real baseline model for verification. We can only do it by ourselves based on our neighborhood knowledge.
- We are hoping to obtain large enough dataset in order to present valuable insight but small eoungh to manipulate it on single machine.
- Our visualization is going to include maps, and we have not yet found a good overlay for showing the boundaries for a neighborhood in Switzerland. We hope to be able to use data from OpenStreetMap to this, but might have to find other resources.
- We have not yet decided on how to visualize the connection between two neighborhoods, but one suggestion is to make the 'user' select one neighborhood he likes, and then show him some kind of heatmap over the compared city.
- Scraped places data
- Processed neighborhood descriptors
- Visualized neighborhood mappings
- Accessing APIs
- Parsing data
- Storing data for future use
-
Processing of data to extract descriptive neighborhood characteristics
- Find representative descriptors
- Extract descriptors
- Proccess data
-
Visualization of aggregated data
- Heatmaps
- Aggregation of weighted variables
- Mapping of one city to another city
- Display findings in HTML
-
Presentation and report of the project
-
Flickr dataset contains only 350k Swiss geolocated pictures and it mostly contains mountains/places/people not food.
-
By default both Instagram and Facebook strip off EXIF geolocation of image they store in database so this database might proove to be quite useless for our application (but maybe there is some other way of obtaining geolocation of pictures from Instagram? Maybe its directly in schema of dataset?)
