Skip to content

aigatdula/datadoubleconfirm

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

111 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Data is nutrients for the soul

Provides simple datasets for data visualization, statistical analysis and modelling
Suitable for those starting out in data science and of course all who find the datasets useful
Data visualizations can be found here
Tutorials/ exercises can be found here

List of datasets along with descriptions

Dataset: akcdogs.csv
Description: Cleaned data on dog breeds scraped from akc.org (as at 17 Jan 2018)
Variables: Breed , Trait1, Trait2, Trait3, Energy level, Size, Rank, Good with Children, Good with other Dogs, Shedding, Grooming, Trainability, Height, Weight, Life expectancy, Barking level, Group
Mode of data collection: Web scraping
Source: American Kennel Club

Dataset: bookdepo.csv
Description: Raw data on bestsellers scraped from bookdepository.com (as at 11 Jan 2018)
Variables: (blank) (row index number) , name (book title), material (book material), author (author), rank (bestsellers rank), maincat (main category), subcat (sub category), rating (rating by readers), ratingcount (number of readers who gave ratings), saleprice (discounted price in S$), listprice (original price in S$), numofpages (number of pages), datepub (date published), isbn13 (ISBN13 number)
Mode of data collection: Web scraping
Source: Book Depository

Dataset: bookdepobest.csv
Description: Cleaned data on bestsellers scraped from bookdepository.com (as at 11 Jan 2018)
Variables: SN, name, rank, maincat, subcat, rating, saleprice, listprice, datepub, isbn13, GoodreadsRateCount, BookMaterial, Author(s), PageCount
Mode of data collection: Web scraping
Source: Book Depository

Dataset: Class1.csv
Description: Hypothetical dataset consisting score results of 100 students for three tests
Variables: id, gender, test1, test2, test3
Mode of data collection: N.A.
Source: N.A.

Dataset: Class2.csv
Description: Hypothetical dataset consisting score results of 100 students for four tests
Variables: id, gender, test1, test2, test3, test4
Mode of data collection: N.A.
Source: N.A.

Dataset: FreqWordsObama.csv
Description: Frequently mentioned words in Barack Obama's tweets between 2007 and 2017 (as at 12 Dec 2017)
Variables: Year (year of tweet), Word (frequently mentioned word), Count (number of tweets containing word), Year Volume (volume of tweet in the year), Percentage (percentage of tweets containing word)
Mode of data collection: Twitter web scraping and text mining
Source: Barack Obama's Twitter account

Dataset: GovSG.csv
Description: Addresses with GIS location and contact information of Ministries and Statutory Boards in Singapore
Variables: Organisation, Type (Ministry/ Statutory Board), Zipcode, Latitude, Longitude, Website, Tel, Fax, Email, Enquiry/ Feedback Form (url), Parent Ministry (Statutory Boards under respective Ministries)
Mode of data collection: Web scraping, Manual, Tableau-generated latitude/longitude based on Zipcode
Source: Singapore Government Directory, The Public Service | Careers@Gov

Dataset: mrtfaretime.csv
Description: Travel time and fare information between train (MRT/LRT) stations in Singapore (as at Oct 2018)
Variables: Station_start (Boarding station), Station_end (Alighting station), Time (Travel time in mins), Adult (Adult fare), Senior (Fare for Seniors and Persons with Disabilities), Standard (Fare for Standard Ticket), Student (Student fare), WTCS (Fare under Workfare Transport Concession Scheme), REF_STNSTART, Latitude_Start, Longitude_Start, REF_STNEND, Latitude_End, Longitude_End
Mode of data collection: Web scraping
Source: TransitLink Electronic Guide

Dataset: mrtsg.csv
Description: Latitude and longitude of train (MRT/LRT) stations in Singapore (as at Jun 2017)
Variables: OBJECTID (id) , STN_NAME (station name), STN_NO (station number), X (X coord in SVY21 format), Y (Y coord in SVY21 format), Latitude, Longitude, COLOR (color of train line)
Mode of data collection: Public dataset, Coordinate conversion web scraping
Source: LTA DataMall, OneMap Singapore

Dataset: passport.csv
Description: Top 10 Passports (in the 2017 Global Passport Power Rank) and their visa requirements to other countries Variables: Top 10 Country (name of country with passport in Top 10), Country (visiting country), Type of Visa (visa required)
Mode of data collection: Manual
Source: Passport Index 2017

Dataset: pokemon.csv
Description: Pokemon and their attack and defense statistics
Variables: HP, Attack, Defense, Sp Atk, Sp Def, Speed, Total, HP Percentile, Attack Percentile, Defense Percentile, Sp Atk Percentile, Sp Def Percentile, Speed Percentile, Total Percentile
Mode of data collection: Manual, Excel function for percentile ranking
Source: Pokemon database

Dataset: primaryschoolsg.csv
Description: Locations of Primary schools in Singapore and their popularity
Variables: Name, Type, GenderMix, Area, Zone, PostalCode, Latitude, Longitude, PlacestakenuptillPhase2B
Mode of data collection: Public data, Web scraping, Tableau-generated latitude/longitude based on Postal code
Source: Wikipedia, School Information Service - MOE, Salary.sg

Dataset: secsch_cleaned.csv
Description: Locations of Secondary schools in Singapore and their 2017 PSLE cut-off scores
Variables: row (row number), SCHNAME (name of school in upper case), zipcode, area, zone, type, latitude, longitude, School (name of school for PSLE cut-off scores as matching key), Rank2017, IB, IP, SAP, Girls, Boys, Co-ed, O-level track (PSLE cut-off score for O-level track), PSLE2017 Cut Off, Gender Mix
Mode of data collection: Web scraping, Tableau-generated latitude/longitude based on zipcode, Manual
Source: School Information Service - MOE, Salary.sg

Dataset: tfresults02.csv
Description: Results of 100m and 200m national track-and-field finals for "A" division boys and girls between 2002 and 2016
Variables: Year, Event, Division, Gender, School, Name, Position, Timing (in s)
Mode of data collection: Manual
Source: Singapore Athletics LIVE Results

List of notebooks along with descriptions

Notebook: AKCDogs.ipynb
Description: Python code for scraping akc.org dog information

Notebook: bookdepository.ipynb
Description: Python code for scraping bookdepository.com bestsellers information

Notebook: imdb.ipynb
Description: Python code for scraping imdb.com most popular movies information

Notebook: Creating Datasets in Python.ipynb
Description: Python code for importing practice datasets from R to Python and creating hypothetical datasets

Notebook: Creating Datasets in R.ipynb
Description: R code for creating hypothetical datasets

Notebook: Data Cleaning in Python.ipynb
Description: Python code for performing various data cleaning tasks

Notebook: Data Cleaning in R.ipynb
Description: R code for performing various data cleaning tasks

Notebook: seleniummrt.ipynb
Description: Python code for scraping time and fare information between train stations in Singapore from TransitLink Electronic Guide

Notebook: Statistical tests.ipynb
Description: R code for performing various types of two-sample tests and correlation checks

Notebook: StatutoryBoardSG.ipynb
Description: Python code for scraping addresses/ contact information of statutory boards in Singapore from Singapore Government Directory and automating download of organisation logo images

Notebook: WiDS.ipynb
Description: R code for predicting gender of survey respondents as part of the WiDS 2018 Datathlon

About

Simple datasets and notebooks for data visualization, statistical analysis and modelling http://projectosyo.wix.com/datadoubleconfirm

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 100.0%