Example datasets retrieval

This blog post proclaims and briefly describes the Python package “ExampleDatasets” for (obtaining) example datasets.

Currently, this repository contains only datasets metadata. The datasets are downloaded from the repository Rdatasets, [VAB1].

This package follows the design of the Raku package “Data::ExampleDatasets”; see [AAr1].


Usage examples

Setup

Here we load the Python packages timepandas, and this package:

from ExampleDatasets import *
import pandas

Get a dataset by using an identifier

Here we get a dataset by using an identifier and display part of the obtained dataset:

tbl = example_dataset(itemSpec = 'Baumann')
tbl.head
<bound method NDFrame.head of     Unnamed: 0  group  pretest.1  pretest.2  post.test.1  post.test.2  \
0            1  Basal          4          3            5            4   
1            2  Basal          6          5            9            5   
2            3  Basal          9          4            5            3   
3            4  Basal         12          6            8            5   
4            5  Basal         16          5           10            9   
..         ...    ...        ...        ...          ...          ...   
61          62  Strat         11          4           11            7   
62          63  Strat         14          4           15            7   
63          64  Strat          8          2            9            5   
64          65  Strat          5          3            6            8   
65          66  Strat          8          3            4            6   

    post.test.3  
0            41  
1            41  
2            43  
3            46  
4            46  
..          ...  
61           48  
62           49  
63           33  
64           45  
65           42  

[66 rows x 7 columns]>

Here we summarize the dataset obtained above:

tbl.describe()
Unnamed: 0pretest.1pretest.2post.test.1post.test.2post.test.3
count66.00000066.00000066.00000066.00000066.00000066.000000
mean33.5000009.7878795.1060618.0757586.71212144.015152
std19.1963543.0205202.2127523.3937072.6356446.643661
min1.0000004.0000001.0000001.0000000.00000030.000000
25%17.2500008.0000003.2500005.0000005.00000040.000000
50%33.5000009.0000005.0000008.0000006.00000045.000000
75%49.75000012.0000006.00000011.0000008.00000049.000000
max66.00000016.00000013.00000015.00000013.00000057.000000

Remark: The values for the arguments itemSpec and packageSpec correspond to the values of the columns “Item” and “Package”, respectively, in the metadata dataset from the GitHub repository “Rdatasets”, [VAB1]. See the datasets metadata sub-section below.

Get a dataset by using an URL

Here we can find URLs of datasets that have titles adhering to a regex:

dfMeta = load_datasets_metadata()
print(dfMeta[dfMeta.Title.str.contains('^tita')][["Package", "Item", "CSV"]].to_string())
    Package        Item                                                                      CSV
288 COUNT titanic https://vincentarelbundock.github.io/Rdatasets/csv/COUNT/titanic.csv
289 COUNT titanicgrp https://vincentarelbundock.github.io/Rdatasets/csv/COUNT/titanicgrp.csv

Here we get a dataset through pandas by using an URL and display the head of the obtained dataset:

import pandas
url = 'https://raw.githubusercontent.com/antononcube/Raku-Data-Reshapers/main/resources/dfTitanic.csv'
tbl2 = pandas.read_csv(url)
tbl2.head()
idpassengerClasspassengerAgepassengerSexpassengerSurvival
011st30femalesurvived
121st0malesurvived
231st0femaledied
341st30maledied
451st20femaledied

Datasets metadata

Here we:

  1. Get the dataset of the datasets metadata
  2. Filter it to have only datasets with 13 rows
  3. Keep only the columns “Item”, “Title”, “Rows”, and “Cols”
  4. Display it
tblMeta = load_datasets_metadata()
tblMeta = tblMeta[["Item", "Title", "Rows", "Cols"]]
tblMeta = tblMeta[tblMeta["Rows"] == 13]
tblMeta
ItemTitleRowsCols
805Snow.pumpsJohn Snow’s Map and Data on the 1854 London Ch…134
820BCGBCG Vaccine Data137
935cementHeat Evolved by Setting Cements135
1354kootenayWaterflow Measurements of Kootenay River in Li…132
1644Newhouse77Medical-Care Expenditure: A Cross-National Sur…135
1735SaxonyFamilies in Saxony132

Keeping downloaded data

By default the data is obtained over the web from Rdatasets, but example_dataset has an option to keep the data “locally.” (The data is saved in XDG_DATA_HOME, see SS1.)

This can be demonstrated with the following timings of a dataset with ~1300 rows:

import time
startTime = time.time()
data = example_dataset(itemSpec = 'titanic', packageSpec = 'COUNT', keep = True)
endTime = time.time()
print("Getting the data first time took " + str( endTime - startTime ) + " seconds")
Getting the data first time took 0.003923892974853516 seconds
import time
startTime = time.time()
data = example_dataset(itemSpec = 'titanic', packageSpec = 'COUNT', keep = True)
endTime = time.time()
print("Geting the data second time took " + str( endTime - startTime ) + " seconds")
Geting the data second time took 0.003058910369873047 seconds

References

Functions, packages, repositories

[AAf1] Anton Antonov, ExampleDataset, (2020), Wolfram Function Repository.

[AAr1] Anton Antonov, Data::ExampleDatasets Raku package, (2021), GitHub/antononcube.

[VAB1] Vincent Arel-Bundock, Rdatasets, (2020), GitHub/vincentarelbundock.

[SS1] Scott Stevenson, xdg Python package, (2016-2021), PyPI.org.

Interactive interfaces

[AAi1] Anton Antonov, Example datasets recommender interface, (2021), Shinyapps.io.