R | Python for Prediction

This blog post proclaims and briefly describes the Python package “ExampleDatasets” for (obtaining) example datasets.

Currently, this repository contains only datasets metadata. The datasets are downloaded from the repository Rdatasets, [VAB1].

This package follows the design of the Raku package “Data::ExampleDatasets”; see [AAr1].

Usage examples

Setup

Here we load the Python packages time, pandas, and this package:

from ExampleDatasets import *
import pandas

Get a dataset by using an identifier

Here we get a dataset by using an identifier and display part of the obtained dataset:

tbl = example_dataset(itemSpec = 'Baumann')
tbl.head

<bound method NDFrame.head of     Unnamed: 0  group  pretest.1  pretest.2  post.test.1  post.test.2  \
0            1  Basal          4          3            5            4   
1            2  Basal          6          5            9            5   
2            3  Basal          9          4            5            3   
3            4  Basal         12          6            8            5   
4            5  Basal         16          5           10            9   
..         ...    ...        ...        ...          ...          ...   
61          62  Strat         11          4           11            7   
62          63  Strat         14          4           15            7   
63          64  Strat          8          2            9            5   
64          65  Strat          5          3            6            8   
65          66  Strat          8          3            4            6   

    post.test.3  
0            41  
1            41  
2            43  
3            46  
4            46  
..          ...  
61           48  
62           49  
63           33  
64           45  
65           42  

[66 rows x 7 columns]>

Here we summarize the dataset obtained above:

tbl.describe()

	Unnamed: 0	pretest.1	pretest.2	post.test.1	post.test.2	post.test.3
count	66.000000	66.000000	66.000000	66.000000	66.000000	66.000000
mean	33.500000	9.787879	5.106061	8.075758	6.712121	44.015152
std	19.196354	3.020520	2.212752	3.393707	2.635644	6.643661
min	1.000000	4.000000	1.000000	1.000000	0.000000	30.000000
25%	17.250000	8.000000	3.250000	5.000000	5.000000	40.000000
50%	33.500000	9.000000	5.000000	8.000000	6.000000	45.000000
75%	49.750000	12.000000	6.000000	11.000000	8.000000	49.000000
max	66.000000	16.000000	13.000000	15.000000	13.000000	57.000000

Remark: The values for the arguments itemSpec and packageSpec correspond to the values of the columns “Item” and “Package”, respectively, in the metadata dataset from the GitHub repository “Rdatasets”, [VAB1]. See the datasets metadata sub-section below.

Get a dataset by using an URL

Here we can find URLs of datasets that have titles adhering to a regex:

dfMeta = load_datasets_metadata()
print(dfMeta[dfMeta.Title.str.contains('^tita')][["Package", "Item", "CSV"]].to_string())

    Package        Item                                                                      CSV
288   COUNT     titanic     https://vincentarelbundock.github.io/Rdatasets/csv/COUNT/titanic.csv
289   COUNT  titanicgrp  https://vincentarelbundock.github.io/Rdatasets/csv/COUNT/titanicgrp.csv

Here we get a dataset through pandas by using an URL and display the head of the obtained dataset:

import pandas
url = 'https://raw.githubusercontent.com/antononcube/Raku-Data-Reshapers/main/resources/dfTitanic.csv'
tbl2 = pandas.read_csv(url)
tbl2.head()

	id	passengerClass	passengerAge	passengerSex	passengerSurvival
0	1	1st	30	female	survived
1	2	1st	0	male	survived
2	3	1st	0	female	died
3	4	1st	30	male	died
4	5	1st	20	female	died

Datasets metadata

Here we:

Get the dataset of the datasets metadata
Filter it to have only datasets with 13 rows
Keep only the columns “Item”, “Title”, “Rows”, and “Cols”
Display it

tblMeta = load_datasets_metadata()
tblMeta = tblMeta[["Item", "Title", "Rows", "Cols"]]
tblMeta = tblMeta[tblMeta["Rows"] == 13]
tblMeta

	Item	Title	Rows	Cols
805	Snow.pumps	John Snow’s Map and Data on the 1854 London Ch…	13	4
820	BCG	BCG Vaccine Data	13	7
935	cement	Heat Evolved by Setting Cements	13	5
1354	kootenay	Waterflow Measurements of Kootenay River in Li…	13	2
1644	Newhouse77	Medical-Care Expenditure: A Cross-National Sur…	13	5
1735	Saxony	Families in Saxony	13	2

Keeping downloaded data

By default the data is obtained over the web from Rdatasets, but example_dataset has an option to keep the data “locally.” (The data is saved in XDG_DATA_HOME, see SS1.)

This can be demonstrated with the following timings of a dataset with ~1300 rows:

import time
startTime = time.time()
data = example_dataset(itemSpec = 'titanic', packageSpec = 'COUNT', keep = True)
endTime = time.time()
print("Getting the data first time took " + str( endTime - startTime ) + " seconds")

Getting the data first time took 0.003923892974853516 seconds

import time
startTime = time.time()
data = example_dataset(itemSpec = 'titanic', packageSpec = 'COUNT', keep = True)
endTime = time.time()
print("Geting the data second time took " + str( endTime - startTime ) + " seconds")

Geting the data second time took 0.003058910369873047 seconds

References

Functions, packages, repositories

[AAf1] Anton Antonov, ExampleDataset, (2020), Wolfram Function Repository.

[AAr1] Anton Antonov, Data::ExampleDatasets Raku package, (2021), GitHub/antononcube.

[VAB1] Vincent Arel-Bundock, Rdatasets, (2020), GitHub/vincentarelbundock.

[SS1] Scott Stevenson, xdg Python package, (2016-2021), PyPI.org.

Interactive interfaces

[AAi1] Anton Antonov, Example datasets recommender interface, (2021), Shinyapps.io.

Python for Prediction

Python compared to Mathematica and R.

Menu

Tag Archives: R

Example datasets retrieval