In this tutorial, we will pull the COVID-19 Public Data from a Singer tap and put it in TerminusDB/ TerminusX.
You can download the TerminusDB docker image to work locally (recommended to use Bootstrap here) or you can connect to TerminusX. If you are using docker image, make sure that your TerminusDB container is running at localhost (https://127.0.0.1). If you are using TerminusX, get the information of the endpoint, team, and API token ready (it should be accessible in the TerminusX dashboard under profile.)
Clone the COVID-19 Public Data from this GitHub repo
$ git clone [email protected]:singer-io/tap-covid-19.git
go into the tap-covid-19 directory
$ cd tap-covid-19/
It is highly recommended to install different singer.io tap and targets in different python environments. To install the COVID-19 Public Data tap from the setup.py in the repo into a new venv environment.
$ python3 -m venv ~/.virtualenvs/tap-covid-19
$ source ~/.virtualenvs/tap-covid-19/bin/activate
$ python3 -m pip install .
$ deactivate
With similar principal as installing the tap, we install target-terminusdb from PyPI in another venv environment:
$ python3 -m venv ~/.virtualenvs/target-terminusdb
$ source ~/.virtualenvs/target-terminusdb/bin/activate
$ python3 -m pip install target-terminusdb
we don't deactivate the environment this time as we can use the terminusdb command if we are in this environment
Now go to the project directory (or start a new one):
$ cd ../covid_data
In the project directory start a TerminusDB project:
$ terminusdb startproject
You will be prompt with a few questions. Pick a project name and if you are running the localhost server with default port you can just press Enter. You have to provide the endpoint and other login information if you are using TerminusX or otherwise.
This is what I did:
Please enter a project name (this will also be the database name): covid_data
Please enter a endpoint location (press enter to use localhost default) [http://127.0.0.1:6363/]:
config.json and schema.py created, please customize them to start your project.
Create the tap's config file tap_config.json with the example. This tap connects to GitHub with a GitHub OAuth2 Token. This may be a Personal Access Token.
{
"api_token": "YOUR_GITHUB_API_TOKEN",
"start_date": "2021-01-01T00:00:00Z",
"user_agent": "tap-covid-19 <api_user_email@your_company.com>"
}
Put in your api_token, user_agent and start_date.
More about how to use the tap-covid-19 can be found here.
$ ~/.virtualenvs/tap-covid-19/bin/tap-covid-19 -c tap_config.json --discover > catalog.json
To selected the stream that we want to import, we have to mark it in the catalog.json. We select eu_daily stream. In it's metadata, added "selected" : true.
...
"stream": "eu_daily",
"metadata": [
{
"breadcrumb": [],
"metadata": {
"selected" : true,
"table-key-properties": [
"git_path",
"__sdc_row_number"
],
"forced-replication-method": "INCREMENTAL",
"valid-replication-keys": [
"git_last_modified"
],
"inclusion": "available"
}
},
...
]
},
...
$ ~/.virtualenvs/tap-covid-19/bin/tap-covid-19 -c tap_config.json --catalog catalog.json | ~/.virtualenvs/target-terminusdb/bin/target-terminusdb -c config.json >> state.json
Check if the documents are looking good by using the following command to get 10 documents to inspect:
$ terminusdb alldocs -h 10
Now the data is in TerminusDB/ TerminusX, you can query the data and export it as CSV for investigation later by using the terminusdb command. For example, I want to get all the data on a particular date:
$ terminusdb alldocs -t eu_daily -q datetime=2020-04-07T00:00:00Z -e --filename eu_2020-04-07.csv