This it the fifth major iteration of the OONI Data Pipeline.
For historical context, these are the major revisions:
v0- The "pipeline" is basically just writing the RAW json files into a publicwwwdirectory. Used until ~2013v1- OONI Pipeline based on custom CLI scripts using mongodb as a backend. Used until ~2015.v2- OONI Pipeline based on luigi. Used until ~2017.v3- OONI Pipeline based on airflow. Used until ~2020.v4- OONI Pipeline basedon custom script and systemd units (aka fastpath). Currently in use in production.v5- Next generation OONI Pipeline. What this readme is relevant to. Expected to become in production by Q4 2024.
In order to run the pipeline you should setup the following dependencies:
git clone https://github.com/ooni/data
Start clickhouse server:
mkdir -p _clickhouse-data
cd _clickhouse-data
clickhouse server
Workflows are started by first scheduling them and then triggering a backfill operation on them. When they are scheduled they will also run on a daily basis.
You can then trigger the run operation like so:
hatch run oonipipeline run --create-tables --probe-cc US --test-name signal --workflow-name observations --start-at 2024-01-01 --end-at 2024-02-01
In production it's recommended that you trigger the workflows using airflow.
You can find the airflow dags specified inside of the root of this repo in the
dags folder.