This repository contains the implementation of CenAlert, a user-driven system for detecting spikes in circumvention-related Google Trends data. These spikes represent changes in users' experiences or expectations of Internet restrictions such as censorship.
This implementation of CenAlert is designed to run on historical datasets only. We provide the relevant code and data to run CenAlert for 76 censoring countries from January 2011 to December 2024.
CenAlert was developed using Python 3.12. Certain dependencies are not compatible with newer versions of Python.
├── auxiliary_data
│ ├── dashboard_data_view.png
│ ├── dashboard_event_view.png
│ ├── spike_end_slack_notification.png
│ ├── spike_start_slack_notification.png
│ ├── top_100_impact_events.csv
├── cenalert
│ ├── __init__.py
│ ├── lib
│ ├── run.py
│ ├── select_parameters.py
│ ├── stitch_windows.py
│ └── tune_parameters.py
├── countries.txt
├── events
│ ├── by_country
│ ├── custom
│ ├── scripts
│ └── sources
├── parameters
│ ├── chebyshev_selected
│ └── chebyshev_tuning
├── raw_data
│ ├── sample0
│ ├── sample1
│ ├── sample2
│ ├── sample3
│ └── sample4
├── requirements.txt
├── scripts
│ ├── run.sh
│ ├── select_parameters.sh
│ └── tune_parameters.sh
└── series
└── 012t0g
cenalert/contains the core implementation of CenAlert.countries.txtcontains a list of ISO 3166-1 alpha-2 codes, specifying all of the countries on which to run CenAlert in batch scripts.events/contains lists of events that CenAlert can use to attribute explanations to detected spikes.parameters/contains the Pareto fronts produced during parameter tuning and the selected per-country parameter sets for the Z-Score anomaly detection algorithm (also referred to as Chebyshev throughout the code and documentation).raw_data/contains a small illustrative subset of the raw Google Trends data from which processed time series were derived.scripts/contains batch scripts to run the parameter tuning and anomaly detection components of CenAlert across several countries.series/contains processed, ready-to-use time series data from the <Virtual Private Network> topic (represented by the code /m/012t0g) in Google Trends.
python3.12 -m venv .venv
source .venv/bin/activate
python3 -m pip install --upgrade pip setuptools wheel
pip3 install -r requirements.txtOur stitching process is designed to address three core challenges of working with Google Trends data: (1) variability, where identical requests at different times may yield different results; (2) normalization, where values are relative proportions of the time series maximum; and (3) resolution constraints, where fine-grained data is only available for short time ranges.
To illustrate the stitching process, we provide a small subset of our raw Google Trends data (5 downloads of data for Türkiye) in raw_data/.
This data can be stitched by running the following script:
python3 -m cenalert.stitch_windows --countries countries.txt --data raw_data --output <output_directory>We provide already-stitched (from 45 downloads) time series for 76 censoring countries in series.
CenAlert's anomaly detection requires tuning and selection of parameters.
Parameter tuning for a single country can be run with:
python3.12 -u -m cenalert.tune_parameters --series <time_series>.csv --algorithm <chebyshev|median|iforest|lof> --output <output>.pkl<output>.pkl will contain the parameter sets comprising the Pareto front.
We also provide a batch script for running parameter tuning across all countries:
scripts/tune_parameters.sh <countries_file> <time_series_directory> <chebyshev|median|iforest|lof>Parameter tuning is both non-deterministic (two runs of the script are not guaranteed to test the same sets of parameters) and time-intensive. We therefore provide the Pareto fronts for the Chebyshev (we use Chebyshev and Z-Score interchangeably) algorithm in parameters/chebyshev_tuning. The Pareto fronts are provided as JSON files but must be converted to pickle files with the following command:
python3 -c "import sys, json, pickle, numpy as np; j = json.load(open(sys.argv[1])); obj = [tuple(np.array(a) for a in t) for t in j]; pickle.dump(obj, open(sys.argv[2], 'wb'), protocol=pickle.HIGHEST_PROTOCOL)" <input>.json <output>.pklThe Pareto fronts and selected parameters can be visualized by running:
python3 -m cenalert.select_parameters --path <pareto_front>.pkl --output <selected_parameters>.pklWe also provide a batch script for selecting the appropriate parameters for all countries:
scripts/select_parameters.sh <parameters_directory> <output_directory>The selected parameter sets for the Chebyshev (Z-Score) algorithm, which is the anomaly detection algorithm ultimately used by CenAlert, are already provided in parameters/chebyshev_selected as JSON files. These files can be converted to pickle files using:
python3 -c "import sys, pickle, json, numpy as np; pickle.dump(tuple(np.float64(x) for x in json.load(open(sys.argv[1]))), open(sys.argv[2],'wb'))" <input>.json <output>.pklCenAlert can be run for a single country as follows:
python3 -m cenalert.run --path <time_series>.csv --algorithm <chebyshev|median|iforest|lof> --parameters <selected_parameters>.pkl --output <output_directory> [--events <events>.csv]For example,
python3 -m cenalert.run --path series/012t0g/RU.csv --algorithm chebyshev --parameters parameters/chebyshev_selected/RU.pkl --output . --events events/by_country/RU.csvEvents are matched to spikes purely by date, so it is important that the event list only contains events for the relevant country (events split by country are provided in events/by_country).
CenAlert generates three files in the output directory:
annotated.csvcontains the Google Trends time series annotated with several pieces of metadata used during anomaly detection:anomalyis a boolean denoting whether the value was considered an anomaly.scoreis the anomaly score that the selected anomaly detection algorithm assigned to the value, if the sliding window is not sparse.residualis the difference between the value and the forecast given by Croston's method, if the sliding window is sparse.thresholdis the minimum value corresponding tomin_score(see below). For performance reasons, particularly with algorithms such as Isolation Forest and Local Outlier Factor, the threshold is only computed for anomalies.min_scorerepresents the minimum anomaly score required for a value to be considered an anomaly. This field is most useful for the Chebyshev (Z-Score) algorithm, where there are two separate minimum anomaly scores depending on whether data in the sliding window is normally distributed.cov2is the squared coefficient of variance of the sliding window. This is used in determining whether the sliding window is sparse.adiis the average number of data points between non-zero values. This is used in determining whether the sliding window is sparse.demand_patternis the categorization of the sliding window into one of four demand patterns (erratic, lumpy, intermittent, or smooth) based oncov2andadi. A classification of lumpy or intermittent indicates that the window is sparse.
anomalies.csvcontains all spikes detected by CenAlert. Each spike is represented by the following information:startis the start date of the spike.endis the end date of the spike.peakis the date on which the spike peaks.scoreis the anomaly score assigned to the first point of the spike, if the sliding window was not sparse.residualis the difference between the first point of the spike and the forecast given by Croston's method, if the sliding window was sparse.impactis the impact factor of the spike. Unlikescoreandresidual, which are local measures of deviation from the sliding window,impactprovides a global measure of significance.proximityis the distance (in days) from the start of the nearest event in the provided events list. A negative value means the spike occurs before the event, while a positive value means it occurs after.causeis the set of blocked services (for censorship events) or the explanation (for non-censorship events) associated with the nearest event.whois the organization(s) responsible for reporting the nearest event in the provided events list when the event involves censorship; for non-censorship events, it is recorded as Other.
explainable.csvcontains all spikes which were matched to an event (i.e., were within 6 days of an event). If no event list was provided to CenAlert, this file will be empty.
We also provide a batch script for running CenAlert on several countries:
bash scripts/run.sh <countries> <series_directory> <events_directory> <chebyshev|median|iforest|lof> <parameters_directory> <output_directory>We collected event lists from four Internet freedom community organizations. These lists only contain service-blocking events, where certain platforms or protocols were blocked, but the Internet remained broadly accessible.
events/sources/AccessNow.csvcontains service-blocking events from the #KeepItOn STOP database (from 2016 to 2023), downloaded usingevents/scripts/getAccessNowEvents.sh.events/sources/Pulse.csvcontains service-blocking events from the Internet Society Pulse Shutdowns Tracker (from 2019 to 2024), downloaded usingevents/scripts/getPulseEvents.sh. Due to changes to the Pulse API, this script is now deprecated.events/sources/NetBlocks.csvcontains service-blocking events manually recorded from NetBlocks reports (through 2023).events/sources/OONI.csvcontains service-blocking events manually recorded from OONI reports (through 2024).
We also provide the following event lists:
events/custom/Community.csvcontains all events reported by at least one of the above organizations. Reports from multiple organizations are merged into a single event if they begin on the same or consecutive days. Discrepancies in reporting (e.g., >= 2 day difference in recorded start date) may cause the same event to be treated as distinct.events/custom/CenAlert.csvcontains censorship events not included in the community datasets that we manually verified, as well as explanations for spikes beyond censorship. Only events with the source tag CenAlert are considered manually verified. We retroactively assigned events to community organizations in two cases: (1) when the event was designated as a full network shutdown in the community datasets and was therefore filtered out (we only considered service-blocking events), or (2) when the event was reported via other channels (e.g., social media) and verified using observatory data (OONI or NetBlocks). We emphasize that this file is not comprehensive: it does not contain explanations for spikes beyond the 100 largest by impact factor or those occurring in nine highly restrictive countries (Azerbaijan, Egypt, Ethiopia, Iran, Kazakhstan, Pakistan, Russia, Türkiye, and Venezuela).events/custom/All.csvcontains a concatenation of the previous two lists. The events split by country inevents/by_countrycome from this event list.
The only fields that are truly mandatory for an event list are the country_code, start_date, affected_services (which provides cause in anomalies.csv) and source (which provides who in anomalies.csv).
We provide multiple utilities for managing event lists.
merge_event_lists.py concatenates multiple event lists.
python3 events/scripts/merge_event_lists.py --directory <events_directory> --output <output>.csvSince CenAlert expects each event list to correspond to a single country, split_events_by_country.py splits aggregate event lists into per-country files stored under events/by_country.
python3 events/scripts/split_events_by_country.py --events <events>.csvauxiliary_data/ contains several demonstrations of CenAlert that were excluded from the paper due to space constraints.
auxiliary_data/dashboard_data_view.pngandauxiliary_data/dashboard_event_view.pngshow example views of the CenAlert dashboard, which currently displays the stitched time series data, highlights of the spikes detected by CenAlert, and summaries of explainable events.auxiliary_data/spike_start_slack_notification.pngandauxiliary_data/spike_end_slack_notification.pngshow example Slack notifications triggered at the start and end of spikes.auxiliary_data/top_100_impact_events.csvcontains the full set of the 100 highest-impact events investigated in Section 5.1, including the corresponding country codes, start dates, descriptions, and event types (i.e., whether the event was community-known, manually verified, a non-censorship event, or unknown). The relevant sources for these events are available in the events lists provided inevents/.