This page provides a full list, with description, of all the environment variables used by the application.
Please ensure these are properly defined in a .env file in the root directory.
| Name | Description | Example |
|---|---|---|
GOOGLE_API_KEY |
The API key required for accessing the Google Custom Search API | abc123 |
GOOGLE_CSE_ID |
The CSE ID required for accessing the Google Custom Search API | abc123 |
POSTGRES_USER |
The username for the test database | test_source_collector_user |
POSTGRES_PASSWORD |
The password for the test database | HanviliciousHamiltonHilltops |
POSTGRES_DB |
The database name for the test database | source_collector_test_db |
POSTGRES_HOST |
The host for the test database | 127.0.0.1 |
POSTGRES_PORT |
The port for the test database | 5432 |
DS_APP_SECRET_KEY |
The secret key used for decoding JWT tokens produced by the Data Sources App. Must match the secret token JWT_SECRET_KEY that is used in the Data Sources App for encoding. |
abc123 |
DEV |
Set to any value to run the application in development mode. | true |
DEEPSEEK_API_KEY |
The API key required for accessing the DeepSeek API. | abc123 |
OPENAI_API_KEY |
The API key required for accessing the OpenAI API. | abc123 |
PDAP_EMAIL |
An email address for accessing the PDAP API.[^1] | [email protected] |
PDAP_PASSWORD |
A password for accessing the PDAP API.[^1] | abc123 |
PDAP_API_KEY |
An API key for accessing the PDAP API. | abc123 |
PDAP_API_URL |
The URL for the PDAP API | https://data-sources-v2.pdap.dev/api |
DISCORD_WEBHOOK_URL |
The URL for the Discord webhook used for notifications | abc123 |
HUGGINGFACE_INFERENCE_API_KEY |
The API key required for accessing the Hugging Face Inference API. | abc123 |
HUGGINGFACE_HUB_TOKEN |
The API key required for uploading to the PDAP HuggingFace account via Hugging Face Hub API. | abc123 |
INTERNET_ARCHIVE_S3_KEYS |
Keys used for saving a URL to the Internet Archives. | 'abc123:gpb0dk` |
[^1:] The user account in question will require elevated permissions to access certain endpoints. At a minimum, the user will require the source_collector and db_write permissions.
The following environment variables have default values that will be used if not otherwise defined.
| Variable | Description | Default |
|---|---|---|
URL_TASKS_FREQUENCY_MINUTES |
The frequency for the RUN_URL_TASKS Scheduled Task, in minutes |
60 |
Flags are used to enable/disable certain features. They are set to 1 to enable the feature and 0 to disable the feature. By default, all flags are enabled.
Configuration flags are used to enable/disable certain configurations.
| Flag | Description |
|---|---|
POST_TO_DISCORD_FLAG |
Enables posting errors to discord. |
PROGRESS_BAR_FLAG |
Enables progress bars on some tasks. |
Task flags are used to enable/disable certain tasks.
Note that some tasks/subtasks are themselves enabled by other tasks.
| Flag | Description |
|---|---|
SCHEDULED_TASKS_FLAG |
All scheduled tasks. Disabling disables all other scheduled tasks. |
PUSH_TO_HUGGING_FACE_TASK_FLAG |
Pushes data to HuggingFace. |
POPULATE_BACKLOG_SNAPSHOT_TASK_FLAG |
Populates the backlog snapshot. |
DELETE_OLD_LOGS_TASK_FLAG |
Deletes old logs. |
RUN_URL_TASKS_TASK_FLAG |
Runs URL tasks. |
IA_PROBE_TASK_FLAG |
Extracts and links Internet Archives metadata to URLs. |
IA_SAVE_TASK_FLAG |
Saves URLs to Internet Archives. |
MARK_TASK_NEVER_COMPLETED_TASK_FLAG |
Marks tasks that were started but never completed (usually due to a restart). |
DELETE_STALE_SCREENSHOTS_TASK_FLAG |
Deletes stale screenshots for URLs already validated. |
TASK_CLEANUP_TASK_FLAG |
Cleans up tasks that are no longer needed. |
REFRESH_MATERIALIZED_VIEWS_TASK_FLAG |
Refreshes materialized views. |
UPDATE_URL_STATUS_TASK_FLAG |
Updates the status of URLs. |
DS_APP_SYNC_AGENCY_ADD_TASK_FLAG |
Adds new agencies to the Data Sources App |
DS_APP_SYNC_AGENCY_UPDATE_TASK_FLAG |
Updates existing agencies in the Data Sources App |
DS_APP_SYNC_AGENCY_DELETE_TASK_FLAG |
Deletes agencies in the Data Sources App |
DS_APP_SYNC_DATA_SOURCE_ADD_TASK_FLAG |
Adds new data sources to the Data Sources App |
DS_APP_SYNC_DATA_SOURCE_UPDATE_TASK_FLAG |
Updates existing data sources in the Data Sources App |
DS_APP_SYNC_DATA_SOURCE_DELETE_TASK_FLAG |
Deletes data sources in the Data Sources App |
DS_APP_SYNC_META_URL_ADD_TASK_FLAG |
Adds new meta URLs to the Data Sources App |
DS_APP_SYNC_META_URL_UPDATE_TASK_FLAG |
Updates existing meta URLs in the Data Sources App |
DS_APP_SYNC_META_URL_DELETE_TASK_FLAG |
Deletes meta URLs in the Data Sources App |
DS_APP_SYNC_USER_FOLLOWS_GET_TASK_FLAG |
Gets user follows from the Data Sources App |
INTEGRITY_MONITOR_TASK_FLAG |
Runs integrity checks. |
URL Task Flags are collectively controlled by the RUN_URL_TASKS_TASK_FLAG flag.
| Flag | Description |
|---|---|
URL_HTML_TASK_FLAG |
URL HTML scraping task. |
URL_RECORD_TYPE_TASK_FLAG |
Automatically assigns Record Types to URLs. |
URL_AGENCY_IDENTIFICATION_TASK_FLAG |
Automatically assigns and suggests Agencies for URLs. |
URL_MISC_METADATA_TASK_FLAG |
Adds misc metadata to URLs. |
URL_AUTO_RELEVANCE_TASK_FLAG |
Automatically assigns Relevances to URLs. |
URL_PROBE_TASK_FLAG |
Probes URLs for web metadata. |
URL_ROOT_URL_TASK_FLAG |
Extracts and links Root URLs to URLs. |
URL_SCREENSHOT_TASK_FLAG |
Takes screenshots of URLs. |
URL_AUTO_VALIDATE_TASK_FLAG |
Automatically validates URLs. |
URL_AUTO_NAME_TASK_FLAG |
Automatically names URLs. |
URL_SUSPEND_TASK_FLAG |
Suspends URLs meeting suspension criteria. |
Agency ID Subtasks are collectively disabled by the URL_AGENCY_IDENTIFICATION_TASK_FLAG flag.
| Flag | Description |
|---|---|
AGENCY_ID_HOMEPAGE_MATCH_FLAG |
Enables the homepage match subtask for agency identification. |
AGENCY_ID_NLP_LOCATION_MATCH_FLAG |
Enables the NLP location match subtask for agency identification. |
AGENCY_ID_CKAN_FLAG |
Enables the CKAN subtask for agency identification. |
AGENCY_ID_MUCKROCK_FLAG |
Enables the MuckRock subtask for agency identification. |
AGENCY_ID_BATCH_LINK_FLAG |
Enables the Batch Link subtask for agency identification. |
Location ID Subtasks are collectively disabled by the URL_LOCATION_IDENTIFICATION_TASK_FLAG flag
| Flag | Description |
|---|---|
LOCATION_ID_NLP_LOCATION_MATCH_FLAG |
Enables the NLP location match subtask for location identification. |
LOCATION_ID_BATCH_LINK_FLAG |
Enables the Batch Link subtask for location identification. |
FDW_DATA_SOURCES_HOST=127.0.0.1 # The host of the Data Sources Database, used for FDW setup
FDW_DATA_SOURCES_PORT=1234 # The port of the Data Sources Database, used for FDW setup
FDW_DATA_SOURCES_USER=fdw_user # The username for the Data Sources Database, used for FDW setup
FDW_DATA_SOURCES_PASSWORD=password # The password for the Data Sources Database, used for FDW setup
FDW_DATA_SOURCES_DB=db_name # The database name for the Data Sources Database, used for FDW setup
PROD_DATA_SOURCES_HOST=127.0.0.1 # The host of the production Data Sources Database, used for Data Dumper
PROD_DATA_SOURCES_PORT=1234 # The port of the production Data Sources Database, used for Data Dumper
PROD_DATA_SOURCES_USER=dump_user # The username for the production Data Sources Database, used for Data Dumper
PROD_DATA_SOURCES_PASSWORD=password # The password for the production Data Sources Database, used for Data Dumper
PROD_DATA_SOURCES_DB=db_name # The database name for the production Data Sources Database, used for Data Dumper