Interactive news-intelligence app for exploring global sentiment from GDELT.

The app continuously ingests GDELT Translingual GKG updates and lets you inspect:
- World sentiment map
- Top topics by region
- Top people/organizations by region
- Time-series sentiment for selected topic/entity
- Article cards for topic/entity (
aboutvsfromregion) - Ingest quality + pipeline timeline
- Colors countries by average tone for the selected window/topic.
- Click a country to set region context for topic/entity analysis.
Top topicsandTop entitiescome from aggregated mention facts.- Each row includes mention count and average sentiment.
- Selecting a row opens article cards and a time-series chart.
about <region>: articles tagged as about that region.from <region>: articles filtered by inferred publisher country.- Mention counts can be non-zero while article cards are empty. This is expected when:
- records contributed to aggregate mentions but had no retained URL row, or
fromscope cannot infer publisher country for those URLs.
Ingest quality (recent): pass/fail checks per batch.Pipeline timeline: stage events and stage durations.Backfill <window>hcontrol: fetches historical batches to populate long lookback windows.
mentionsin topic/entity tables are aggregate counts from parsed GKG records.- Article cards are a bounded subset (
theme_article/entity_article) and are not a 1:1 representation of aggregate mentions. - Tone uses the first value of GDELT
V2Tone.
docker compose up --buildOpen http://localhost:8000
Backend:
cd backend
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --reload --port 8000Frontend:
cd frontend
npm install
npm run devOpen http://localhost:5173
GET /api/regionsGET /api/country_summaryGET /api/top_topicsGET /api/top_entitiesGET /api/timeseries_topicGET /api/timeseries_entityGET /api/topic_articlesGET /api/entity_articlesGET /api/ingest_statusGET /api/ingest_qualityGET /api/ingest_stagesPOST /api/ingest_nowPOST /api/history_backfill
POLL_SECONDS: ingest poll intervalDEFAULT_WINDOW_HOURS: default lookback windowUSE_DAILY_MARTS_THRESHOLD_HOURS: switch to hourly+daily hybrid query planDQ_MIN_PARSED_ROWS,DQ_MAX_PARSED_ROWS: batch quality limitsHISTORY_BACKFILL_MAX_HOURS: maximum on-demand historical backfill windowGDELT_SSL_VERIFY,GDELT_CA_BUNDLE,GDELT_ALLOW_INSECURE_FALLBACK: upstream TLS behavior
frontend/:
- React SPA (Vite build) for map, ranking tables, drilldowns, and operational panels.
- Uses TanStack Query for pull-based refresh and local cache invalidation.
- Uses WebSocket updates (
/ws/updates) to react quickly to newly ingested batches.
backend/:
- FastAPI service exposing read APIs, ingest control APIs, and WebSocket updates.
- DuckDB-backed analytical store for both fact tables and operational metadata.
- Ingest worker runs continuously in-process (startup background tasks).
GDELT lastupdate -> discover latest batch id
-> download zip
-> parse + aggregate rows
-> data quality validation
-> contract write
-> persist hourly facts + daily marts + article subsets
-> emit ingest status + WS update
-> UI refetches affected queries
Stage sequence:
discover -> download -> aggregate -> validate -> persist -> complete
Stage behavior:
discover
- Reads
lastupdate-translation.txt. - Resolves current batch metadata (
batch_id, file URL). - Short-circuits with
already_ingestedwhen batch exists.
download
- Downloads
*.translation.gkg.csv.zipinto cache. - Supports TLS verification and optional insecure fallback when upstream certs are invalid.
aggregate
- Parses tab-separated GKG lines.
- Maps FIPS location codes to ISO country codes.
- Produces hourly aggregates for:
- country mentions/tone
- theme mentions/tone
- entity mentions/tone
- Produces bounded URL subsets for topic/entity drilldowns.
validate
- Runs batch quality expectations (row-count bounds, no negative mentions, tone sanity).
- Writes quality result into
ingestion_quality.
contract
- Writes a stage contract row (
ingestion_contract) with materialized batch stats + quality result. - Persist stage is gated by successful contract (
quality_success=true).
persist
- Upserts hourly fact tables.
- Upserts daily marts for long-window acceleration.
- Upserts article-card tables.
- Records
ingestion_batches.
complete
- Writes stage event log entries and final ingest status.
- Broadcasts websocket event to connected clients.
Primary dimensional table:
countries: FIPS -> ISO2/ISO3/name/continent mapping.
Hourly facts (high-fidelity recent data):
country_hour(hour_ts, iso2, mentions, tone_sum)theme_hour(hour_ts, iso2, theme, mentions, tone_sum)entity_hour(hour_ts, iso2, entity_type, entity, mentions, tone_sum)
Daily marts (long-window acceleration):
country_day(day_ts, iso2, mentions, tone_sum)theme_day(day_ts, iso2, theme, mentions, tone_sum)entity_day(day_ts, iso2, entity_type, entity, mentions, tone_sum)
Drilldown subsets (URL-level):
theme_article(hour_ts, iso2, theme, url, source, source_country, tone_sum, mentions)entity_article(hour_ts, iso2, entity_type, entity, url, source, source_country, tone_sum, mentions)
Operational/observability tables:
ingestion_batchesingestion_qualityingestion_contractingestion_stage_eventsarticle_meta(best-effort title/image/description cache)
Short windows:
- Reads directly from hourly tables for maximum precision.
Long windows:
- Uses hybrid plan:
- hourly rows for edge periods
- daily marts for interior period
- Controlled by
USE_DAILY_MARTS_THRESHOLD_HOURS.
Reason:
- Keeps query latency stable for 30-day windows without losing near-real-time detail at the edges.
Write path:
- Write operations run behind backend lock wrappers (
with_lock) to avoid concurrent conflicting writes.
Read path:
- API reads are lock-light (
with_read_lock) and optimized for dashboard responsiveness.
Ingest control:
- Ingest run lock prevents overlapping ingest cycles.
- Manual trigger endpoint and background loop share the same lock boundary.
Expected failure classes:
- Upstream fetch errors (network/TLS)
- Parsing anomalies in raw feed
- Quality-rule failures
Behavior:
- Errors are captured in ingest status and stage events.
- Failed quality batches do not pass contract gate and therefore are not persisted into fact tables.
- Loop continues polling after failures (no permanent halt).
Data refresh:
- Polls APIs on interval via TanStack Query.
- Ingest panels poll faster while ingest is running.
- WebSocket notifications trigger near-real-time refresh behavior.
UX semantics:
- Table mentions come from aggregate facts.
- Article cards are a constrained subset and may not exist for every aggregate mention.
aboutsentiment now reflects table aggregate when available;fromreflects shown article subset.
A raw GKG row contributes to aggregate mentions only if all conditions are true:
- Valid
V2.1DATEparseable into UTC hour bucket. - Valid
V2Tonefirst value parseable as float. - At least one valid location country code extracted from
V2Locations.
A raw row may still contribute to aggregate mentions even if:
- URL is missing/unparseable.
- Theme/entity fields are empty (it still contributes to country aggregates).
Implication:
- Mention aggregates are intentionally broader than article-card subsets.
For a single parsed document:
- Country mentions: incremented once per distinct mapped country.
- Theme mentions: incremented for each
(country, theme)pair. - Entity mentions: incremented for each
(country, entity_type, entity)pair.
This means one document can inflate mentions across multiple countries/themes/entities. This is expected and required for co-mention analysis.
URL-level tables are bounded by three controls:
MAX_ARTICLE_URLS_PER_BATCHMAX_ARTICLE_URLS_PER_THEMEMAX_ARTICLE_URLS_PER_ENTITY
Rows are deduplicated by primary keys:
- theme rows:
(hour_ts, iso2, theme, url) - entity rows:
(hour_ts, iso2, entity_type, entity, url)
Consequence:
- URL tables are a representative drilldown subset, not full coverage.
ingestion_stage_events event payload model:
event_id: UUIDbatch_id: nullable for pre-discovery/no-update eventsstage: discover/download/aggregate/validate/persist/complete/ingeststatus: started/success/failed/error/skipped/already_ingested/no_updatedetail_json: stage-specific payloadcreated_at: server timestamp
ingestion_contract model (gate between validate and persist):
batch_id,contract_version- materialized batch stats (
parsed_rows,country_rows,theme_rows,entity_rows) quality_success- serialized quality report JSON
Gate invariant:
- Persist stage must not execute unless contract exists and
quality_success=true.
Idempotency is guaranteed at two levels:
- Batch-level:
ingestion_batches.batch_idprevents re-ingesting the same batch. - Table-level: upserts on fact/article tables merge repeated writes safely.
Operational behavior:
- Poll loop may discover the same latest batch repeatedly.
- Backend suppresses noisy duplicate
already_ingestedstage logs for the same batch over a cooldown window.
/api/top_topics and /api/top_entities:
- Source: aggregate facts (
theme_hour/day,entity_hour/day) - Guarantee: mention/tone reflects full aggregate scope for selected region + window.
/api/topic_articles and /api/entity_articles:
- Source: URL subset tables (
theme_article,entity_article) - Guarantee: only reflects retained URL subset within window.
scope=about:
- Filters by topic/entity association and region membership (
iso2on article row).
scope=from:
- Filters by inferred publisher country (
source_country) derived from URL host/TLD. - Coverage depends on domain-country inferability.
For windows below threshold:
- planner reads hourly tables only.
For windows above threshold:
- planner composes:
- leading partial-day hourly slice
- middle full-day daily-mart slice
- trailing current-day hourly slice
This preserves edge accuracy while reducing scan volume for multi-week windows.
Concurrency controls:
- async run lock prevents overlapping ingest runs.
- task lock prevents duplicate manual-trigger tasks.
- DB write lock serializes upsert-heavy operations.
CPU/IO split:
- Zip parsing/aggregation is executed in worker threads to avoid blocking event loop.
- network fetch and API remain async.
Main hot spots:
- Zip parsing and per-row split/tokenization.
- Wide group-by writes during upsert into fact tables.
- Metadata fetch for external article pages (best-effort).
Main latency levers:
- reducing
MAX_THEMES_PER_DOCandMAX_ENTITIES_PER_DOC - tuning article subset caps
- adjusting
USE_DAILY_MARTS_THRESHOLD_HOURS
/api/topic_articles_backfill:
- Scans recent expected batch timestamps backward from latest known batch.
- Downloads missing zips on demand (unless already cached).
- Extracts matching topic URLs and upserts
theme_article. - Optional cache cleanup controlled by
BACKFILL_KEEP_ZIPS.
Use case:
- recover article-card coverage when aggregate mentions exist but URL subset is sparse.
If ingest appears stale:
- Check
GET /api/ingest_statusforrunning,phase,last_error. - Check
GET /api/ingest_stagesfor most recent stage transitions. - Trigger
POST /api/ingest_now.
If quality failures occur:
- Inspect
GET /api/ingest_quality. - Verify parsed row counts against expected feed size.
- Adjust
DQ_MIN_PARSED_ROWS/DQ_MAX_PARSED_ROWSonly with evidence.
If “mentions but no articles” is reported:
- Confirm scope (
aboutvsfrom). - Run topic backfill.
- Verify URL-country inferability constraints for
from.
- This is a single-process service architecture; no external queue/orchestrator yet.
- DuckDB is ideal for local analytics, but write concurrency is intentionally serialized.
- URL-country inference via TLD is heuristic and incomplete.
- Article metadata enrichment depends on third-party site availability and SSL behavior.
- Add multi-source ingestion beyond news web domains.
- Ingest and persist
SourceCollectionIdentifierfrom raw GKG rows. - Add social-in-news source pipeline (e.g., daily social link datasets referenced by GDELT).
- Add TV-GKG as a second source family and map into shared fact schemas.
- Add
source_typeandsource_familydimensions to facts and article tables. - Add UI filters/toggles for source scope:
News,Social,TV,All. - Expose source coverage diagnostics in technical panels (per source type, per batch).
- GDELT location country codes are FIPS; backend maps to ISO2/ISO3.
- Theme interpretations are sourced from GDELT theme lookup and stored in:
backend/app/ingest/data/gdelt_theme_interpretations.json