GDELT Sentiment Visualizer

Interactive news-intelligence app for exploring global sentiment from GDELT.

The app continuously ingests GDELT Translingual GKG updates and lets you inspect:

World sentiment map
Top topics by region
Top people/organizations by region
Time-series sentiment for selected topic/entity
Article cards for topic/entity (about vs from region)
Ingest quality + pipeline timeline

What You See In The UI

1) World Sentiment Map

Colors countries by average tone for the selected window/topic.
Click a country to set region context for topic/entity analysis.

2) Topics and Entities

Top topics and Top entities come from aggregated mention facts.
Each row includes mention count and average sentiment.
Selecting a row opens article cards and a time-series chart.

3) Article Cards (`about` vs `from`)

about <region>: articles tagged as about that region.
from <region>: articles filtered by inferred publisher country.
Mention counts can be non-zero while article cards are empty. This is expected when:
- records contributed to aggregate mentions but had no retained URL row, or
- from scope cannot infer publisher country for those URLs.

4) Technical Panels

Ingest quality (recent): pass/fail checks per batch.
Pipeline timeline: stage events and stage durations.
Backfill <window>h control: fetches historical batches to populate long lookback windows.

Data Semantics (Important)

mentions in topic/entity tables are aggregate counts from parsed GKG records.
Article cards are a bounded subset (theme_article / entity_article) and are not a 1:1 representation of aggregate mentions.
Tone uses the first value of GDELT V2Tone.

Run The App

Docker (recommended)

docker compose up --build

Open http://localhost:8000

Local Dev

Backend:

cd backend
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --reload --port 8000

Frontend:

cd frontend
npm install
npm run dev

Open http://localhost:5173

Key API Endpoints

GET /api/regions
GET /api/country_summary
GET /api/top_topics
GET /api/top_entities
GET /api/timeseries_topic
GET /api/timeseries_entity
GET /api/topic_articles
GET /api/entity_articles
GET /api/ingest_status
GET /api/ingest_quality
GET /api/ingest_stages
POST /api/ingest_now
POST /api/history_backfill

Config Highlights

POLL_SECONDS: ingest poll interval
DEFAULT_WINDOW_HOURS: default lookback window
USE_DAILY_MARTS_THRESHOLD_HOURS: switch to hourly+daily hybrid query plan
DQ_MIN_PARSED_ROWS, DQ_MAX_PARSED_ROWS: batch quality limits
HISTORY_BACKFILL_MAX_HOURS: maximum on-demand historical backfill window
GDELT_SSL_VERIFY, GDELT_CA_BUNDLE, GDELT_ALLOW_INSECURE_FALLBACK: upstream TLS behavior

Architecture (In Depth)

System Boundaries

frontend/:

React SPA (Vite build) for map, ranking tables, drilldowns, and operational panels.
Uses TanStack Query for pull-based refresh and local cache invalidation.
Uses WebSocket updates (/ws/updates) to react quickly to newly ingested batches.

backend/:

FastAPI service exposing read APIs, ingest control APIs, and WebSocket updates.
DuckDB-backed analytical store for both fact tables and operational metadata.
Ingest worker runs continuously in-process (startup background tasks).

End-to-End Data Flow

GDELT lastupdate -> discover latest batch id
               -> download zip
               -> parse + aggregate rows
               -> data quality validation
               -> contract write
               -> persist hourly facts + daily marts + article subsets
               -> emit ingest status + WS update
               -> UI refetches affected queries

Ingest Pipeline Stages

Stage sequence: discover -> download -> aggregate -> validate -> persist -> complete

Stage behavior:

discover

Reads lastupdate-translation.txt.
Resolves current batch metadata (batch_id, file URL).
Short-circuits with already_ingested when batch exists.

download

Downloads *.translation.gkg.csv.zip into cache.
Supports TLS verification and optional insecure fallback when upstream certs are invalid.

aggregate

Parses tab-separated GKG lines.
Maps FIPS location codes to ISO country codes.
Produces hourly aggregates for:
- country mentions/tone
- theme mentions/tone
- entity mentions/tone
Produces bounded URL subsets for topic/entity drilldowns.

validate

Runs batch quality expectations (row-count bounds, no negative mentions, tone sanity).
Writes quality result into ingestion_quality.

contract

Writes a stage contract row (ingestion_contract) with materialized batch stats + quality result.
Persist stage is gated by successful contract (quality_success=true).

persist

Upserts hourly fact tables.
Upserts daily marts for long-window acceleration.
Upserts article-card tables.
Records ingestion_batches.

complete

Writes stage event log entries and final ingest status.
Broadcasts websocket event to connected clients.

Storage Model (DuckDB)

Primary dimensional table:

countries: FIPS -> ISO2/ISO3/name/continent mapping.

Hourly facts (high-fidelity recent data):

country_hour(hour_ts, iso2, mentions, tone_sum)
theme_hour(hour_ts, iso2, theme, mentions, tone_sum)
entity_hour(hour_ts, iso2, entity_type, entity, mentions, tone_sum)

Daily marts (long-window acceleration):

country_day(day_ts, iso2, mentions, tone_sum)
theme_day(day_ts, iso2, theme, mentions, tone_sum)
entity_day(day_ts, iso2, entity_type, entity, mentions, tone_sum)

Drilldown subsets (URL-level):

theme_article(hour_ts, iso2, theme, url, source, source_country, tone_sum, mentions)
entity_article(hour_ts, iso2, entity_type, entity, url, source, source_country, tone_sum, mentions)

Operational/observability tables:

ingestion_batches
ingestion_quality
ingestion_contract
ingestion_stage_events
article_meta (best-effort title/image/description cache)

Query Planner Strategy

Short windows:

Reads directly from hourly tables for maximum precision.

Long windows:

Uses hybrid plan:
- hourly rows for edge periods
- daily marts for interior period
Controlled by USE_DAILY_MARTS_THRESHOLD_HOURS.

Reason:

Keeps query latency stable for 30-day windows without losing near-real-time detail at the edges.

Consistency and Concurrency

Write path:

Write operations run behind backend lock wrappers (with_lock) to avoid concurrent conflicting writes.

Read path:

API reads are lock-light (with_read_lock) and optimized for dashboard responsiveness.

Ingest control:

Ingest run lock prevents overlapping ingest cycles.
Manual trigger endpoint and background loop share the same lock boundary.

Failure Model and Recovery

Expected failure classes:

Upstream fetch errors (network/TLS)
Parsing anomalies in raw feed
Quality-rule failures

Behavior:

Errors are captured in ingest status and stage events.
Failed quality batches do not pass contract gate and therefore are not persisted into fact tables.
Loop continues polling after failures (no permanent halt).

Frontend Runtime Model

Data refresh:

Polls APIs on interval via TanStack Query.
Ingest panels poll faster while ingest is running.
WebSocket notifications trigger near-real-time refresh behavior.

UX semantics:

Table mentions come from aggregate facts.
Article cards are a constrained subset and may not exist for every aggregate mention.
about sentiment now reflects table aggregate when available; from reflects shown article subset.

Parser and Record-Admissibility Contract

A raw GKG row contributes to aggregate mentions only if all conditions are true:

Valid V2.1DATE parseable into UTC hour bucket.
Valid V2Tone first value parseable as float.
At least one valid location country code extracted from V2Locations.

A raw row may still contribute to aggregate mentions even if:

URL is missing/unparseable.
Theme/entity fields are empty (it still contributes to country aggregates).

Implication:

Mention aggregates are intentionally broader than article-card subsets.

Cardinality and Mention Expansion Rules

For a single parsed document:

Country mentions: incremented once per distinct mapped country.
Theme mentions: incremented for each (country, theme) pair.
Entity mentions: incremented for each (country, entity_type, entity) pair.

This means one document can inflate mentions across multiple countries/themes/entities. This is expected and required for co-mention analysis.

URL Subset Materialization Rules

URL-level tables are bounded by three controls:

MAX_ARTICLE_URLS_PER_BATCH
MAX_ARTICLE_URLS_PER_THEME
MAX_ARTICLE_URLS_PER_ENTITY

Rows are deduplicated by primary keys:

theme rows: (hour_ts, iso2, theme, url)
entity rows: (hour_ts, iso2, entity_type, entity, url)

Consequence:

URL tables are a representative drilldown subset, not full coverage.

Stage Event and Contract Schemas

ingestion_stage_events event payload model:

event_id: UUID
batch_id: nullable for pre-discovery/no-update events
stage: discover/download/aggregate/validate/persist/complete/ingest
status: started/success/failed/error/skipped/already_ingested/no_update
detail_json: stage-specific payload
created_at: server timestamp

ingestion_contract model (gate between validate and persist):

batch_id, contract_version
materialized batch stats (parsed_rows, country_rows, theme_rows, entity_rows)
quality_success
serialized quality report JSON

Gate invariant:

Persist stage must not execute unless contract exists and quality_success=true.

Idempotency Model

Idempotency is guaranteed at two levels:

Batch-level: ingestion_batches.batch_id prevents re-ingesting the same batch.
Table-level: upserts on fact/article tables merge repeated writes safely.

Operational behavior:

Poll loop may discover the same latest batch repeatedly.
Backend suppresses noisy duplicate already_ingested stage logs for the same batch over a cooldown window.

Query Path Contracts

/api/top_topics and /api/top_entities:

Source: aggregate facts (theme_hour/day, entity_hour/day)
Guarantee: mention/tone reflects full aggregate scope for selected region + window.

/api/topic_articles and /api/entity_articles:

Source: URL subset tables (theme_article, entity_article)
Guarantee: only reflects retained URL subset within window.

scope=about:

Filters by topic/entity association and region membership (iso2 on article row).

scope=from:

Filters by inferred publisher country (source_country) derived from URL host/TLD.
Coverage depends on domain-country inferability.

Time-Window Planner Internals

For windows below threshold:

planner reads hourly tables only.

For windows above threshold:

planner composes:
- leading partial-day hourly slice
- middle full-day daily-mart slice
- trailing current-day hourly slice

This preserves edge accuracy while reducing scan volume for multi-week windows.

Locks, Threads, and Execution Model

Concurrency controls:

async run lock prevents overlapping ingest runs.
task lock prevents duplicate manual-trigger tasks.
DB write lock serializes upsert-heavy operations.

CPU/IO split:

Zip parsing/aggregation is executed in worker threads to avoid blocking event loop.
network fetch and API remain async.

Performance Characteristics

Main hot spots:

Zip parsing and per-row split/tokenization.
Wide group-by writes during upsert into fact tables.
Metadata fetch for external article pages (best-effort).

Main latency levers:

reducing MAX_THEMES_PER_DOC and MAX_ENTITIES_PER_DOC
tuning article subset caps
adjusting USE_DAILY_MARTS_THRESHOLD_HOURS

Backfill Path (Topic URL Recovery)

/api/topic_articles_backfill:

Scans recent expected batch timestamps backward from latest known batch.
Downloads missing zips on demand (unless already cached).
Extracts matching topic URLs and upserts theme_article.
Optional cache cleanup controlled by BACKFILL_KEEP_ZIPS.

Use case:

recover article-card coverage when aggregate mentions exist but URL subset is sparse.

Operational Runbook

If ingest appears stale:

Check GET /api/ingest_status for running, phase, last_error.
Check GET /api/ingest_stages for most recent stage transitions.
Trigger POST /api/ingest_now.

If quality failures occur:

Inspect GET /api/ingest_quality.
Verify parsed row counts against expected feed size.
Adjust DQ_MIN_PARSED_ROWS/DQ_MAX_PARSED_ROWS only with evidence.

If “mentions but no articles” is reported:

Confirm scope (about vs from).
Run topic backfill.
Verify URL-country inferability constraints for from.

Known Tradeoffs and Limits

This is a single-process service architecture; no external queue/orchestrator yet.
DuckDB is ideal for local analytics, but write concurrency is intentionally serialized.
URL-country inference via TLD is heuristic and incomplete.
Article metadata enrichment depends on third-party site availability and SSL behavior.

To-Dos

Add multi-source ingestion beyond news web domains.
Ingest and persist SourceCollectionIdentifier from raw GKG rows.
Add social-in-news source pipeline (e.g., daily social link datasets referenced by GDELT).
Add TV-GKG as a second source family and map into shared fact schemas.
Add source_type and source_family dimensions to facts and article tables.
Add UI filters/toggles for source scope: News, Social, TV, All.
Expose source coverage diagnostics in technical panels (per source type, per batch).

Notes

GDELT location country codes are FIPS; backend maps to ISO2/ISO3.
Theme interpretations are sourced from GDELT theme lookup and stored in:
- backend/app/ingest/data/gdelt_theme_interpretations.json

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
backend		backend
frontend		frontend
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
compose.dev.yml		compose.dev.yml
docker-compose.yml		docker-compose.yml
preview.png		preview.png
render.yaml		render.yaml

Folders and files

Latest commit

History

Repository files navigation

GDELT Sentiment Visualizer

What You See In The UI

1) World Sentiment Map

2) Topics and Entities

3) Article Cards (about vs from)

4) Technical Panels

Data Semantics (Important)

Run The App

Docker (recommended)

Local Dev

Key API Endpoints

Config Highlights

Architecture (In Depth)

System Boundaries

End-to-End Data Flow

Ingest Pipeline Stages

Storage Model (DuckDB)

Query Planner Strategy

Consistency and Concurrency

Failure Model and Recovery

Frontend Runtime Model

Parser and Record-Admissibility Contract

Cardinality and Mention Expansion Rules

URL Subset Materialization Rules

Stage Event and Contract Schemas

Idempotency Model

Query Path Contracts

Time-Window Planner Internals

Locks, Threads, and Execution Model

Performance Characteristics

Backfill Path (Topic URL Recovery)

Operational Runbook

Known Tradeoffs and Limits

To-Dos

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

3) Article Cards (`about` vs `from`)

Packages