Skip to content

com-digi-s/world-sentiment-map

Repository files navigation

GDELT Sentiment Visualizer

Interactive news-intelligence app for exploring global sentiment from GDELT. Preview

The app continuously ingests GDELT Translingual GKG updates and lets you inspect:

  • World sentiment map
  • Top topics by region
  • Top people/organizations by region
  • Time-series sentiment for selected topic/entity
  • Article cards for topic/entity (about vs from region)
  • Ingest quality + pipeline timeline

What You See In The UI

1) World Sentiment Map

  • Colors countries by average tone for the selected window/topic.
  • Click a country to set region context for topic/entity analysis.

2) Topics and Entities

  • Top topics and Top entities come from aggregated mention facts.
  • Each row includes mention count and average sentiment.
  • Selecting a row opens article cards and a time-series chart.

3) Article Cards (about vs from)

  • about <region>: articles tagged as about that region.
  • from <region>: articles filtered by inferred publisher country.
  • Mention counts can be non-zero while article cards are empty. This is expected when:
    • records contributed to aggregate mentions but had no retained URL row, or
    • from scope cannot infer publisher country for those URLs.

4) Technical Panels

  • Ingest quality (recent): pass/fail checks per batch.
  • Pipeline timeline: stage events and stage durations.
  • Backfill <window>h control: fetches historical batches to populate long lookback windows.

Data Semantics (Important)

  • mentions in topic/entity tables are aggregate counts from parsed GKG records.
  • Article cards are a bounded subset (theme_article / entity_article) and are not a 1:1 representation of aggregate mentions.
  • Tone uses the first value of GDELT V2Tone.

Run The App

Docker (recommended)

docker compose up --build

Open http://localhost:8000

Local Dev

Backend:

cd backend
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --reload --port 8000

Frontend:

cd frontend
npm install
npm run dev

Open http://localhost:5173

Key API Endpoints

  • GET /api/regions
  • GET /api/country_summary
  • GET /api/top_topics
  • GET /api/top_entities
  • GET /api/timeseries_topic
  • GET /api/timeseries_entity
  • GET /api/topic_articles
  • GET /api/entity_articles
  • GET /api/ingest_status
  • GET /api/ingest_quality
  • GET /api/ingest_stages
  • POST /api/ingest_now
  • POST /api/history_backfill

Config Highlights

  • POLL_SECONDS: ingest poll interval
  • DEFAULT_WINDOW_HOURS: default lookback window
  • USE_DAILY_MARTS_THRESHOLD_HOURS: switch to hourly+daily hybrid query plan
  • DQ_MIN_PARSED_ROWS, DQ_MAX_PARSED_ROWS: batch quality limits
  • HISTORY_BACKFILL_MAX_HOURS: maximum on-demand historical backfill window
  • GDELT_SSL_VERIFY, GDELT_CA_BUNDLE, GDELT_ALLOW_INSECURE_FALLBACK: upstream TLS behavior

Architecture (In Depth)

System Boundaries

frontend/:

  • React SPA (Vite build) for map, ranking tables, drilldowns, and operational panels.
  • Uses TanStack Query for pull-based refresh and local cache invalidation.
  • Uses WebSocket updates (/ws/updates) to react quickly to newly ingested batches.

backend/:

  • FastAPI service exposing read APIs, ingest control APIs, and WebSocket updates.
  • DuckDB-backed analytical store for both fact tables and operational metadata.
  • Ingest worker runs continuously in-process (startup background tasks).

End-to-End Data Flow

GDELT lastupdate -> discover latest batch id
               -> download zip
               -> parse + aggregate rows
               -> data quality validation
               -> contract write
               -> persist hourly facts + daily marts + article subsets
               -> emit ingest status + WS update
               -> UI refetches affected queries

Ingest Pipeline Stages

Stage sequence: discover -> download -> aggregate -> validate -> persist -> complete

Stage behavior:

  1. discover
  • Reads lastupdate-translation.txt.
  • Resolves current batch metadata (batch_id, file URL).
  • Short-circuits with already_ingested when batch exists.
  1. download
  • Downloads *.translation.gkg.csv.zip into cache.
  • Supports TLS verification and optional insecure fallback when upstream certs are invalid.
  1. aggregate
  • Parses tab-separated GKG lines.
  • Maps FIPS location codes to ISO country codes.
  • Produces hourly aggregates for:
    • country mentions/tone
    • theme mentions/tone
    • entity mentions/tone
  • Produces bounded URL subsets for topic/entity drilldowns.
  1. validate
  • Runs batch quality expectations (row-count bounds, no negative mentions, tone sanity).
  • Writes quality result into ingestion_quality.
  1. contract
  • Writes a stage contract row (ingestion_contract) with materialized batch stats + quality result.
  • Persist stage is gated by successful contract (quality_success=true).
  1. persist
  • Upserts hourly fact tables.
  • Upserts daily marts for long-window acceleration.
  • Upserts article-card tables.
  • Records ingestion_batches.
  1. complete
  • Writes stage event log entries and final ingest status.
  • Broadcasts websocket event to connected clients.

Storage Model (DuckDB)

Primary dimensional table:

  • countries: FIPS -> ISO2/ISO3/name/continent mapping.

Hourly facts (high-fidelity recent data):

  • country_hour(hour_ts, iso2, mentions, tone_sum)
  • theme_hour(hour_ts, iso2, theme, mentions, tone_sum)
  • entity_hour(hour_ts, iso2, entity_type, entity, mentions, tone_sum)

Daily marts (long-window acceleration):

  • country_day(day_ts, iso2, mentions, tone_sum)
  • theme_day(day_ts, iso2, theme, mentions, tone_sum)
  • entity_day(day_ts, iso2, entity_type, entity, mentions, tone_sum)

Drilldown subsets (URL-level):

  • theme_article(hour_ts, iso2, theme, url, source, source_country, tone_sum, mentions)
  • entity_article(hour_ts, iso2, entity_type, entity, url, source, source_country, tone_sum, mentions)

Operational/observability tables:

  • ingestion_batches
  • ingestion_quality
  • ingestion_contract
  • ingestion_stage_events
  • article_meta (best-effort title/image/description cache)

Query Planner Strategy

Short windows:

  • Reads directly from hourly tables for maximum precision.

Long windows:

  • Uses hybrid plan:
    • hourly rows for edge periods
    • daily marts for interior period
  • Controlled by USE_DAILY_MARTS_THRESHOLD_HOURS.

Reason:

  • Keeps query latency stable for 30-day windows without losing near-real-time detail at the edges.

Consistency and Concurrency

Write path:

  • Write operations run behind backend lock wrappers (with_lock) to avoid concurrent conflicting writes.

Read path:

  • API reads are lock-light (with_read_lock) and optimized for dashboard responsiveness.

Ingest control:

  • Ingest run lock prevents overlapping ingest cycles.
  • Manual trigger endpoint and background loop share the same lock boundary.

Failure Model and Recovery

Expected failure classes:

  • Upstream fetch errors (network/TLS)
  • Parsing anomalies in raw feed
  • Quality-rule failures

Behavior:

  • Errors are captured in ingest status and stage events.
  • Failed quality batches do not pass contract gate and therefore are not persisted into fact tables.
  • Loop continues polling after failures (no permanent halt).

Frontend Runtime Model

Data refresh:

  • Polls APIs on interval via TanStack Query.
  • Ingest panels poll faster while ingest is running.
  • WebSocket notifications trigger near-real-time refresh behavior.

UX semantics:

  • Table mentions come from aggregate facts.
  • Article cards are a constrained subset and may not exist for every aggregate mention.
  • about sentiment now reflects table aggregate when available; from reflects shown article subset.

Parser and Record-Admissibility Contract

A raw GKG row contributes to aggregate mentions only if all conditions are true:

  • Valid V2.1DATE parseable into UTC hour bucket.
  • Valid V2Tone first value parseable as float.
  • At least one valid location country code extracted from V2Locations.

A raw row may still contribute to aggregate mentions even if:

  • URL is missing/unparseable.
  • Theme/entity fields are empty (it still contributes to country aggregates).

Implication:

  • Mention aggregates are intentionally broader than article-card subsets.

Cardinality and Mention Expansion Rules

For a single parsed document:

  • Country mentions: incremented once per distinct mapped country.
  • Theme mentions: incremented for each (country, theme) pair.
  • Entity mentions: incremented for each (country, entity_type, entity) pair.

This means one document can inflate mentions across multiple countries/themes/entities. This is expected and required for co-mention analysis.

URL Subset Materialization Rules

URL-level tables are bounded by three controls:

  • MAX_ARTICLE_URLS_PER_BATCH
  • MAX_ARTICLE_URLS_PER_THEME
  • MAX_ARTICLE_URLS_PER_ENTITY

Rows are deduplicated by primary keys:

  • theme rows: (hour_ts, iso2, theme, url)
  • entity rows: (hour_ts, iso2, entity_type, entity, url)

Consequence:

  • URL tables are a representative drilldown subset, not full coverage.

Stage Event and Contract Schemas

ingestion_stage_events event payload model:

  • event_id: UUID
  • batch_id: nullable for pre-discovery/no-update events
  • stage: discover/download/aggregate/validate/persist/complete/ingest
  • status: started/success/failed/error/skipped/already_ingested/no_update
  • detail_json: stage-specific payload
  • created_at: server timestamp

ingestion_contract model (gate between validate and persist):

  • batch_id, contract_version
  • materialized batch stats (parsed_rows, country_rows, theme_rows, entity_rows)
  • quality_success
  • serialized quality report JSON

Gate invariant:

  • Persist stage must not execute unless contract exists and quality_success=true.

Idempotency Model

Idempotency is guaranteed at two levels:

  • Batch-level: ingestion_batches.batch_id prevents re-ingesting the same batch.
  • Table-level: upserts on fact/article tables merge repeated writes safely.

Operational behavior:

  • Poll loop may discover the same latest batch repeatedly.
  • Backend suppresses noisy duplicate already_ingested stage logs for the same batch over a cooldown window.

Query Path Contracts

/api/top_topics and /api/top_entities:

  • Source: aggregate facts (theme_hour/day, entity_hour/day)
  • Guarantee: mention/tone reflects full aggregate scope for selected region + window.

/api/topic_articles and /api/entity_articles:

  • Source: URL subset tables (theme_article, entity_article)
  • Guarantee: only reflects retained URL subset within window.

scope=about:

  • Filters by topic/entity association and region membership (iso2 on article row).

scope=from:

  • Filters by inferred publisher country (source_country) derived from URL host/TLD.
  • Coverage depends on domain-country inferability.

Time-Window Planner Internals

For windows below threshold:

  • planner reads hourly tables only.

For windows above threshold:

  • planner composes:
    • leading partial-day hourly slice
    • middle full-day daily-mart slice
    • trailing current-day hourly slice

This preserves edge accuracy while reducing scan volume for multi-week windows.

Locks, Threads, and Execution Model

Concurrency controls:

  • async run lock prevents overlapping ingest runs.
  • task lock prevents duplicate manual-trigger tasks.
  • DB write lock serializes upsert-heavy operations.

CPU/IO split:

  • Zip parsing/aggregation is executed in worker threads to avoid blocking event loop.
  • network fetch and API remain async.

Performance Characteristics

Main hot spots:

  • Zip parsing and per-row split/tokenization.
  • Wide group-by writes during upsert into fact tables.
  • Metadata fetch for external article pages (best-effort).

Main latency levers:

  • reducing MAX_THEMES_PER_DOC and MAX_ENTITIES_PER_DOC
  • tuning article subset caps
  • adjusting USE_DAILY_MARTS_THRESHOLD_HOURS

Backfill Path (Topic URL Recovery)

/api/topic_articles_backfill:

  • Scans recent expected batch timestamps backward from latest known batch.
  • Downloads missing zips on demand (unless already cached).
  • Extracts matching topic URLs and upserts theme_article.
  • Optional cache cleanup controlled by BACKFILL_KEEP_ZIPS.

Use case:

  • recover article-card coverage when aggregate mentions exist but URL subset is sparse.

Operational Runbook

If ingest appears stale:

  • Check GET /api/ingest_status for running, phase, last_error.
  • Check GET /api/ingest_stages for most recent stage transitions.
  • Trigger POST /api/ingest_now.

If quality failures occur:

  • Inspect GET /api/ingest_quality.
  • Verify parsed row counts against expected feed size.
  • Adjust DQ_MIN_PARSED_ROWS/DQ_MAX_PARSED_ROWS only with evidence.

If “mentions but no articles” is reported:

  • Confirm scope (about vs from).
  • Run topic backfill.
  • Verify URL-country inferability constraints for from.

Known Tradeoffs and Limits

  • This is a single-process service architecture; no external queue/orchestrator yet.
  • DuckDB is ideal for local analytics, but write concurrency is intentionally serialized.
  • URL-country inference via TLD is heuristic and incomplete.
  • Article metadata enrichment depends on third-party site availability and SSL behavior.

To-Dos

  • Add multi-source ingestion beyond news web domains.
  • Ingest and persist SourceCollectionIdentifier from raw GKG rows.
  • Add social-in-news source pipeline (e.g., daily social link datasets referenced by GDELT).
  • Add TV-GKG as a second source family and map into shared fact schemas.
  • Add source_type and source_family dimensions to facts and article tables.
  • Add UI filters/toggles for source scope: News, Social, TV, All.
  • Expose source coverage diagnostics in technical panels (per source type, per batch).

Notes

  • GDELT location country codes are FIPS; backend maps to ISO2/ISO3.
  • Theme interpretations are sourced from GDELT theme lookup and stored in:
    • backend/app/ingest/data/gdelt_theme_interpretations.json

About

Interactive news-intelligence app for exploring global sentiment from GDELT.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors