Problem

Merging banks must consolidate customer, account, and transaction data that live in different schemas and file formats. Today this is done with spreadsheets and ad-hoc scripts: slow, error-prone, hard to audit, and difficult to reproduce under regulatory scrutiny.

What We Built — EY DataFusion

EY DataFusion is a working, end-to-end platform that ingests two banks’ tabular datasets (CSV/XLSX/JSON), profiles each table, auto-pairs tables across banks, proposes column mappings with per-signal confidence and explanations, merges with lineage, validates against contracts, and exports audit-ready documentation (Markdown + normalized JSON) signed with a SHA-256 hash. It runs as a FastAPI backend with a React frontend and includes CI, Docker Compose, and an evidence bundle tool.

Why it’s different

  • No mispairs: a Table Pairing Engine auto-pairs Bank A ↔ Bank B tables (Accounts, Customers, Loans…) using profiles + semantic tags + Hungarian matching.
  • Explainable AI: each suggestion shows per-signal scores (name/type/value-overlap/semantic), reasons, warnings, and masked sample values.
  • Audit-first: lineage columns, versioned manifest, run ledger with file/manifest hashes, presigned artifact URLs, and an evidence bundle generator.
  • Human-in-the-loop: analysts approve/reject with one click; transforms applied via a safe DSL (to_datetime, concat, map_values, …).

🔐 Secure by default (Regulated Mode)

  • Masked examples (email/phone/IBAN), optional masking on profile samples.
  • Embeddings off (no outbound LLM calls); API-key auth, strict CORS, optional OpenAPI disable.
  • Presigned artifact URLs with TTL; PII redaction in logs.

🧠 How it works (end-to-end)

  1. Upload & Profile → detect types, nulls, uniqueness, semantic tags.
  2. Auto-Pairing → match tables across banks (e.g., Accounts ↔ Deposit Accounts, Customers ↔ Customers).
  3. Suggest Mappings → AI proposes column pairs with explanations & confidence.
  4. Merge Preview → apply decisions + transforms; add lineage columns.
  5. Validate → rules (not-null, unique, regex, enums, ranges, date order, outliers). gate_blocked=true prevents export until fixed.
  6. Export Docs → Markdown + normalized JSON + manifest_hash (SHA-256); evidence bundle zip.

Confidence formula
$$ \text{confidence} \;=\; 0.45\,\text{name} \;+\; 0.20\,\text{type} \;+\; 0.20\,\text{overlap} \;+\; 0.15\,\text{embedding} $$ (Weights are env-configurable; embeddings disabled in Secure Mode.)


🏗️ Architecture

  • Backend: FastAPI (Python 3.11), Pydantic v2, Pandas, RapidFuzz, SQLAlchemy, MinIO/S3, orjson, Uvicorn
  • Frontend: React + TypeScript + Vite, Zustand, TanStack Query (MSW only in dev)
  • DevOps/CI: Docker/Compose, GitHub Actions, Postgres (prod) / SQLite (dev)

📊 What we achieved (demo dataset)

  • 60–85% auto-approved mappings at ~0.80 threshold (with table pairing).
  • Validation gates prevent risky merges; the evidence bundle provides an instant audit trail.

🧪 What we learned

  • Pairing tables before column matching removes most confusion and boosts precision.
  • “Truth from backend” builds trust: the UI renders API-returned scores, thresholds, and stats—no heuristics.

🚧 Challenges

  • Mixed Excel/CSV encodings & header collisions.
  • Balancing privacy with usefulness of examples (deterministic masking).
  • Designing reasons/warnings that non-technical users find helpful.

🗺️ What’s next

  • Entity resolution & survivorship policies.
  • Expanded validation contracts (KYC/AML, cross-table FK checks).
  • Pluggable enrichment (FX normalization, ISO country codes) under policy controls.

Built With

Share this project:

Updates