Problem
Merging banks must consolidate customer, account, and transaction data that live in different schemas and file formats. Today this is done with spreadsheets and ad-hoc scripts: slow, error-prone, hard to audit, and difficult to reproduce under regulatory scrutiny.
What We Built — EY DataFusion
EY DataFusion is a working, end-to-end platform that ingests two banks’ tabular datasets (CSV/XLSX/JSON), profiles each table, auto-pairs tables across banks, proposes column mappings with per-signal confidence and explanations, merges with lineage, validates against contracts, and exports audit-ready documentation (Markdown + normalized JSON) signed with a SHA-256 hash. It runs as a FastAPI backend with a React frontend and includes CI, Docker Compose, and an evidence bundle tool.
Why it’s different
- No mispairs: a Table Pairing Engine auto-pairs Bank A ↔ Bank B tables (Accounts, Customers, Loans…) using profiles + semantic tags + Hungarian matching.
- Explainable AI: each suggestion shows per-signal scores (name/type/value-overlap/semantic), reasons, warnings, and masked sample values.
- Audit-first: lineage columns, versioned manifest, run ledger with file/manifest hashes, presigned artifact URLs, and an evidence bundle generator.
- Human-in-the-loop: analysts approve/reject with one click; transforms applied via a safe DSL (
to_datetime,concat,map_values, …).
🔐 Secure by default (Regulated Mode)
- Masked examples (email/phone/IBAN), optional masking on profile samples.
- Embeddings off (no outbound LLM calls); API-key auth, strict CORS, optional OpenAPI disable.
- Presigned artifact URLs with TTL; PII redaction in logs.
🧠 How it works (end-to-end)
- Upload & Profile → detect types, nulls, uniqueness, semantic tags.
- Auto-Pairing → match tables across banks (e.g., Accounts ↔ Deposit Accounts, Customers ↔ Customers).
- Suggest Mappings → AI proposes column pairs with explanations & confidence.
- Merge Preview → apply decisions + transforms; add lineage columns.
- Validate → rules (not-null, unique, regex, enums, ranges, date order, outliers).
gate_blocked=trueprevents export until fixed. - Export Docs → Markdown + normalized JSON +
manifest_hash(SHA-256); evidence bundle zip.
Confidence formula
$$
\text{confidence} \;=\; 0.45\,\text{name} \;+\; 0.20\,\text{type} \;+\; 0.20\,\text{overlap} \;+\; 0.15\,\text{embedding}
$$
(Weights are env-configurable; embeddings disabled in Secure Mode.)
🏗️ Architecture
- Backend: FastAPI (Python 3.11), Pydantic v2, Pandas, RapidFuzz, SQLAlchemy, MinIO/S3, orjson, Uvicorn
- Frontend: React + TypeScript + Vite, Zustand, TanStack Query (MSW only in dev)
- DevOps/CI: Docker/Compose, GitHub Actions, Postgres (prod) / SQLite (dev)
📊 What we achieved (demo dataset)
- 60–85% auto-approved mappings at ~0.80 threshold (with table pairing).
- Validation gates prevent risky merges; the evidence bundle provides an instant audit trail.
🧪 What we learned
- Pairing tables before column matching removes most confusion and boosts precision.
- “Truth from backend” builds trust: the UI renders API-returned scores, thresholds, and stats—no heuristics.
🚧 Challenges
- Mixed Excel/CSV encodings & header collisions.
- Balancing privacy with usefulness of examples (deterministic masking).
- Designing reasons/warnings that non-technical users find helpful.
🗺️ What’s next
- Entity resolution & survivorship policies.
- Expanded validation contracts (KYC/AML, cross-table FK checks).
- Pluggable enrichment (FX normalization, ISO country codes) under policy controls.


Log in or sign up for Devpost to join the conversation.