EY DataFusion

Problem

Merging banks must consolidate customer, account, and transaction data that live in different schemas and file formats. Today this is done with spreadsheets and ad-hoc scripts: slow, error-prone, hard to audit, and difficult to reproduce under regulatory scrutiny.

What We Built — EY DataFusion

EY DataFusion is a working, end-to-end platform that ingests two banks’ tabular datasets (CSV/XLSX/JSON), profiles each table, auto-pairs tables across banks, proposes column mappings with per-signal confidence and explanations, merges with lineage, validates against contracts, and exports audit-ready documentation (Markdown + normalized JSON) signed with a SHA-256 hash. It runs as a FastAPI backend with a React frontend and includes CI, Docker Compose, and an evidence bundle tool.

Why it’s different

No mispairs: a Table Pairing Engine auto-pairs Bank A ↔ Bank B tables (Accounts, Customers, Loans…) using profiles + semantic tags + Hungarian matching.
Explainable AI: each suggestion shows per-signal scores (name/type/value-overlap/semantic), reasons, warnings, and masked sample values.
Audit-first: lineage columns, versioned manifest, run ledger with file/manifest hashes, presigned artifact URLs, and an evidence bundle generator.
Human-in-the-loop: analysts approve/reject with one click; transforms applied via a safe DSL (to_datetime, concat, map_values, …).

🔐 Secure by default (Regulated Mode)

Masked examples (email/phone/IBAN), optional masking on profile samples.
Embeddings off (no outbound LLM calls); API-key auth, strict CORS, optional OpenAPI disable.
Presigned artifact URLs with TTL; PII redaction in logs.

🧠 How it works (end-to-end)

Upload & Profile → detect types, nulls, uniqueness, semantic tags.
Auto-Pairing → match tables across banks (e.g., Accounts ↔ Deposit Accounts, Customers ↔ Customers).
Suggest Mappings → AI proposes column pairs with explanations & confidence.
Merge Preview → apply decisions + transforms; add lineage columns.
Validate → rules (not-null, unique, regex, enums, ranges, date order, outliers). gate_blocked=true prevents export until fixed.
Export Docs → Markdown + normalized JSON + manifest_hash (SHA-256); evidence bundle zip.

Confidence formula
$$ \text{confidence} \;=\; 0.45\,\text{name} \;+\; 0.20\,\text{type} \;+\; 0.20\,\text{overlap} \;+\; 0.15\,\text{embedding} $$ (Weights are env-configurable; embeddings disabled in Secure Mode.)

🏗️ Architecture

Backend: FastAPI (Python 3.11), Pydantic v2, Pandas, RapidFuzz, SQLAlchemy, MinIO/S3, orjson, Uvicorn
Frontend: React + TypeScript + Vite, Zustand, TanStack Query (MSW only in dev)
DevOps/CI: Docker/Compose, GitHub Actions, Postgres (prod) / SQLite (dev)

📊 What we achieved (demo dataset)

60–85% auto-approved mappings at ~0.80 threshold (with table pairing).
Validation gates prevent risky merges; the evidence bundle provides an instant audit trail.

🧪 What we learned

Pairing tables before column matching removes most confusion and boosts precision.
“Truth from backend” builds trust: the UI renders API-returned scores, thresholds, and stats—no heuristics.

🚧 Challenges

Mixed Excel/CSV encodings & header collisions.
Balancing privacy with usefulness of examples (deterministic masking).
Designing reasons/warnings that non-technical users find helpful.

🗺️ What’s next

Entity resolution & survivorship policies.
Expanded validation contracts (KYC/AML, cross-table FK checks).
Pluggable enrichment (FX normalization, ISO country codes) under policy controls.

Built With

docker
fastapi
gemeni
python
react
sqlite
typscript

Updates

Rohan Saha started this project — Oct 04, 2025 11:55 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.