DataMorph — AI‑Driven ETL on Google Cloud Run
Turn messy CSV/JSON into clean, structured outputs with an AI‑assisted pipeline. DataMorph handles upload → profiling → canonicalization → transformation → export, with a simple API and a lightweight React UI.
Stack: .NET 9, React + Typescript, Docker, Google Cloud - Google Cloud Run, Firestore, Firebase Auth Cloud Storage, Pub/Sub, Eventarc
How does this work?
The purpose of this tool is that people can upload any messy datasets - CSV, JSON and then have Gemini perform necessary ETL on it. Gemini can also answer any question you may have on the dataset and also clean it up for you or even generate another dataset based on what you've provided as an input - Anything pretty much! Sky is the limit on what you can do with Datamorph.
DataMorph vs LLM Chat UI
You're probably wondering why use DataMorph, when you can probably use any other LLM directly, by uploading a file onto it - asking for a response. And you're probably right! However, raw LLM UI falls over when you need reliability, scale, governance and integration. Here's what happens -
- You can enforce exactly what you need and reject/flag violations.
- DataMorph emits a strict JSONL with errors per row; no prose, no markdown.
- DataMorph is capable of handling large files (Currently disabled, to avoid huge gemini bills) via GCS uploads, chunked processing and idempotency keys. LLM UIs cap out pretty quickly
- Every run ties to a hash, so that the exact same prompt could be re-ran giving you the exact same results.
- Costs - We use a preflight profiler for estimating tokens/costs and do cheap rule based transformations before reaching an LLM, avoiding unnecessary LLM calls.
In a nutshell, we can turn something messy, clean it up, versioned transform plans, scalable chunked processing, cost control, something an ad-hoc chat with LLM cannot operationalize.
Behind the scenes
- When a file is uploaded to the UI, this calls an HTTP trigger API running on Cloud Run -
/init/pipelines. This API creates a record (job) in Firestore, with user prompt, job id, and other metadata. - After this is created, the backend sends back a GCS Signed Url in order to upload the user uploaded file. The file is then sent using a
PUTrequest to the signed url. - Once this file is added to the Google Cloud Storage, an EventArc trigger picks up the file from the Storage, and calls the
/parseendpoint, which grabs the file and cleans it up. At this stage, the PII Mode is checked to ensure that PII allowed fields are specified and accordingly cleaned out along with regular cleaning transformation. The new transformation field data is then sent on to Pub/Sub as a message. - The Pub/Sub message is then pulled out by another EventArc trigger. A sample message looks as -
{"jobId":"c8017eb92aae43bca1cebcb90cef0d3e","bucket":"BUCKET_NAME,"parsedPath":"parsed/c8017eb92aae43bca1cebcb90cef0d3e/data.jsonl","profilePath":"profile/c8017eb92aae43bca1cebcb90cef0d3e/profile.json","state":"READY_FOR_TRANSFORM"}
This EventArc trigger on publish of a message calls the transformation api, which converts the file into a canonical format. The converted canonical file is stored in (output) Google Cloud Bucket.
- Final trigger picks up the canonical file, and converts it based on user prompt by calling the
gemini-2.5-flash-litemodel.
Firebase state is constantly updated so that the user can keep track of what file is being processed with what status.
Architecture

Notes
- Signed URL is time‑limited and scope‑limited to the job path. There is a 7 days expiry on the download signed url, after which user will not be able to download the said file.
- Rate limiting is set to 60 seconds per user, per file upload. This is to prevent abuse of the system, and to avoid the massive server and Gemini bill that DataMorph racks up!
Built With
- .net
- cloudrun
- firebase
- firebaseauth
- firestore
- gcp
- googlecloudbucket
- googlecloudpubsub
- googleiam
- react
- typescript
Log in or sign up for Devpost to join the conversation.