Disclaimer:
I created this repository to experiment with Apache Fluss (Incubating). Please note that the cluster sizing, security configuration, fraud detection logic are intended for exploration only and are not production-ready.
Following my previous post introducing Fluss, here’s how I built a Streamhouse architecture leveraging Fluss, Flink, and Iceberg to:
- Process bank transactions in real time
- Detect fraud
- Serve data seamlessly across hot (sub-second latency) and cold (minutes latency) layers
Using Docker Compose, I deployed a small Fluss and Flink cluster.
I chose MinIO as the S3-compatible storage to hold Iceberg tables, and an Iceberg REST Catalog for metadata management.
Fluss Java Client creates three tables:
- Transaction Log Table (append-only)
- Account Primary Table (supports
INSERT,UPDATE,DELETE) - Enriched Fraud Log Table with
datalakeoption enabled (append-only)
When datalake option is enabled, Fluss automatically initializes an Iceberg table on MinIO through the REST Catalog for historical data.
Finally, the client generates transaction and account records and pushes them to Fluss.
The Flink-based fraud detection job, developed to continuously stream transactions from Fluss, identifies fraudulent records in real time.
- Once fraud is detected, the records are enriched with the account name by referencing the Account Primary Table.
- This enrichment uses a temporal streaming lookup join on the Account table, which serves as a dimension.
- Finally, the enriched fraud records are appended to the Enriched Fraud Log Table.
The Fluss Tiering Service (a Flink Job provided by the Fluss ecosystem) appends and compacts enriched fraud into the Iceberg table on MinIO.
-
Queryable Tables
Unlike Apache Kafka, where topics are not queryable, Fluss Log Tables allow direct querying for real-time insights. -
No More External Caches
Eliminate the need to deploy/scale cache/DB/state for lookups—just use Fluss Primary Tables. -
Column Pruning for Streaming Reads
Fluss supports column pruning for Log Tables and Primary Key Table changelogs, reducing data reads and network costs—even during streaming reads. -
Automatic Tiering to Lakehouse
Real-time data is automatically compacted into Iceberg, Paimon, or Lance via a built-in Flink service, seamlessly bridging streaming and batch. -
Union Reads
Fluss enables combined reads of real-time and historical data (Fluss tables + Iceberg), delivering true real-time analytics without duplication. -
Cut Disk Storage Costs
Fluss offloads old data (Log Table segments and Primary Table snapshots) to remote storage, significantly reducing disk costs.
Fluss/Flik required JARs:
https://fluss.apache.org/docs/streaming-lakehouse/integrate-data-lakes/iceberg/
Fraud Job JAR packaging cmd:
mvn package -f pom.xmlFlink Job submit parameters:
--parallelism 2 --fluss.bootstrap.servers coordinator-server:9122 --datalake.format iceberg --datalake.iceberg.type rest --datalake.iceberg.warehouse s3://fluss/data --datalake.iceberg.uri http://rest:8181 --datalake.iceberg.s3.endpoint http://minio:9000 --datalake.iceberg.s3.path-style-access trueDocker commands:
docker-compose down -v
docker-compose up --buildFlink SQL client commands:
CREATE CATALOG fluss_catalog WITH (
'type' = 'fluss',
'bootstrap.servers' = 'coordinator-server:9122'
);
USE CATALOG fluss_catalog;
SET 'execution.runtime-mode' = 'batch';
SET 'sql-client.execution.result-mode' = 'tableau';
select * from fraud;
select * from fraud$lake;