A cloud-native pipeline for streaming synthetic logs from Kafka to Spark (Dataproc) and storing them in PostgreSQL, all provisioned with Terraform.
- Kafka VM: Runs Apache Kafka for log ingestion.
- Flog VM: Generates synthetic logs using flog and streams them to Kafka via a Node.js producer.
- Dataproc Cluster: Runs Spark streaming jobs to process logs from Kafka and write to PostgreSQL.
- Cloud SQL PostgreSQL: Stores processed log data.
Provisioned via Terraform:
- VPC network and firewall rules
- Compute instances for Kafka and Flog
- Dataproc cluster with custom service account and GCS buckets
- Cloud SQL PostgreSQL instance, database, and user
Edit terraform/cred.tfvars with your GCP project ID, region, and zone.
cd terraform
terraform init
terraform apply -var-file=cred.tfvarsAfter terraform apply, note the outputs:
- Kafka external/internal IP
- Flog external IP
- PostgreSQL public IP and connection string
- Dataproc cluster name
- Kafka VM: Installs and starts Kafka automatically.
- Flog VM: Installs flog and a Node.js producer that generates logs and sends them to Kafka.
- Dataproc Cluster: Runs a Spark streaming job (
stream-job2.py) that:- Reads logs from Kafka
- Parses and transforms them
- Writes batches to PostgreSQL using JDBC
The Spark job is designed to run on Dataproc and write to PostgreSQL.
gcloud dataproc jobs submit pyspark gs://script-bucket-logflow/stream-job2.py \
--cluster=spark-streaming-cluster \
--region=us-central1 \
--properties=spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0,org.postgresql:postgresql:42.5.1 \
-- \
"<KAFKA_INTERNAL_IP>:9092" \
"<POSTGRES_PUBLIC_IP>"- Replace
<KAFKA_INTERNAL_IP>and<POSTGRES_PUBLIC_IP>with the Terraform outputs.
A null_resource in Terraform can submit the job automatically after provisioning.
- Instance: Cloud SQL PostgreSQL, public IP enabled for development.
- Database:
logs - User:
loguser/ password from Terraform (changeme123!by default) - Table:
kafka_logs(created automatically by the Spark job)
Connection string example:
postgresql://loguser:changeme123!@<POSTGRES_PUBLIC_IP>/logs
stream-job2.py: Spark streaming job (Kafka → PostgreSQL)terraform/main.tf: Core infrastructure (Kafka, Flog, VPC)terraform/dataproc.tf: Dataproc cluster and bucketsterraform/postgres.tf: Cloud SQL PostgreSQL resourcesterraform/variables.tf: Input variablesterraform/cred.tfvars: Project credentials
- The PostgreSQL instance is open to all IPs for development (
0.0.0.0/0). Restrict this in production. - Store secrets (like DB passwords) in Secret Manager for production.
MIT © 2025 Smit Patel