Featherman brings DuckDB's powerful DuckLake functionality to Kubernetes, enabling declarative management of data lakes with the simplicity of DuckDB and the scalability of cloud object storage.
This project is in early development. Any contributions are appreciated.
- π¦ DuckDB-Native: Leverages DuckDB's simplicity and performance
- π― Declarative: Define your data lake structure using Kubernetes CRDs
- βοΈ Cloud Storage: Seamless integration with S3-compatible object stores
- π Enterprise Ready: Built-in backup, encryption, and monitoring
- π Kubernetes-Native: Fully integrated with K8s ecosystem
- β‘ Warm Pod Pool: Pre-initialized DuckDB pods for low-latency queries
-
Prerequisites:
- Go 1.21+
- Docker
- KinD (Kubernetes in Docker)
- kubectl
-
Set up local development cluster:
make kind-setup- Deploy MinIO (for local S3-compatible storage):
make minio-setup- Build and load the operator:
make docker-build
make kind-load
make deploy- Install Featherman:
# Clone the repository
git clone https://github.com/TFMV/featherman.git
cd featherman/operator
# Install CRDs
make install
# Deploy the operator
make deploy- Create a catalog:
apiVersion: ducklake.featherman.dev/v1alpha1
kind: DuckLakeCatalog
metadata:
name: example
spec:
storageClass: standard
size: 10Gi
objectStore:
endpoint: s3.amazonaws.com
bucket: my-data-lake
credentialsSecret:
name: s3-credentials
backupPolicy:
schedule: "0 2 * * *" # Daily at 2 AM
retentionDays: 7- Create a table:
apiVersion: ducklake.featherman.dev/v1alpha1
kind: DuckLakeTable
metadata:
name: users
spec:
catalogRef: example
name: users
columns:
- name: id
type: INTEGER
- name: name
type: VARCHAR
format:
compression: ZSTD
partitioning: ["created_at"]Featherman follows a cloud-native architecture designed for reliability and scalability:
graph TD
subgraph Control-Plane
OP[Operator]
BM[Backup Manager]
WH[Webhooks]
PM[Pool Manager]
end
subgraph Data-Plane
JOB[DuckDB Jobs]
PVC[Catalog PVC]
S3[Object Store]
WP[Warm Pods]
end
OP --> JOB
OP --> PM
PM --> WP
BM --> S3
JOB --> PVC
JOB --> S3
WP --> PVC
WP --> S3
-
Control Plane
- Operator: Manages CRDs and orchestrates data operations
- Backup Manager: Handles scheduled backups and retention
- Webhooks: Validates and defaults resource configurations
- Pool Manager: Maintains warm pod pool for low-latency queries
-
Data Plane
- DuckDB Jobs: Ephemeral pods that execute SQL operations
- Warm Pods: Pre-initialized pods for immediate query execution
- Catalog Storage: Persistent volumes storing DuckDB metadata
- Object Store: S3-compatible storage for Parquet data files
- Separation of Concerns: Metadata and data are stored separately for better scalability
- Stateless Operations: All operations run in ephemeral jobs for reliability
- Cloud-Native Storage: Leverages object storage for data and K8s volumes for metadata
- Kubernetes Patterns: Follows standard K8s patterns like operator pattern and CRDs
- Performance Optimization: Warm pod pool eliminates cold start latency
The warm pod pool feature eliminates cold start latency by maintaining pre-initialized DuckDB pods ready to execute queries immediately. This is particularly useful for interactive workloads and low-latency query requirements.
apiVersion: ducklake.featherman.dev/v1alpha1
kind: DuckLakePool
metadata:
name: default-pool
spec:
# Pool sizing
minSize: 2
maxSize: 10
targetUtilization: 0.8
# Pod template
template:
resources:
requests:
memory: "2Gi"
cpu: "1"
limits:
memory: "4Gi"
cpu: "2"
# Lifecycle policies
maxIdleTime: 300s # Terminate pods idle > 5 min
maxLifetime: 3600s # Recycle pods after 1 hour
maxQueries: 100 # Recycle after N queries
# Scaling behavior
scaleUpRate: 2 # Max pods to add per interval
scaleDownRate: 1 # Max pods to remove per interval
scaleInterval: 30s # Evaluation interval
# Catalog mounting
catalogRef:
name: main-catalog
readOnly: true- Performance: Eliminates cold start latency (typically 5-10s β <100ms)
- Resource Efficiency: Reuses initialized pods
- Predictable Latency: Consistent query response times
- Graceful Degradation: Falls back to Jobs if pool unavailable
- Cost Optimization: Scales based on actual demand
Featherman can materialize query results back to object storage. Define the
materializeTo block on a DuckLakeTable to export a view as Parquet.
spec:
materializeTo:
enabled: true
sql: "SELECT country, COUNT(*) AS cnt FROM users GROUP BY country"
destination:
bucket: my-exports
prefix: top_users/
format:
type: parquet
compression: ZSTDWhen enabled, the operator will run the query and write results to the specified bucket and prefix.
The pool manager exposes Prometheus metrics for:
ducklake_pool_size_current: Current number of podsducklake_pool_size_desired: Target number of podsducklake_pool_pods_idle: Number of idle podsducklake_pool_pods_busy: Number of busy podsducklake_pool_queue_length: Pending requestsducklake_pool_query_duration: Query execution timeducklake_pool_wait_duration: Time waiting for pod
The operator includes comprehensive test suites:
- Unit Tests: Test individual components and functions
make test- End-to-End Tests: Test full operator functionality in a KinD cluster
make e2e-testThe E2E tests require:
- Running KinD cluster
- MinIO deployment (for S3 testing)
- Controller image built and loaded
- Proper RBAC and namespace configuration
featherman-query exposes a simple HTTP endpoint for ad-hoc SQL queries using the warm pod pool.
curl -X POST http://localhost:8080/query \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"sql":"SELECT 1","format":"csv","catalog":"example"}'Results stream back as CSV (default) or Arrow IPC.
The operator exposes Prometheus metrics for:
- Catalog operations (create, update, delete)
- Storage usage
- Backup status
- Job durations
- Error counts
Access metrics at :8080/metrics endpoint (configurable).
MIT
