GitHub - TFMV/featherman: Kubernetes Operator for DuckLake

Featherman brings DuckDB's powerful DuckLake functionality to Kubernetes, enabling declarative management of data lakes with the simplicity of DuckDB and the scalability of cloud object storage.

This project is in early development. Any contributions are appreciated.

Features

🦆 DuckDB-Native: Leverages DuckDB's simplicity and performance
🎯 Declarative: Define your data lake structure using Kubernetes CRDs
☁️ Cloud Storage: Seamless integration with S3-compatible object stores
🔒 Enterprise Ready: Built-in backup, encryption, and monitoring
🚀 Kubernetes-Native: Fully integrated with K8s ecosystem
⚡ Warm Pod Pool: Pre-initialized DuckDB pods for low-latency queries

Development Setup

Prerequisites:
- Go 1.21+
- Docker
- KinD (Kubernetes in Docker)
- kubectl
Set up local development cluster:

make kind-setup

Deploy MinIO (for local S3-compatible storage):

make minio-setup

Build and load the operator:

make docker-build
make kind-load
make deploy

Quick Start

Install Featherman:

# Clone the repository
git clone https://github.com/TFMV/featherman.git
cd featherman/operator

# Install CRDs
make install

# Deploy the operator
make deploy

Create a catalog:

apiVersion: ducklake.featherman.dev/v1alpha1
kind: DuckLakeCatalog
metadata:
  name: example
spec:
  storageClass: standard
  size: 10Gi
  objectStore:
    endpoint: s3.amazonaws.com
    bucket: my-data-lake
    credentialsSecret:
      name: s3-credentials
  backupPolicy:
    schedule: "0 2 * * *"    # Daily at 2 AM
    retentionDays: 7

Create a table:

apiVersion: ducklake.featherman.dev/v1alpha1
kind: DuckLakeTable
metadata:
  name: users
spec:
  catalogRef: example
  name: users
  columns:
    - name: id
      type: INTEGER
    - name: name
      type: VARCHAR
  format:
    compression: ZSTD
    partitioning: ["created_at"]

Architecture

Featherman follows a cloud-native architecture designed for reliability and scalability:

graph TD
    subgraph Control-Plane
        OP[Operator]
        BM[Backup Manager]
        WH[Webhooks]
        PM[Pool Manager]
    end
    subgraph Data-Plane
        JOB[DuckDB Jobs]
        PVC[Catalog PVC]
        S3[Object Store]
        WP[Warm Pods]
    end
    OP --> JOB
    OP --> PM
    PM --> WP
    BM --> S3
    JOB --> PVC
    JOB --> S3
    WP --> PVC
    WP --> S3

Components

Control Plane
- Operator: Manages CRDs and orchestrates data operations
- Backup Manager: Handles scheduled backups and retention
- Webhooks: Validates and defaults resource configurations
- Pool Manager: Maintains warm pod pool for low-latency queries
Data Plane
- DuckDB Jobs: Ephemeral pods that execute SQL operations
- Warm Pods: Pre-initialized pods for immediate query execution
- Catalog Storage: Persistent volumes storing DuckDB metadata
- Object Store: S3-compatible storage for Parquet data files

Key Design Principles

Separation of Concerns: Metadata and data are stored separately for better scalability
Stateless Operations: All operations run in ephemeral jobs for reliability
Cloud-Native Storage: Leverages object storage for data and K8s volumes for metadata
Kubernetes Patterns: Follows standard K8s patterns like operator pattern and CRDs
Performance Optimization: Warm pod pool eliminates cold start latency

Warm Pod Pool

The warm pod pool feature eliminates cold start latency by maintaining pre-initialized DuckDB pods ready to execute queries immediately. This is particularly useful for interactive workloads and low-latency query requirements.

Configuration

apiVersion: ducklake.featherman.dev/v1alpha1
kind: DuckLakePool
metadata:
  name: default-pool
spec:
  # Pool sizing
  minSize: 2
  maxSize: 10
  targetUtilization: 0.8
  
  # Pod template
  template:
    resources:
      requests:
        memory: "2Gi"
        cpu: "1"
      limits:
        memory: "4Gi"
        cpu: "2"
    
  # Lifecycle policies
  maxIdleTime: 300s          # Terminate pods idle > 5 min
  maxLifetime: 3600s         # Recycle pods after 1 hour
  maxQueries: 100            # Recycle after N queries
  
  # Scaling behavior
  scaleUpRate: 2             # Max pods to add per interval
  scaleDownRate: 1           # Max pods to remove per interval
  scaleInterval: 30s         # Evaluation interval
  
  # Catalog mounting
  catalogRef:
    name: main-catalog
    readOnly: true

Benefits

Performance: Eliminates cold start latency (typically 5-10s → <100ms)
Resource Efficiency: Reuses initialized pods
Predictable Latency: Consistent query response times
Graceful Degradation: Falls back to Jobs if pool unavailable
Cost Optimization: Scales based on actual demand

View Materialization

Featherman can materialize query results back to object storage. Define the materializeTo block on a DuckLakeTable to export a view as Parquet.

spec:
  materializeTo:
    enabled: true
    sql: "SELECT country, COUNT(*) AS cnt FROM users GROUP BY country"
    destination:
      bucket: my-exports
      prefix: top_users/
    format:
      type: parquet
      compression: ZSTD

When enabled, the operator will run the query and write results to the specified bucket and prefix.

Metrics

The pool manager exposes Prometheus metrics for:

ducklake_pool_size_current: Current number of pods
ducklake_pool_size_desired: Target number of pods
ducklake_pool_pods_idle: Number of idle pods
ducklake_pool_pods_busy: Number of busy pods
ducklake_pool_queue_length: Pending requests
ducklake_pool_query_duration: Query execution time
ducklake_pool_wait_duration: Time waiting for pod

Testing

The operator includes comprehensive test suites:

Unit Tests: Test individual components and functions

make test

End-to-End Tests: Test full operator functionality in a KinD cluster

make e2e-test

The E2E tests require:

Running KinD cluster
MinIO deployment (for S3 testing)
Controller image built and loaded
Proper RBAC and namespace configuration

featherman-query Service

featherman-query exposes a simple HTTP endpoint for ad-hoc SQL queries using the warm pod pool.

curl -X POST http://localhost:8080/query \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"sql":"SELECT 1","format":"csv","catalog":"example"}'

Results stream back as CSV (default) or Arrow IPC.

Monitoring

The operator exposes Prometheus metrics for:

Catalog operations (create, update, delete)
Storage usage
Backup status
Job durations
Error counts

Access metrics at :8080/metrics endpoint (configurable).

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
art		art
examples		examples
featherman-query		featherman-query
feathermanctl		feathermanctl
logo		logo
operator		operator
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Features

Development Setup

Quick Start

Architecture

Components

Key Design Principles

Warm Pod Pool

Configuration

Benefits

View Materialization

Metrics

Testing

featherman-query Service

Monitoring

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Features

Development Setup

Quick Start

Architecture

Components

Key Design Principles

Warm Pod Pool

Configuration

Benefits

View Materialization

Metrics

Testing

featherman-query Service

Monitoring

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages