This directory contains example YAML configurations demonstrating various use cases for Dataiku IaC.
Status: 🚧 Experimental (Waves 1-3 Complete)
Complexity: ⭐ Beginner Resources: 1 project, 1 dataset Best For: Learning, testing, quick validation
The most minimal example - perfect for understanding the basic YAML structure.
project:
key: SIMPLE_PROJECT
name: Simple Project
datasets:
- name: SAMPLE_DATA
type: managed
format_type: csvUse Cases:
- First-time IaC users learning the format
- Testing IaC installation
- Validating configuration syntax
- CI/CD smoke tests
Try it:
python -m dataikuapi.iac.cli.plan -c simple-project.yml -e devComplexity: ⭐⭐⭐ Intermediate Resources: 1 project, 6 datasets, 4 recipes Best For: Machine learning workflows, data science projects
A realistic ML pipeline for customer churn prediction:
- Load data from Snowflake
- Clean and prepare data
- Engineer features
- Split train/test sets
- Generate predictions
project:
key: CHURN_PREDICTION
name: Customer Churn Prediction
datasets:
- name: RAW_CUSTOMERS
type: snowflake
connection: "{{ env.DB_CONNECTION }}"
- name: CLEANED_CUSTOMERS
type: managed
format_type: parquet
# ... more datasets
recipes:
- name: clean_customers
type: python
inputs: [RAW_CUSTOMERS]
outputs: [CLEANED_CUSTOMERS]
# ... more recipesKey Features:
- External data source (Snowflake)
- Environment variables for connections
- Multiple transformation steps
- Python recipes with realistic code
- Schema definitions
Use Cases:
- ML model training pipelines
- Feature engineering workflows
- Customer analytics projects
- Data science team collaboration
Try it:
export DB_CONNECTION=snowflake_prod
python -m dataikuapi.iac.cli.plan -c ml-pipeline.yml -e prodComplexity: ⭐⭐⭐⭐ Advanced Resources: 1 project, 9 datasets, 5 recipes Best For: Data integration, ETL pipelines, multi-source analytics
A comprehensive data integration pipeline:
- Load from multiple sources (Snowflake, PostgreSQL, S3)
- Join and enrich data
- Compute aggregations
- Generate business metrics
project:
key: SALES_ANALYTICS
name: Sales Analytics Pipeline
datasets:
- name: SNOWFLAKE_CUSTOMERS
type: snowflake
connection: "{{ env.SNOWFLAKE_CONN }}"
- name: POSTGRES_SALES
type: postgresql
connection: "{{ env.POSTGRES_CONN }}"
- name: S3_PRODUCTS
type: s3
connection: "{{ env.S3_CONN }}"
# ... intermediate and final datasets
recipes:
- name: enrich_sales_with_customers
type: sql
inputs: [POSTGRES_SALES, SNOWFLAKE_CUSTOMERS]
outputs: [ENRICHED_SALES]
# ... more recipesKey Features:
- Multiple data sources (Snowflake, PostgreSQL, S3)
- Complex dependency chains
- SQL and Python recipes
- Multi-step transformations
- Business metrics computation
Use Cases:
- Data warehouse ETL
- Multi-source data integration
- Business intelligence pipelines
- Cross-platform analytics
- Enterprise data consolidation
Try it:
export SNOWFLAKE_CONN=snowflake_prod
export POSTGRES_CONN=postgres_prod
export S3_CONN=s3_data_lake
python -m dataikuapi.iac.cli.plan -c multi-dataset.yml -e prodAll examples use environment variables for environment-specific configuration:
datasets:
- name: SOURCE_DATA
type: snowflake
connection: "{{ env.DB_CONNECTION }}"Dev environment:
export DB_CONNECTION=snowflake_dev
python -m dataikuapi.iac.cli.plan -c config.yml -e devProd environment:
export DB_CONNECTION=snowflake_prod
python -m dataikuapi.iac.cli.plan -c config.yml -e prodDefine dataset schemas for validation and documentation:
datasets:
- name: CUSTOMER_DATA
type: snowflake
schema:
columns:
- name: CUSTOMER_ID
type: bigint
- name: EMAIL
type: string
- name: CREATED_AT
type: timestampRecipes automatically determine execution order based on inputs/outputs:
recipes:
- name: clean_data
inputs: [RAW_DATA]
outputs: [CLEAN_DATA]
- name: compute_metrics
inputs: [CLEAN_DATA] # Depends on clean_data
outputs: [METRICS]IaC ensures clean_data runs before compute_metrics.
Python recipes for complex transformations:
recipes:
- name: feature_engineering
type: python
inputs: [CLEAN_DATA]
outputs: [FEATURES]
code: |
import dataiku
import pandas as pd
df = dataiku.Dataset("CLEAN_DATA").get_dataframe()
# Complex transformations
dataiku.Dataset("FEATURES").write_with_schema(df)SQL recipes for aggregations and joins:
recipes:
- name: daily_aggregation
type: sql
inputs: [TRANSACTION_DATA]
outputs: [DAILY_METRICS]
code: |
SELECT
transaction_date,
COUNT(*) as total_transactions,
SUM(amount) as total_revenue
FROM TRANSACTION_DATA
GROUP BY transaction_date- New to IaC? Start with
simple-project.yml - Building ML pipeline? Use
ml-pipeline.yml - Multi-source ETL? Use
multi-dataset.yml
Edit the YAML file:
- Update project key and name
- Change dataset names
- Modify connections (use your Dataiku connection names)
- Adjust recipe code
python -m dataikuapi.iac.cli.plan -c your-config.yml -e devThe plan shows what would be created:
+(green) = Resources to create~(yellow) = Resources to update-(red) = Resources to delete
All examples follow Dataiku IaC naming conventions:
| Resource | Convention | Example |
|---|---|---|
| Project Keys | UPPERCASE_WITH_UNDERSCORES | CUSTOMER_ANALYTICS |
| Dataset Names | UPPERCASE_WITH_UNDERSCORES | RAW_CUSTOMERS |
| Recipe Names | lowercase_with_underscores | clean_customers |
Why UPPERCASE?
- Snowflake requires UPPERCASE for tables/columns
- Prevents case-sensitivity issues
- Maintains SQL compatibility
simple-project.yml
Good for small projects (< 10 resources).
my-project/
├── project.yml # Project metadata
├── datasets/
│ ├── sources.yml # External data sources
│ ├── prepared.yml # Prepared datasets
│ └── outputs.yml # Final outputs
└── recipes/
├── cleaning.yml # Data cleaning recipes
├── features.yml # Feature engineering
└── metrics.yml # Metric computation
Good for large projects (10+ resources). Use ConfigParser().parse_directory().
from dataikuapi.iac.config import ConfigParser, ConfigValidator
parser = ConfigParser()
config = parser.parse_file("your-config.yml")
validator = ConfigValidator(strict=True)
validator.validate(config) # Raises exception if invalid# See what would be created
python -m dataikuapi.iac.cli.plan -c your-config.yml -e devError: "Project key must be UPPERCASE"
# ❌ Wrong
project:
key: my_project
# ✓ Correct
project:
key: MY_PROJECTError: "Recipe references non-existent dataset"
# ❌ Wrong - RAW_DATA not defined
recipes:
- name: clean_data
inputs: [RAW_DATA]
# ✓ Correct - RAW_DATA defined
datasets:
- name: RAW_DATA
type: managed
recipes:
- name: clean_data
inputs: [RAW_DATA]Error: "Circular dependency detected"
# ❌ Wrong - circular dependency
recipes:
- name: recipe_a
inputs: [DATASET_B]
outputs: [DATASET_A]
- name: recipe_b
inputs: [DATASET_A] # Circular!
outputs: [DATASET_B]- Quick Start:
../docs/IAC_QUICKSTART.md - Full Overview:
../docs/IAC_OVERVIEW.md - Planning Docs:
../dataiku-iac-planning/README.md
- State Management:
../demos/week1_state_workflow.py - Plan Workflow:
../demos/week2_plan_workflow.py
- Copy one of these examples
- Customize for your project
- Validate with
ConfigValidator - Generate plan to preview changes
- (Coming soon) Apply to deploy to Dataiku
- Parse YAML configurations
- Validate syntax, naming, references
- Generate execution plans
- Detect dependencies automatically
- Apply execution (actually create resources)
- Rollback on failure
- Progress reporting
- Import existing projects to YAML
- Drift detection
- Remote state backends
- CI/CD templates
Questions or Issues?
- Review
../docs/TROUBLESHOOTING.md - Check
../docs/IAC_OVERVIEW.mdFAQ section
Version: 1.0 Last Updated: 2025-11-27 IaC Status: Experimental (Waves 1-3 Complete)