Skip to content

Hive Partitioning Metadata Design for STAC #232

@nlebovits

Description

@nlebovits

Hive Partitioning Metadata Design for STAC

Context

Cloud-native data lakes commonly use Hive-style partitioning to organize large datasets:

demographics/
├── year=2020/
│   ├── state=CA/data.parquet
│   └── state=NY/data.parquet
└── year=2021/
    └── state=CA/data.parquet

Portolan needs to catalog partitioned datasets in STAC metadata. However, the STAC Table Extension v1.2.0 does not define partition metadata—it only describes table schemas, not how tables are organized on disk.

This ticket focuses on how to represent partition metadata in STAC for discovery, validation, and consumption by downstream tools.


Background: What is Hive Partitioning?

Hive-style partitioning organizes data into directories based on column values:

Pattern: key1=value1/key2=value2/...

Example:

sales/
├── year=2023/
│   ├── quarter=Q1/
│   │   └── data.parquet
│   └── quarter=Q2/
│       └── data.parquet
└── year=2024/
    └── quarter=Q1/
        └── data.parquet

Benefits:

  • Query optimization — Filter by partition keys without scanning all data
  • Incremental updates — Add new partitions without rewriting existing data
  • Parallel processing — Distribute work across partitions

Used by: Apache Iceberg, Delta Lake, Apache Hive, DuckDB, Polars, pandas, PyArrow


Current STAC Table Extension Gaps

The table extension defines:

  • table:columns — Column metadata (name, type, description)
  • table:row_count — Total row count
  • table:primary_geometry — Geometry column name

What's missing:

  • ❌ Partition key definitions (which columns are partition keys)
  • ❌ Partition value ranges (min/max for each partition)
  • ❌ Partition statistics (row count per partition, file count)
  • ❌ Partitioning scheme (Hive-style, directory-based, other)

Proposed Approach: Single Logical Table at Collection Level

Based on how modern data lake formats (Iceberg, Delta Lake) handle partitioned datasets, we propose:

✅ Decisions We Can Make Now

1. Partitioned datasets = single logical table at Collection level

Treat the entire partition tree as one asset of the collection:

demographics/
├── collection.json          ← Describes entire partitioned dataset
├── year=2020/
│   ├── state=CA/data.parquet
│   └── state=NY/data.parquet
└── year=2021/
    └── state=CA/data.parquet

Rationale:

  • Matches Iceberg/Delta Lake semantics (partitioning is physical, not logical)
  • Aligns with table extension pattern (collection-level assets for tabular data)
  • Simplifies queries (one schema applies to all partitions)

Alternative (rejected): Create one Item per partition

  • ❌ Creates explosion of Items for large datasets
  • ❌ Breaks semantic meaning (partitions aren't independent entities)
  • ❌ Complicates queries (must aggregate across Items)

2. table:row_count = sum across all partitions

{
  "table:row_count": 150000  // Sum of all partition row counts
}

Rationale:

  • Users want to know total dataset size, not per-partition size
  • Matches DuckDB/pandas behavior (query entire dataset)
  • Per-partition counts can be derived from Parquet metadata if needed

3. Partition discovery from Parquet metadata (not STAC)

Consumers should read partition information directly from Parquet files, not duplicate it in STAC:

Why:

  • Source of truth: Parquet file metadata already contains partition info
  • Avoids drift: STAC metadata can become stale if partitions are added/removed
  • Standard tools: DuckDB, pandas, PyArrow already do this

Example (DuckDB):

SELECT * FROM 'demographics/**/*.parquet' WHERE year = 2020;
-- DuckDB automatically detects Hive partitioning from directory structure

Example (PyArrow):

import pyarrow.dataset as ds
dataset = ds.dataset("demographics/", partitioning="hive")
# PyArrow reads partition schema from directory names

Implication for STAC: We don't need to store partition values in collection.json—just document that the dataset is partitioned.


Research Questions

While we've established the logical model (single table at collection level), we need to research how to express partition metadata in STAC for discoverability and validation.

Section A: Partition Metadata Representation

A1. How Should We Document Partitioning in STAC?

Option A: Human-readable description only

{
  "description": "Census data partitioned by year and state (Hive-style: year=YYYY/state=XX/)"
}

Pros:

  • Simple, no spec changes needed
  • Flexible for any partitioning scheme
  • Works with STAC today

Cons:

  • Not machine-readable
  • Can't validate partition structure
  • Can't generate partition-aware queries

Option B: Custom STAC extension for partitioning

{
  "stac_extensions": ["https://portolan.org/stac-extensions/partition/v1.0.0/schema.json"],
  "partition:scheme": "hive",
  "partition:keys": [
    {"name": "year", "type": "int64"},
    {"name": "state", "type": "string"}
  ]
}

Pros:

  • Machine-readable
  • Can validate partition consistency
  • Other tools could adopt if useful

Cons:

  • Requires maintaining custom extension
  • May duplicate work if upstream adds this to table extension
  • Needs community review for STAC compliance

Option C: Extend STAC Table Extension upstream

Contribute partition metadata fields to the table extension:

{
  "stac_extensions": ["https://stac-extensions.github.io/table/v1.3.0/schema.json"],
  "table:columns": [...],
  "table:partitioning": {
    "scheme": "hive",
    "keys": ["year", "state"]
  }
}

Pros:

  • Standard STAC (if accepted)
  • Benefits entire STAC ecosystem
  • Aligns with table extension's purpose

Cons:

  • Requires upstream approval (may take time)
  • Table extension is "Pilot" maturity (evolving spec)
  • May need to prove value with real-world usage first

Research tasks:

  • Check if any STAC catalogs document partitioned datasets
  • Search STAC GitHub issues/discussions for partition-related requests
  • Review Iceberg/Delta Lake metadata formats for inspiration
  • Test STAC validators with custom extensions
  • Draft potential table extension PR (explore feasibility)

Questions to answer:

  • Has anyone else requested partition metadata in STAC?
  • Would the STAC table extension maintainers accept this?
  • Is there prior art in other STAC extensions?

A2. Should We Store Partition Statistics in STAC?

Beyond partition keys, should we track:

  • Row count per partition?
  • File count per partition?
  • Min/max values per partition?
  • Partition creation timestamps?

Example:

{
  "partition:statistics": {
    "year=2020/state=CA": {
      "row_count": 50000,
      "file_count": 5,
      "min_values": {"population": 1000},
      "max_values": {"population": 5000000}
    }
  }
}

Research tasks:

  • Check how Iceberg stores partition statistics
  • Review Delta Lake transaction log format
  • Test DuckDB's partition pruning behavior (does it need STAC metadata?)
  • Determine if STAC consumers would use this information

Questions to answer:

  • Do STAC consumers need partition stats, or can they derive them?
  • Would this metadata become stale quickly (high maintenance)?
  • Is this STAC's responsibility or Parquet's?

A3. How to Handle Partition Schema Evolution?

What happens when partition structure changes?

Scenario:

# Original partitioning
data/
└── year=2020/data.parquet

# Later, add state partitioning
data/
├── year=2020/data.parquet  ← Old partition (no state key)
└── year=2021/
    ├── state=CA/data.parquet
    └── state=NY/data.parquet

Research tasks:

  • Check how Iceberg/Delta handle partition evolution
  • Test DuckDB behavior with inconsistent partition structure
  • Determine if STAC should warn about this
  • Consider versioning strategy (new collection version?)

Questions to answer:

  • Should portolan validate partition consistency?
  • Should we block mixed partition schemas?
  • Is this a user error or valid use case?

Section B: Asset Representation

B1. How Should Collection Assets Point to Partitioned Data?

Option A: Single asset pointing to root

{
  "assets": {
    "data": {
      "href": "./",  // Points to partition tree root
      "type": "application/vnd.apache.parquet",
      "roles": ["data"]
    }
  }
}

Pros:

  • Simple, one asset for entire dataset
  • Matches "single logical table" model
  • Tools can glob for **/*.parquet

Cons:

  • href points to directory, not file (unusual for STAC)
  • May confuse tools expecting file URLs

Option B: Asset per partition

{
  "assets": {
    "year=2020/state=CA": {
      "href": "./year=2020/state=CA/data.parquet",
      "type": "application/vnd.apache.parquet",
      "roles": ["data"]
    },
    "year=2020/state=NY": {...}
  }
}

Pros:

  • Each asset is a file (standard STAC pattern)
  • Can include per-partition metadata

Cons:

  • Explosion of assets for large datasets (hundreds/thousands)
  • Breaks "single logical table" model
  • Hard to maintain as partitions are added

Option C: Single asset with glob pattern

{
  "assets": {
    "data": {
      "href": "./**/*.parquet",  // Glob pattern
      "type": "application/vnd.apache.parquet",
      "roles": ["data"]
    }
  }
}

Pros:

  • Concise, works for any number of partitions
  • Tools know how to resolve globs

Cons:

  • Glob patterns aren't standard in STAC hrefs
  • May not work with HTTP range requests
  • Unclear for cloud storage (S3 doesn't support globs directly)

Research tasks:

  • Check STAC spec for href requirements (must be file? can be directory?)
  • Test STAC Browser with directory hrefs
  • Test pystac asset validation
  • Review how other STAC catalogs handle multi-file assets
  • Test cloud storage tools (fsspec, s3fs) with directory URLs

Questions to answer:

  • What does STAC allow in asset hrefs?
  • How do STAC clients handle directory-based assets?
  • Is there precedent for glob patterns in STAC?

B2. Should We Use table:storage_options for Partition Access?

The table extension defines table:storage_options for fsspec parameters:

{
  "assets": {
    "data": {
      "href": "s3://bucket/demographics/",
      "table:storage_options": {
        "anon": false,
        "requester_pays": true
      }
    }
  }
}

Research tasks:

  • Test fsspec with partitioned Parquet datasets
  • Check if DuckDB/PyArrow respect storage_options
  • Determine if this is sufficient for cloud access
  • Test with Portolan's S3 integration

Questions to answer:

  • Does table:storage_options work for partitioned datasets?
  • Do tools automatically discover partitions via fsspec?
  • Should we document specific storage_options patterns?

Section C: Validation and Tooling

C1. How Should portolan scan Detect Partitioning?

When scanning a directory with Hive-style partitioning:

portolan scan demographics/

Expected output:

✓ Found: demographics/ (Hive-partitioned GeoParquet)
  → Partition keys: year, state
  → Partitions: 3 (year=2020/state=CA, year=2020/state=NY, year=2021/state=CA)
  → Total rows: 150,000
  → Recommended: Collection-level partitioned asset

Research tasks:

  • Implement partition key detection from directory names
  • Validate partition consistency (all use same keys?)
  • Count total rows by reading Parquet metadata
  • Detect mixed partition schemas (warn if inconsistent)

Questions to answer:

  • Should scan automatically detect partitioning?
  • Should we validate partition structure?
  • What warnings should we show for malformed partitions?

C2. How Should portolan add Handle Partitioned Datasets?

When adding a partitioned dataset:

portolan add demographics/

Option A: Auto-detect and create collection

# Detects partitioning, creates collection.json with partition metadata
# Single collection asset pointing to partition tree

Option B: Require explicit flag

portolan add demographics/ --partitioned

Option C: Interactive prompt

portolan add demographics/
# → "Detected Hive-style partitioning (keys: year, state). Add as partitioned dataset? [Y/n]"

Research tasks:

  • Design UX for partition detection
  • Consider --auto flag implications
  • Test with various partition structures
  • Handle edge cases (incomplete partitions, nested partitioning)

Questions to answer:

  • Which option aligns with portolan's "interactive + automatable" principle?
  • Should auto-detection be default behavior?
  • How do we handle ambiguous cases?

C3. Validation Requirements for Partitioned Datasets

What should portolan validate?

Partition structure:

  • All partitions use same keys?
  • Partition values are valid (no special characters)?
  • All leaf directories contain Parquet files?

Schema consistency:

  • All partitions have same Parquet schema?
  • Partition keys exist in Parquet schema?
  • Data types match across partitions?

Metadata accuracy:

  • table:row_count matches sum of partition row counts?
  • table:columns matches actual Parquet schema?

Research tasks:

  • Design validation pipeline for partitioned datasets
  • Implement schema comparison across partitions
  • Test with real-world partitioned datasets
  • Determine performance implications (reading all Parquet metadata)

Questions to answer:

  • What level of validation is appropriate?
  • Should validation be fast (heuristics) or exhaustive (full scan)?
  • How do we report validation errors?

Section D: Upstream Contribution

D1. Should We Propose Partition Metadata to STAC Table Extension?

If we implement a custom partition metadata approach, should we:

  • Contribute it upstream to the table extension?
  • Keep it as Portolan-specific extension?
  • Wait for community demand before proposing?

Research tasks:

  • Draft potential table extension PR
  • Review table extension governance (how to propose changes)
  • Check if other STAC users have this need
  • Engage with STAC community in discussions

Questions to answer:

  • Is there appetite for partition metadata in STAC?
  • Should we prove value with Portolan first, then propose?
  • What would the contribution process look like?

D2. Compatibility with Other Table Formats

Beyond Parquet, partitioning applies to:

  • Apache Iceberg (sophisticated partition evolution)
  • Delta Lake (transaction log with partition info)
  • Apache Hive (original Hive-style partitioning)

Research tasks:

  • Review Iceberg metadata format
  • Review Delta Lake transaction log
  • Determine if partition metadata should be format-agnostic
  • Consider future plugin architecture (ADR-0003)

Questions to answer:

  • Should partition metadata be Parquet-specific or generic?
  • How would this work with Iceberg plugin (future)?
  • Can we design for extensibility?

Implementation Scope

Once research is complete, this ticket will cover:

Phase 1: Basic Partition Support

  • Detect Hive-style partitioning in portolan scan
  • Create collection-level assets for partitioned datasets
  • Document partitioning in collection description (human-readable)
  • Compute table:row_count by summing partition row counts

Phase 2: Partition Metadata (TBD based on research)

  • Implement chosen metadata representation (description / custom extension / upstream)
  • Extract partition keys from directory structure
  • Validate partition consistency

Phase 3: Documentation

  • ADR: "Partitioned Dataset Cataloging"
  • Examples in docs/ showing partitioned datasets
  • Update CLAUDE.md with partitioning patterns
  • Possibly: Contribution to STAC table extension

Dependencies

Depends on: #231 (Collection-Level Assets and Nested Collections)

  • Need collection-level assets working before adding partition metadata

Blocks:

  • #TBD (Table Extension Integration) — partition metadata affects schema extraction
  • #TBD (Enhanced Scan UX) — partition detection is part of scan improvements

Out of Scope (Separate Work)

  • Partition pruning optimization → DuckDB/PyArrow handle this, not Portolan's job
  • Partition evolution → Complex topic, defer to future if needed
  • Iceberg/Delta integration → Covered by plugin architecture (ADR-0003)
  • Spatial partitioning → Different pattern (tile-based), separate research

Success Criteria

This ticket is complete when:

  1. All research questions have documented answers
  2. Partitioned datasets can be cataloged with appropriate metadata
  3. portolan scan detects and reports partitioning
  4. Validation ensures partition consistency
  5. ADR documents the partitioning approach
  6. Decision made on upstream contribution (if applicable)

Related Issues


Additional Context

Why this matters: Partitioning is fundamental to cloud-native data lakes. Modern data tools (DuckDB, Polars, PyArrow) expect partitioned Parquet datasets. Portolan should catalog them correctly to enable efficient queries.

Real-world example: The Den Haag datasets could benefit from partitioning by year or district if they grow over time. Designing this now prevents future breaking changes.

STAC ecosystem gap: The table extension doesn't define partition metadata. This is an opportunity to contribute back to STAC if our approach proves useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestroadmap:mvpPhase 1: Core CLI + Spec

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions