Hive Partitioning Metadata Design for STAC
Context
Cloud-native data lakes commonly use Hive-style partitioning to organize large datasets:
demographics/
├── year=2020/
│ ├── state=CA/data.parquet
│ └── state=NY/data.parquet
└── year=2021/
└── state=CA/data.parquet
Portolan needs to catalog partitioned datasets in STAC metadata. However, the STAC Table Extension v1.2.0 does not define partition metadata—it only describes table schemas, not how tables are organized on disk.
This ticket focuses on how to represent partition metadata in STAC for discovery, validation, and consumption by downstream tools.
Background: What is Hive Partitioning?
Hive-style partitioning organizes data into directories based on column values:
Pattern: key1=value1/key2=value2/...
Example:
sales/
├── year=2023/
│ ├── quarter=Q1/
│ │ └── data.parquet
│ └── quarter=Q2/
│ └── data.parquet
└── year=2024/
└── quarter=Q1/
└── data.parquet
Benefits:
- Query optimization — Filter by partition keys without scanning all data
- Incremental updates — Add new partitions without rewriting existing data
- Parallel processing — Distribute work across partitions
Used by: Apache Iceberg, Delta Lake, Apache Hive, DuckDB, Polars, pandas, PyArrow
Current STAC Table Extension Gaps
The table extension defines:
table:columns — Column metadata (name, type, description)
table:row_count — Total row count
table:primary_geometry — Geometry column name
What's missing:
- ❌ Partition key definitions (which columns are partition keys)
- ❌ Partition value ranges (min/max for each partition)
- ❌ Partition statistics (row count per partition, file count)
- ❌ Partitioning scheme (Hive-style, directory-based, other)
Proposed Approach: Single Logical Table at Collection Level
Based on how modern data lake formats (Iceberg, Delta Lake) handle partitioned datasets, we propose:
✅ Decisions We Can Make Now
1. Partitioned datasets = single logical table at Collection level
Treat the entire partition tree as one asset of the collection:
demographics/
├── collection.json ← Describes entire partitioned dataset
├── year=2020/
│ ├── state=CA/data.parquet
│ └── state=NY/data.parquet
└── year=2021/
└── state=CA/data.parquet
Rationale:
- Matches Iceberg/Delta Lake semantics (partitioning is physical, not logical)
- Aligns with table extension pattern (collection-level assets for tabular data)
- Simplifies queries (one schema applies to all partitions)
Alternative (rejected): Create one Item per partition
- ❌ Creates explosion of Items for large datasets
- ❌ Breaks semantic meaning (partitions aren't independent entities)
- ❌ Complicates queries (must aggregate across Items)
2. table:row_count = sum across all partitions
{
"table:row_count": 150000 // Sum of all partition row counts
}
Rationale:
- Users want to know total dataset size, not per-partition size
- Matches DuckDB/pandas behavior (query entire dataset)
- Per-partition counts can be derived from Parquet metadata if needed
3. Partition discovery from Parquet metadata (not STAC)
Consumers should read partition information directly from Parquet files, not duplicate it in STAC:
Why:
- Source of truth: Parquet file metadata already contains partition info
- Avoids drift: STAC metadata can become stale if partitions are added/removed
- Standard tools: DuckDB, pandas, PyArrow already do this
Example (DuckDB):
SELECT * FROM 'demographics/**/*.parquet' WHERE year = 2020;
-- DuckDB automatically detects Hive partitioning from directory structure
Example (PyArrow):
import pyarrow.dataset as ds
dataset = ds.dataset("demographics/", partitioning="hive")
# PyArrow reads partition schema from directory names
Implication for STAC: We don't need to store partition values in collection.json—just document that the dataset is partitioned.
Research Questions
While we've established the logical model (single table at collection level), we need to research how to express partition metadata in STAC for discoverability and validation.
Section A: Partition Metadata Representation
A1. How Should We Document Partitioning in STAC?
Option A: Human-readable description only
{
"description": "Census data partitioned by year and state (Hive-style: year=YYYY/state=XX/)"
}
Pros:
- Simple, no spec changes needed
- Flexible for any partitioning scheme
- Works with STAC today
Cons:
- Not machine-readable
- Can't validate partition structure
- Can't generate partition-aware queries
Option B: Custom STAC extension for partitioning
{
"stac_extensions": ["https://portolan.org/stac-extensions/partition/v1.0.0/schema.json"],
"partition:scheme": "hive",
"partition:keys": [
{"name": "year", "type": "int64"},
{"name": "state", "type": "string"}
]
}
Pros:
- Machine-readable
- Can validate partition consistency
- Other tools could adopt if useful
Cons:
- Requires maintaining custom extension
- May duplicate work if upstream adds this to table extension
- Needs community review for STAC compliance
Option C: Extend STAC Table Extension upstream
Contribute partition metadata fields to the table extension:
{
"stac_extensions": ["https://stac-extensions.github.io/table/v1.3.0/schema.json"],
"table:columns": [...],
"table:partitioning": {
"scheme": "hive",
"keys": ["year", "state"]
}
}
Pros:
- Standard STAC (if accepted)
- Benefits entire STAC ecosystem
- Aligns with table extension's purpose
Cons:
- Requires upstream approval (may take time)
- Table extension is "Pilot" maturity (evolving spec)
- May need to prove value with real-world usage first
Research tasks:
Questions to answer:
- Has anyone else requested partition metadata in STAC?
- Would the STAC table extension maintainers accept this?
- Is there prior art in other STAC extensions?
A2. Should We Store Partition Statistics in STAC?
Beyond partition keys, should we track:
- Row count per partition?
- File count per partition?
- Min/max values per partition?
- Partition creation timestamps?
Example:
{
"partition:statistics": {
"year=2020/state=CA": {
"row_count": 50000,
"file_count": 5,
"min_values": {"population": 1000},
"max_values": {"population": 5000000}
}
}
}
Research tasks:
Questions to answer:
- Do STAC consumers need partition stats, or can they derive them?
- Would this metadata become stale quickly (high maintenance)?
- Is this STAC's responsibility or Parquet's?
A3. How to Handle Partition Schema Evolution?
What happens when partition structure changes?
Scenario:
# Original partitioning
data/
└── year=2020/data.parquet
# Later, add state partitioning
data/
├── year=2020/data.parquet ← Old partition (no state key)
└── year=2021/
├── state=CA/data.parquet
└── state=NY/data.parquet
Research tasks:
Questions to answer:
- Should portolan validate partition consistency?
- Should we block mixed partition schemas?
- Is this a user error or valid use case?
Section B: Asset Representation
B1. How Should Collection Assets Point to Partitioned Data?
Option A: Single asset pointing to root
{
"assets": {
"data": {
"href": "./", // Points to partition tree root
"type": "application/vnd.apache.parquet",
"roles": ["data"]
}
}
}
Pros:
- Simple, one asset for entire dataset
- Matches "single logical table" model
- Tools can glob for
**/*.parquet
Cons:
href points to directory, not file (unusual for STAC)
- May confuse tools expecting file URLs
Option B: Asset per partition
{
"assets": {
"year=2020/state=CA": {
"href": "./year=2020/state=CA/data.parquet",
"type": "application/vnd.apache.parquet",
"roles": ["data"]
},
"year=2020/state=NY": {...}
}
}
Pros:
- Each asset is a file (standard STAC pattern)
- Can include per-partition metadata
Cons:
- Explosion of assets for large datasets (hundreds/thousands)
- Breaks "single logical table" model
- Hard to maintain as partitions are added
Option C: Single asset with glob pattern
{
"assets": {
"data": {
"href": "./**/*.parquet", // Glob pattern
"type": "application/vnd.apache.parquet",
"roles": ["data"]
}
}
}
Pros:
- Concise, works for any number of partitions
- Tools know how to resolve globs
Cons:
- Glob patterns aren't standard in STAC hrefs
- May not work with HTTP range requests
- Unclear for cloud storage (S3 doesn't support globs directly)
Research tasks:
Questions to answer:
- What does STAC allow in asset hrefs?
- How do STAC clients handle directory-based assets?
- Is there precedent for glob patterns in STAC?
B2. Should We Use table:storage_options for Partition Access?
The table extension defines table:storage_options for fsspec parameters:
{
"assets": {
"data": {
"href": "s3://bucket/demographics/",
"table:storage_options": {
"anon": false,
"requester_pays": true
}
}
}
}
Research tasks:
Questions to answer:
- Does table:storage_options work for partitioned datasets?
- Do tools automatically discover partitions via fsspec?
- Should we document specific storage_options patterns?
Section C: Validation and Tooling
C1. How Should portolan scan Detect Partitioning?
When scanning a directory with Hive-style partitioning:
portolan scan demographics/
Expected output:
✓ Found: demographics/ (Hive-partitioned GeoParquet)
→ Partition keys: year, state
→ Partitions: 3 (year=2020/state=CA, year=2020/state=NY, year=2021/state=CA)
→ Total rows: 150,000
→ Recommended: Collection-level partitioned asset
Research tasks:
Questions to answer:
- Should scan automatically detect partitioning?
- Should we validate partition structure?
- What warnings should we show for malformed partitions?
C2. How Should portolan add Handle Partitioned Datasets?
When adding a partitioned dataset:
portolan add demographics/
Option A: Auto-detect and create collection
# Detects partitioning, creates collection.json with partition metadata
# Single collection asset pointing to partition tree
Option B: Require explicit flag
portolan add demographics/ --partitioned
Option C: Interactive prompt
portolan add demographics/
# → "Detected Hive-style partitioning (keys: year, state). Add as partitioned dataset? [Y/n]"
Research tasks:
Questions to answer:
- Which option aligns with portolan's "interactive + automatable" principle?
- Should auto-detection be default behavior?
- How do we handle ambiguous cases?
C3. Validation Requirements for Partitioned Datasets
What should portolan validate?
Partition structure:
Schema consistency:
Metadata accuracy:
Research tasks:
Questions to answer:
- What level of validation is appropriate?
- Should validation be fast (heuristics) or exhaustive (full scan)?
- How do we report validation errors?
Section D: Upstream Contribution
D1. Should We Propose Partition Metadata to STAC Table Extension?
If we implement a custom partition metadata approach, should we:
- Contribute it upstream to the table extension?
- Keep it as Portolan-specific extension?
- Wait for community demand before proposing?
Research tasks:
Questions to answer:
- Is there appetite for partition metadata in STAC?
- Should we prove value with Portolan first, then propose?
- What would the contribution process look like?
D2. Compatibility with Other Table Formats
Beyond Parquet, partitioning applies to:
- Apache Iceberg (sophisticated partition evolution)
- Delta Lake (transaction log with partition info)
- Apache Hive (original Hive-style partitioning)
Research tasks:
Questions to answer:
- Should partition metadata be Parquet-specific or generic?
- How would this work with Iceberg plugin (future)?
- Can we design for extensibility?
Implementation Scope
Once research is complete, this ticket will cover:
Phase 1: Basic Partition Support
Phase 2: Partition Metadata (TBD based on research)
Phase 3: Documentation
Dependencies
Depends on: #231 (Collection-Level Assets and Nested Collections)
- Need collection-level assets working before adding partition metadata
Blocks:
- #TBD (Table Extension Integration) — partition metadata affects schema extraction
- #TBD (Enhanced Scan UX) — partition detection is part of scan improvements
Out of Scope (Separate Work)
- Partition pruning optimization → DuckDB/PyArrow handle this, not Portolan's job
- Partition evolution → Complex topic, defer to future if needed
- Iceberg/Delta integration → Covered by plugin architecture (ADR-0003)
- Spatial partitioning → Different pattern (tile-based), separate research
Success Criteria
This ticket is complete when:
- All research questions have documented answers
- Partitioned datasets can be cataloged with appropriate metadata
portolan scan detects and reports partitioning
- Validation ensures partition consistency
- ADR documents the partitioning approach
- Decision made on upstream contribution (if applicable)
Related Issues
Additional Context
Why this matters: Partitioning is fundamental to cloud-native data lakes. Modern data tools (DuckDB, Polars, PyArrow) expect partitioned Parquet datasets. Portolan should catalog them correctly to enable efficient queries.
Real-world example: The Den Haag datasets could benefit from partitioning by year or district if they grow over time. Designing this now prevents future breaking changes.
STAC ecosystem gap: The table extension doesn't define partition metadata. This is an opportunity to contribute back to STAC if our approach proves useful.
Hive Partitioning Metadata Design for STAC
Context
Cloud-native data lakes commonly use Hive-style partitioning to organize large datasets:
Portolan needs to catalog partitioned datasets in STAC metadata. However, the STAC Table Extension v1.2.0 does not define partition metadata—it only describes table schemas, not how tables are organized on disk.
This ticket focuses on how to represent partition metadata in STAC for discovery, validation, and consumption by downstream tools.
Background: What is Hive Partitioning?
Hive-style partitioning organizes data into directories based on column values:
Pattern:
key1=value1/key2=value2/...Example:
Benefits:
Used by: Apache Iceberg, Delta Lake, Apache Hive, DuckDB, Polars, pandas, PyArrow
Current STAC Table Extension Gaps
The table extension defines:
table:columns— Column metadata (name, type, description)table:row_count— Total row counttable:primary_geometry— Geometry column nameWhat's missing:
Proposed Approach: Single Logical Table at Collection Level
Based on how modern data lake formats (Iceberg, Delta Lake) handle partitioned datasets, we propose:
✅ Decisions We Can Make Now
1. Partitioned datasets = single logical table at Collection level
Treat the entire partition tree as one asset of the collection:
Rationale:
Alternative (rejected): Create one Item per partition
2.
table:row_count= sum across all partitions{ "table:row_count": 150000 // Sum of all partition row counts }Rationale:
3. Partition discovery from Parquet metadata (not STAC)
Consumers should read partition information directly from Parquet files, not duplicate it in STAC:
Why:
Example (DuckDB):
Example (PyArrow):
Implication for STAC: We don't need to store partition values in collection.json—just document that the dataset is partitioned.
Research Questions
While we've established the logical model (single table at collection level), we need to research how to express partition metadata in STAC for discoverability and validation.
Section A: Partition Metadata Representation
A1. How Should We Document Partitioning in STAC?
Option A: Human-readable description only
{ "description": "Census data partitioned by year and state (Hive-style: year=YYYY/state=XX/)" }Pros:
Cons:
Option B: Custom STAC extension for partitioning
{ "stac_extensions": ["https://portolan.org/stac-extensions/partition/v1.0.0/schema.json"], "partition:scheme": "hive", "partition:keys": [ {"name": "year", "type": "int64"}, {"name": "state", "type": "string"} ] }Pros:
Cons:
Option C: Extend STAC Table Extension upstream
Contribute partition metadata fields to the table extension:
{ "stac_extensions": ["https://stac-extensions.github.io/table/v1.3.0/schema.json"], "table:columns": [...], "table:partitioning": { "scheme": "hive", "keys": ["year", "state"] } }Pros:
Cons:
Research tasks:
Questions to answer:
A2. Should We Store Partition Statistics in STAC?
Beyond partition keys, should we track:
Example:
{ "partition:statistics": { "year=2020/state=CA": { "row_count": 50000, "file_count": 5, "min_values": {"population": 1000}, "max_values": {"population": 5000000} } } }Research tasks:
Questions to answer:
A3. How to Handle Partition Schema Evolution?
What happens when partition structure changes?
Scenario:
Research tasks:
Questions to answer:
Section B: Asset Representation
B1. How Should Collection Assets Point to Partitioned Data?
Option A: Single asset pointing to root
{ "assets": { "data": { "href": "./", // Points to partition tree root "type": "application/vnd.apache.parquet", "roles": ["data"] } } }Pros:
**/*.parquetCons:
hrefpoints to directory, not file (unusual for STAC)Option B: Asset per partition
{ "assets": { "year=2020/state=CA": { "href": "./year=2020/state=CA/data.parquet", "type": "application/vnd.apache.parquet", "roles": ["data"] }, "year=2020/state=NY": {...} } }Pros:
Cons:
Option C: Single asset with glob pattern
{ "assets": { "data": { "href": "./**/*.parquet", // Glob pattern "type": "application/vnd.apache.parquet", "roles": ["data"] } } }Pros:
Cons:
Research tasks:
Questions to answer:
B2. Should We Use
table:storage_optionsfor Partition Access?The table extension defines
table:storage_optionsfor fsspec parameters:{ "assets": { "data": { "href": "s3://bucket/demographics/", "table:storage_options": { "anon": false, "requester_pays": true } } } }Research tasks:
Questions to answer:
Section C: Validation and Tooling
C1. How Should
portolan scanDetect Partitioning?When scanning a directory with Hive-style partitioning:
Expected output:
Research tasks:
Questions to answer:
C2. How Should
portolan addHandle Partitioned Datasets?When adding a partitioned dataset:
Option A: Auto-detect and create collection
Option B: Require explicit flag
Option C: Interactive prompt
portolan add demographics/ # → "Detected Hive-style partitioning (keys: year, state). Add as partitioned dataset? [Y/n]"Research tasks:
--autoflag implicationsQuestions to answer:
C3. Validation Requirements for Partitioned Datasets
What should portolan validate?
Partition structure:
Schema consistency:
Metadata accuracy:
table:row_countmatches sum of partition row counts?table:columnsmatches actual Parquet schema?Research tasks:
Questions to answer:
Section D: Upstream Contribution
D1. Should We Propose Partition Metadata to STAC Table Extension?
If we implement a custom partition metadata approach, should we:
Research tasks:
Questions to answer:
D2. Compatibility with Other Table Formats
Beyond Parquet, partitioning applies to:
Research tasks:
Questions to answer:
Implementation Scope
Once research is complete, this ticket will cover:
Phase 1: Basic Partition Support
portolan scantable:row_countby summing partition row countsPhase 2: Partition Metadata (TBD based on research)
Phase 3: Documentation
Dependencies
Depends on: #231 (Collection-Level Assets and Nested Collections)
Blocks:
Out of Scope (Separate Work)
Success Criteria
This ticket is complete when:
portolan scandetects and reports partitioningRelated Issues
Additional Context
Why this matters: Partitioning is fundamental to cloud-native data lakes. Modern data tools (DuckDB, Polars, PyArrow) expect partitioned Parquet datasets. Portolan should catalog them correctly to enable efficient queries.
Real-world example: The Den Haag datasets could benefit from partitioning by year or district if they grow over time. Designing this now prevents future breaking changes.
STAC ecosystem gap: The table extension doesn't define partition metadata. This is an opportunity to contribute back to STAC if our approach proves useful.