Hive Partitioning Metadata Design for STAC

# Hive Partitioning Metadata Design for STAC

## Context

Cloud-native data lakes commonly use **Hive-style partitioning** to organize large datasets:
```
demographics/
├── year=2020/
│   ├── state=CA/data.parquet
│   └── state=NY/data.parquet
└── year=2021/
    └── state=CA/data.parquet
```

Portolan needs to catalog partitioned datasets in STAC metadata. However, the [STAC Table Extension v1.2.0](https://github.com/stac-extensions/table) **does not define partition metadata**—it only describes table schemas, not how tables are organized on disk.

This ticket focuses on **how to represent partition metadata in STAC** for discovery, validation, and consumption by downstream tools.

---

## Background: What is Hive Partitioning?

**Hive-style partitioning** organizes data into directories based on column values:

**Pattern:** `key1=value1/key2=value2/...`

**Example:**
```
sales/
├── year=2023/
│   ├── quarter=Q1/
│   │   └── data.parquet
│   └── quarter=Q2/
│       └── data.parquet
└── year=2024/
    └── quarter=Q1/
        └── data.parquet
```

**Benefits:**
- **Query optimization** — Filter by partition keys without scanning all data
- **Incremental updates** — Add new partitions without rewriting existing data
- **Parallel processing** — Distribute work across partitions

**Used by:** Apache Iceberg, Delta Lake, Apache Hive, DuckDB, Polars, pandas, PyArrow

---

## Current STAC Table Extension Gaps

The table extension defines:
- `table:columns` — Column metadata (name, type, description)
- `table:row_count` — Total row count
- `table:primary_geometry` — Geometry column name

**What's missing:**
- ❌ Partition key definitions (which columns are partition keys)
- ❌ Partition value ranges (min/max for each partition)
- ❌ Partition statistics (row count per partition, file count)
- ❌ Partitioning scheme (Hive-style, directory-based, other)

---

## Proposed Approach: Single Logical Table at Collection Level

Based on how modern data lake formats (Iceberg, Delta Lake) handle partitioned datasets, we propose:

### ✅ Decisions We Can Make Now

**1. Partitioned datasets = single logical table at Collection level**

Treat the entire partition tree as **one asset** of the collection:
```
demographics/
├── collection.json          ← Describes entire partitioned dataset
├── year=2020/
│   ├── state=CA/data.parquet
│   └── state=NY/data.parquet
└── year=2021/
    └── state=CA/data.parquet
```

**Rationale:**
- Matches Iceberg/Delta Lake semantics (partitioning is physical, not logical)
- Aligns with table extension pattern (collection-level assets for tabular data)
- Simplifies queries (one schema applies to all partitions)

**Alternative (rejected):** Create one Item per partition
- ❌ Creates explosion of Items for large datasets
- ❌ Breaks semantic meaning (partitions aren't independent entities)
- ❌ Complicates queries (must aggregate across Items)

---

**2. `table:row_count` = sum across all partitions**

```json
{
  "table:row_count": 150000  // Sum of all partition row counts
}
```

**Rationale:**
- Users want to know total dataset size, not per-partition size
- Matches DuckDB/pandas behavior (query entire dataset)
- Per-partition counts can be derived from Parquet metadata if needed

---

**3. Partition discovery from Parquet metadata (not STAC)**

Consumers should read partition information directly from Parquet files, not duplicate it in STAC:

**Why:**
- **Source of truth:** Parquet file metadata already contains partition info
- **Avoids drift:** STAC metadata can become stale if partitions are added/removed
- **Standard tools:** DuckDB, pandas, PyArrow already do this

**Example (DuckDB):**
```sql
SELECT * FROM 'demographics/**/*.parquet' WHERE year = 2020;
-- DuckDB automatically detects Hive partitioning from directory structure
```

**Example (PyArrow):**
```python
import pyarrow.dataset as ds
dataset = ds.dataset("demographics/", partitioning="hive")
# PyArrow reads partition schema from directory names
```

**Implication for STAC:** We don't need to store partition values in collection.json—just document that the dataset is partitioned.

---

## Research Questions

While we've established the **logical model** (single table at collection level), we need to research **how to express partition metadata** in STAC for discoverability and validation.

### Section A: Partition Metadata Representation

**A1. How Should We Document Partitioning in STAC?**

**Option A: Human-readable description only**
```json
{
  "description": "Census data partitioned by year and state (Hive-style: year=YYYY/state=XX/)"
}
```

**Pros:**
- Simple, no spec changes needed
- Flexible for any partitioning scheme
- Works with STAC today

**Cons:**
- Not machine-readable
- Can't validate partition structure
- Can't generate partition-aware queries

---

**Option B: Custom STAC extension for partitioning**
```json
{
  "stac_extensions": ["https://portolan.org/stac-extensions/partition/v1.0.0/schema.json"],
  "partition:scheme": "hive",
  "partition:keys": [
    {"name": "year", "type": "int64"},
    {"name": "state", "type": "string"}
  ]
}
```

**Pros:**
- Machine-readable
- Can validate partition consistency
- Other tools could adopt if useful

**Cons:**
- Requires maintaining custom extension
- May duplicate work if upstream adds this to table extension
- Needs community review for STAC compliance

---

**Option C: Extend STAC Table Extension upstream**

Contribute partition metadata fields to the table extension:
```json
{
  "stac_extensions": ["https://stac-extensions.github.io/table/v1.3.0/schema.json"],
  "table:columns": [...],
  "table:partitioning": {
    "scheme": "hive",
    "keys": ["year", "state"]
  }
}
```

**Pros:**
- Standard STAC (if accepted)
- Benefits entire STAC ecosystem
- Aligns with table extension's purpose

**Cons:**
- Requires upstream approval (may take time)
- Table extension is "Pilot" maturity (evolving spec)
- May need to prove value with real-world usage first

---

**Research tasks:**
- [ ] Check if any STAC catalogs document partitioned datasets
- [ ] Search STAC GitHub issues/discussions for partition-related requests
- [ ] Review Iceberg/Delta Lake metadata formats for inspiration
- [ ] Test STAC validators with custom extensions
- [ ] Draft potential table extension PR (explore feasibility)

**Questions to answer:**
- Has anyone else requested partition metadata in STAC?
- Would the STAC table extension maintainers accept this?
- Is there prior art in other STAC extensions?

---

**A2. Should We Store Partition Statistics in STAC?**

Beyond partition keys, should we track:
- Row count per partition?
- File count per partition?
- Min/max values per partition?
- Partition creation timestamps?

**Example:**
```json
{
  "partition:statistics": {
    "year=2020/state=CA": {
      "row_count": 50000,
      "file_count": 5,
      "min_values": {"population": 1000},
      "max_values": {"population": 5000000}
    }
  }
}
```

**Research tasks:**
- [ ] Check how Iceberg stores partition statistics
- [ ] Review Delta Lake transaction log format
- [ ] Test DuckDB's partition pruning behavior (does it need STAC metadata?)
- [ ] Determine if STAC consumers would use this information

**Questions to answer:**
- Do STAC consumers need partition stats, or can they derive them?
- Would this metadata become stale quickly (high maintenance)?
- Is this STAC's responsibility or Parquet's?

---

**A3. How to Handle Partition Schema Evolution?**

What happens when partition structure changes?

**Scenario:**
```
# Original partitioning
data/
└── year=2020/data.parquet

# Later, add state partitioning
data/
├── year=2020/data.parquet  ← Old partition (no state key)
└── year=2021/
    ├── state=CA/data.parquet
    └── state=NY/data.parquet
```

**Research tasks:**
- [ ] Check how Iceberg/Delta handle partition evolution
- [ ] Test DuckDB behavior with inconsistent partition structure
- [ ] Determine if STAC should warn about this
- [ ] Consider versioning strategy (new collection version?)

**Questions to answer:**
- Should portolan validate partition consistency?
- Should we block mixed partition schemas?
- Is this a user error or valid use case?

---

### Section B: Asset Representation

**B1. How Should Collection Assets Point to Partitioned Data?**

**Option A: Single asset pointing to root**
```json
{
  "assets": {
    "data": {
      "href": "./",  // Points to partition tree root
      "type": "application/vnd.apache.parquet",
      "roles": ["data"]
    }
  }
}
```

**Pros:**
- Simple, one asset for entire dataset
- Matches "single logical table" model
- Tools can glob for `**/*.parquet`

**Cons:**
- `href` points to directory, not file (unusual for STAC)
- May confuse tools expecting file URLs

---

**Option B: Asset per partition**
```json
{
  "assets": {
    "year=2020/state=CA": {
      "href": "./year=2020/state=CA/data.parquet",
      "type": "application/vnd.apache.parquet",
      "roles": ["data"]
    },
    "year=2020/state=NY": {...}
  }
}
```

**Pros:**
- Each asset is a file (standard STAC pattern)
- Can include per-partition metadata

**Cons:**
- Explosion of assets for large datasets (hundreds/thousands)
- Breaks "single logical table" model
- Hard to maintain as partitions are added

---

**Option C: Single asset with glob pattern**
```json
{
  "assets": {
    "data": {
      "href": "./**/*.parquet",  // Glob pattern
      "type": "application/vnd.apache.parquet",
      "roles": ["data"]
    }
  }
}
```

**Pros:**
- Concise, works for any number of partitions
- Tools know how to resolve globs

**Cons:**
- Glob patterns aren't standard in STAC hrefs
- May not work with HTTP range requests
- Unclear for cloud storage (S3 doesn't support globs directly)

---

**Research tasks:**
- [ ] Check STAC spec for href requirements (must be file? can be directory?)
- [ ] Test STAC Browser with directory hrefs
- [ ] Test pystac asset validation
- [ ] Review how other STAC catalogs handle multi-file assets
- [ ] Test cloud storage tools (fsspec, s3fs) with directory URLs

**Questions to answer:**
- What does STAC allow in asset hrefs?
- How do STAC clients handle directory-based assets?
- Is there precedent for glob patterns in STAC?

---

**B2. Should We Use `table:storage_options` for Partition Access?**

The table extension defines `table:storage_options` for fsspec parameters:

```json
{
  "assets": {
    "data": {
      "href": "s3://bucket/demographics/",
      "table:storage_options": {
        "anon": false,
        "requester_pays": true
      }
    }
  }
}
```

**Research tasks:**
- [ ] Test fsspec with partitioned Parquet datasets
- [ ] Check if DuckDB/PyArrow respect storage_options
- [ ] Determine if this is sufficient for cloud access
- [ ] Test with Portolan's S3 integration

**Questions to answer:**
- Does table:storage_options work for partitioned datasets?
- Do tools automatically discover partitions via fsspec?
- Should we document specific storage_options patterns?

---

### Section C: Validation and Tooling

**C1. How Should `portolan scan` Detect Partitioning?**

When scanning a directory with Hive-style partitioning:

```bash
portolan scan demographics/
```

**Expected output:**
```
✓ Found: demographics/ (Hive-partitioned GeoParquet)
  → Partition keys: year, state
  → Partitions: 3 (year=2020/state=CA, year=2020/state=NY, year=2021/state=CA)
  → Total rows: 150,000
  → Recommended: Collection-level partitioned asset
```

**Research tasks:**
- [ ] Implement partition key detection from directory names
- [ ] Validate partition consistency (all use same keys?)
- [ ] Count total rows by reading Parquet metadata
- [ ] Detect mixed partition schemas (warn if inconsistent)

**Questions to answer:**
- Should scan automatically detect partitioning?
- Should we validate partition structure?
- What warnings should we show for malformed partitions?

---

**C2. How Should `portolan add` Handle Partitioned Datasets?**

When adding a partitioned dataset:

```bash
portolan add demographics/
```

**Option A: Auto-detect and create collection**
```bash
# Detects partitioning, creates collection.json with partition metadata
# Single collection asset pointing to partition tree
```

**Option B: Require explicit flag**
```bash
portolan add demographics/ --partitioned
```

**Option C: Interactive prompt**
```bash
portolan add demographics/
# → "Detected Hive-style partitioning (keys: year, state). Add as partitioned dataset? [Y/n]"
```

**Research tasks:**
- [ ] Design UX for partition detection
- [ ] Consider `--auto` flag implications
- [ ] Test with various partition structures
- [ ] Handle edge cases (incomplete partitions, nested partitioning)

**Questions to answer:**
- Which option aligns with portolan's "interactive + automatable" principle?
- Should auto-detection be default behavior?
- How do we handle ambiguous cases?

---

**C3. Validation Requirements for Partitioned Datasets**

What should portolan validate?

**Partition structure:**
- [ ] All partitions use same keys?
- [ ] Partition values are valid (no special characters)?
- [ ] All leaf directories contain Parquet files?

**Schema consistency:**
- [ ] All partitions have same Parquet schema?
- [ ] Partition keys exist in Parquet schema?
- [ ] Data types match across partitions?

**Metadata accuracy:**
- [ ] `table:row_count` matches sum of partition row counts?
- [ ] `table:columns` matches actual Parquet schema?

**Research tasks:**
- [ ] Design validation pipeline for partitioned datasets
- [ ] Implement schema comparison across partitions
- [ ] Test with real-world partitioned datasets
- [ ] Determine performance implications (reading all Parquet metadata)

**Questions to answer:**
- What level of validation is appropriate?
- Should validation be fast (heuristics) or exhaustive (full scan)?
- How do we report validation errors?

---

### Section D: Upstream Contribution

**D1. Should We Propose Partition Metadata to STAC Table Extension?**

If we implement a custom partition metadata approach, should we:
- Contribute it upstream to the table extension?
- Keep it as Portolan-specific extension?
- Wait for community demand before proposing?

**Research tasks:**
- [ ] Draft potential table extension PR
- [ ] Review table extension governance (how to propose changes)
- [ ] Check if other STAC users have this need
- [ ] Engage with STAC community in discussions

**Questions to answer:**
- Is there appetite for partition metadata in STAC?
- Should we prove value with Portolan first, then propose?
- What would the contribution process look like?

---

**D2. Compatibility with Other Table Formats**

Beyond Parquet, partitioning applies to:
- Apache Iceberg (sophisticated partition evolution)
- Delta Lake (transaction log with partition info)
- Apache Hive (original Hive-style partitioning)

**Research tasks:**
- [ ] Review Iceberg metadata format
- [ ] Review Delta Lake transaction log
- [ ] Determine if partition metadata should be format-agnostic
- [ ] Consider future plugin architecture (ADR-0003)

**Questions to answer:**
- Should partition metadata be Parquet-specific or generic?
- How would this work with Iceberg plugin (future)?
- Can we design for extensibility?

---

## Implementation Scope

Once research is complete, this ticket will cover:

### Phase 1: Basic Partition Support
- [ ] Detect Hive-style partitioning in `portolan scan`
- [ ] Create collection-level assets for partitioned datasets
- [ ] Document partitioning in collection description (human-readable)
- [ ] Compute `table:row_count` by summing partition row counts

### Phase 2: Partition Metadata (TBD based on research)
- [ ] Implement chosen metadata representation (description / custom extension / upstream)
- [ ] Extract partition keys from directory structure
- [ ] Validate partition consistency

### Phase 3: Documentation
- [ ] ADR: "Partitioned Dataset Cataloging"
- [ ] Examples in docs/ showing partitioned datasets
- [ ] Update CLAUDE.md with partitioning patterns
- [ ] Possibly: Contribution to STAC table extension

---

## Dependencies

**Depends on:** #231 (Collection-Level Assets and Nested Collections)
- Need collection-level assets working before adding partition metadata

**Blocks:**
- #TBD (Table Extension Integration) — partition metadata affects schema extraction
- #TBD (Enhanced Scan UX) — partition detection is part of scan improvements

---

## Out of Scope (Separate Work)

- **Partition pruning optimization** → DuckDB/PyArrow handle this, not Portolan's job
- **Partition evolution** → Complex topic, defer to future if needed
- **Iceberg/Delta integration** → Covered by plugin architecture (ADR-0003)
- **Spatial partitioning** → Different pattern (tile-based), separate research

---

## Success Criteria

This ticket is complete when:
1. All research questions have documented answers
2. Partitioned datasets can be cataloged with appropriate metadata
3. `portolan scan` detects and reports partitioning
4. Validation ensures partition consistency
5. ADR documents the partitioning approach
6. Decision made on upstream contribution (if applicable)

---

## Related Issues

- Parent issue: #226 (catalog structure patterns)
- Foundation: #231 (collection-level assets)
- Round-trip testing: #6

---

## Additional Context

**Why this matters:** Partitioning is fundamental to cloud-native data lakes. Modern data tools (DuckDB, Polars, PyArrow) expect partitioned Parquet datasets. Portolan should catalog them correctly to enable efficient queries.

**Real-world example:** The Den Haag datasets could benefit from partitioning by year or district if they grow over time. Designing this now prevents future breaking changes.

**STAC ecosystem gap:** The table extension doesn't define partition metadata. This is an opportunity to contribute back to STAC if our approach proves useful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hive Partitioning Metadata Design for STAC #232

Hive Partitioning Metadata Design for STAC

Context

Background: What is Hive Partitioning?

Current STAC Table Extension Gaps

Proposed Approach: Single Logical Table at Collection Level

✅ Decisions We Can Make Now

Research Questions

Section A: Partition Metadata Representation

Section B: Asset Representation

Section C: Validation and Tooling

Section D: Upstream Contribution

Implementation Scope

Phase 1: Basic Partition Support

Phase 2: Partition Metadata (TBD based on research)

Phase 3: Documentation

Dependencies

Out of Scope (Separate Work)

Success Criteria

Related Issues

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Hive Partitioning Metadata Design for STAC #232

Description

Hive Partitioning Metadata Design for STAC

Context

Background: What is Hive Partitioning?

Current STAC Table Extension Gaps

Proposed Approach: Single Logical Table at Collection Level

✅ Decisions We Can Make Now

Research Questions

Section A: Partition Metadata Representation

Section B: Asset Representation

Section C: Validation and Tooling

Section D: Upstream Contribution

Implementation Scope

Phase 1: Basic Partition Support

Phase 2: Partition Metadata (TBD based on research)

Phase 3: Documentation

Dependencies

Out of Scope (Separate Work)

Success Criteria

Related Issues

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions