Skip to content

Support Iceberg Metadata Files Cache#77156

Merged
hanfei1991 merged 50 commits intomasterfrom
hanfei/datalake_meatadata_cache
Apr 9, 2025
Merged

Support Iceberg Metadata Files Cache#77156
hanfei1991 merged 50 commits intomasterfrom
hanfei/datalake_meatadata_cache

Conversation

@hanfei1991
Copy link
Member

@hanfei1991 hanfei1991 commented Mar 5, 2025

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Support IcebergMetadataFilesCache, which will store manifest files/list and metadata.json in one cache.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@hanfei1991 hanfei1991 added the pr-feature Pull request with new product feature label Mar 6, 2025
@clickhouse-gh
Copy link
Contributor

clickhouse-gh bot commented Mar 6, 2025

Workflow [PR], commit [7e4e53b]

@kssenii kssenii self-assigned this Mar 11, 2025
@hanfei1991 hanfei1991 marked this pull request as ready for review March 14, 2025 14:58
@hanfei1991 hanfei1991 changed the title [WIP] Support DataLake Metadata Cache Support DataLake Metadata Cache Mar 14, 2025
@hanfei1991
Copy link
Member Author

hanfei1991 commented Mar 20, 2025

Before I finish this PR, I have to implement cache of ManifestList/File first. Because:

  • Right now we have cached ManifestList/File in IcebergMetadata, but don't evict anything, so there will be memory leak for Iceberg Storage.
  • Reusing metadata_ptr is infeasible
    • After we have supported TimeTravel. Every metadata may get different snapshot, which need different manifest list / files.
    • update will cause data race
  • but copying metadata_ptr is also infeasible, because manifest file can be too huge to copy.

@hanfei1991
Copy link
Member Author

hanfei1991 commented Apr 3, 2025

fuzzer fail #73653

@kssenii PTAL again

@hanfei1991 hanfei1991 force-pushed the hanfei/datalake_meatadata_cache branch from 26dd9e9 to c05f8d0 Compare April 7, 2025 19:01
@hanfei1991
Copy link
Member Author

Fuzzer fail due to #78639

@kssenii PTAL

if (manifest_cache)
{
auto manifest_file = manifest_cache->getOrSetManifestFile(IcebergMetadataFilesCache::getKey(configuration_ptr, filename), create_fn);
schema_processor.addIcebergTableSchema(manifest_file->getSchemaObject());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw why do we need

schema_processor.addIcebergTableSchema(manifest_file->getSchemaObject());

?

Copy link
Member Author

@hanfei1991 hanfei1991 Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to construct schema_processor while constructing ManifestFileContent. See line 604.
If we get it from cache, we also need to do it (again), because it could be a different IcebergMetadata without adding anything to schema_processor

Copy link
Member

@kssenii kssenii Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be remove it from line 604 if we anyway do it here? to avoid doing it twice unnecessarily. And also add it in line 623

Copy link
Member Author

@hanfei1991 hanfei1991 Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to do it once before the ctor of ManifestFileContent, because we need it here https://github.com/ClickHouse/ClickHouse/blob/master/src/Storages/ObjectStorage/DataLakes/Iceberg/ManifestFile.cpp#L181

And adding it twice is not bad. We use map inside it to deduplicate

@hanfei1991 hanfei1991 force-pushed the hanfei/datalake_meatadata_cache branch from dd8cbb0 to 7e4e53b Compare April 8, 2025 15:26
@hanfei1991 hanfei1991 added this pull request to the merge queue Apr 9, 2025
Merged via the queue into master with commit 8f76db8 Apr 9, 2025
110 of 125 checks passed
@hanfei1991 hanfei1991 deleted the hanfei/datalake_meatadata_cache branch April 9, 2025 14:41
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 9, 2025
ianton-ru pushed a commit to Altinity/ClickHouse that referenced this pull request Apr 11, 2025
…eatadata_cache

Support Iceberg Metadata Files Cache
Enmk added a commit to Altinity/ClickHouse that referenced this pull request Apr 14, 2025
ianton-ru pushed a commit to Altinity/ClickHouse that referenced this pull request May 23, 2025
…eatadata_cache

Support Iceberg Metadata Files Cache
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants