[deprecates after 25.8] Glue: Deduce Iceberg table metadata location if metadata_location not specified#1070
Conversation
bbd2471 to
b1229b7
Compare
88224fa to
53eb8cc
Compare
metadata_location not specified
499df74 to
52ffc6c
Compare
arthurpassos
left a comment
There was a problem hiding this comment.
Overall it looks good, just a couple of minor adjustments.
Btw, is it possible to add a test? If it's too hard / we are running against the clock, then it is ok.
| // Construct path to version-hint.text | ||
| String version_hint_path = table_location + "metadata/version-hint.text"; | ||
|
|
||
| DB::ASTStorage * storage = table_engine_definition->as<DB::ASTStorage>(); |
There was a problem hiding this comment.
This code seems to be duplicated with GlueCatalog::classifyTimestampTZ. Instictively I would ask you to consider creating a function for this, but I am under the impression this PR is not going to upstream, right? Extracting it into a function would make rebasing harder, so it is ok.
There was a problem hiding this comment.
Btw, why is it not going into upstream?
There was a problem hiding this comment.
Yes, I'd keep it as is for now. Later I will bring this PR to upstream, and then a small refactor will be done.
There was a problem hiding this comment.
Btw, why is it not going into upstream?
Because we want it in the next 25.8 antalya release ASAP
| String version_hint_object_path = version_hint_path; | ||
| if (version_hint_object_path.starts_with("s3://")) | ||
| { | ||
| version_hint_object_path = version_hint_object_path.substr(5); | ||
| // Remove bucket from path | ||
| std::size_t pos = version_hint_object_path.find('/'); | ||
| if (pos != std::string::npos) | ||
| version_hint_object_path = version_hint_object_path.substr(pos + 1); | ||
| } |
There was a problem hiding this comment.
Consider using S3::URI, I think it offers the functionality you need. All you gotta do is to instantiate it passing the path
There was a problem hiding this comment.
Here again, I am keeping the code that is similar to upstream. This is a good note, I will keep it in mind when making a PR to upstream
|
|
||
| return table_location + "metadata/v" + version_str + "-metadata.json"; | ||
| } | ||
| catch (...) |
There was a problem hiding this comment.
Instead of nest try-catch blocks, consider creating two auxiliary functions that return a std::optional
There was a problem hiding this comment.
Here we are trying to read objects that may not exist at all -- for we still need to have try/catch
There was a problem hiding this comment.
I know, but instead of having nested try catch blocks, you can have two different functions that return std::optional
Wouldn't something like the below work?
std::optional<std::string> resolveApproach1(...)
{
try
{
// stuff that may throw
return resolved_path;
}
catch(...)
{
return std::nullopt;
}
}
std::optional<std::string> resolveApproach2(...)
{
try
{
// stuff that may throw
return resolved_path;
}
catch(...)
{
return std::nullopt;
}
}
String GlueCatalog::resolveMetadataPathFromTableLocation(const String & table_location, const TableMetadata & table_metadata) const
{
....
if (const auto path = resolveApproach1())
{
return path;
}
if (const auto path = resolveApproach2())
{
return path;
}
return std::nulopt;
}
There was a problem hiding this comment.
Yeah, maybe
But I actually do not think it shall be done here (as it is a style and code beauty issue, though I agree that what you suggest looks better). I will probably refactor some of this code when submitting into upstream (e.g. as suggested in your other comments), so I do not see a good reason to spend time on it now.
WDYT?
Antalya 25.6.5 Backport of 87733: Fix handling of timestamp(tz) columns in Glue Catalog
Not really. IDK if a good Glue emulator exists. But the thing that existing tests work with != AWS Glue. So, it can only be tested against real AWS. |
03ae62a to
0402d8d
Compare
metadata_location not specifiedmetadata_location not specified
Also includes #1053 (ClickHouse#87733)
In future releases shall be superseded by ClickHouse#91994
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Deduce Iceberg metadata from Glue's
LocationExclude tests: