Define S3 client with bucket and endpoint resolution#45783
Define S3 client with bucket and endpoint resolution#45783nikitamikhaylov merged 23 commits intomasterfrom
Conversation
460f9f2 to
113ca78
Compare
| { | ||
| auto params = BaseRequest::GetEndpointContextParams(); | ||
| if (!region_override.empty()) | ||
| params.emplace_back("Region", region_override); |
There was a problem hiding this comment.
Is it possible that BaseRequest::GetEndpointContextParams() already has this override?
There was a problem hiding this comment.
Could be possible but the params are processed sequentially where the latest will always override the previous.
Also, the params are fetched from 3 different sources where BaseRequest::GetEndpointContextParams is the last one so we are sure it will always win. An assumption we need to be careful of if they change it in SDK (don't see a reason to do it but it could happen). I need to add a comment explaining exactly this.
|
|
||
| ClientCache(const ClientCache & other) | ||
| : region_for_bucket_cache(other.region_for_bucket_cache) | ||
| , uri_for_bucket_cache(other.uri_for_bucket_cache) |
There was a problem hiding this comment.
No, but I didn't see a reason not to do it. We copy the client for globbed URLs and if it's copied than the endpoint is for sure the same so why not reuse the cache.
| ClientCacheRegistry() = default; | ||
|
|
||
| std::mutex clients_mutex; | ||
| std::unordered_map<ClientCache *, std::weak_ptr<ClientCache>> client_caches; |
There was a problem hiding this comment.
Is it necessary to have this two-dimensional system? I mean can't we just keep region_for_bucket_cache and uri_for_bucket_cache here in this singleton class?
There was a problem hiding this comment.
We can but I wanted a way to detect if a client is deleted but the destructor failed to unregister it so we can clean the dead ptrs here if that happens.
If it was premature optimization I can simplify in the way as you described.
| mutable std::optional<S3::URI> uri_override; | ||
| }; | ||
|
|
||
| using HeadObjectRequest = ExtendedRequest<Model::HeadObjectRequest>; |
There was a problem hiding this comment.
Can't we just modify BaseRequest without inheriting every S3 request type?
There was a problem hiding this comment.
I wanted to avoid modifying the SDK because it's more complex to extend with new types (update fork then update the commit of the submodule). This way it's much easier to add support for new type of request IMO.
| ClientCacheRegistry::instance().registerClient(cache); | ||
| } | ||
|
|
||
| /// Make regular functions private |
There was a problem hiding this comment.
If you use requests from SDK the client can't override the region/endpoint so we make the function accepting Aws::S3::Model::*Request as private and the only way to call the functions with our client is with our extended requests.
| , max_redirects(max_redirects_) | ||
| , log(&Poco::Logger::get("S3Client")) | ||
| { | ||
| auto * endpoint_provider = dynamic_cast<Aws::S3::Endpoint::S3DefaultEpProviderBase *>(accessEndpointProvider().get()); |
There was a problem hiding this comment.
Can endpoint_provider be nullptr?
There was a problem hiding this comment.
it shouldn't be, AWS always does a nullptr check before accessing it, I can do chassert.
|
|
||
| template <typename RequestType, typename RequestFn> | ||
| std::invoke_result_t<RequestFn, RequestType> | ||
| doRequest(const RequestType & request, RequestFn request_fn) const |
There was a problem hiding this comment.
This template function is not called anywhere except Client.cpp so it can be easily moved to Client.cpp.
| detect_region = explicit_region == Aws::Region::AWS_GLOBAL && endpoint.find(".amazonaws.com") != std::string::npos; | ||
|
|
||
| cache = std::make_shared<ClientCache>(); | ||
| ClientCacheRegistry::instance().registerClient(cache); |
There was a problem hiding this comment.
Let's move the constructor to Client.cpp
| if (auto region = getRegionForBucket(bucket); !region.empty()) | ||
| { | ||
| if (!detect_region) | ||
| LOG_INFO(log, "Using region override {} for bucket {}", region, bucket); |
There was a problem hiding this comment.
If the endpoint is specified with an explicit region then detect == false and we will always see this line in the logs. Looks excessive.
There was a problem hiding this comment.
Only if we explicitly defined wrong region, if it was correct no log will be printed.
But yeah, it can still be excessive.
| if (checkIfWrongRegionDefined(bucket, error, new_region)) | ||
| { | ||
| request.overrideRegion(new_region); | ||
| return HeadObject(request); |
There was a problem hiding this comment.
The HeadObject function doesn't have to be recursive here. If possible let's use just a cycle, usually they're easier for understanding and debugging.
| { | ||
| auto uri_override = request.getURIOverride(); | ||
| assert(uri_override.has_value()); | ||
| updateURIForBucket(bucket, std::move(*uri_override)); |
There was a problem hiding this comment.
There is only one place where you set found_new_endpoint later - let's move this code there. SCOPE_EXIT makes the code harder to read.
| request.overrideRegion(std::move(region)); | ||
| } | ||
|
|
||
| if (auto uri = getURIForBucket(bucket); uri.has_value()) |
There was a problem hiding this comment.
I think we can add the function empty() to S3::URI and so make the code more consistent without using std::optional<S3::URI>.
| return wrapped_strategy->RequestBookkeeping(httpResponseOutcome, lastError); | ||
| } | ||
|
|
||
| bool Client::checkIfWrongRegionDefined(const std::string & bucket, const Aws::S3::S3Error & error, std::string & region) const |
There was a problem hiding this comment.
Let's rename the function to checkIfExplicitRegionIsWrong for consistency with the variable explicit_region.
| } | ||
| ); | ||
|
|
||
| for (size_t attempt = 0; attempt <= max_redirects; ++attempt) |
There was a problem hiding this comment.
request_fn already has its own retries, see AWSClient::AttemptExhaustively(). And our code which uses S3::Client often has its own retries too. I'm afraid there can be too many retries.
There was a problem hiding this comment.
We already do something similar with PocoHTTPClient and with s3_max_redirects.
This is basically a redirect attempt + it should be much rarer with 1 or 2 jumps.
| region = GetErrorMarshaller()->ExtractRegion(error); | ||
|
|
||
| if (region.empty()) | ||
| region = getRegionForBucket(bucket, /*force_detect*/ true); |
There was a problem hiding this comment.
I don't understand why we need to call HeadBucket after our request failed. In all known cases if a request fails because of the region the correct region would be in the response body. With one exception - HeadObject but we cover this case with HeadBucket before the main request.
There was a problem hiding this comment.
if the region is explicitly defined, the HeadBucket is not called.
| auto bucket_uri = getURIForBucket(bucket); | ||
| if (!bucket_uri) | ||
| { | ||
| if (auto maybe_error = updateURIForBucketForHead(bucket); maybe_error.has_value()) |
There was a problem hiding this comment.
I don't think we should make our code that complicated. If HeadBucket and the HeadObject's response header both didn't help us to get a correct region then ListObjects will hardly help.
There was a problem hiding this comment.
It will help us if we got 301 but we couldn't get the location because HEAD doesn't have a response body.
|
I thought about this PR a while. I hope that time when I updated AWS SDK in out contrib wasn't the last time we did it. And this PR introduced many changes which will make that updating quite hard. Inheriting It seems all we needed were:
For 1 we could add two virtual functions to then call them from For 2 we could add two virtual functions to and again call them from And for 3 we could call What's the difference? While updating AWS SDK it's much easier to resolve a few small conflicts because of we added a few empty functions than trying to understand what happened with that very complicated retry mechanism in our code or trying to figure out what we should do with new requests possibly added by AWS in the future - should we inherit them too or not? |
|
I can try doing it in this way, but I think we should do
|
|
Do you know any example where some non-HeadObject request failed because of undefined region but the response didn't contain the correct region? |
|
Not sure now, I think I did see a request like that but why not cover this case just in case? |
|
Hi, @antonio2368. Is there any benchmark on this improved client? Just like @vitlibar said, it's a great modification which may bring trouble when updating aws submodules. So I'd like to know how much improvement this client can bring? |
|
Hi @Alima777 But there are many other client improvements already introduced and planned to be introduced in other PRs. |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Improve internal S3 client to correctly deduce regions and redirections for different types of URLs.
I defined a client that will try to detect region for the endpoint (if it's not explicitly defined in the URL or config) by using
HeadBucketallowing us to useHeadObjectcorrectly.Regions will be cached for each bucket per client.
Additionally, I noticed that the SDK doesn't handle correctly code 301 when the style of URL changes.
For example, this never worked with CH:
because redirect is to a different style URL (virtual address) while AWS SDK simply changes URI authority with the location in the body.
I defined a similar logic with cache as for the region and now this also correctly works, even with
HeadObject.To verify that both caches are working, you can define an S3 table and check the performance difference between first and second run (if an URL without region is used and even more if it's redirected).
Things left to do:
SYSTEM DROP S3 CLIENT CACHEAWS SDK PR ClickHouse/aws-sdk-cpp#13
Documentation entry for user-facing changes