Conversation
|
Hi @azr. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
2531c18 to
0748b0f
Compare
|
@azr Thanks for the PR, this looks promising. I wonder if you were able to get any memory usage data from your tests? Previous effort to use ECR containerd resolver , which has similar multipart layer download, showed that it can take up disproportionate amount of memory specially when we increase the number of parallel chunks(without providing significant benefit to latency). The high memory utilization was mainly from the Also can you share some information about your test image? Number of layers? Size of individual layers? |
|
/ok-to-test |
|
Hey @swagatbora90 , of course ! The theory in my mind is that this should use max/worst-case I think memory usage would be better if we were to directly write in parallel in a file at different positions, with 'holes'. And, sort of tell our progress to the checksumer with no-op writers that tell where we are, etc. (DL actually was so much faster this way in a test program I did, but it was not doing any unpacking, etc.) I also think it could be nice to be able to have a per registry parallelism setting, because not all registries are s3 backed, and docker.io seems to throttle things at 60mb/s. Topology of images: ~8GB imageFrom dive infos: ~27GB imageFrom dive infos: Here are memory usages, where I'm periodically storing ~27GB image pull, max_concurrent_downloads: 2, 0 parallelism (before)~27GB image pull, max_concurrent_downloads: 2, 110 parallelism, 32mb chunksGC traces: 8GB image with `GODEBUG=gctrace=1`, parallelism set to 110 and chunksize set to 328GB image with `GODEBUG=gctrace=1`, parallelism set to 0 ( existing code )Grpc tracing screenshots from the same run (8GB image with `GODEBUG=gctrace=1`, parallelism set to 110 and chunksize set to 32):
Screenshot from another run for a ~27GB image, after a while, all chunks seem to take the same amount of time, ~22s, we've probably reached the writing speed burst limit, and are slowly taking more time to do things:
|
This comment was marked as outdated.
This comment was marked as outdated.
c13969f to
8fc47db
Compare
|
@azr Thanks for adding the performance numbers. I ran some tests as well using your patch and the memory usage looks better than what I saw in the htcat implementation specially with high parallelism count. However, I do observe that increasing parallelism does not yield better latency and may lead to higher memory usage (I think there is a number of other factors to consider here mainly type of instance used for testing, network bandwidth). I tried to limit the test to a single image with a single layer and fixing the chunk size to 20 MB. A lower parallelism count(3 or 4) may be preferable than setting parallelism to upwards of 10. Using a c7.12xlarge instance to pull a 3GB single layer image from ECR private repo.
Also the network download time was much faster (see Network Pull time) (~15sec) while containerd took additional ~20secs to complete the pull (before it started unpacking). I calculated the Network Download time by periodically calling |
core/remotes/resolver.go
Outdated
| // All content fetched from the returned fetcher will be | ||
| // from the namespace referred to by ref. | ||
| Fetcher(ctx context.Context, ref string) (Fetcher, error) | ||
| Fetcher(ctx context.Context, ref string, opts ...FetcherOpt) (Fetcher, error) |
There was a problem hiding this comment.
I don't think we should make this interface change. I was trying to think of some better options. Previously we had another interface appended with WithOptions, but I don't think that should be necessary here. This configuration could just be directly given as resolver options. The resolver could have a SetOptions function defined and type-asserted to in the transfer case. I think we can avoid changing the interface here though, especially since these settings are global and not defined per call.
There was a problem hiding this comment.
Okay, updated, thanks for the review, I think this looks better, WDYT ?
|
/retest |
|
Hey @dmcgowan & @djdongjin I updated the code to only add one setting: Also I applied @dmcgowan's review to conf these settings ( In the meantime, I just realised that while testing 1 pull at a time was much faster that before, pulling two things at a time made for a longer total pull than if I were to pull at 2 different times, which doesn't sound like the best. Will investigate a bit. |
|
re:
Turns out it's not the case on tmpfs, so I think this is okay. So yeah okay this PR looks quite good to me, thanks for the many reviews, please let me know if anything needs to change. Will try to have tests pass now. |
dmcgowan
left a comment
There was a problem hiding this comment.
I think these two conditions need to be checked to avoid leak and panic, but are hard to simulate in tests.
core/remotes/docker/fetcher.go
Outdated
| case <-stopChan: | ||
| return errors.New("another worker failed") | ||
| default: | ||
| close(stopChan) |
There was a problem hiding this comment.
This is still racy, multiple goroutines may hit this condition and cause a panic. There is no guarantee in one goroutines selects default and closes before another goroutine selects the default.
There was a problem hiding this comment.
ditto here, I put the close in a sync.Once to be sure, also made the stop close the context. LMK if anything as well.
There was a problem hiding this comment.
ah made it even better, just using the context with cancel and its done chan
Signed-off-by: Adrien Delorme <[email protected]> Co-Authored-By: Corentin REGAL <[email protected]>
Signed-off-by: Adrien Delorme <[email protected]>
Signed-off-by: Adrien Delorme <[email protected]>
Signed-off-by: Adrien Delorme <[email protected]>
Signed-off-by: Adrien Delorme <[email protected]>
Signed-off-by: Adrien Delorme <[email protected]>
|
/test pull-containerd-k8s-e2e-ec2 |
dmcgowan
left a comment
There was a problem hiding this comment.
We should just get this in, there are still maybe a few interface tweaks we can make before final release that won't affect the functionality.
| r.Release(1) | ||
| for range parallelism { | ||
| go func() { | ||
| for i := range queue { // first in first out |
There was a problem hiding this comment.
Not sure if there is competive condition in http2 stdlib on window updating.
Will check this part later. it's not blocker.
Signed-off-by: Adrien Delorme <[email protected]>
|
@azr thank you for all the hard work on this, it's an awesome improvement! 🙏 |
|
Aw thanks ! ❤️ You're welcome !! 🚀 |







TLDR: this makes pulls of big images ~2x faster (edit: and a bit more in the latest iteration), and closes #9922.
cc: #8160, #4989
Hello Containerd People, I have this draft PR I would like to get your eyes on.
It basically makes pulls faster, but also tries to have not such a big memory impact, by getting consecutive chunks of the layers and immediately pushing them in the pipe (that writes to a file + that signature checksum thing).
I noticed it made pulls ~2x faster, when using the correct settings.
The settings have a big impact, and so I did a bunch of perf tests with different settings, here are some results on a ~8GB image using a
r6id.4xlargeinstance, pulling it from s3.Gains are somewhat similar on a ~27GB and a ~100GB image (with a little tiny bit of slowdown)
I also tried on an nvme, and a ebs drives, they are ofc slower but gains are still the same.
Metrics on a
r6id.4xlargetimingcrictl pullof a 8.6GB image.The first one with 13 tries is with 0 parallelism, it's the current code.
The rest are tries with different settings
tmpfs tests:
nvme (885GB) tests:
Observations, I did a little go program to multipart download big files directly into a file at different positions with different requests, and that was much faster than piping single threadedly into a file. Containerd pipes in a checksumer and then pipes into a file. I think that this can in some conditions create some sort of thrashing, hence why the parameters are very important here.
That simple go program had pretty bad perfs with one connection, but I was able to saturate the network with multiple connection, with better or on par perfs with aws-crt.
I think that for maximum perfs, we could try to re-architecture things a bit; like concurrently write directly into the tmpfile, and then tell the checksumer our progress, so that it can do that in parallel, and then carry on like usual.