Skip to content

Update CRI to use transfer service for image pull by default#8515

Merged
dmcgowan merged 1 commit intocontainerd:mainfrom
fangn2:cri-image-transfer
Apr 24, 2025
Merged

Update CRI to use transfer service for image pull by default#8515
dmcgowan merged 1 commit intocontainerd:mainfrom
fangn2:cri-image-transfer

Conversation

@fangn2
Copy link
Contributor

@fangn2 fangn2 commented May 15, 2023

This PR is to update CRI to optionally use transfer service for image pull.

Part of #8227

  1. Create a new CRI config ImagePullWithTransferService to specify whether to use transfer service for image pull
  2. Move CRI image pull related configs to server side, i.e. transfer service.

The PR is still in progress, put out to for early review and utilize CI for integration tests.

@fangn2 fangn2 marked this pull request as draft May 15, 2023 03:01
@k8s-ci-robot
Copy link

Hi @fangn2. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fangn2 fangn2 changed the title [WIP]Update CRI to optional use transfer service for image pull [WIP]Update CRI to optionally use transfer service for image pull May 15, 2023
@fangn2 fangn2 force-pushed the cri-image-transfer branch from a62e6ed to ed26de0 Compare May 15, 2023 03:07
@mikebrow
Copy link
Member

/ok-to-test

@fangn2 fangn2 force-pushed the cri-image-transfer branch from ed26de0 to 3ed06cd Compare May 30, 2023 16:56
@dmcgowan dmcgowan self-assigned this May 30, 2023
@fangn2 fangn2 force-pushed the cri-image-transfer branch from 3ed06cd to 88d975b Compare June 12, 2023 03:12
return nil, fmt.Errorf("failed to pull and unpack image %q: %w", ref, err)
}

return &runtime.PullImageResponse{ImageRef: ref}, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to update the image store.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review! Will address it in the coming changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
Just curious CRI currently maintain an in-memory image cache(there could be situations that the CRI image store and containerd image store are out of sync, like other clients e.g. ctr removed an image). Can we get rid of it by directly using contained image store?

@fangn2 fangn2 force-pushed the cri-image-transfer branch 2 times, most recently from fd0d0c3 to 113293b Compare August 9, 2023 14:41
@fangn2 fangn2 force-pushed the cri-image-transfer branch 6 times, most recently from 45eaa5d to 88e6abe Compare August 16, 2023 16:52
@mxpv mxpv added the status/needs-update Awaiting contributor update label Sep 7, 2023
@fangn2 fangn2 force-pushed the cri-image-transfer branch from b47bf45 to 95e3822 Compare October 3, 2023 01:58
@estesp
Copy link
Member

estesp commented Oct 3, 2023

CRI test failures seem related to the new (optional) image verifier service merged in #8493; the image verifier plugin is a required plugin for transfer plugin now and the error seems to be related to that? Seems you rebased, so I'm not sure what else has to change to make that work properly?

Error:

2023-10-03T02:31:08.9836244Z === NAME  TestCRIImagePullTimeout/HoldingContentOpenWriter
2023-10-03T02:31:08.9837268Z     build_local_containerd_helper_test.go:109: 
2023-10-03T02:31:08.9838989Z         	Error Trace:	/home/runner/work/containerd/containerd/integration/build_local_containerd_helper_test.go:109
2023-10-03T02:31:08.9840452Z         	            				/home/runner/work/containerd/containerd/integration/image_pull_timeout_test.go:78
2023-10-03T02:31:08.9841049Z         	Error:      	Received unexpected error:
2023-10-03T02:31:08.9842280Z         	            	no plugins registered for io.containerd.image-verifier.v1: not found
2023-10-03T02:31:08.9843017Z         	Test:       	TestCRIImagePullTimeout/HoldingContentOpenWriter
2023-10-03T02:31:08.9843552Z     build_local_containerd_helper_test.go:120: 
2023-10-03T02:31:08.9844427Z         	Error Trace:	/home/runner/work/containerd/containerd/integration/build_local_containerd_helper_test.go:120
2023-10-03T02:31:08.9845822Z         	            				/home/runner/work/containerd/containerd/integration/image_pull_timeout_test.go:78
2023-10-03T02:31:08.9846406Z         	Error:      	Received unexpected error:
2023-10-03T02:31:08.9847983Z         	            	failed to get "io.containerd.transfer.v1" plugin: no plugins registered for io.containerd.image-verifier.v1: not found
2023-10-03T02:31:08.9848807Z         	Test:       	TestCRIImagePullTimeout/HoldingContentOpenWriter
2023-10-03T02:31:08.9861131Z === NAME  TestCRIImagePullTimeout/NoDataTransferred
2023-10-03T02:31:08.9862011Z     build_local_containerd_helper_test.go:109: 
2023-10-03T02:31:08.9863128Z         	Error Trace:	/home/runner/work/containerd/containerd/integration/build_local_containerd_helper_test.go:109
2023-10-03T02:31:08.9864564Z         	            				/home/runner/work/containerd/containerd/integration/image_pull_timeout_test.go:216
2023-10-03T02:31:08.9865157Z         	Error:      	Received unexpected error:
2023-10-03T02:31:08.9866373Z         	            	no plugins registered for io.containerd.image-verifier.v1: not found
2023-10-03T02:31:08.9867085Z         	Test:       	TestCRIImagePullTimeout/NoDataTransferred
2023-10-03T02:31:08.9867605Z     build_local_containerd_helper_test.go:120: 
2023-10-03T02:31:08.9868487Z         	Error Trace:	/home/runner/work/containerd/containerd/integration/build_local_containerd_helper_test.go:120
2023-10-03T02:31:08.9870257Z         	            				/home/runner/work/containerd/containerd/integration/image_pull_timeout_test.go:216
2023-10-03T02:31:08.9871126Z         	Error:      	Received unexpected error:
2023-10-03T02:31:08.9872738Z         	            	failed to get "io.containerd.transfer.v1" plugin: no plugins registered for io.containerd.image-verifier.v1: not found
2023-10-03T02:31:08.9873513Z         	Test:       	TestCRIImagePullTimeout/NoDataTransferred
2023-10-03T02:31:08.9874021Z --- FAIL: TestCRIImagePullTimeout (0.01s)

Copy link
Member

@dmcgowan dmcgowan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can get this in for 2.1. Either in an update to this PR or a followup, I would like to see more cases where it will fallback to the local behavior if CRI is configured for something that is not supported in transfer service. We can keep it as a default, but should minimize the disruption for this release, we don't need to rush the cutover.

@swagatbora90
Copy link
Contributor

We can get this in for 2.1. Either in an update to this PR or a followup, I would like to see more cases where it will fallback to the local behavior if CRI is configured for something that is not supported in transfer service. We can keep it as a default, but should minimize the disruption for this release, we don't need to rush the cutover.

Hey @dmcgowan, are you suggesting changing the default to use local pull to minimize any risk? I know that the initial implementation kept local pull as the default behavior, but then there was some discussion to make transfer service default since we were initially targeting this change for the 2.0 major release.

@dmcgowan
Copy link
Member

are you suggesting changing the default to use local pull to minimize any risk? I know that the initial implementation kept local pull as the default behavior, but then there was some discussion to make transfer service default since we were initially targeting this change for the 2.0 major release.

I'm fine keeping transfer as default, but we should fallback to use local if there is a configuration that is not supported with the transfer service yet.

// TODO: Add support for DisableSnapshotAnnotations, DiscardUnpackedLayers, ImagePullWithSyncFs and unpackDuplicationSuppressor

Referring to this comment. Either update the comment if some of those are no longer relevant or fallback to local if they are set. We don't want a configuration to suddenly stop working in 2.1.

@sajuptpm
Copy link

Please check the perf results @swagatbora90 @samuelkarp
#11726

@estesp
Copy link
Member

estesp commented Apr 22, 2025

We just merged a vagrant fix to CI; if you rebase on main you should be able to get a valid CI run

@swagatbora90
Copy link
Contributor

/retest

@estesp
Copy link
Member

estesp commented Apr 23, 2025

Given @dmcgowan's comment and the fact that @swagatbora90 put in the fallback checks, we can go ahead and merge and do any fixups in a follow-up PR that are still outstanding. This should be good for 2.1 RC though

@dmcgowan
Copy link
Member

Its getting kicked out of the merge queue from TestCRIImagePullTimeout/NoDataTransferredWithTransferService flaking

@swagatbora90
Copy link
Contributor

The flaky tests seems to be testing two different Registry config. One using Registry.ConfigPath and another with Registry.Mirror

for idx, registryCfg := range []criconfig.Registry{
{
ConfigPath: filepath.Dir(hostCfgDir),
},
// TODO(fuweid):
//
// Both Mirrors and Configs are deprecated in the future. And
// this registryCfg should also be removed at that time.
{
Mirrors: map[string]criconfig.Mirror{
mirrorURL.Host: {
Endpoints: []string{mirrorURL.String()},
},
},
},

Transfer Service does not support Registry.Mirror, so this may be the issue. But I am confused why this has been passing in some cases. If we do not use the mirror, we should never hit the circuit breaker and the test should fail always.

Logs does show that the test passes once but when it runs again, no content gets fetched

=== NAME  TestCRIImagePullTimeout/NoDataTransferredWithTransferService
    log_hook.go:47: time="2025-04-23T05:02:23.420499600Z" level=debug msg="progress for image pull" func="images.(*transferProgressReporter).checkProgress" file="/home/runner/work/containerd/containerd/internal/cri/server/images/image_pull.go:1012" activeReqs=0 lastSeenBytesRead=0 lastSeenTimestamp="2025-04-23T05:02:20Z" ref="127.0.0.1:37887/containerd/volume-ownership:2.1" reportInterval=2.5s testcase=TestCRIImagePullTimeout/NoDataTransferredWithTransferService totalBytesRead=0

- adds a transfer service progress reporter to handle timeouts. Also other test fixes
- fallback to local image pull when configuration conflict

Signed-off-by: Tony Fang <[email protected]>

Co-authored-by: Swagat Bora <[email protected]>
@swagatbora90
Copy link
Contributor

Skipping Registry.Mirrors test when running with Transfer Service.

@swagatbora90
Copy link
Contributor

There are additional test failures, but does not appear to be related to the PR changes:

Run script/test/test2annotation.sh *-gotest.json
jq: error: Could not open file *-gotest.json: No such file or directory

@swagatbora90
Copy link
Contributor

I don't see any report of this particular error, so may be related to my changes. Taking a look.

@estesp
Copy link
Member

estesp commented Apr 23, 2025

There are additional test failures, but does not appear to be related to the PR changes:

Run script/test/test2annotation.sh *-gotest.json
jq: error: Could not open file *-gotest.json: No such file or directory

That's a common problem after a previous stage failure on the Windows CI jobs; in this case a 503 from registry.k8s.io killed the image tests in the first integration run. We're not having the best of luck with CI stability these days

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.