Sandbox: Handle unexpected shim kill events by adityaramani · Pull Request #9112 · containerd/containerd

adityaramani · 2023-09-18T19:04:10Z

Observations:

When a shim process is unexpectedly killed in a way that was not initiated through containerd - containerd reports the pod as not ready but the containers as running. This results in kubelet repeatedly sending container kill requests that fail since containerd cannot connect to the shim.

Note: This behavior is seen only for sandbox shims

Changes:

In the container exit handler, treat err: Unavailable as if the container has already exited out
When attempting to get a connection to the shim, if the controller isn't available assume that the shim has been killed (needs to be done since we have a separate exit handler that cleans up the reference to the shim controller - before kubelet has the chance to call StopPodSandbox)

containerd/runtime/v2/manager.go

Line 271 in 82df7d5

cleanupAfterDeadShim(cleanup.Background(ctx), id, m.shims, m.events, b)

k8s-ci-robot · 2023-09-18T19:04:20Z

Hi @adityaramani. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

When a shim process is unexpectedly killed in a way that was not initiated through containerd - containerd reports the pod as not ready but the containers as running. This results in kubelet repeatedly sending container kill requests that fail since containerd cannot connect to the shim. Changes: - In the container exit handler, treat `err: Unavailable` as if the container has already exited out - When attempting to get a connection to the shim, if the controller isn't available assume that the shim has been killed (needs to be done since we have a separate exit handler that cleans up the reference to the shim controller - before kubelet has the chance to call StopPodSandbox) Signed-off-by: Aditya Ramani <[email protected]>

plugins/sandbox/controller.go

dcantah

LGTM. We should really enlighten one of our shims with sandbox support, we don't have anything to test this in CI atm 😪 but the changes make sense

fuweid · 2023-09-19T06:59:32Z

pkg/cri/sbserver/events.go

 	)
 	if err != nil {
-		if !errdefs.IsNotFound(err) {
+		if !errdefs.IsNotFound(err) && !errdefs.IsUnavailable(err) {


The unavailable error belongs to specific sandbox implementation?

No the unavailable error is independent of the sandbox implementation - it is a result of a failed grpc connection (since the shim is dead). The error would originate from

containerd/runtime/v2/shim.go

Line 700 in 0066676

return runtime.State{}, errdefs.FromGRPC(err)

When we would try to contact the shim to get its state but realize we cant reach it

The error is something like below

ERRO[2023-09-19T16:11:06.940444000Z] StopContainer for "6668b4c6f405e58613b8cbfb1a5299b01d4eaf288115766a36f008156f746cad" failed error="rpc error: code = Unavailable desc = failed to stop container \"6668b4c6f405e58613b8cbfb1a5299b01d4eaf288115766a36f008156f746cad\": connection error: desc = \"transport: Error while dialing: dial unix <path>: connect: connection refused\": unavailable"

TIL. The grpc will reconnect to the tcp but fast-fail on unix socket. Thanks for details. I think we should have work item to setup the integration for the grpc shim. @adityaramani would you mind creating the issue to track this? Thanks

@fuweid Here is the issue: #9137

Please tweak the details as you think is fit

fuweid

LGTM

k8s-ci-robot added the needs-ok-to-test label Sep 18, 2023

dcantah self-assigned this Sep 18, 2023

adityaramani force-pushed the handle-shim-kill branch from dd8aa2b to 729c97c Compare September 18, 2023 19:16

dcantah self-requested a review September 19, 2023 00:58

dcantah removed their assignment Sep 19, 2023

dcantah changed the title ~~Handle unexpected shim kill events~~ Sandbox: Handle unexpected shim kill events Sep 19, 2023

dcantah reviewed Sep 19, 2023

View reviewed changes

plugins/sandbox/controller.go Show resolved Hide resolved

dcantah approved these changes Sep 19, 2023

View reviewed changes

fuweid reviewed Sep 19, 2023

View reviewed changes

mxpv approved these changes Sep 19, 2023

View reviewed changes

adityaramani requested a review from fuweid September 22, 2023 05:05

fuweid approved these changes Sep 22, 2023

View reviewed changes

fuweid merged commit 7a0e6b7 into containerd:main Sep 22, 2023

thaJeztah mentioned this pull request Sep 22, 2023

[release/1.7] Handle unexpected shim kill events #9132

Merged

henry118 mentioned this pull request Oct 5, 2023

Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to create new parent process: namespace path: lstat /proc/0/ns/ipc: no such file or directory: unknown #9160

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sandbox: Handle unexpected shim kill events#9112

Sandbox: Handle unexpected shim kill events#9112
fuweid merged 1 commit intocontainerd:mainfrom
adityaramani:handle-shim-kill

adityaramani commented Sep 18, 2023 •

edited

Loading

Uh oh!

k8s-ci-robot commented Sep 18, 2023

Uh oh!

Uh oh!

dcantah left a comment

Uh oh!

fuweid Sep 19, 2023

Uh oh!

adityaramani Sep 19, 2023

Uh oh!

fuweid Sep 22, 2023

Uh oh!

adityaramani Sep 24, 2023

Uh oh!

fuweid left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

adityaramani commented Sep 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Sep 18, 2023

Uh oh!

Uh oh!

dcantah left a comment

Choose a reason for hiding this comment

Uh oh!

fuweid Sep 19, 2023

Choose a reason for hiding this comment

Uh oh!

adityaramani Sep 19, 2023

Choose a reason for hiding this comment

Uh oh!

fuweid Sep 22, 2023

Choose a reason for hiding this comment

Uh oh!

adityaramani Sep 24, 2023

Choose a reason for hiding this comment

Uh oh!

fuweid left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

adityaramani commented Sep 18, 2023 •

edited

Loading