Skip to content

Sandbox: Handle unexpected shim kill events#9112

Merged
fuweid merged 1 commit intocontainerd:mainfrom
adityaramani:handle-shim-kill
Sep 22, 2023
Merged

Sandbox: Handle unexpected shim kill events#9112
fuweid merged 1 commit intocontainerd:mainfrom
adityaramani:handle-shim-kill

Conversation

@adityaramani
Copy link
Contributor

@adityaramani adityaramani commented Sep 18, 2023

Observations:

When a shim process is unexpectedly killed in a way that was not initiated through containerd - containerd reports the pod as not ready but the containers as running. This results in kubelet repeatedly sending container kill requests that fail since containerd cannot connect to the shim.

Note: This behavior is seen only for sandbox shims

Changes:

  • In the container exit handler, treat err: Unavailable as if the container has already exited out
  • When attempting to get a connection to the shim, if the controller isn't available assume that the shim has been killed (needs to be done since we have a separate exit handler that cleans up the reference to the shim controller - before kubelet has the chance to call StopPodSandbox)
    cleanupAfterDeadShim(cleanup.Background(ctx), id, m.shims, m.events, b)

@k8s-ci-robot
Copy link

Hi @adityaramani. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dcantah dcantah self-assigned this Sep 18, 2023
When a shim process is unexpectedly killed in a way that was not initiated through containerd - containerd reports the pod as not ready but the containers as running. This results in kubelet repeatedly sending container kill requests that fail since containerd cannot connect to the shim.

Changes:

- In the container exit handler, treat `err: Unavailable` as if the container has already exited out
- When attempting to get a connection to the shim, if the controller isn't available assume that the shim has been killed (needs to be done since we have a separate exit handler that cleans up the reference to the shim controller - before kubelet has the chance to call StopPodSandbox)

Signed-off-by: Aditya Ramani <[email protected]>
@dcantah dcantah self-requested a review September 19, 2023 00:58
@dcantah dcantah removed their assignment Sep 19, 2023
@dcantah dcantah changed the title Handle unexpected shim kill events Sandbox: Handle unexpected shim kill events Sep 19, 2023
Copy link
Member

@dcantah dcantah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We should really enlighten one of our shims with sandbox support, we don't have anything to test this in CI atm 😪 but the changes make sense

)
if err != nil {
if !errdefs.IsNotFound(err) {
if !errdefs.IsNotFound(err) && !errdefs.IsUnavailable(err) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unavailable error belongs to specific sandbox implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No the unavailable error is independent of the sandbox implementation - it is a result of a failed grpc connection (since the shim is dead). The error would originate from

return runtime.State{}, errdefs.FromGRPC(err)

When we would try to contact the shim to get its state but realize we cant reach it

The error is something like below

ERRO[2023-09-19T16:11:06.940444000Z] StopContainer for "6668b4c6f405e58613b8cbfb1a5299b01d4eaf288115766a36f008156f746cad" failed  error="rpc error: code = Unavailable desc = failed to stop container \"6668b4c6f405e58613b8cbfb1a5299b01d4eaf288115766a36f008156f746cad\": connection error: desc = \"transport: Error while dialing: dial unix <path>: connect: connection refused\": unavailable"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL. The grpc will reconnect to the tcp but fast-fail on unix socket. Thanks for details. I think we should have work item to setup the integration for the grpc shim. @adityaramani would you mind creating the issue to track this? Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fuweid Here is the issue: #9137

Please tweak the details as you think is fit

@adityaramani adityaramani requested a review from fuweid September 22, 2023 05:05
Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants