Sandbox: Handle unexpected shim kill events#9112
Conversation
|
Hi @adityaramani. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
When a shim process is unexpectedly killed in a way that was not initiated through containerd - containerd reports the pod as not ready but the containers as running. This results in kubelet repeatedly sending container kill requests that fail since containerd cannot connect to the shim. Changes: - In the container exit handler, treat `err: Unavailable` as if the container has already exited out - When attempting to get a connection to the shim, if the controller isn't available assume that the shim has been killed (needs to be done since we have a separate exit handler that cleans up the reference to the shim controller - before kubelet has the chance to call StopPodSandbox) Signed-off-by: Aditya Ramani <[email protected]>
dd8aa2b to
729c97c
Compare
dcantah
left a comment
There was a problem hiding this comment.
LGTM. We should really enlighten one of our shims with sandbox support, we don't have anything to test this in CI atm 😪 but the changes make sense
| ) | ||
| if err != nil { | ||
| if !errdefs.IsNotFound(err) { | ||
| if !errdefs.IsNotFound(err) && !errdefs.IsUnavailable(err) { |
There was a problem hiding this comment.
The unavailable error belongs to specific sandbox implementation?
There was a problem hiding this comment.
No the unavailable error is independent of the sandbox implementation - it is a result of a failed grpc connection (since the shim is dead). The error would originate from
Line 700 in 0066676
When we would try to contact the shim to get its state but realize we cant reach it
The error is something like below
ERRO[2023-09-19T16:11:06.940444000Z] StopContainer for "6668b4c6f405e58613b8cbfb1a5299b01d4eaf288115766a36f008156f746cad" failed error="rpc error: code = Unavailable desc = failed to stop container \"6668b4c6f405e58613b8cbfb1a5299b01d4eaf288115766a36f008156f746cad\": connection error: desc = \"transport: Error while dialing: dial unix <path>: connect: connection refused\": unavailable"
There was a problem hiding this comment.
TIL. The grpc will reconnect to the tcp but fast-fail on unix socket. Thanks for details. I think we should have work item to setup the integration for the grpc shim. @adityaramani would you mind creating the issue to track this? Thanks
Observations:
When a shim process is unexpectedly killed in a way that was not initiated through containerd - containerd reports the pod as not ready but the containers as running. This results in kubelet repeatedly sending container kill requests that fail since containerd cannot connect to the shim.
Note: This behavior is seen only for sandbox shims
Changes:
err: Unavailableas if the container has already exited outcontainerd/runtime/v2/manager.go
Line 271 in 82df7d5