Handle SIGTERMs during helm upgrade and helm install by Moser-ss · Pull Request #9180 · helm/helm

Moser-ss · 2020-12-30T19:48:55Z

What this PR does / why we need it:
This PR is open to try to fix issue #4558 and issue #8987. By handling the SIGTERM and throw an error that can be handled internally my intent is to mark the release as FAILED when the helm was interrupted or terminated
Special notes for your reviewer:
I still have some TODOS:

Implement the same logic to the install action
In case the flag atomic is passed do a rollback
Implement/modify tests
I would like to have an opinion from the maintainers to understand if this approach makes sense from your point of view
I tested in my machine ( a MacOS) and when the Ctrl +C sequence is hit the release is marked as failed

If applicable:

this PR contains documentation
this PR contains unit tests
this PR has been tested for backwards compatibility

jdolitsky

Hello. This appears to be a useful change, but definitely introduces lots of new code with channels etc. that might cause unexpected issues.

If there any way to test these with unit tests? By inserting SIGTERM on a timer, for example?

Moser-ss · 2021-01-25T16:00:00Z

Thanks for the suggestion.
I was thinking about how to add unit tests to this flow, I was hoping that could get some suggestions from the mantainers. I will look for the approach by using timer and SIGTERM

bacongobbler · 2021-01-25T16:28:52Z

How do you suggest we handle cases where an upgrade is midway through completion and CTRL+C is invoked? If I understand this correctly, any hooks currently being executed (e.g. pre-upgrade) are not interrupted. There are a few scenarios I could perceive where an upgrade is cancelled halfway but the pre-upgrade hooks install/upgrade certain resources, and a second call to helm upgrade would fail as the previous upgrade did not complete. By introducing this behaviour, you could effectively enter a deadlock. It is also possible that providing the --atomic flag with this behaviour may enter a deadlock as well.

Have you tested for these scenarios? What was the outcome?

PhilThurston · 2021-01-25T20:56:30Z

@bacongobbler from my understanding CTRL+C sends the SIGINT signal so simply having different logic for SIGTERM vs SIGINT would work here.

bacongobbler · 2021-01-25T20:59:42Z

Fair enough. Point still stands though... Just replace a CTRL+C with a hook that runs past the given timeout threshold (#4558). The same problem cases and issues are present... Just a different event triggering the stoppage.

bacongobbler · 2021-01-25T21:05:18Z

Here's the case I'm trying to clarify here:

I have a Helm chart.
I introduce a hook that accidentally runs for 30 days instead of 30 minutes.
I call helm upgrade --timeout 31m --wait ...
(as per this PR) My release enters the FAILED state.
I correct my mistake, changing the hook's execution time to 30 minutes.
I call helm upgrade --timeout 31m --wait

Isn't this going to cause a deadlock? We haven't "cleaned up" or rolled back any of the changes that occurred during the failed release. We just marked it as FAILED.

Is this a situation where --atomic might be more practical?

Moser-ss · 2021-01-26T00:03:50Z

How do you suggest we handle cases where an upgrade is midway through completion and CTRL+C is invoked?

The flows I have in my mind are the following:
In the case of an upgrade without --atomic flag just mark the release as failed with a message that it was marked as failed because helm received a SIGTERM or SIGINT, in this way it will not leave the release as pending

 ➜  helm git:(feature-handle-SIGINT)  ./bin/helm upgrade -i issue /Users/stephanemoser/tmp/issue-4558 --wait
^CError: UPGRADE FAILED: SIGTERM or SIGINT received, release failed

REVISION        UPDATED                         STATUS          CHART                   APP VERSION     DESCRIPTION

1               Mon Jan 25 23:11:09 2021        deployed        helm-to-fail-0.2.0                      Install complete

2               Mon Jan 25 23:13:19 2021        failed          helm-to-fail-0.2.0                      Upgrade "issue" failed: SIGTERM or SIGINT received, release failed

In case if an upgrade with the flag --atomic, it will rollback to the previous release using the same logic in case a release normally fail

 ➜  helm git:(feature-handle-SIGINT)  ./bin/helm upgrade -i issue /Users/stephanemoser/tmp/issue-4558 --atomic
^CError: UPGRADE FAILED: release issue failed, and has been rolled back due to atomic being set: SIGTERM or SIGINT received, release failed

3               Mon Jan 25 23:17:20 2021        superseded      helm-to-fail-0.2.0                      Upgrade complete
4               Mon Jan 25 23:20:42 2021        failed          helm-to-fail-0.2.0                      Upgrade "issue" failed: SIGTERM or SIGINT received, release failed
5               Mon Jan 25 23:20:52 2021        deployed        helm-to-fail-0.2.0                      Rollback to 3

Moser-ss · 2021-01-26T00:07:53Z

Here's the case I'm trying to clarify here:

I have a Helm chart.

I introduce a hook that accidentally runs for 30 days instead of 30 minutes.

I call helm upgrade --timeout 31m --wait ...

(as per this PR) My release enters the FAILED state.

I correct my mistake, changing the hook's execution time to 30 minutes.

I call helm upgrade --timeout 31m --wait

I tried to reproduce this behavior using this example #4558 (comment)
The first time, it fails with the timeout, then I fix the job to be able to exit and it was capable to finish the release with success

bacongobbler · 2021-02-03T16:10:21Z

The reason I'm asking is because this appears to be the only case where Helm can enter the PENDING_UPGRADE state during an upgrade.

If the intention is to return to the FAILED state, instead of introducing all this timeout/capture logic (which can be prone to deadlocks or unintended breaking behaviour), why not just enter the FAILED state until the upgrade succeeds? If Helm were to fail due to a timeout, it'd already be in the FAILED state and we don't have to add all this complex logic to the action handlers.

I'm also wondering if we should mark the PENDING_UPGRADE state as deprecated - it does not appear there's any specific rhyme or reason why it was introduced, why Helm should enter that state, or how to resolve a release stuck in that state (other than the workarounds described in earlier bug reports).

Moser-ss · 2021-02-03T16:41:26Z

why not just enter the FAILED state until the upgrade succeeds?

I think I am not the right person to answer this question; I was waiting for some guidance from the maintainers about this topic.

From Helm User's experience, I notice that when an upgrade starts, helm defines the state of the release as PENDING_UPGRADE until it receives a timeout or a success.

In version 3.4.0 looks that helm starts to check if there is any release in PENDING_UPGRADE before starting the upgrade. I don't know if that was introduced to avoid concurrent upgrades, but in reality, if some external factor kills the helm execution, we are lock-in that `PENDING_UPGRADE state.

So my initial idea was to handle the SIGINT and SIGTERM to mark the release a FAILED, but then I found if I set the flag atomic, it will not revert to the previous successfully release, for that reason, I introduced the mutex to lock to flow, and the release is rollbacked

But I am open to suggestions of which will be the best approach to tackle the problem of releases been lock when helm is killed or timeout in a pre-hook

Moser-ss · 2021-02-18T23:40:44Z

@bacongobbler / @jdolitsky Anything I should improve in this PR? Or should we have a different approach for this issue?

bacongobbler · 2021-02-23T18:45:08Z

Sorry for not being totally clear. I'm asking you to research whether it would be more feasible to replace any reference to the PENDING_UPGRADE state with the FAILED state and report back with your findings.

I do not know which is the right approach, but if my intuition is right, moving forward with that behavior would be much simpler to maintain as you would not have to introduce all this control/timing logic in the first place. And the use cases where PENDING_UPGRADE is used vs. when the FAILED state are introduced should be identical.

Moser-ss · 2021-02-24T16:10:42Z

Well, we have this PENDING_UPGRADE state to avoid corruption of the storage, or at least is what I understand from this PR that created the lock around the PENDIND_UPGRADE state.
If we accept that the only solution for releases that are blocked in the PENDING_X state is to do a rollback I can open PR to improve the documentation.

hiddeco

With an eye on projects that make use of pkg/action as a "Helm SDK" (without CLI bits), and that may have their own way of dealing with locks.

I have some concerns about the introduction of the sync.Mutex, channels, etc. directly into the package, instead of them e.g. being injected from higher up.

kzap · 2021-03-11T02:59:52Z

if UPGRADE_PENDING is used as a lock so 2 upgrades don't happen at the same time and corrupt storage. Why not have a new state WAITING_STATUS or something that happens after the upgrade is complete but we are --wait

Then a new deployment can still proceed as long as its not possible for the corruption of storage during the actual upgrade process of helm applying all the manifests.

Moser-ss · 2021-03-12T23:00:58Z

I have some concerns about the introduction of the sync.Mutex, channels, etc. directly into the package, instead of them e.g. being injected from higher up.

Any suggestion when the Helm CLI should start listing and handling OS signals?

hiddeco · 2021-03-13T11:37:47Z

Any suggestion when the Helm CLI should start listing and handling OS signals?

I am not an expert on Helm's architecture, but have a pretty good idea on how things are wired together due to having read most of the code.

My first concern about the sync.Mutex could be solved by moving it from the global var to the Install structure. By doing this, the lock only applies to a single release, and no longer gets in the way of SDK consumers that e.g. perform concurrent operations on different releases.

The entry point for starting to listen to OS signals is in my opinion https://github.com/helm/helm/tree/master/cmd/helm.

Change the logic to release Upgrade to handle SIGTERMs Extract logic to 2 goroutine so it is possible to handle SIGTERMS and the release flow Fix go style Signed-off-by: Stephane Moser <[email protected]>

pkg/action/upgrade.go

pkg/action/install_test.go

bacongobbler · 2021-07-21T19:15:41Z

looking good! provided some feedback to help push things along. Let me know when you're finished with those changes and I will schedule another round of reviews.

hiddeco · 2021-07-22T11:36:32Z

Sorry about the late reply after getting pinged, I was on vacation and very busy the weeks prior to that.

The mutex on the action in combination with the latest comments from Matthew around the signal handler would address all my previous concerns 👍

Use context to handle SIGTERM in the cmd/helm instead of pkg/action Signed-off-by: Stephane Moser <[email protected]>

Moser-ss · 2021-07-26T00:23:28Z

The logic to handle signals was moved from pkg/action to cmd/helm
A new function was created to Install and Upgrade types RunWithContext

I think I am duplicating code in install and upgrade but I didn't find a good place to add a function to set up the context and handle the signals

The tests were rewritten and are cleaner now that we are using context

I don't add any customization to the chart description, basically, it will receive the error message from the context

@bacongobbler / @jdolitsky the PR is ready for a new review

LuckySB · 2021-08-02T18:18:43Z

And what about cases when the deployment process is interrupted not by a SIGTERM, but due to network loss

cmd/helm/install.go

cmd/helm/upgrade.go

bacongobbler · 2021-08-03T16:13:56Z

This looks much better. Thanks for taking the time to refactor!

bacongobbler · 2021-08-03T16:14:53Z

And what about cases when the deployment process is interrupted not by a SIGTERM, but due to network loss

We could potentially tie into the context.Context by adding a WithTimeout. But that's out of this PR's current scope.

Fix typos Remove condition arround time.Sleep Because a negative or zero duration causes Sleep to return immediately. Signed-off-by: Stephane Moser <[email protected]>

bridgetkromhout · 2021-08-19T16:59:12Z

Discussed in today's community call; an additional reviewer is needed due to the scope of this PR. Thanks!

hickeyma

Thanks for the effort @Moser-ss. The code looks good. Just a few things:

Due to the changes involved, I would like a few more maintainers to review this in addition to the required 2.
The PR now sets failed state when you interrupt (CTRL-C) an install and upgrade
While testing manually I saw output as follows:

$ helm install foo mychart/ --debug

install.go:178: [debug] Original chart version: ""
install.go:195: [debug] CHART PATH: /Users/mhickey/tmp/helm-charts/mychart

^CRelease foo has been cancelled.
client.go:122: [debug] creating 3 resource(s)
Error: context canceled
helm.go:81: [debug] context canceled

$ helm upgrade issue-10058 issue-10058 --debug 

upgrade.go:139: [debug] preparing upgrade for issue-10058
upgrade.go:147: [debug] performing update for issue-10058
upgrade.go:319: [debug] creating upgraded release for issue-10058
client.go:203: [debug] checking 3 resources for changes
client.go:466: [debug] Looks like there are no changes for ServiceAccount "issue-10058"
client.go:466: [debug] Looks like there are no changes for Service "issue-10058"
^CRelease issue-10058 has been cancelled.
upgrade.go:420: [debug] warning: Upgrade "issue-10058" failed: context canceled
client.go:466: [debug] Looks like there are no changes for Deployment "issue-10058"
Error: UPGRADE FAILED: context canceled
helm.go:81: [debug] context canceled
UPGRADE FAILED
main.newUpgradeCmd.func2
	helm.sh/helm/v3/cmd/helm/upgrade.go:196
github.com/spf13/cobra.(*Command).execute
	github.com/spf13/[email protected]/command.go:852
github.com/spf13/cobra.(*Command).ExecuteC
	github.com/spf13/[email protected]/command.go:960
github.com/spf13/cobra.(*Command).Execute
	github.com/spf13/[email protected]/command.go:897
main.main
	helm.sh/helm/v3/cmd/helm/helm.go:80
runtime.main
	runtime/proc.go:225
runtime.goexit
	runtime/asm_amd64.s:1371

Should I have got a stack trace on the upgrade?

Moser-ss · 2021-08-27T00:21:15Z

Should I have got a stack trace on the upgrade?

Based on the code we have in the cmd/upgrade.go where we return the error wrap around the stack. On the other hand, in the cmd/install.go, we just send the error without any wrapping.
I can change that behavior to make things more consistent

To make the install comand consistent with upgrade comand when handling errors Signed-off-by: Stephane Moser <[email protected]>

hickeyma · 2021-08-30T16:36:07Z

@Moser-ss Testing again, it would seem that stack trace will happen depending on when you you send the signal rather than always happening.

hickeyma

LGTM, thanks for working on this @Moser-ss

mattfarina · 2021-08-30T21:58:56Z

Note, a follow-up pull request could add this to the uninstall command.

mattfarina

It looks like I get the easy rubber stamp after @bacongobbler and @hickeyma did all the real reviewer work.

Moser-ss · 2021-08-31T08:55:46Z

Note, a follow-up pull request could add this to the uninstall command.

@mattfarina Do you mean to add the capability to handle SIGTERM to uninstall command? And what will be the behavior?

In the case of install and upgrade commands, we have a wait flag that gives us a bigger opportunity window to press the Ctrl+C and with that interrupt the installation and left the chart in the upgrading state.

I am not sure if I am correct, but in the case of uninstalling command we are not changing the state of the chart, we are just deleting all resources and history without using an intermediate state.

But if you clarify the behaviour you have in mind I can work in that PR

czy006 · 2022-01-22T02:52:21Z

Helm Version : 3.7.2
K8S Version: 1.22.5
Hi, if i using --atomic args to upgrade/install , and use command ctrl c (on mac or linux) to interrupt(SIGTERM) and exit it, next time use helm status to check the status can be see status is holding on pending-upgrade/pending-rollback,but it was failed. I think the right status is failed, Because it's not normal anymore. Is this discussion about solving the problem ？

helm-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Dec 30, 2020

Gumi22 mentioned this pull request Dec 31, 2020

helm upgrade > timeout on pre-upgrade hook > revision stuck in PENDING_UPGRADE and multiple DEPLOYED revisions arise soon #4558

Closed

Moser-ss force-pushed the feature-handle-SIGINT branch from 3ccc26f to 13d776c Compare January 1, 2021 18:16

helm-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 8, 2021

Moser-ss changed the title ~~WIP: Handle SIGTERMs during helm upgrade and helm install~~ Handle SIGTERMs during helm upgrade and helm install Jan 8, 2021

jdolitsky suggested changes Jan 25, 2021

View reviewed changes

Moser-ss force-pushed the feature-handle-SIGINT branch from b0de3f5 to 3c79749 Compare February 2, 2021 01:04

Moser-ss requested a review from jdolitsky February 2, 2021 19:47

hiddeco reviewed Mar 5, 2021

View reviewed changes

bacongobbler mentioned this pull request Mar 9, 2021

When a SIGTERM is sent to the Helm CLI during a deployment / upgrade, it leaves the cluster labels in a PENDING state, blocking further deployments #9446

Closed

hiddeco mentioned this pull request Mar 12, 2021

Helm upgrade failed: another operation (install/upgrade/rollback) is in progress fluxcd/helm-controller#149

Closed

joelanford mentioned this pull request Jun 16, 2021

[WIP] ROX-7352 unlock pending helm actions stackrox/helm-operator#13

Closed

4 tasks

bacongobbler mentioned this pull request Jun 30, 2021

[Helm 3] improper RBAC setup causes release to be stuck in pending-install status despite installing fine #7139

Closed

Handle SIGTERM

027cea4

Change the logic to release Upgrade to handle SIGTERMs Extract logic to 2 goroutine so it is possible to handle SIGTERMS and the release flow Fix go style Signed-off-by: Stephane Moser <[email protected]>

bacongobbler requested changes Jul 21, 2021

View reviewed changes

pkg/action/upgrade.go Outdated Show resolved Hide resolved

pkg/action/install_test.go Outdated Show resolved Hide resolved

pkg/action/install_test.go Outdated Show resolved Hide resolved

Refactor SIGTERM logic

c62ce12

Use context to handle SIGTERM in the cmd/helm instead of pkg/action Signed-off-by: Stephane Moser <[email protected]>

Moser-ss force-pushed the feature-handle-SIGINT branch from ec043c8 to c62ce12 Compare July 26, 2021 00:05

bacongobbler mentioned this pull request Aug 3, 2021

Helm v3.4 Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress #8987

Closed

bacongobbler requested changes Aug 3, 2021

View reviewed changes

cmd/helm/install.go Outdated Show resolved Hide resolved

cmd/helm/upgrade.go Outdated Show resolved Hide resolved

Resolve PR comments

4bc901c

Fix typos Remove condition arround time.Sleep Because a negative or zero duration causes Sleep to return immediately. Signed-off-by: Stephane Moser <[email protected]>

Moser-ss requested a review from bacongobbler August 6, 2021 17:10

bridgetkromhout added this to the 3.7.0 milestone Aug 19, 2021

hickeyma reviewed Aug 26, 2021

View reviewed changes

Wrap error

101370a

To make the install comand consistent with upgrade comand when handling errors Signed-off-by: Stephane Moser <[email protected]>

Moser-ss force-pushed the feature-handle-SIGINT branch from 7184b81 to 101370a Compare August 27, 2021 00:50

hickeyma approved these changes Aug 30, 2021

View reviewed changes

mattfarina approved these changes Aug 30, 2021

View reviewed changes

mattfarina merged commit accf82b into helm:main Aug 30, 2021

hickeyma mentioned this pull request Nov 16, 2021

fix(install/upgrade): Use buffered channel for signal notification to avoid signal loss #10347

Merged

3 tasks

bacongobbler mentioned this pull request Jan 21, 2022

Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress #10599

Closed

bacongobbler mentioned this pull request Feb 3, 2022

Proposal: initiate rollback if --atomic is set and SIGINT is received #8040

Closed

steved mentioned this pull request Oct 7, 2022

Helmfile does not handle termination signals when Helm is still running helmfile/helmfile#416

Closed

Conversation

Moser-ss commented Dec 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jdolitsky left a comment

Choose a reason for hiding this comment

Uh oh!

Moser-ss commented Jan 25, 2021

Uh oh!

bacongobbler commented Jan 25, 2021

Uh oh!

PhilThurston commented Jan 25, 2021

Uh oh!

bacongobbler commented Jan 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bacongobbler commented Jan 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Moser-ss commented Jan 26, 2021

Uh oh!

Moser-ss commented Jan 26, 2021

Uh oh!

bacongobbler commented Feb 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Moser-ss commented Feb 3, 2021

Uh oh!

Moser-ss commented Feb 18, 2021

Uh oh!

bacongobbler commented Feb 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Moser-ss commented Feb 24, 2021

Uh oh!

hiddeco left a comment

Choose a reason for hiding this comment

Uh oh!

kzap commented Mar 11, 2021

Uh oh!

Moser-ss commented Mar 12, 2021

Uh oh!

hiddeco commented Mar 13, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bacongobbler commented Jul 21, 2021

Uh oh!

hiddeco commented Jul 22, 2021

Uh oh!

Moser-ss commented Jul 26, 2021

Uh oh!

LuckySB commented Aug 2, 2021

Uh oh!

Uh oh!

Uh oh!

bacongobbler commented Aug 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bacongobbler commented Aug 3, 2021

Uh oh!

bridgetkromhout commented Aug 19, 2021

Uh oh!

hickeyma left a comment

Choose a reason for hiding this comment

Uh oh!

Moser-ss commented Aug 27, 2021

Uh oh!

hickeyma commented Aug 30, 2021

Uh oh!

hickeyma left a comment

Choose a reason for hiding this comment

Uh oh!

mattfarina commented Aug 30, 2021

Uh oh!

mattfarina left a comment

Choose a reason for hiding this comment

Uh oh!

Moser-ss commented Aug 31, 2021

Moser-ss commented Dec 30, 2020 •

edited

Loading

bacongobbler commented Jan 25, 2021 •

edited

Loading

bacongobbler commented Jan 25, 2021 •

edited

Loading

bacongobbler commented Feb 3, 2021 •

edited

Loading

bacongobbler commented Feb 23, 2021 •

edited

Loading

bacongobbler commented Aug 3, 2021 •

edited

Loading