Skip to content

Handle SIGTERMs during helm upgrade and helm install#9180

Merged
mattfarina merged 12 commits intohelm:mainfrom
Moser-ss:feature-handle-SIGINT
Aug 30, 2021
Merged

Handle SIGTERMs during helm upgrade and helm install#9180
mattfarina merged 12 commits intohelm:mainfrom
Moser-ss:feature-handle-SIGINT

Conversation

@Moser-ss
Copy link
Contributor

@Moser-ss Moser-ss commented Dec 30, 2020

What this PR does / why we need it:
This PR is open to try to fix issue #4558 and issue #8987. By handling the SIGTERM and throw an error that can be handled internally my intent is to mark the release as FAILED when the helm was interrupted or terminated
Special notes for your reviewer:
I still have some TODOS:

  • Implement the same logic to the install action
  • In case the flag atomic is passed do a rollback
  • Implement/modify tests
    I would like to have an opinion from the maintainers to understand if this approach makes sense from your point of view
    I tested in my machine ( a MacOS) and when the Ctrl +C sequence is hit the release is marked as failed

If applicable:

  • this PR contains documentation
  • this PR contains unit tests
  • this PR has been tested for backwards compatibility

@helm-bot helm-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Dec 30, 2020
@Moser-ss Moser-ss force-pushed the feature-handle-SIGINT branch from 3ccc26f to 13d776c Compare January 1, 2021 18:16
@helm-bot helm-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 8, 2021
@Moser-ss Moser-ss changed the title WIP: Handle SIGTERMs during helm upgrade and helm install Handle SIGTERMs during helm upgrade and helm install Jan 8, 2021
Copy link
Contributor

@jdolitsky jdolitsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello. This appears to be a useful change, but definitely introduces lots of new code with channels etc. that might cause unexpected issues.

If there any way to test these with unit tests? By inserting SIGTERM on a timer, for example?

@Moser-ss
Copy link
Contributor Author

Thanks for the suggestion.
I was thinking about how to add unit tests to this flow, I was hoping that could get some suggestions from the mantainers. I will look for the approach by using timer and SIGTERM

@bacongobbler
Copy link
Member

How do you suggest we handle cases where an upgrade is midway through completion and CTRL+C is invoked? If I understand this correctly, any hooks currently being executed (e.g. pre-upgrade) are not interrupted. There are a few scenarios I could perceive where an upgrade is cancelled halfway but the pre-upgrade hooks install/upgrade certain resources, and a second call to helm upgrade would fail as the previous upgrade did not complete. By introducing this behaviour, you could effectively enter a deadlock. It is also possible that providing the --atomic flag with this behaviour may enter a deadlock as well.

Have you tested for these scenarios? What was the outcome?

@PhilThurston
Copy link

@bacongobbler from my understanding CTRL+C sends the SIGINT signal so simply having different logic for SIGTERM vs SIGINT would work here.

@bacongobbler
Copy link
Member

bacongobbler commented Jan 25, 2021

Fair enough. Point still stands though... Just replace a CTRL+C with a hook that runs past the given timeout threshold (#4558). The same problem cases and issues are present... Just a different event triggering the stoppage.

@bacongobbler
Copy link
Member

bacongobbler commented Jan 25, 2021

Here's the case I'm trying to clarify here:

  1. I have a Helm chart.
  2. I introduce a hook that accidentally runs for 30 days instead of 30 minutes.
  3. I call helm upgrade --timeout 31m --wait ...
  4. (as per this PR) My release enters the FAILED state.
  5. I correct my mistake, changing the hook's execution time to 30 minutes.
  6. I call helm upgrade --timeout 31m --wait

Isn't this going to cause a deadlock? We haven't "cleaned up" or rolled back any of the changes that occurred during the failed release. We just marked it as FAILED.

Is this a situation where --atomic might be more practical?

@Moser-ss
Copy link
Contributor Author

How do you suggest we handle cases where an upgrade is midway through completion and CTRL+C is invoked?

The flows I have in my mind are the following:
In the case of an upgrade without --atomic flag just mark the release as failed with a message that it was marked as failed because helm received a SIGTERM or SIGINT, in this way it will not leave the release as pending

 ➜  helm git:(feature-handle-SIGINT)  ./bin/helm upgrade -i issue /Users/stephanemoser/tmp/issue-4558 --wait
^CError: UPGRADE FAILED: SIGTERM or SIGINT received, release failed
REVISION        UPDATED                         STATUS          CHART                   APP VERSION     DESCRIPTION

1               Mon Jan 25 23:11:09 2021        deployed        helm-to-fail-0.2.0                      Install complete

2               Mon Jan 25 23:13:19 2021        failed          helm-to-fail-0.2.0                      Upgrade "issue" failed: SIGTERM or SIGINT received, release failed

In case if an upgrade with the flag --atomic, it will rollback to the previous release using the same logic in case a release normally fail

 ➜  helm git:(feature-handle-SIGINT)  ./bin/helm upgrade -i issue /Users/stephanemoser/tmp/issue-4558 --atomic
^CError: UPGRADE FAILED: release issue failed, and has been rolled back due to atomic being set: SIGTERM or SIGINT received, release failed
3               Mon Jan 25 23:17:20 2021        superseded      helm-to-fail-0.2.0                      Upgrade complete
4               Mon Jan 25 23:20:42 2021        failed          helm-to-fail-0.2.0                      Upgrade "issue" failed: SIGTERM or SIGINT received, release failed
5               Mon Jan 25 23:20:52 2021        deployed        helm-to-fail-0.2.0                      Rollback to 3

@Moser-ss
Copy link
Contributor Author

Here's the case I'm trying to clarify here:

  1. I have a Helm chart.
  2. I introduce a hook that accidentally runs for 30 days instead of 30 minutes.
  3. I call helm upgrade --timeout 31m --wait ...
  4. (as per this PR) My release enters the FAILED state.
  5. I correct my mistake, changing the hook's execution time to 30 minutes.
  6. I call helm upgrade --timeout 31m --wait

I tried to reproduce this behavior using this example #4558 (comment)
The first time, it fails with the timeout, then I fix the job to be able to exit and it was capable to finish the release with success

@Moser-ss Moser-ss force-pushed the feature-handle-SIGINT branch from b0de3f5 to 3c79749 Compare February 2, 2021 01:04
@Moser-ss Moser-ss requested a review from jdolitsky February 2, 2021 19:47
@bacongobbler
Copy link
Member

bacongobbler commented Feb 3, 2021

The reason I'm asking is because this appears to be the only case where Helm can enter the PENDING_UPGRADE state during an upgrade.

If the intention is to return to the FAILED state, instead of introducing all this timeout/capture logic (which can be prone to deadlocks or unintended breaking behaviour), why not just enter the FAILED state until the upgrade succeeds? If Helm were to fail due to a timeout, it'd already be in the FAILED state and we don't have to add all this complex logic to the action handlers.

I'm also wondering if we should mark the PENDING_UPGRADE state as deprecated - it does not appear there's any specific rhyme or reason why it was introduced, why Helm should enter that state, or how to resolve a release stuck in that state (other than the workarounds described in earlier bug reports).

@Moser-ss
Copy link
Contributor Author

Moser-ss commented Feb 3, 2021

why not just enter the FAILED state until the upgrade succeeds?

I think I am not the right person to answer this question; I was waiting for some guidance from the maintainers about this topic.

From Helm User's experience, I notice that when an upgrade starts, helm defines the state of the release as PENDING_UPGRADE until it receives a timeout or a success.

In version 3.4.0 looks that helm starts to check if there is any release in PENDING_UPGRADE before starting the upgrade. I don't know if that was introduced to avoid concurrent upgrades, but in reality, if some external factor kills the helm execution, we are lock-in that `PENDING_UPGRADE state.

So my initial idea was to handle the SIGINT and SIGTERM to mark the release a FAILED, but then I found if I set the flag atomic, it will not revert to the previous successfully release, for that reason, I introduced the mutex to lock to flow, and the release is rollbacked

But I am open to suggestions of which will be the best approach to tackle the problem of releases been lock when helm is killed or timeout in a pre-hook

@Moser-ss
Copy link
Contributor Author

@bacongobbler / @jdolitsky Anything I should improve in this PR? Or should we have a different approach for this issue?

@bacongobbler
Copy link
Member

bacongobbler commented Feb 23, 2021

Sorry for not being totally clear. I'm asking you to research whether it would be more feasible to replace any reference to the PENDING_UPGRADE state with the FAILED state and report back with your findings.

I do not know which is the right approach, but if my intuition is right, moving forward with that behavior would be much simpler to maintain as you would not have to introduce all this control/timing logic in the first place. And the use cases where PENDING_UPGRADE is used vs. when the FAILED state are introduced should be identical.

@Moser-ss
Copy link
Contributor Author

Well, we have this PENDING_UPGRADE state to avoid corruption of the storage, or at least is what I understand from this PR that created the lock around the PENDIND_UPGRADE state.
If we accept that the only solution for releases that are blocked in the PENDING_X state is to do a rollback I can open PR to improve the documentation.

Copy link
Contributor

@hiddeco hiddeco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With an eye on projects that make use of pkg/action as a "Helm SDK" (without CLI bits), and that may have their own way of dealing with locks.

I have some concerns about the introduction of the sync.Mutex, channels, etc. directly into the package, instead of them e.g. being injected from higher up.

@kzap
Copy link

kzap commented Mar 11, 2021

if UPGRADE_PENDING is used as a lock so 2 upgrades don't happen at the same time and corrupt storage. Why not have a new state WAITING_STATUS or something that happens after the upgrade is complete but we are --wait

Then a new deployment can still proceed as long as its not possible for the corruption of storage during the actual upgrade process of helm applying all the manifests.

@Moser-ss
Copy link
Contributor Author

I have some concerns about the introduction of the sync.Mutex, channels, etc. directly into the package, instead of them e.g. being injected from higher up.

Any suggestion when the Helm CLI should start listing and handling OS signals?

@hiddeco
Copy link
Contributor

hiddeco commented Mar 13, 2021

Any suggestion when the Helm CLI should start listing and handling OS signals?

I am not an expert on Helm's architecture, but have a pretty good idea on how things are wired together due to having read most of the code.

My first concern about the sync.Mutex could be solved by moving it from the global var to the Install structure. By doing this, the lock only applies to a single release, and no longer gets in the way of SDK consumers that e.g. perform concurrent operations on different releases.

The entry point for starting to listen to OS signals is in my opinion https://github.com/helm/helm/tree/master/cmd/helm.

Change the logic to release Upgrade to handle SIGTERMs
Extract logic to 2 goroutine so it is possible to handle SIGTERMS and the release flow
Fix go style

Signed-off-by: Stephane Moser <[email protected]>
@bacongobbler
Copy link
Member

looking good! provided some feedback to help push things along. Let me know when you're finished with those changes and I will schedule another round of reviews.

@hiddeco
Copy link
Contributor

hiddeco commented Jul 22, 2021

Sorry about the late reply after getting pinged, I was on vacation and very busy the weeks prior to that.

The mutex on the action in combination with the latest comments from Matthew around the signal handler would address all my previous concerns 👍

Use context to handle SIGTERM in the cmd/helm instead of pkg/action

Signed-off-by: Stephane Moser <[email protected]>
@Moser-ss Moser-ss force-pushed the feature-handle-SIGINT branch from ec043c8 to c62ce12 Compare July 26, 2021 00:05
@Moser-ss
Copy link
Contributor Author

The logic to handle signals was moved from pkg/action to cmd/helm
A new function was created to Install and Upgrade types RunWithContext

I think I am duplicating code in install and upgrade but I didn't find a good place to add a function to set up the context and handle the signals

The tests were rewritten and are cleaner now that we are using context

I don't add any customization to the chart description, basically, it will receive the error message from the context

@bacongobbler / @jdolitsky the PR is ready for a new review

@LuckySB
Copy link

LuckySB commented Aug 2, 2021

And what about cases when the deployment process is interrupted not by a SIGTERM, but due to network loss

@bacongobbler
Copy link
Member

bacongobbler commented Aug 3, 2021

This looks much better. Thanks for taking the time to refactor!

@bacongobbler
Copy link
Member

And what about cases when the deployment process is interrupted not by a SIGTERM, but due to network loss

We could potentially tie into the context.Context by adding a WithTimeout. But that's out of this PR's current scope.

Fix typos
Remove condition arround time.Sleep
Because a negative or zero duration causes Sleep to return immediately.

Signed-off-by: Stephane Moser <[email protected]>
@Moser-ss Moser-ss requested a review from bacongobbler August 6, 2021 17:10
@bridgetkromhout bridgetkromhout added this to the 3.7.0 milestone Aug 19, 2021
@bridgetkromhout
Copy link
Member

Discussed in today's community call; an additional reviewer is needed due to the scope of this PR. Thanks!

Copy link
Contributor

@hickeyma hickeyma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the effort @Moser-ss. The code looks good. Just a few things:

  • Due to the changes involved, I would like a few more maintainers to review this in addition to the required 2.
  • The PR now sets failed state when you interrupt (CTRL-C) an install and upgrade
  • While testing manually I saw output as follows:
$ helm install foo mychart/ --debug

install.go:178: [debug] Original chart version: ""
install.go:195: [debug] CHART PATH: /Users/mhickey/tmp/helm-charts/mychart

^CRelease foo has been cancelled.
client.go:122: [debug] creating 3 resource(s)
Error: context canceled
helm.go:81: [debug] context canceled

$ helm upgrade issue-10058 issue-10058 --debug 

upgrade.go:139: [debug] preparing upgrade for issue-10058
upgrade.go:147: [debug] performing update for issue-10058
upgrade.go:319: [debug] creating upgraded release for issue-10058
client.go:203: [debug] checking 3 resources for changes
client.go:466: [debug] Looks like there are no changes for ServiceAccount "issue-10058"
client.go:466: [debug] Looks like there are no changes for Service "issue-10058"
^CRelease issue-10058 has been cancelled.
upgrade.go:420: [debug] warning: Upgrade "issue-10058" failed: context canceled
client.go:466: [debug] Looks like there are no changes for Deployment "issue-10058"
Error: UPGRADE FAILED: context canceled
helm.go:81: [debug] context canceled
UPGRADE FAILED
main.newUpgradeCmd.func2
	helm.sh/helm/v3/cmd/helm/upgrade.go:196
github.com/spf13/cobra.(*Command).execute
	github.com/spf13/[email protected]/command.go:852
github.com/spf13/cobra.(*Command).ExecuteC
	github.com/spf13/[email protected]/command.go:960
github.com/spf13/cobra.(*Command).Execute
	github.com/spf13/[email protected]/command.go:897
main.main
	helm.sh/helm/v3/cmd/helm/helm.go:80
runtime.main
	runtime/proc.go:225
runtime.goexit
	runtime/asm_amd64.s:1371

Should I have got a stack trace on the upgrade?

@Moser-ss
Copy link
Contributor Author

Should I have got a stack trace on the upgrade?

Based on the code we have in the cmd/upgrade.go where we return the error wrap around the stack. On the other hand, in the cmd/install.go, we just send the error without any wrapping.
I can change that behavior to make things more consistent

To make the install comand consistent with upgrade comand when handling errors

Signed-off-by: Stephane Moser <[email protected]>
@Moser-ss Moser-ss force-pushed the feature-handle-SIGINT branch from 7184b81 to 101370a Compare August 27, 2021 00:50
@hickeyma
Copy link
Contributor

@Moser-ss Testing again, it would seem that stack trace will happen depending on when you you send the signal rather than always happening.

Copy link
Contributor

@hickeyma hickeyma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for working on this @Moser-ss

@mattfarina
Copy link
Collaborator

Note, a follow-up pull request could add this to the uninstall command.

Copy link
Collaborator

@mattfarina mattfarina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like I get the easy rubber stamp after @bacongobbler and @hickeyma did all the real reviewer work.

@mattfarina mattfarina merged commit accf82b into helm:main Aug 30, 2021
@Moser-ss
Copy link
Contributor Author

Note, a follow-up pull request could add this to the uninstall command.

@mattfarina Do you mean to add the capability to handle SIGTERM to uninstall command? And what will be the behavior?

In the case of install and upgrade commands, we have a wait flag that gives us a bigger opportunity window to press the Ctrl+C and with that interrupt the installation and left the chart in the upgrading state.

I am not sure if I am correct, but in the case of uninstalling command we are not changing the state of the chart, we are just deleting all resources and history without using an intermediate state.

But if you clarify the behaviour you have in mind I can work in that PR

@czy006
Copy link

czy006 commented Jan 22, 2022

Helm Version : 3.7.2
K8S Version: 1.22.5
Hi, if i using --atomic args to upgrade/install , and use command ctrl c (on mac or linux) to interrupt(SIGTERM) and exit it, next time use helm status to check the status can be see status is holding on pending-upgrade/pending-rollback,but it was failed. I think the right status is failed, Because it's not normal anymore. Is this discussion about solving the problem ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.