Handle SIGTERMs during helm upgrade and helm install#9180
Handle SIGTERMs during helm upgrade and helm install#9180mattfarina merged 12 commits intohelm:mainfrom
Conversation
3ccc26f to
13d776c
Compare
jdolitsky
left a comment
There was a problem hiding this comment.
Hello. This appears to be a useful change, but definitely introduces lots of new code with channels etc. that might cause unexpected issues.
If there any way to test these with unit tests? By inserting SIGTERM on a timer, for example?
|
Thanks for the suggestion. |
|
How do you suggest we handle cases where an upgrade is midway through completion and CTRL+C is invoked? If I understand this correctly, any hooks currently being executed (e.g. pre-upgrade) are not interrupted. There are a few scenarios I could perceive where an upgrade is cancelled halfway but the pre-upgrade hooks install/upgrade certain resources, and a second call to Have you tested for these scenarios? What was the outcome? |
|
@bacongobbler from my understanding CTRL+C sends the SIGINT signal so simply having different logic for SIGTERM vs SIGINT would work here. |
|
Fair enough. Point still stands though... Just replace a CTRL+C with a hook that runs past the given timeout threshold (#4558). The same problem cases and issues are present... Just a different event triggering the stoppage. |
|
Here's the case I'm trying to clarify here:
Isn't this going to cause a deadlock? We haven't "cleaned up" or rolled back any of the changes that occurred during the failed release. We just marked it as FAILED. Is this a situation where |
The flows I have in my mind are the following: In case if an upgrade with the flag |
I tried to reproduce this behavior using this example #4558 (comment) |
b0de3f5 to
3c79749
Compare
|
The reason I'm asking is because this appears to be the only case where Helm can enter the PENDING_UPGRADE state during an upgrade. If the intention is to return to the FAILED state, instead of introducing all this timeout/capture logic (which can be prone to deadlocks or unintended breaking behaviour), why not just enter the FAILED state until the upgrade succeeds? If Helm were to fail due to a timeout, it'd already be in the FAILED state and we don't have to add all this complex logic to the action handlers. I'm also wondering if we should mark the PENDING_UPGRADE state as deprecated - it does not appear there's any specific rhyme or reason why it was introduced, why Helm should enter that state, or how to resolve a release stuck in that state (other than the workarounds described in earlier bug reports). |
I think I am not the right person to answer this question; I was waiting for some guidance from the maintainers about this topic. From Helm User's experience, I notice that when an upgrade starts, helm defines the state of the release as In version 3.4.0 looks that helm starts to check if there is any release in So my initial idea was to handle the SIGINT and SIGTERM to mark the release a But I am open to suggestions of which will be the best approach to tackle the problem of releases been lock when helm is killed or timeout in a pre-hook |
|
@bacongobbler / @jdolitsky Anything I should improve in this PR? Or should we have a different approach for this issue? |
|
Sorry for not being totally clear. I'm asking you to research whether it would be more feasible to replace any reference to the PENDING_UPGRADE state with the FAILED state and report back with your findings. I do not know which is the right approach, but if my intuition is right, moving forward with that behavior would be much simpler to maintain as you would not have to introduce all this control/timing logic in the first place. And the use cases where PENDING_UPGRADE is used vs. when the FAILED state are introduced should be identical. |
|
Well, we have this PENDING_UPGRADE state to avoid corruption of the storage, or at least is what I understand from this PR that created the lock around the PENDIND_UPGRADE state. |
hiddeco
left a comment
There was a problem hiding this comment.
With an eye on projects that make use of pkg/action as a "Helm SDK" (without CLI bits), and that may have their own way of dealing with locks.
I have some concerns about the introduction of the sync.Mutex, channels, etc. directly into the package, instead of them e.g. being injected from higher up.
|
if UPGRADE_PENDING is used as a lock so 2 upgrades don't happen at the same time and corrupt storage. Why not have a new state WAITING_STATUS or something that happens after the upgrade is complete but we are Then a new deployment can still proceed as long as its not possible for the corruption of storage during the actual upgrade process of helm applying all the manifests. |
Any suggestion when the Helm CLI should start listing and handling OS signals? |
I am not an expert on Helm's architecture, but have a pretty good idea on how things are wired together due to having read most of the code. My first concern about the The entry point for starting to listen to OS signals is in my opinion https://github.com/helm/helm/tree/master/cmd/helm. |
Change the logic to release Upgrade to handle SIGTERMs Extract logic to 2 goroutine so it is possible to handle SIGTERMS and the release flow Fix go style Signed-off-by: Stephane Moser <[email protected]>
|
looking good! provided some feedback to help push things along. Let me know when you're finished with those changes and I will schedule another round of reviews. |
|
Sorry about the late reply after getting pinged, I was on vacation and very busy the weeks prior to that. The mutex on the action in combination with the latest comments from Matthew around the signal handler would address all my previous concerns 👍 |
Use context to handle SIGTERM in the cmd/helm instead of pkg/action Signed-off-by: Stephane Moser <[email protected]>
ec043c8 to
c62ce12
Compare
|
The logic to handle signals was moved from pkg/action to cmd/helm I think I am duplicating code in install and upgrade but I didn't find a good place to add a function to set up the context and handle the signals The tests were rewritten and are cleaner now that we are using context I don't add any customization to the chart description, basically, it will receive the error message from the context @bacongobbler / @jdolitsky the PR is ready for a new review |
|
And what about cases when the deployment process is interrupted not by a SIGTERM, but due to network loss |
|
This looks much better. Thanks for taking the time to refactor! |
We could potentially tie into the |
Fix typos Remove condition arround time.Sleep Because a negative or zero duration causes Sleep to return immediately. Signed-off-by: Stephane Moser <[email protected]>
|
Discussed in today's community call; an additional reviewer is needed due to the scope of this PR. Thanks! |
hickeyma
left a comment
There was a problem hiding this comment.
Thanks for the effort @Moser-ss. The code looks good. Just a few things:
- Due to the changes involved, I would like a few more maintainers to review this in addition to the required 2.
- The PR now sets
failedstate when you interrupt (CTRL-C) aninstallandupgrade - While testing manually I saw output as follows:
$ helm install foo mychart/ --debug
install.go:178: [debug] Original chart version: ""
install.go:195: [debug] CHART PATH: /Users/mhickey/tmp/helm-charts/mychart
^CRelease foo has been cancelled.
client.go:122: [debug] creating 3 resource(s)
Error: context canceled
helm.go:81: [debug] context canceled
$ helm upgrade issue-10058 issue-10058 --debug
upgrade.go:139: [debug] preparing upgrade for issue-10058
upgrade.go:147: [debug] performing update for issue-10058
upgrade.go:319: [debug] creating upgraded release for issue-10058
client.go:203: [debug] checking 3 resources for changes
client.go:466: [debug] Looks like there are no changes for ServiceAccount "issue-10058"
client.go:466: [debug] Looks like there are no changes for Service "issue-10058"
^CRelease issue-10058 has been cancelled.
upgrade.go:420: [debug] warning: Upgrade "issue-10058" failed: context canceled
client.go:466: [debug] Looks like there are no changes for Deployment "issue-10058"
Error: UPGRADE FAILED: context canceled
helm.go:81: [debug] context canceled
UPGRADE FAILED
main.newUpgradeCmd.func2
helm.sh/helm/v3/cmd/helm/upgrade.go:196
github.com/spf13/cobra.(*Command).execute
github.com/spf13/[email protected]/command.go:852
github.com/spf13/cobra.(*Command).ExecuteC
github.com/spf13/[email protected]/command.go:960
github.com/spf13/cobra.(*Command).Execute
github.com/spf13/[email protected]/command.go:897
main.main
helm.sh/helm/v3/cmd/helm/helm.go:80
runtime.main
runtime/proc.go:225
runtime.goexit
runtime/asm_amd64.s:1371Should I have got a stack trace on the upgrade?
Based on the code we have in the cmd/upgrade.go where we return the error wrap around the stack. On the other hand, in the cmd/install.go, we just send the error without any wrapping. |
To make the install comand consistent with upgrade comand when handling errors Signed-off-by: Stephane Moser <[email protected]>
7184b81 to
101370a
Compare
|
@Moser-ss Testing again, it would seem that stack trace will happen depending on when you you send the signal rather than always happening. |
|
Note, a follow-up pull request could add this to the uninstall command. |
mattfarina
left a comment
There was a problem hiding this comment.
It looks like I get the easy rubber stamp after @bacongobbler and @hickeyma did all the real reviewer work.
@mattfarina Do you mean to add the capability to handle SIGTERM to uninstall command? And what will be the behavior? In the case of install and upgrade commands, we have a wait flag that gives us a bigger opportunity window to press the Ctrl+C and with that interrupt the installation and left the chart in the upgrading state. I am not sure if I am correct, but in the case of uninstalling command we are not changing the state of the chart, we are just deleting all resources and history without using an intermediate state. But if you clarify the behaviour you have in mind I can work in that PR |
|
Helm Version : 3.7.2 |
What this PR does / why we need it:
This PR is open to try to fix issue #4558 and issue #8987. By handling the SIGTERM and throw an error that can be handled internally my intent is to mark the release as
FAILEDwhen the helm was interrupted or terminatedSpecial notes for your reviewer:
I still have some TODOS:
I would like to have an opinion from the maintainers to understand if this approach makes sense from your point of view
I tested in my machine ( a MacOS) and when the Ctrl +C sequence is hit the release is marked as failed
If applicable: