Skip to content

Fix Shutdown deadlock in some scenarios#10044

Merged
Roasbeef merged 2 commits intolightningnetwork:masterfrom
ziggie1984:fix-shutdown-issue
Jul 8, 2025
Merged

Fix Shutdown deadlock in some scenarios#10044
Roasbeef merged 2 commits intolightningnetwork:masterfrom
ziggie1984:fix-shutdown-issue

Conversation

@ziggie1984
Copy link
Collaborator

@ziggie1984 ziggie1984 commented Jul 7, 2025

In case LND fails to startup correctly and fail before we would start the SubSwapper we would deadlock when shutting down the server here:

lnd/chanbackup/pubsub.go

Lines 206 to 207 in ff32e90

case s.manualUpdates <- update:
case <-s.quit:

Happend while fixing: #10041

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @ziggie1984, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical shutdown deadlock within the chanbackup component, specifically when the backup process might not have fully initialized. My changes enhance the SubSwapper's lifecycle management by introducing atomic state tracking and robust startup/shutdown logic, ensuring operations are only performed on a fully initialized component. I've also updated the release notes to reflect this important fix.

Highlights

  • Deadlock Fix: I've addressed a shutdown deadlock in the chanbackup component that could occur if the backuper had not fully started. This was achieved by refining the SubSwapper's lifecycle management.
  • Concurrency Control Improvement: I've replaced sync.Once with atomic.Bool for managing the SubSwapper's started and stopped states. This provides more explicit, robust, and thread-safe control over the component's initialization and termination.
  • State Validation: I've added checks to prevent operations like ManualUpdate on a SubSwapper that has not been successfully started. Additionally, the started state is now reset if the initial updateBackupFile() call fails during startup, preventing a partially initialized state.
  • Documentation: I've updated the release notes to include an entry for this fix, detailing the resolution of the shutdown deadlock.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a potential deadlock in the chanbackup.SubSwapper by refactoring the start/stop logic from sync.Once to atomic.Bool. This allows for more robust state management, including resetting the started state on a failed start and checking if the service is running before performing a manual update. The changes look good and address the described issue. I've found one potential issue in the new Stop() implementation that could lead to a deadlock and suggested a fix.

@ziggie1984 ziggie1984 force-pushed the fix-shutdown-issue branch 2 times, most recently from adc4c55 to 3c468e0 Compare July 7, 2025 08:20
@ziggie1984 ziggie1984 requested a review from starius July 7, 2025 08:21
@ziggie1984 ziggie1984 self-assigned this Jul 7, 2025
@ziggie1984
Copy link
Collaborator Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a potential shutdown deadlock by replacing sync.Once with atomic.Bool for managing the SubSwapper's lifecycle. This change allows for more robust state management, especially in startup failure scenarios. I've suggested one improvement to prevent a stopped SubSwapper from being restarted, which will make the state transitions more predictable.

@saubyk saubyk added this to the v0.19.2 milestone Jul 7, 2025
// Start starts the chanbackup.SubSwapper.
func (s *SubSwapper) Start() error {
var startErr error
s.started.Do(func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did the sync.Once usage mask things here? Mainly that w/ the public API you can't tell if something has actually run once or not?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we have the call here:

lnd/server.go

Line 2801 in b815109

err := s.chanSubSwapper.ManualUpdate(singles)

but if we never started the swapper (because we fail before) we will deadlock there and not shutdown cleanly.

@ziggie1984 ziggie1984 requested a review from Roasbeef July 7, 2025 22:17
Copy link
Collaborator

@starius starius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 💾
Great fix! I added small proposal to improve code readability.


// stopped tracks whether the SubSwapper has been stopped and ensures
// it can only be stopped once.
stopped atomic.Bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove stopped and just use quit state as an indication of closeness? These two variables seem to have the same meaning.

@ziggie1984 ziggie1984 force-pushed the fix-shutdown-issue branch from 3c468e0 to d9f57f5 Compare July 8, 2025 13:46
@ziggie1984 ziggie1984 requested review from bhandras and removed request for Roasbeef July 8, 2025 13:48
Copy link
Collaborator

@bhandras bhandras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🎉

return
}
// Ensure we only start the SubSwapper once.
if !s.started.CompareAndSwap(false, true) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: an even simpler way to do this is to just:

if s.started.Swap(true) {
    // already started
}

@ziggie1984 ziggie1984 force-pushed the fix-shutdown-issue branch from d9f57f5 to fa165e8 Compare July 8, 2025 15:06
@ziggie1984 ziggie1984 requested a review from starius July 8, 2025 15:07
Copy link
Collaborator

@starius starius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

}

s.wg.Add(1)
go s.backupUpdater()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ready is used only by Start and backupUpdater.
We can make it local to Start and pass to backupUpdater as an argument.

@ziggie1984 ziggie1984 force-pushed the fix-shutdown-issue branch from fa165e8 to a119e0a Compare July 8, 2025 17:09
@ziggie1984 ziggie1984 requested a review from Roasbeef July 8, 2025 17:10
@ziggie1984
Copy link
Collaborator Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix a deadlock that can occur during shutdown if the SubSwapper is not started correctly. The changes involve using an atomic.Bool to track the started state and adding checks to prevent operations on a non-started SubSwapper.

The core logic change in ManualUpdate correctly prevents the deadlock. The use of a ready channel in Start to wait for the backupUpdater goroutine is also a good improvement for robustness.

However, I've identified a critical race condition in the Start method's implementation. The check and set of the started flag are not atomic, which could lead to issues under concurrent calls. I've provided suggestions to address this.

The release note update is clear and appropriate.

@ziggie1984 ziggie1984 force-pushed the fix-shutdown-issue branch 2 times, most recently from 1b87eac to be2c0a0 Compare July 8, 2025 17:51
@ziggie1984
Copy link
Collaborator Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix a shutdown deadlock in SubSwapper by introducing an isActive flag to prevent updates when the swapper is not running. This is a good approach, but the current implementation has a race condition where ManualUpdate can still block if Stop is called concurrently. I've provided a suggestion to fix this race condition.

Additionally, I've pointed out a misleading comment that was updated and suggested a correction to improve code clarity.

@ziggie1984 ziggie1984 force-pushed the fix-shutdown-issue branch from be2c0a0 to 354585e Compare July 8, 2025 18:10
@ziggie1984 ziggie1984 requested review from bhandras and starius July 8, 2025 18:10
Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🪖

@Roasbeef Roasbeef merged commit d40ac45 into lightningnetwork:master Jul 8, 2025
35 of 40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants