Skip to content

Fix leaving connection in a broken state after preliminary cancellation distributed queries#93029

Merged
azat merged 1 commit intoClickHouse:masterfrom
azat:fix-preliminary-dist-query-finish
Dec 25, 2025
Merged

Fix leaving connection in a broken state after preliminary cancellation distributed queries#93029
azat merged 1 commit intoClickHouse:masterfrom
azat:fix-preliminary-dist-query-finish

Conversation

@azat
Copy link
Member

@azat azat commented Dec 24, 2025

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix leaving connection in a broken state after preliminary cancellation distributed queries

The problem is that some places in the code (i.e.
RemoteSource::onUpdatePorts()), can call finish() even before sendQueryAsync() finishes sending the query, and so if after it will try to call finish() again (after sendQueryAsync() finishes) it will be no-op.

The problem pops up after #92807, since RemoteSource has these (anti-)pattern.

Fixes: #92807
Fixes: #93018

…on distributed queries

The problem is that some places in the code (i.e.
RemoteSource::onUpdatePorts()), can call finish() even before
sendQueryAsync() finishes sending the query, and so if after it will try
to call finish() again (after sendQueryAsync() finishes) it will be
no-op.

The problem pops up after ClickHouse#92807, since RemoteSource has these (anti-)pattern.
@clickhouse-gh
Copy link
Contributor

clickhouse-gh bot commented Dec 24, 2025

Workflow [PR], commit [9ac4c8b]

Summary:

job_name test_name status info comment
Stateless tests (amd_tsan, s3 storage, sequential, 2/2) failure
03708_flush_async_insert_queue_for_table FAIL
Integration tests (amd_tsan, 2/6) failure
test_storage_s3_queue/test_5.py::test_shutdown_order FAIL
BuzzHouse (amd_debug) failure
Logical error: 'Inconsistent AST formatting: the query: (STID: 1941-1bfa) FAIL issue
Finish Workflow failure
python3 ./ci/jobs/scripts/workflow_hooks/new_tests_check.py failure

@azat azat force-pushed the fix-preliminary-dist-query-finish branch from f7243d1 to 9ac4c8b Compare December 24, 2025 20:48
@clickhouse-gh clickhouse-gh bot added the pr-bugfix Pull request with bugfix, not backported by default label Dec 24, 2025
Copy link
Member

@rienath rienath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for the quick fix and great explanation!

@rienath rienath self-assigned this Dec 24, 2025
@azat azat enabled auto-merge December 24, 2025 21:06
@azat azat added this pull request to the merge queue Dec 25, 2025
Merged via the queue into ClickHouse:master with commit 076f883 Dec 25, 2025
126 of 131 checks passed
@azat azat deleted the fix-preliminary-dist-query-finish branch December 25, 2025 08:12
robot-clickhouse added a commit that referenced this pull request Dec 25, 2025
Cherry pick #93029 to 25.8: Fix leaving connection in a broken state after preliminary cancellation distributed queries
robot-clickhouse added a commit that referenced this pull request Dec 25, 2025
…er preliminary cancellation distributed queries
robot-clickhouse added a commit that referenced this pull request Dec 25, 2025
Cherry pick #93029 to 25.10: Fix leaving connection in a broken state after preliminary cancellation distributed queries
robot-clickhouse added a commit that referenced this pull request Dec 25, 2025
…ter preliminary cancellation distributed queries
robot-clickhouse added a commit that referenced this pull request Dec 25, 2025
Cherry pick #93029 to 25.11: Fix leaving connection in a broken state after preliminary cancellation distributed queries
robot-clickhouse added a commit that referenced this pull request Dec 25, 2025
…ter preliminary cancellation distributed queries
robot-clickhouse added a commit that referenced this pull request Dec 25, 2025
Cherry pick #93029 to 25.12: Fix leaving connection in a broken state after preliminary cancellation distributed queries
robot-clickhouse added a commit that referenced this pull request Dec 25, 2025
…ter preliminary cancellation distributed queries
@robot-clickhouse robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Dec 25, 2025
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR label Dec 25, 2025
@robot-ch-test-poll robot-ch-test-poll added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Dec 25, 2025
clickhouse-gh bot added a commit that referenced this pull request Dec 25, 2025
Backport #93029 to 25.11: Fix leaving connection in a broken state after preliminary cancellation distributed queries
clickhouse-gh bot added a commit that referenced this pull request Dec 25, 2025
Backport #93029 to 25.12: Fix leaving connection in a broken state after preliminary cancellation distributed queries
azat added a commit that referenced this pull request Dec 25, 2025
Backport #93029 to 25.10: Fix leaving connection in a broken state after preliminary cancellation distributed queries
azat added a commit that referenced this pull request Dec 25, 2025
Backport #93029 to 25.8: Fix leaving connection in a broken state after preliminary cancellation distributed queries
AVMusorin pushed a commit to AVMusorin/ClickHouse that referenced this pull request Jan 28, 2026
The race condition causes a crash when reading from a disconnected
connection in distributed queries.

The `in` buffer:

```
(gdb) frame 1
(gdb) print this->in
$19 = {__ptr_ = 0x0, __cntrl_ = 0x0}
(gdb) print this->connected
$18 = false
```

But the same time:

```
(gdb) frame 2
    async_callback=...) at ./ci/tmp/build/./src/Client/MultiplexedConnections.cpp:398
(gdb) print this->sent_query
$15 = true
(gdb) print this->cancelled
$16 = false
(gdb) print this->active_connection_count
$17 = 1
```

Current solution will not fix the issue, it just apply a check like in
`sendCancel()`.

Related PRs:
- ClickHouse#89740
- ClickHouse#92807
- ClickHouse#93029
batkovic75 pushed a commit to batkovic75/ClickHouse that referenced this pull request Jan 30, 2026
The race condition causes a crash when reading from a disconnected
connection in distributed queries.

The `in` buffer:

```
(gdb) frame 1
(gdb) print this->in
$19 = {__ptr_ = 0x0, __cntrl_ = 0x0}
(gdb) print this->connected
$18 = false
```

But the same time:

```
(gdb) frame 2
    async_callback=...) at ./ci/tmp/build/./src/Client/MultiplexedConnections.cpp:398
(gdb) print this->sent_query
$15 = true
(gdb) print this->cancelled
$16 = false
(gdb) print this->active_connection_count
$17 = 1
```

Current solution will not fix the issue, it just apply a check like in
`sendCancel()`.

Related PRs:
- ClickHouse#89740
- ClickHouse#92807
- ClickHouse#93029
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-bugfix Pull request with bugfix, not backported by default pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR pr-synced-to-cloud The PR is synced to the cloud repo v25.8-must-backport v25.10-must-backport v25.11-must-backport v25.12-must-backport

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Connection left in a broken state (was: flaky test 03259_grouping_sets_aliases)

5 participants