Skip to content

Avoid possible crash for distributed queries in case of cancellation#95466

Merged
azat merged 6 commits intoClickHouse:masterfrom
AVMusorin:fix-crashing-distributed-cancel
Jan 30, 2026
Merged

Avoid possible crash for distributed queries in case of cancellation#95466
azat merged 6 commits intoClickHouse:masterfrom
AVMusorin:fix-crashing-distributed-cancel

Conversation

@AVMusorin
Copy link
Contributor

@AVMusorin AVMusorin commented Jan 28, 2026

Linked issues

Changelog category (leave one):

  • Critical Bug Fix (crash, data loss, RBAC) or LOGICAL_ERROR

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Avoid possible crash for distributed queries in case of cancellation

Details

(gdb) bt
#0  DB::readVarUInt (x=<optimized out>, istr=...) at ./src/IO/VarInt.h:98
#1  DB::Connection::receivePacket (this=0x7fb32ee2f418) at ./ci/tmp/build/./src/Client/Connection.cpp:1315
#2  0x000000001a4b71e8 in DB::MultiplexedConnections::receivePacketUnlocked(std::__1::function<void (int, Poco::Timespan, DB::AsyncEventTimeoutType, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, unsigned int)>) (this=0x7fb268e99c40, async_callback=...)
    at ./ci/tmp/build/./src/Client/MultiplexedConnections.cpp:398
#3  0x000000001a4b6efe in DB::MultiplexedConnections::receivePacket (this=0x7fb268e99c40) at ./ci/tmp/build/./src/Client/MultiplexedConnections.cpp:253
#4  0x00000000178df465 in DB::RemoteQueryExecutor::finish (this=0x7fb17401af98) at ./ci/tmp/build/./src/QueryPipeline/RemoteQueryExecutor.cpp:811
#5  0x000000001a6b747a in DB::ExecutingGraph::updateNode (this=0x7fb4889a9f80, pid=<optimized out>, queue=..., async_queue=...) at ./ci/tmp/build/./src/Processors/Executors/ExecutingGraph.cpp:253
#6  0x000000001a6b1be6 in DB::PipelineExecutor::executeStepImpl (this=0x7fa7149f3c10, thread_num=<optimized out>, cpu_slot=<optimized out>, yield_flag=0x0) at ./ci/tmp/build/./src/Processors/Executors/PipelineExecutor.cpp:371
#7  0x000000001a6b5c03 in DB::PipelineExecutor::executeSingleThread (this=0x7fa7149f3c10, thread_num=814337472, cpu_slot=0x0) at ./ci/tmp/build/./src/Processors/Executors/PipelineExecutor.cpp:279
#8  DB::PipelineExecutor::spawnThreads(std::__1::shared_ptr<DB::IAcquiredSlot>)::$_0::operator()() const (this=0x7fb22db706b0) at ./ci/tmp/build/./src/Processors/Executors/PipelineExecutor.cpp:565

The in buffer:

(gdb) frame 1
(gdb) print this->in
$19 = {__ptr_ = 0x0, __cntrl_ = 0x0}
(gdb) print this->connected
$18 = false

But the same time:

(gdb) frame 2
  async_callback=...) at ./ci/tmp/build/./src/Client/MultiplexedConnections.cpp:398
(gdb) print this->sent_query
$15 = true
(gdb) print this->cancelled
$16 = false
(gdb) print this->active_connection_count
$17 = 1

Ref: #89740
Ref: #92807
Ref: #93029

CC @kssenii @azat

kssenii and others added 2 commits January 28, 2026 19:50
The race condition causes a crash when reading from a disconnected
connection in distributed queries.

The `in` buffer:

```
(gdb) frame 1
(gdb) print this->in
$19 = {__ptr_ = 0x0, __cntrl_ = 0x0}
(gdb) print this->connected
$18 = false
```

But the same time:

```
(gdb) frame 2
    async_callback=...) at ./ci/tmp/build/./src/Client/MultiplexedConnections.cpp:398
(gdb) print this->sent_query
$15 = true
(gdb) print this->cancelled
$16 = false
(gdb) print this->active_connection_count
$17 = 1
```

Current solution will not fix the issue, it just apply a check like in
`sendCancel()`.

Related PRs:
- ClickHouse#89740
- ClickHouse#92807
- ClickHouse#93029
@azat azat self-assigned this Jan 28, 2026
/// If we already disconnected.
if (!in)
{
throw Exception(ErrorCodes::NETWORK_ERROR, "Connection to {} is terminated", getDescription());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I guess we can add this sanity check, but it should be LOGICAL_ERROR, since it is a bug in the code
But I still want to understand the what is this bug in the code.

Also are you sure that it can still happen after other fixes (do you have #93029 in your version)? And if so, do you have logs for this query?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also are you sure that it can still happen after other fixes (do you have #93029 in your version)? And if so, do you have logs for this query?

Yes, on 25.10.4.104

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it should be LOGICAL_ERROR

done

@korowa
Copy link
Contributor

korowa commented Jan 29, 2026

It looks like it's caused by failed RemoteQueryExecutor::cancel() call followed by RemoteQueryExecutor::finish():

  1. RemoteQueryExecutor::cancel() calls RemoteQueryExecutor::tryCancel(), which sets was_cancelled flag, and proceeds to MultiplexedConnections::sendCancel() -> Connection::sendCancel() which throws NETWORK_ERROR

  2. RemoteQueryExecutor::finish() again does tryCancel(), which now is a no-op due to previously set was_cancelled flag -- so, the executor proceeds to reading from closed / invalidated connection (as it's finished = false yet, and still sees non-zero active connections count) and crashes there.

Checking query logs for the queries + hostnames from the failures gives the following, and after these are +- last records of the query log (coredump contains segfault on connection to <NODE_NAME>):

SELECT
    event_time,
    thread_id,
    message
FROM system.text_log
WHERE (event_date = '2026-01-27') AND (((query_id = '188E979838C520A2') AND ((message ILIKE '%<NODE_NAME>%') OR (logger_name ILIKE '%<NODE_NAME>%'))) OR ((level = 'Fatal') AND (event_time >= '2026-01-27 16:53:49')))
ORDER BY event_time_microseconds ASC

Query id: aac37c33-d4f0-45b4-9ef2-59535f0dea01

    ┌──────────event_time─┬─thread_id─┬─message────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
 1. │ 2026-01-27 16:53:29 │    711638 │ Sending external_roles with query: [] (0)                                                                                                                                                                                                                  │
 2. │ 2026-01-27 16:53:29 │    711638 │ Sent data for 2 scalars, total 2 rows in 5.2118e-05 sec., 37907 rows/sec., 72.00 B (1.30 MiB/sec.), compressed 0.47368421052631576 times to 152.00 B (2.74 MiB/sec.)                                                                                       │
 3. │ 2026-01-27 16:53:49 │    711755 │ Error occurs on cancellation.: Code: 210. DB::Exception: Connection to <NODE_NAME>:9000 terminated. (NETWORK_ERROR), Stack trace (when copying this message, always include the lines below):                                               ↴│
    │                     │           │↳                                                                                                                                                                                                                                                          ↴│
    │                     │           │↳0. ./ci/tmp/build/./src/Common/Exception.cpp:131: DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x0000000013d9069f                                                                                                                ↴│
    │                     │           │↳1. DB::Exception::Exception(String&&, int, String, bool) @ 0x000000000c9dd20e                                                                                                                                                                             ↴│
    │                     │           │↳2. DB::Exception::Exception(PreformattedMessage&&, int) @ 0x000000000c9dccc0                                                                                                                                                                              ↴│
    │                     │           │↳3. DB::Exception::Exception<String const&>(int, FormatStringHelperImpl<std::type_identity<String const&>::type>, String const&) @ 0x000000000dd3f46b                                                                                                      ↴│
    │                     │           │↳4. ./ci/tmp/build/./src/Client/Connection.cpp:1008: DB::Connection::sendCancel() @ 0x000000001a468bb7                                                                                                                                                     ↴│
    │                     │           │↳5. ./ci/tmp/build/./src/Client/MultiplexedConnections.cpp:283: DB::MultiplexedConnections::sendCancel() @ 0x000000001a4b77ab                                                                                                                              ↴│
    │                     │           │↳6. ./ci/tmp/build/./src/QueryPipeline/RemoteQueryExecutor.cpp:959: DB::RemoteQueryExecutor::tryCancel(char const*) @ 0x00000000178df773                                                                                                                   ↴│
    │                     │           │↳7. ./ci/tmp/build/./src/QueryPipeline/RemoteQueryExecutor.cpp:857: DB::RemoteQueryExecutor::cancel() @ 0x00000000178dfd3e                                                                                                                                 ↴│
    │                     │           │↳8. ./ci/tmp/build/./src/Processors/Sources/RemoteSource.cpp:242: DB::RemoteSource::onCancel() @ 0x000000001aa9ba53                                                                                                                                        ↴│
    │                     │           │↳9. ./src/Processors/IProcessor.cpp:73: DB::ExecutingGraph::cancel(bool) @ 0x000000001a6b977c                                                                                                                                                              ↴│
    │                     │           │↳10. ./ci/tmp/build/./src/Processors/Executors/PipelineExecutor.cpp:104: DB::PipelineExecutor::executeStepImpl(unsigned long, DB::IAcquiredSlot*, std::atomic<bool>*) @ 0x000000001a6b1ec6                                                                 ↴│
    │                     │           │↳11. ./ci/tmp/build/./src/Processors/Executors/PipelineExecutor.cpp:279: void std::__function::__policy_invoker<void ()>::__call_impl[abi:ne190107]<std::__function::__default_alloc_func<DB::PipelineExecutor::spawnThreads(std::shared_ptr<DB::IAcquiredSlot>)::$_0, void ()>>(std::__function::__policy_storage const*) @ 0x000000001a6b5c03↴│
    │                     │           │↳12. ./contrib/llvm-project/libcxx/include/__functional/function.h:716: ? @ 0x0000000013ee502b                                                                                                                                                             ↴│
    │                     │           │↳13. ./contrib/llvm-project/libcxx/include/__type_traits/invoke.h:117: void std::__function::__policy_invoker<void ()>::__call_impl[abi:ne190107]<std::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false, true>::ThreadFromGlobalPoolImpl<void (ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool::*)(), ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool*>(void (ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool::*&&)(), ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool*&&)::'lambda'(), void ()>>(std::__function::__policy_storage const*) @ 0x0000000013eec4a6↴│
    │                     │           │↳14. ./contrib/llvm-project/libcxx/include/__functional/function.h:716: ? @ 0x0000000013ee1f52                                                                                                                                                             ↴│
    │                     │           │↳15. ./contrib/llvm-project/libcxx/include/__type_traits/invoke.h:117: void* std::__thread_proxy[abi:ne190107]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, void (ThreadPoolImpl<std::thread>::ThreadFromThreadPool::*)(), ThreadPoolImpl<std::thread>::ThreadFromThreadPool*>>(void*) @ 0x0000000013ee9bda↴│
    │                     │           │↳16. start_thread @ 0x0000000000089134                                                                                                                                                                                                                     ↴│
    │                     │           │↳17. __clone3 @ 0x00000000001097dc                                                                                                                                                                                                                         ↴│
    │                     │           │↳ (version 25.10.4.104 (official build))                                                                                                                                                                                                                    │
 4. │ 2026-01-27 16:53:49 │    706418 │ ########## Short fault info ############                                                                                                                                                                                                                   │
 5. │ 2026-01-27 16:53:49 │    706418 │ (version 25.10.4.104 (official build), build id: D1085B22D012F1AB002A03277859D9462E762533, git hash: 5051266d584791b2d565e4ebc40bc7cdd4edd8f3, architecture: x86_64) (from thread 711612) Received signal 11                                               │
 6. │ 2026-01-27 16:53:49 │    706418 │ Signal description: Segmentation fault                                                                                                                                                                                                                     │
 7. │ 2026-01-27 16:53:49 │    706418 │ Address: 0x8. Access: read. Address not mapped to object.                                                                                                                                                                                                  │
 8. │ 2026-01-27 16:53:49 │    706418 │ Stack trace: 0x000000001a46d9fb 0x000000001a4b71e8 0x000000001a4b6efe 0x00000000178df465 0x000000001a6b747a 0x000000001a6b1be6 0x000000001a6b5c03 0x0000000013ee502b 0x0000000013eec4a6 0x0000000013ee1f52 0x0000000013ee9bda 0x00007fb51f0ff134 0x00007fb51f17f7dc │
 9. │ 2026-01-27 16:53:49 │    706418 │ ########################################                                                                                                                                                                                                                   │
10. │ 2026-01-27 16:53:49 │    706418 │ (version 25.10.4.104 (official build), build id: D1085B22D012F1AB002A03277859D9462E762533, git hash: 5051266d584791b2d565e4ebc40bc7cdd4edd8f3) (from thread 711612) (query_id: 188E979838C520A2) (query:                                                  ↴│
    │                     │           │↳        ) Received signal Segmentation fault (11)                                                                                                                                                                                                                │
11. │ 2026-01-27 16:53:49 │    706418 │ Address: 0x8. Access: read. Address not mapped to object.                                                                                                                                                                                                  │
12. │ 2026-01-27 16:53:49 │    706418 │ Stack trace: 0x000000001a46d9fb 0x000000001a4b71e8 0x000000001a4b6efe 0x00000000178df465 0x000000001a6b747a 0x000000001a6b1be6 0x000000001a6b5c03 0x0000000013ee502b 0x0000000013eec4a6 0x0000000013ee1f52 0x0000000013ee9bda 0x00007fb51f0ff134 0x00007fb51f17f7dc │
13. │ 2026-01-27 16:53:49 │    706418 │ 2.0. inlined from ./src/IO/VarInt.h:98: DB::readVarUInt(unsigned long&, DB::ReadBuffer&)                                                                                                                                                                   │
14. │ 2026-01-27 16:53:49 │    706418 │ 2. ./ci/tmp/build/./src/Client/Connection.cpp:1315: DB::Connection::receivePacket() @ 0x000000001a46d9fb                                                                                                                                                   │
15. │ 2026-01-27 16:53:49 │    706418 │ 3. ./ci/tmp/build/./src/Client/MultiplexedConnections.cpp:398: DB::MultiplexedConnections::receivePacketUnlocked(std::function<void (int, Poco::Timespan, DB::AsyncEventTimeoutType, String const&, unsigned int)>) @ 0x000000001a4b71e8                   │
16. │ 2026-01-27 16:53:49 │    706418 │ 4. ./ci/tmp/build/./src/Client/MultiplexedConnections.cpp:253: DB::MultiplexedConnections::receivePacket() @ 0x000000001a4b6efe                                                                                                                            │
17. │ 2026-01-27 16:53:49 │    706418 │ 5. ./ci/tmp/build/./src/QueryPipeline/RemoteQueryExecutor.cpp:811: DB::RemoteQueryExecutor::finish() @ 0x00000000178df465                                                                                                                                  │
18. │ 2026-01-27 16:53:49 │    706418 │ 6. ./ci/tmp/build/./src/Processors/Executors/ExecutingGraph.cpp:253: DB::ExecutingGraph::updateNode(unsigned long, std::queue<DB::ExecutingGraph::Node*, boost::container::devector<DB::ExecutingGraph::Node*, AllocatorWithMemoryTracking<DB::ExecutingGraph::Node*>, void>>&, std::queue<DB::ExecutingGraph::Node*, boost::container::devector<DB::ExecutingGraph::Node*, AllocatorWithMemoryTracking<DB::ExecutingGraph::Node*>, void>>&) @ 0x000000001a6b747a │
19. │ 2026-01-27 16:53:49 │    706418 │ 7. ./ci/tmp/build/./src/Processors/Executors/PipelineExecutor.cpp:371: DB::PipelineExecutor::executeStepImpl(unsigned long, DB::IAcquiredSlot*, std::atomic<bool>*) @ 0x000000001a6b1be6                                                                   │
20. │ 2026-01-27 16:53:49 │    706418 │ 8.0. inlined from ./ci/tmp/build/./src/Processors/Executors/PipelineExecutor.cpp:279: DB::PipelineExecutor::executeSingleThread(unsigned long, DB::IAcquiredSlot*)                                                                                         │
21. │ 2026-01-27 16:53:49 │    706418 │ 8.1. inlined from ./ci/tmp/build/./src/Processors/Executors/PipelineExecutor.cpp:565: operator()
...

@azat
Copy link
Member

azat commented Jan 29, 2026

Egh, again it is onUpdatePorts...

@azat azat changed the title Avoid crash in Connection::receivePacket Avoid possible crash for distributed queries in case of cancellation Jan 29, 2026
@azat
Copy link
Member

azat commented Jan 29, 2026

I've pushed one more patch into your PR to prevent this LOGICAL_ERROR, and updated changelog entry to make it user-friendly

@azat azat added the can be tested Allows running workflows for external contributors label Jan 29, 2026
@clickhouse-gh
Copy link
Contributor

clickhouse-gh bot commented Jan 29, 2026

Workflow [PR], commit [33d49e1]

Summary:

job_name test_name status info comment
Stateless tests (amd_debug, parallel) failure
02260_alter_compact_part_drop_nested_column FAIL cidb
Logical error: There was an error on A: B (probably it's a bug) (STID: 6337-3cad) FAIL cidb, issue
Stress test (amd_tsan) failure
Logical error: Block structure mismatch in A stream: different number of columns: (STID: 0993-2da1) FAIL cidb, issue
Stress test (amd_ubsan) failure
Logical error: Block structure mismatch in A stream: different number of columns: (STID: 0993-38e6) FAIL cidb, issue
Finish Workflow failure
python3 ./ci/jobs/scripts/workflow_hooks/new_tests_check.py failure

@clickhouse-gh clickhouse-gh bot added pr-critical-bugfix pr-must-backport Pull request should be backported intentionally. Use this label with great care! labels Jan 29, 2026
@azat azat enabled auto-merge January 30, 2026 09:32
@azat azat added this pull request to the merge queue Jan 30, 2026
Merged via the queue into ClickHouse:master with commit a97cc84 Jan 30, 2026
129 of 134 checks passed
@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added pr-synced-to-cloud The PR is synced to the cloud repo pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR labels Jan 30, 2026
robot-ch-test-poll1 added a commit that referenced this pull request Jan 30, 2026
Cherry pick #95466 to 25.3: Avoid possible crash for distributed queries in case of cancellation
robot-clickhouse added a commit that referenced this pull request Jan 30, 2026
robot-ch-test-poll1 added a commit that referenced this pull request Jan 30, 2026
Cherry pick #95466 to 25.8: Avoid possible crash for distributed queries in case of cancellation
robot-clickhouse added a commit that referenced this pull request Jan 30, 2026
robot-ch-test-poll1 added a commit that referenced this pull request Jan 30, 2026
Cherry pick #95466 to 25.10: Avoid possible crash for distributed queries in case of cancellation
robot-clickhouse added a commit that referenced this pull request Jan 30, 2026
robot-ch-test-poll1 added a commit that referenced this pull request Jan 30, 2026
Cherry pick #95466 to 25.11: Avoid possible crash for distributed queries in case of cancellation
robot-clickhouse added a commit that referenced this pull request Jan 30, 2026
robot-ch-test-poll1 added a commit that referenced this pull request Jan 30, 2026
Cherry pick #95466 to 25.12: Avoid possible crash for distributed queries in case of cancellation
robot-clickhouse added a commit that referenced this pull request Jan 30, 2026
robot-ch-test-poll1 added a commit that referenced this pull request Jan 30, 2026
Cherry pick #95466 to 26.1: Avoid possible crash for distributed queries in case of cancellation
robot-clickhouse added a commit that referenced this pull request Jan 30, 2026
@AVMusorin AVMusorin deleted the fix-crashing-distributed-cancel branch January 30, 2026 10:55
@robot-clickhouse robot-clickhouse added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Jan 30, 2026
clickhouse-gh bot added a commit that referenced this pull request Jan 30, 2026
Backport #95466 to 25.12: Avoid possible crash for distributed queries in case of cancellation
clickhouse-gh bot added a commit that referenced this pull request Jan 30, 2026
Backport #95466 to 26.1: Avoid possible crash for distributed queries in case of cancellation
azat added a commit that referenced this pull request Jan 30, 2026
Backport #95466 to 25.11: Avoid possible crash for distributed queries in case of cancellation
azat added a commit that referenced this pull request Jan 30, 2026
Backport #95466 to 25.10: Avoid possible crash for distributed queries in case of cancellation
azat added a commit that referenced this pull request Jan 30, 2026
Backport #95466 to 25.8: Avoid possible crash for distributed queries in case of cancellation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-critical-bugfix pr-must-backport Pull request should be backported intentionally. Use this label with great care! pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants