Skip to content

Avoid crash due to reading from remote server after disconnect in remote queries during cancellation#89740

Merged
azat merged 1 commit intoClickHouse:masterfrom
azat:dist-fix-cancel
Nov 10, 2025
Merged

Avoid crash due to reading from remote server after disconnect in remote queries during cancellation#89740
azat merged 1 commit intoClickHouse:masterfrom
azat:dist-fix-cancel

Conversation

@azat
Copy link
Member

@azat azat commented Nov 7, 2025

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Avoid crash due to reading from remote server after disconnect in remote queries during cancellation. Resolves #89468

Details

Previously only sendCancel() was protected against disconnection, but RemoteQueryExecutor may read after sendCancel() and this will lead to NULL pointer dereference:

2025.11.07 14:05:20.159362 [ 73554 ] {b7ebe401-aa59-4306-9663-278657eac110} <Debug> ReadBuffer: ReadBuffer is canceled by the exception: Code: 241. DB::Exception: Query memory tracker: fault injected. Would use 2.00 MiB (attempt to allocate chunk of 1.00 MiB), maximum: 9.31 GiB. (MEMORY_LIMIT_EXCEEDED), Stack trace (when copying this message, always include the lines below):

0. ./ci/tmp/build/./src/Common/Exception.cpp:131: DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x0000000013c3c7df
1. DB::Exception::Exception(String&&, int, String, bool) @ 0x000000000c8902ce
2. DB::Exception::Exception(PreformattedMessage&&, int) @ 0x000000000c88fd80
3. ./src/Common/Exception.h:141: DB::Exception::Exception<char const*, char const*, String, String, String>(int, FormatStringHelperImpl<std::type_identity<char const*>::type, std::type_identity<char const*>::type, std::type_identity<String>::type, std::type_identity<String>::type, std::type_identity<String>::type>, char const*&&, char const*&&, String&&, String&&, String&&) @ 0x0000000013cbcc21
4. ./ci/tmp/build/./src/Common/MemoryTracker.cpp:333: MemoryTracker::allocImpl(long, bool, MemoryTracker*, double) @ 0x0000000013cba10e
5. ./ci/tmp/build/./src/Common/MemoryTracker.cpp:476: MemoryTracker::allocImpl(long, bool, MemoryTracker*, double) @ 0x0000000013cb9dc4
6. ./ci/tmp/build/./src/Common/CurrentMemoryTracker.cpp:80: CurrentMemoryTracker::allocImpl(long, bool) @ 0x0000000013c0c079
7. ./src/Common/CurrentMemoryTracker.cpp:103: Allocator<false, false>::realloc(void*, unsigned long, unsigned long, unsigned long) @ 0x0000000013c096c7
8. ./src/Common/PODArray.h:167: void DB::PODArrayBase<1ul, 4096ul, Allocator<false, false>, 0ul, 0ul>::reallocPowerOfTwoElements<>(unsigned long) @ 0x0000000013cc7f0e
9. ./src/Common/PODArray.h:243: DB::CompressedReadBufferBase::readCompressedData(unsigned long&, unsigned long&, bool) @ 0x000000001af881d7
10. ./ci/tmp/build/./src/Compression/CompressedReadBuffer.cpp:10: non-virtual thunk to DB::CompressedReadBuffer::nextImpl() @ 0x000000001af87922
11. ./ci/tmp/build/./src/IO/ReadBuffer.cpp:96: DB::ReadBuffer::next() @ 0x0000000013d245ed
12. ./src/IO/ReadBuffer.h:81: DB::NativeReader::read() @ 0x000000001a4e2e4f
13. ./ci/tmp/build/./src/Client/Connection.cpp:1422: DB::Connection::receiveDataImpl(DB::NativeReader&) @ 0x000000001a2edb59
14. ./ci/tmp/build/./src/Client/Connection.cpp:1434: DB::Connection::receivePacket() @ 0x000000001a2ed154
15. ./ci/tmp/build/./src/Client/MultiplexedConnections.cpp:398: DB::MultiplexedConnections::receivePacketUnlocked(std::function<void (int, Poco::Timespan, DB::AsyncEventTimeoutType, String const&, unsigned int)>) @ 0x000000001a335768
16. ./ci/tmp/build/./src/QueryPipeline/RemoteQueryExecutorReadContext.cpp:67: DB::RemoteQueryExecutorReadContext::Task::run(std::function<void (int, Poco::Timespan, DB::AsyncEventTimeoutType, String const&, unsigned int)>, std::function<void ()>) @ 0x000000001779498d
17. ./ci/tmp/build/./src/Common/AsyncTaskExecutor.cpp:89: void boost::context::detail::fiber_entry<boost::context::detail::fiber_record<boost::context::fiber, FiberStack&, Fiber::RoutineImpl<DB::AsyncTaskExecutor::Routine>>>(boost::context::detail::transfer_t) @ 0x0000000017793e83 (version 25.10.1.3832 (official build)) 2025.11.07 14:05:20.159570 [ 28135 ] {} <Debug> ReadBuffer: ReadBuffer is canceled by the exception: Code: 210. DB::NetException: Connection reset by peer, while reading from socket (peer: [::ffff:127.0.0.1]:57678, local: [::ffff:127.0.0.10]:9000). (NETWORK_ERROR), Stack trace (when copying this message, always include the lines below):

0. ./ci/tmp/build/./src/Common/Exception.cpp:131: DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x0000000013c3c7df
1. DB::Exception::Exception(String&&, int, String, bool) @ 0x000000000c8902ce
2. ./src/Common/NetException.h:26: DB::NetException::NetException<String, String, String>(int, FormatStringHelperImpl<std::type_identity<String>::type, std::type_identity<String>::type, std::type_identity<String>::type>, String&&, String&&, String&&) @ 0x0000000013e6d433
3. ./ci/tmp/build/./src/IO/ReadBufferFromPocoSocket.cpp:83: DB::ReadBufferFromPocoSocketBase::socketReceiveBytesImpl(char*, unsigned long) @ 0x0000000013e6f17b
4. ./ci/tmp/build/./src/IO/ReadBufferFromPocoSocket.cpp:107: DB::ReadBufferFromPocoSocketBase::nextImpl() @ 0x0000000013e6fa75
5. ./ci/tmp/build/./src/IO/ReadBuffer.cpp:96: DB::ReadBuffer::next() @ 0x0000000013d245ed
6. ./src/IO/ReadBuffer.h:81: DB::TCPHandler::runImpl() @ 0x000000001a46b7b0
7. ./ci/tmp/build/./src/Server/TCPHandler.cpp:2836: DB::TCPHandler::run() @ 0x000000001a48ec99
8. ./ci/tmp/build/./base/poco/Net/src/TCPServerConnection.cpp:40: Poco::Net::TCPServerConnection::start() @ 0x000000001f54aa87
9. ./ci/tmp/build/./base/poco/Net/src/TCPServerDispatcher.cpp:115: Poco::Net::TCPServerDispatcher::run() @ 0x000000001f54af19
10. ./ci/tmp/build/./base/poco/Foundation/src/ThreadPool.cpp:205: Poco::PooledThread::run() @ 0x000000001f5116c7
11. ./base/poco/Foundation/src/Thread_POSIX.cpp:341: Poco::ThreadImpl::runnableEntry(void*) @ 0x000000001f50fac1
12. ? @ 0x0000000000094ac3
13. ? @ 0x00000000001268c0 (version 25.10.1.3832 (official build)) 2025.11.07 14:05:20.159568 [ 63583 ] {b7ebe401-aa59-4306-9663-278657eac110} <Trace> ReadFromParallelRemoteReplicasStep: (127.0.0.10:9000) Cancelling query because enough data has been read

BaseDaemon: Address: 0x8. Access: read. Address not mapped to object.
BaseDaemon: Stack trace: 0x000000001a2ebf7b 0x000000001a335768 0x000000001a33547e 0x000000001777d565 0x000000001a53597a 0x000000001a5300e6 0x000000001a534103 0x0000000013d908eb 0x0000000013d97d66 0x0000000013d8d812 0x0000000013d9549a 0x00007f521e393ac3 0x00007f521e4258c0
BaseDaemon: 2.0. inlined from ./src/IO/VarInt.h:98: DB::readVarUInt(unsigned long&, DB::ReadBuffer&)
BaseDaemon: 2. ./ci/tmp/build/./src/Client/Connection.cpp:1315: DB::Connection::receivePacket() @ 0x000000001a2ebf7b
BaseDaemon: 3. ./ci/tmp/build/./src/Client/MultiplexedConnections.cpp:398: DB::MultiplexedConnections::receivePacketUnlocked(std::function<void (int, Poco::Timespan, DB::AsyncEventTimeoutType, String const&, unsigned int)>) @ 0x000000001a335768
BaseDaemon: 4. ./ci/tmp/build/./src/Client/MultiplexedConnections.cpp:253: DB::MultiplexedConnections::receivePacket() @ 0x000000001a33547e
BaseDaemon: 5. ./ci/tmp/build/./src/QueryPipeline/RemoteQueryExecutor.cpp:808: DB::RemoteQueryExecutor::finish() @ 0x000000001777d565

So let's simply throw in sendCancel() if the connection was lost.

CI: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=89712&sha=9eebfa48a4c58a959ae9b57f3fa6e6eaa5bac61e&name_0=PR

…uring cancellation

Previously only sendCancel() was protected against disconnection, but
RemoteQueryExecutor may read after sendCancel() and this will lead to
NULL pointer dereference:

    2025.11.07 14:05:20.159362 [ 73554 ] {b7ebe401-aa59-4306-9663-278657eac110} <Debug> ReadBuffer: ReadBuffer is canceled by the exception: Code: 241. DB::Exception: Query memory tracker: fault injected. Would use 2.00 MiB (attempt to allocate chunk of 1.00 MiB), maximum: 9.31 GiB. (MEMORY_LIMIT_EXCEEDED), Stack trace (when copying this message, always include the lines below):

    0. ./ci/tmp/build/./src/Common/Exception.cpp:131: DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x0000000013c3c7df
    1. DB::Exception::Exception(String&&, int, String, bool) @ 0x000000000c8902ce
    2. DB::Exception::Exception(PreformattedMessage&&, int) @ 0x000000000c88fd80
    3. ./src/Common/Exception.h:141: DB::Exception::Exception<char const*, char const*, String, String, String>(int, FormatStringHelperImpl<std::type_identity<char const*>::type, std::type_identity<char const*>::type, std::type_identity<String>::type, std::type_identity<String>::type, std::type_identity<String>::type>, char const*&&, char const*&&, String&&, String&&, String&&) @ 0x0000000013cbcc21
    4. ./ci/tmp/build/./src/Common/MemoryTracker.cpp:333: MemoryTracker::allocImpl(long, bool, MemoryTracker*, double) @ 0x0000000013cba10e
    5. ./ci/tmp/build/./src/Common/MemoryTracker.cpp:476: MemoryTracker::allocImpl(long, bool, MemoryTracker*, double) @ 0x0000000013cb9dc4
    6. ./ci/tmp/build/./src/Common/CurrentMemoryTracker.cpp:80: CurrentMemoryTracker::allocImpl(long, bool) @ 0x0000000013c0c079
    7. ./src/Common/CurrentMemoryTracker.cpp:103: Allocator<false, false>::realloc(void*, unsigned long, unsigned long, unsigned long) @ 0x0000000013c096c7
    8. ./src/Common/PODArray.h:167: void DB::PODArrayBase<1ul, 4096ul, Allocator<false, false>, 0ul, 0ul>::reallocPowerOfTwoElements<>(unsigned long) @ 0x0000000013cc7f0e
    9. ./src/Common/PODArray.h:243: DB::CompressedReadBufferBase::readCompressedData(unsigned long&, unsigned long&, bool) @ 0x000000001af881d7
    10. ./ci/tmp/build/./src/Compression/CompressedReadBuffer.cpp:10: non-virtual thunk to DB::CompressedReadBuffer::nextImpl() @ 0x000000001af87922
    11. ./ci/tmp/build/./src/IO/ReadBuffer.cpp:96: DB::ReadBuffer::next() @ 0x0000000013d245ed
    12. ./src/IO/ReadBuffer.h:81: DB::NativeReader::read() @ 0x000000001a4e2e4f
    13. ./ci/tmp/build/./src/Client/Connection.cpp:1422: DB::Connection::receiveDataImpl(DB::NativeReader&) @ 0x000000001a2edb59
    14. ./ci/tmp/build/./src/Client/Connection.cpp:1434: DB::Connection::receivePacket() @ 0x000000001a2ed154
    15. ./ci/tmp/build/./src/Client/MultiplexedConnections.cpp:398: DB::MultiplexedConnections::receivePacketUnlocked(std::function<void (int, Poco::Timespan, DB::AsyncEventTimeoutType, String const&, unsigned int)>) @ 0x000000001a335768
    16. ./ci/tmp/build/./src/QueryPipeline/RemoteQueryExecutorReadContext.cpp:67: DB::RemoteQueryExecutorReadContext::Task::run(std::function<void (int, Poco::Timespan, DB::AsyncEventTimeoutType, String const&, unsigned int)>, std::function<void ()>) @ 0x000000001779498d
    17. ./ci/tmp/build/./src/Common/AsyncTaskExecutor.cpp:89: void boost::context::detail::fiber_entry<boost::context::detail::fiber_record<boost::context::fiber, FiberStack&, Fiber::RoutineImpl<DB::AsyncTaskExecutor::Routine>>>(boost::context::detail::transfer_t) @ 0x0000000017793e83
     (version 25.10.1.3832 (official build))
    2025.11.07 14:05:20.159570 [ 28135 ] {} <Debug> ReadBuffer: ReadBuffer is canceled by the exception: Code: 210. DB::NetException: Connection reset by peer, while reading from socket (peer: [::ffff:127.0.0.1]:57678, local: [::ffff:127.0.0.10]:9000). (NETWORK_ERROR), Stack trace (when copying this message, always include the lines below):

    0. ./ci/tmp/build/./src/Common/Exception.cpp:131: DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x0000000013c3c7df
    1. DB::Exception::Exception(String&&, int, String, bool) @ 0x000000000c8902ce
    2. ./src/Common/NetException.h:26: DB::NetException::NetException<String, String, String>(int, FormatStringHelperImpl<std::type_identity<String>::type, std::type_identity<String>::type, std::type_identity<String>::type>, String&&, String&&, String&&) @ 0x0000000013e6d433
    3. ./ci/tmp/build/./src/IO/ReadBufferFromPocoSocket.cpp:83: DB::ReadBufferFromPocoSocketBase::socketReceiveBytesImpl(char*, unsigned long) @ 0x0000000013e6f17b
    4. ./ci/tmp/build/./src/IO/ReadBufferFromPocoSocket.cpp:107: DB::ReadBufferFromPocoSocketBase::nextImpl() @ 0x0000000013e6fa75
    5. ./ci/tmp/build/./src/IO/ReadBuffer.cpp:96: DB::ReadBuffer::next() @ 0x0000000013d245ed
    6. ./src/IO/ReadBuffer.h:81: DB::TCPHandler::runImpl() @ 0x000000001a46b7b0
    7. ./ci/tmp/build/./src/Server/TCPHandler.cpp:2836: DB::TCPHandler::run() @ 0x000000001a48ec99
    8. ./ci/tmp/build/./base/poco/Net/src/TCPServerConnection.cpp:40: Poco::Net::TCPServerConnection::start() @ 0x000000001f54aa87
    9. ./ci/tmp/build/./base/poco/Net/src/TCPServerDispatcher.cpp:115: Poco::Net::TCPServerDispatcher::run() @ 0x000000001f54af19
    10. ./ci/tmp/build/./base/poco/Foundation/src/ThreadPool.cpp:205: Poco::PooledThread::run() @ 0x000000001f5116c7
    11. ./base/poco/Foundation/src/Thread_POSIX.cpp:341: Poco::ThreadImpl::runnableEntry(void*) @ 0x000000001f50fac1
    12. ? @ 0x0000000000094ac3
    13. ? @ 0x00000000001268c0
     (version 25.10.1.3832 (official build))
    2025.11.07 14:05:20.159568 [ 63583 ] {b7ebe401-aa59-4306-9663-278657eac110} <Trace> ReadFromParallelRemoteReplicasStep: (127.0.0.10:9000) Cancelling query because enough data has been read

    BaseDaemon: Address: 0x8. Access: read. Address not mapped to object.
    BaseDaemon: Stack trace: 0x000000001a2ebf7b 0x000000001a335768 0x000000001a33547e 0x000000001777d565 0x000000001a53597a 0x000000001a5300e6 0x000000001a534103 0x0000000013d908eb 0x0000000013d97d66 0x0000000013d8d812 0x0000000013d9549a 0x00007f521e393ac3 0x00007f521e4258c0
    BaseDaemon: 2.0. inlined from ./src/IO/VarInt.h:98: DB::readVarUInt(unsigned long&, DB::ReadBuffer&)
    BaseDaemon: 2. ./ci/tmp/build/./src/Client/Connection.cpp:1315: DB::Connection::receivePacket() @ 0x000000001a2ebf7b
    BaseDaemon: 3. ./ci/tmp/build/./src/Client/MultiplexedConnections.cpp:398: DB::MultiplexedConnections::receivePacketUnlocked(std::function<void (int, Poco::Timespan, DB::AsyncEventTimeoutType, String const&, unsigned int)>) @ 0x000000001a335768
    BaseDaemon: 4. ./ci/tmp/build/./src/Client/MultiplexedConnections.cpp:253: DB::MultiplexedConnections::receivePacket() @ 0x000000001a33547e
    BaseDaemon: 5. ./ci/tmp/build/./src/QueryPipeline/RemoteQueryExecutor.cpp:808: DB::RemoteQueryExecutor::finish() @ 0x000000001777d565

So let's simply throw in sendCancel() if the connection was lost
@clickhouse-gh
Copy link
Contributor

clickhouse-gh bot commented Nov 7, 2025

Workflow [PR], commit [0d8387c]

Summary:

job_name test_name status info comment
Integration tests (amd_asan, old analyzer, 3/6) failure
test_s3_access_headers/test.py::test_custom_access_header[test_access_over_custom_header] FAIL cidb
Integration tests (amd_binary, 5/5) failure
test_s3_access_headers/test.py::test_custom_access_header[test_access_over_custom_header] FAIL cidb
Integration tests (arm_binary, distributed plan, 4/4) failure
test_s3_access_headers/test.py::test_custom_access_header[test_access_over_custom_header] FAIL cidb
Integration tests (amd_tsan, 3/6) failure
test_s3_access_headers/test.py::test_custom_access_header[test_access_over_custom_header] FAIL cidb
Integration tests (amd_tsan, 5/6) failure
test_storage_nats/test_nats_core.py::test_nats_no_connection_at_startup_2 FAIL cidb
AST fuzzer (amd_ubsan) failure
Logical error: 'Expected reading a single granule with 1384 rows, requested 3616 rows in Compact part in bucket 0'. FAIL cidb
Finish Workflow failure
python3 ./ci/jobs/scripts/workflow_hooks/new_tests_check.py failure

@clickhouse-gh clickhouse-gh bot added the pr-bugfix Pull request with bugfix, not backported by default label Nov 7, 2025
@azat azat requested a review from shankar-iyer November 7, 2025 17:53
@azat
Copy link
Member Author

azat commented Nov 10, 2025

CI:

@azat azat added this pull request to the merge queue Nov 10, 2025
Merged via the queue into ClickHouse:master with commit afc4cac Nov 10, 2025
123 of 131 checks passed
@azat azat deleted the dist-fix-cancel branch November 10, 2025 12:47
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label Nov 10, 2025
@robot-clickhouse robot-clickhouse added the pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR label Nov 24, 2025
robot-ch-test-poll4 added a commit that referenced this pull request Nov 24, 2025
Cherry pick #89740 to 25.8: Avoid crash due to reading from remote server after disconnect in remote queries during cancellation
robot-clickhouse added a commit that referenced this pull request Nov 24, 2025
…r after disconnect in remote queries during cancellation
robot-ch-test-poll4 added a commit that referenced this pull request Nov 24, 2025
Cherry pick #89740 to 25.9: Avoid crash due to reading from remote server after disconnect in remote queries during cancellation
robot-clickhouse added a commit that referenced this pull request Nov 24, 2025
…r after disconnect in remote queries during cancellation
robot-ch-test-poll4 added a commit that referenced this pull request Nov 24, 2025
Cherry pick #89740 to 25.10: Avoid crash due to reading from remote server after disconnect in remote queries during cancellation
robot-clickhouse added a commit that referenced this pull request Nov 24, 2025
…er after disconnect in remote queries during cancellation
@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Nov 24, 2025
clickhouse-gh bot added a commit that referenced this pull request Nov 24, 2025
Backport #89740 to 25.10: Avoid crash due to reading from remote server after disconnect in remote queries during cancellation
rienath added a commit that referenced this pull request Nov 25, 2025
Backport #89740 to 25.8: Avoid crash due to reading from remote server after disconnect in remote queries during cancellation
rienath added a commit that referenced this pull request Nov 25, 2025
Backport #89740 to 25.9: Avoid crash due to reading from remote server after disconnect in remote queries during cancellation
@rienath rienath assigned rienath and shankar-iyer and unassigned shankar-iyer and rienath Dec 8, 2025
@rienath
Copy link
Member

rienath commented Dec 8, 2025

Sorry, I thought I was typing in console

AVMusorin pushed a commit to AVMusorin/ClickHouse that referenced this pull request Jan 28, 2026
The race condition causes a crash when reading from a disconnected
connection in distributed queries.

The `in` buffer:

```
(gdb) frame 1
(gdb) print this->in
$19 = {__ptr_ = 0x0, __cntrl_ = 0x0}
(gdb) print this->connected
$18 = false
```

But the same time:

```
(gdb) frame 2
    async_callback=...) at ./ci/tmp/build/./src/Client/MultiplexedConnections.cpp:398
(gdb) print this->sent_query
$15 = true
(gdb) print this->cancelled
$16 = false
(gdb) print this->active_connection_count
$17 = 1
```

Current solution will not fix the issue, it just apply a check like in
`sendCancel()`.

Related PRs:
- ClickHouse#89740
- ClickHouse#92807
- ClickHouse#93029
batkovic75 pushed a commit to batkovic75/ClickHouse that referenced this pull request Jan 30, 2026
The race condition causes a crash when reading from a disconnected
connection in distributed queries.

The `in` buffer:

```
(gdb) frame 1
(gdb) print this->in
$19 = {__ptr_ = 0x0, __cntrl_ = 0x0}
(gdb) print this->connected
$18 = false
```

But the same time:

```
(gdb) frame 2
    async_callback=...) at ./ci/tmp/build/./src/Client/MultiplexedConnections.cpp:398
(gdb) print this->sent_query
$15 = true
(gdb) print this->cancelled
$16 = false
(gdb) print this->active_connection_count
$17 = 1
```

Current solution will not fix the issue, it just apply a check like in
`sendCancel()`.

Related PRs:
- ClickHouse#89740
- ClickHouse#92807
- ClickHouse#93029
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-bugfix Pull request with bugfix, not backported by default pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR pr-synced-to-cloud The PR is synced to the cloud repo v25.8-must-backport v25.10-must-backport

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI crash] Connection::receivePacket

6 participants