Avoid possible crash for distributed queries in case of cancellation#95466
Avoid possible crash for distributed queries in case of cancellation#95466azat merged 6 commits intoClickHouse:masterfrom
Conversation
The race condition causes a crash when reading from a disconnected
connection in distributed queries.
The `in` buffer:
```
(gdb) frame 1
(gdb) print this->in
$19 = {__ptr_ = 0x0, __cntrl_ = 0x0}
(gdb) print this->connected
$18 = false
```
But the same time:
```
(gdb) frame 2
async_callback=...) at ./ci/tmp/build/./src/Client/MultiplexedConnections.cpp:398
(gdb) print this->sent_query
$15 = true
(gdb) print this->cancelled
$16 = false
(gdb) print this->active_connection_count
$17 = 1
```
Current solution will not fix the issue, it just apply a check like in
`sendCancel()`.
Related PRs:
- ClickHouse#89740
- ClickHouse#92807
- ClickHouse#93029
src/Client/Connection.cpp
Outdated
| /// If we already disconnected. | ||
| if (!in) | ||
| { | ||
| throw Exception(ErrorCodes::NETWORK_ERROR, "Connection to {} is terminated", getDescription()); |
There was a problem hiding this comment.
Ok, I guess we can add this sanity check, but it should be LOGICAL_ERROR, since it is a bug in the code
But I still want to understand the what is this bug in the code.
Also are you sure that it can still happen after other fixes (do you have #93029 in your version)? And if so, do you have logs for this query?
There was a problem hiding this comment.
Also are you sure that it can still happen after other fixes (do you have #93029 in your version)? And if so, do you have logs for this query?
Yes, on 25.10.4.104
There was a problem hiding this comment.
but it should be LOGICAL_ERROR
done
|
It looks like it's caused by failed RemoteQueryExecutor::cancel() call followed by RemoteQueryExecutor::finish():
Checking query logs for the queries + hostnames from the failures gives the following, and after these are +- last records of the query log (coredump contains segfault on connection to <NODE_NAME>): |
|
Egh, again it is |
|
I've pushed one more patch into your PR to prevent this |
|
Workflow [PR], commit [33d49e1] Summary: ❌
|
a97cc84
Cherry pick #95466 to 25.3: Avoid possible crash for distributed queries in case of cancellation
… in case of cancellation
Cherry pick #95466 to 25.8: Avoid possible crash for distributed queries in case of cancellation
… in case of cancellation
Cherry pick #95466 to 25.10: Avoid possible crash for distributed queries in case of cancellation
…s in case of cancellation
Cherry pick #95466 to 25.11: Avoid possible crash for distributed queries in case of cancellation
…s in case of cancellation
Cherry pick #95466 to 25.12: Avoid possible crash for distributed queries in case of cancellation
…s in case of cancellation
Cherry pick #95466 to 26.1: Avoid possible crash for distributed queries in case of cancellation
… in case of cancellation
Backport #95466 to 25.12: Avoid possible crash for distributed queries in case of cancellation
Backport #95466 to 26.1: Avoid possible crash for distributed queries in case of cancellation
Backport #95466 to 25.11: Avoid possible crash for distributed queries in case of cancellation
Backport #95466 to 25.10: Avoid possible crash for distributed queries in case of cancellation
Backport #95466 to 25.8: Avoid possible crash for distributed queries in case of cancellation
Linked issues
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Avoid possible crash for distributed queries in case of cancellation
Details
The
inbuffer:But the same time:
Ref: #89740
Ref: #92807
Ref: #93029
CC @kssenii @azat