Avoid race condition for updating system.distribution_queue values by azat · Pull Request #43406 · ClickHouse/ClickHouse

azat · 2022-11-20T13:14:01Z

Changelog category (leave one):

Not for changelog (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Avoid race condition for updating system.distribution_queue values

Previously it was possible to have a race while updating files_count/bytes_count, since INSERT updates it those counters from one thread and the same metrics are updated from filesystem in a separate thread, and even though the access is synchronized with the mutex it avoids the race only for accessing the variables not the logical race, since it is possible that getFiles() from a separate thread will increment counters and later addAndSchedule() will increment them again.

Here you can find an example of this race 1.

Note, that I analyzed logs from production system with lots of async Distributed INSERT and everything is OK there, even though the logs contains the following:

2022.11.20 02:21:15.459483 [ 11528 ] {} <Trace> v21.dist_out.DirectoryMonitor: Files set to 35 (was 34)
2022.11.20 02:21:15.459515 [ 11528 ] {} <Trace> v21.dist_out.DirectoryMonitor: Bytes set to 4035418 (was 3929008)
2022.11.20 02:21:15.819488 [ 11528 ] {} <Trace> v21.dist_out.DirectoryMonitor: Files set to 1 (was 2)
2022.11.20 02:21:15.819502 [ 11528 ] {} <Trace> v21.dist_out.DirectoryMonitor: Bytes set to 190072 (was 296482)

As you may see it first increases the counters and next update decreases (and 4035418-3929008 == 296482-190072)

Refs: #23885
Reported-by: @tavplubix

@tavplubix

Previously it was possible to have a race while updating files_count/bytes_count, since INSERT updates it those counters from one thread and the same metrics are updated from filesystem in a separate thread, and even though the access is synchronized with the mutex it avoids the race only for accessing the variables not the logical race, since it is possible that getFiles() from a separate thread will increment counters and later addAndSchedule() will increment them again. Here you can find an example of this race [1]. [1]: https://pastila.nl/?00950e00/41a3c7bbb0a7e75bd3f2922c58b02334 Note, that I analyzed logs from production system with lots of async Distributed INSERT and everything is OK there, even though the logs contains the following: 2022.11.20 02:21:15.459483 [ 11528 ] {} <Trace> v21.dist_out.DirectoryMonitor: Files set to 35 (was 34) 2022.11.20 02:21:15.459515 [ 11528 ] {} <Trace> v21.dist_out.DirectoryMonitor: Bytes set to 4035418 (was 3929008) 2022.11.20 02:21:15.819488 [ 11528 ] {} <Trace> v21.dist_out.DirectoryMonitor: Files set to 1 (was 2) 2022.11.20 02:21:15.819502 [ 11528 ] {} <Trace> v21.dist_out.DirectoryMonitor: Bytes set to 190072 (was 296482) As you may see it first increases the counters and next update decreases (and 4035418-3929008 == 296482-190072) Refs: ClickHouse#23885 Reported-by: @tavplubix Signed-off-by: Azat Khuzhin <[email protected]>

tavplubix · 2022-11-21T11:29:31Z

Integration tests (asan) [6/6] - test_distributed_queries_stress is known to be flaky (#41776), but it's suspicious

AST fuzzer (tsan) - #43199 and #41850
Stateless tests (tsan) [2/5] - #43408
Stress test (asan) - OOM in dmesg
Stress test (msan) - #39174
Stress test (ubsan) - #39174

azat · 2022-12-03T10:59:33Z

Integration tests (asan) [6/6] - test_distributed_queries_stress is known to be flaky (#41776), but it's suspicious

Retries failed, but the first time it fails because the process got killed and container had been destroyed, but no info in logs about this (maybe OOM, but why):

2022-11-20 15:06:43 [ 317 ] DEBUG : run container_id:roottestdistributedqueriesstress_node1_r1_1 detach:False nothrow:False cmd: ['bash', '-c', "echo 'select * from dist_two where key = 0;\n    select * from dist_two where key = 1;\n    select * from dist_two where key = 2;\n    select * from dist_two where key = 3;\n    select * from dist_two;' | clickhouse benchmark --concurrency=100 --cumulative --delay=0 --timelimit=3 --hedged_connection_timeout_ms=200 --connect_timeout_with_failover_ms=200 --connections_with_failover_max_tries=5 --async_socket_for_remote=0 --distributed_group_by_no_merge=2"] (cluster.py:1745, exec_in_container)
2022-11-20 15:06:43 [ 317 ] DEBUG : Command:['docker', 'exec', 'roottestdistributedqueriesstress_node1_r1_1', 'bash', '-c', "echo 'select * from dist_two where key = 0;\n    select * from dist_two where key = 1;\n    select * from dist_two where key = 2;\n    select * from dist_two where key = 3;\n    select * from dist_two;' | clickhouse benchmark --concurrency=100 --cumulative --delay=0 --timelimit=3 --hedged_connection_timeout_ms=200 --connect_timeout_with_failover_ms=200 --connections_with_failover_max_tries=5 --async_socket_for_remote=0 --distributed_group_by_no_merge=2"] (cluster.py:95, run_and_check)
2022-11-20 15:08:48 [ 317 ] DEBUG : Stderr:Loaded 5 queries. (cluster.py:105, run_and_check)
2022-11-20 15:08:48 [ 317 ] DEBUG : Exitcode:137 (cluster.py:107, run_and_check)

Note, 137 is SIGKILL (128+9)

parallel1_0_dockerd.log:time="2022-11-20T15:08:48.244758252Z" level=debug msg="Revoking external connectivity on endpoint roottestdistributedqueriesstress_node1_r1_1 (82dfd051d379869bf885f90745cd4b097c70cd04bd3b4f86e49096358112fc51)"
parallel1_0_dockerd.log:time="2022-11-20T15:08:48.445809392Z" level=debug msg="82dfd051d379869bf885f90745cd4b097c70cd04bd3b4f86e49096358112fc51 (1c6863e).deleteSvcRecords(roottestdistributedqueriesstress_node1_r1_1, 172.16.8.2, <nil>, true) updateSvcRecord sid:82dfd051d3798
parallel1_0_dockerd.log:time="2022-11-20T15:08:48.526045522Z" level=debug msg="Releasing addresses for endpoint roottestdistributedqueriesstress_node1_r1_1's interface on network roottestdistributedqueriesstress_default"

And than all docker exec fails.

azat · 2023-01-04T18:10:48Z

src/Storages/Distributed/DirectoryMonitor.cpp

This patch introduces a bug - now this counters are not initialized on server start (not a huge deal, but still), I've rewrote the code completely to avoid readdir (you may think about this readdir here as a synchronization method between INSERT and thread that sends data to remote nodes) before each send - #44922

In ClickHouse#43406 metrics was broken for a clean start, since they where not initialized from disk, but metrics for broken files was never initialized from disk. Fix this and rework how DirectoryMonitor works with file system: - do not iterate over directory before each send, do this only once on init, after the map of files will be updated from the INSERT - call fs::create_directories() from the ctor for "broken" folder to avoid excessive calls - cache "broken" paths This patch also fixes possible issue when current_batch can be processed multiple times (second time will be an exception), since if there is existing current_batch.txt after processing it you should remove it instantly. Plus this patch implicitly fixes issues with logging, that logs incorrect number of files in case of error (see ClickHouse#44907 for details). Signed-off-by: Azat Khuzhin <[email protected]>

robot-ch-test-poll added the pr-not-for-changelog This PR should not be mentioned in the changelog label Nov 20, 2022

azat mentioned this pull request Nov 20, 2022

Add metrics for broken files during async distributed sends #23885

Merged

tavplubix self-assigned this Nov 21, 2022

tavplubix approved these changes Nov 21, 2022

View reviewed changes

azat mentioned this pull request Dec 3, 2022

test_distributed_queries_stress is flaky #41776

Closed

tavplubix merged commit 52d3e54 into ClickHouse:master Dec 5, 2022

azat deleted the dist/async-insert-stat branch December 6, 2022 13:05

azat mentioned this pull request Jan 4, 2023

Optimize and fix metrics for Distributed async INSERT #44922

Merged

azat commented Jan 4, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid race condition for updating system.distribution_queue values#43406

Avoid race condition for updating system.distribution_queue values#43406
tavplubix merged 1 commit intoClickHouse:masterfrom
azat:dist/async-insert-stat

azat commented Nov 20, 2022

Uh oh!

tavplubix commented Nov 21, 2022

Uh oh!

azat commented Dec 3, 2022

Uh oh!

azat Jan 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

                       }
                   }
-                  {

Conversation

azat commented Nov 20, 2022

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Uh oh!

tavplubix commented Nov 21, 2022

Uh oh!

azat commented Dec 3, 2022

Uh oh!

azat Jan 4, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants