Skip to content

Add metrics for broken files during async distributed sends#23885

Merged
alexey-milovidov merged 2 commits intoClickHouse:masterfrom
azat:dist-broken-metrics
May 5, 2021
Merged

Add metrics for broken files during async distributed sends#23885
alexey-milovidov merged 2 commits intoClickHouse:masterfrom
azat:dist-broken-metrics

Conversation

@azat
Copy link
Member

@azat azat commented May 4, 2021

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Add broken_data_files/broken_data_compressed_bytes into system.distribution_queue. Add metric for number of files for asynchronous insertion into Distributed tables that has been marked as broken (BrokenDistributedFilesToInsert).

azat added 2 commits May 4, 2021 22:48
Number of files for asynchronous insertion into Distributed tables that
has been marked as broken. This metric will starts from 0 on start.
Number of files for every shard is summed.
@robot-clickhouse robot-clickhouse added the pr-improvement Pull request with some product improvements label May 4, 2021
@alexey-milovidov alexey-milovidov self-assigned this May 5, 2021
@alexey-milovidov alexey-milovidov merged commit bf8c28a into ClickHouse:master May 5, 2021
@azat azat deleted the dist-broken-metrics branch May 5, 2021 18:23
azat added a commit to azat/ClickHouse that referenced this pull request Nov 20, 2022
Previously it was possible to have a race while updating
files_count/bytes_count, since INSERT updates it those counters from one
thread and the same metrics are updated from filesystem in a separate
thread, and even though the access is synchronized with the mutex it
avoids the race only for accessing the variables not the logical race,
since it is possible that getFiles() from a separate thread will
increment counters and later addAndSchedule() will increment them again.

Here you can find an example of this race [1].

  [1]: https://pastila.nl/?00950e00/41a3c7bbb0a7e75bd3f2922c58b02334

Note, that I analyzed logs from production system with lots of async
Distributed INSERT and everything is OK there, even though the logs
contains the following:

    2022.11.20 02:21:15.459483 [ 11528 ] {} <Trace> v21.dist_out.DirectoryMonitor: Files set to 35 (was 34)
    2022.11.20 02:21:15.459515 [ 11528 ] {} <Trace> v21.dist_out.DirectoryMonitor: Bytes set to 4035418 (was 3929008)
    2022.11.20 02:21:15.819488 [ 11528 ] {} <Trace> v21.dist_out.DirectoryMonitor: Files set to 1 (was 2)
    2022.11.20 02:21:15.819502 [ 11528 ] {} <Trace> v21.dist_out.DirectoryMonitor: Bytes set to 190072 (was 296482)

As you may see it first increases the counters and next update
decreases (and 4035418-3929008 == 296482-190072)

Refs: ClickHouse#23885
Reported-by: @tavplubix
Signed-off-by: Azat Khuzhin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants