Skip to content

Adjustable asynchronous insert timeouts#58486

Merged
jkartseva merged 9 commits intoClickHouse:masterfrom
jkartseva:adjustable-async-timeouts
Feb 2, 2024
Merged

Adjustable asynchronous insert timeouts#58486
jkartseva merged 9 commits intoClickHouse:masterfrom
jkartseva:adjustable-async-timeouts

Conversation

@jkartseva
Copy link
Member

@jkartseva jkartseva commented Jan 4, 2024

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Implement auto-adjustment for asynchronous insert timeouts. The following settings are introduced: async_insert_poll_timeout_ms, async_insert_use_adaptive_busy_timeout, async_insert_busy_timeout_min_ms, async_insert_busy_timeout_max_ms, async_insert_busy_timeout_increase_rate, async_insert_busy_timeout_decrease_rate.

async_insert_busy_timeout_ms is aliased to async_insert_busy_timeout_max_ms.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

@jkartseva jkartseva added the can be tested Allows running workflows for external contributors label Jan 4, 2024
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-improvement Pull request with some product improvements label Jan 4, 2024
@robot-ch-test-poll3
Copy link
Contributor

robot-ch-test-poll3 commented Jan 4, 2024

This is an automated comment for commit 505b9ba with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Successful checks
Check nameDescriptionStatus
AST fuzzerRuns randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help✅ success
ClickBenchRuns [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table✅ success
ClickHouse build checkBuilds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process✅ success
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help✅ success
Docker keeper imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docker server imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docs checkBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integrational tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc✅ success
Install packagesChecks that the built packages are installable in a clear environment✅ success
Mergeable CheckChecks if all other necessary checks are successful✅ success
Performance ComparisonMeasure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests✅ success
SQLancerFuzzing tests that detect logical bugs with SQLancer tool✅ success
SqllogicRun clickhouse on the sqllogic test set against sqlite and checks that all statements are passed✅ success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stress testRuns stateless functional tests concurrently from several clients to detect concurrency-related errors✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success
Upgrade checkRuns stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts✅ success
Check nameDescriptionStatus
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR⏳ pending
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests❌ failure

@jkartseva jkartseva marked this pull request as draft January 4, 2024 06:42
@antonio2368 antonio2368 self-assigned this Jan 4, 2024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you can't set min to 0? Auto adjusting should be capable to understand that it is a bad idea to really set it to 0. But even in that case it is ok to just start to write synchronously for this requests.

Copy link
Member Author

@jkartseva jkartseva Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be done. I would still like to preserve milliseconds as the smallest time unit, without dispersing into microseconds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size_in_bytes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size_in_bytes corresponds to the entries with the same key.hash, they have already been inserted and accounted for by the metric. Here, we should only increase the metric by the size of the current entry (entry_data_size).
However, we should subtract the size of the previously inserted entries from the metric if the entreis are to be processed synchronously.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how it will work with numbers like min=1, max=5. Will it stuck at 1 and 2 indefinitely?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not get if it, how it works. We update timeout before we check bytes/queries limit.
Will we update busy timeout in this case as well? As I understand the algorithm we do not need to batch more if there is enough bytes per batch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not mark the whole queue shard for immediate processing, but rather only the entries with the specific key.hash that have reached either bytes or # of queries limit:

auto [it, inserted] = shard.iterators.try_emplace(key.hash);
if (inserted)
{
auto now = std::chrono::steady_clock::now();
auto timeout = now + Milliseconds{key.settings.async_insert_busy_timeout_ms};
it->second = shard.queue.emplace(timeout, Container{key, std::make_unique<InsertData>()}).first;
}
auto queue_it = it->second;
auto & data = queue_it->second.data;
size_t entry_data_size = entry->chunk.byteSize();
assert(data);
data->size_in_bytes += entry_data_size;
data->entries.emplace_back(entry);
insert_future = entry->getFuture();
LOG_TRACE(log, "Have {} pending inserts with total {} bytes of data for query '{}'",
data->entries.size(), data->size_in_bytes, key.query_str);
bool has_enough_bytes = data->size_in_bytes >= key.settings.async_insert_max_data_size;
bool has_enough_queries = data->entries.size() >= key.settings.async_insert_max_query_number && key.settings.async_insert_deduplicate;
/// Here we check whether we hit the limit on maximum data size in the buffer.
/// And use setting from query context.
/// It works, because queries with the same set of settings are already grouped together.
if (!flush_stopped && (has_enough_bytes || has_enough_queries))
{
data_to_process = std::move(data);

Other entries that have fallen into the same queue shard will remain in the async queue.
Hence, I believe, we should adjust the timeout.

@jkartseva jkartseva force-pushed the adjustable-async-timeouts branch 6 times, most recently from 78fb04f to fb51320 Compare January 13, 2024 03:59
@jkartseva jkartseva force-pushed the adjustable-async-timeouts branch 12 times, most recently from 64a8343 to 4a7c239 Compare January 22, 2024 01:24
@jkartseva jkartseva marked this pull request as ready for review January 22, 2024 07:42
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want this to be default for async inserts no?
also, if there is a reason, there is a compatibility setting which can preserve behaviour for older versions (check SettingsChangesHistory)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted this setting to be deployed and adopted gradually, so I added enable_async_insert_adaptive_busy_timeout with a kill switch in mind. We might meet a corner case not accounted for during testing.
Would allow_experimental... be a better name for the purpose? The setting can be marked as deprecated later on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I see this more as an implementation detail for async inserts and a strictly better implementation.
From users perspective, he should worry only if he wants to use async inserts or not. I'm okay with having this flag as a safety net if the new implementation misbehaves but I cannot see a reason why someone would want the old behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready for review. Tests failures are not relevant, e.g., https://s3.amazonaws.com/clickhouse-test-reports/0/0cbadc9b52f5e728a397b3081c6887c0f4213856/stress_test__debug_.html is also failing on master.
I'll take a look at what might be wrong there.

Currently, the new feature is disabled by default, we may choose to enable it. I want to preserve allow_experimental_<..> kill switch, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more descriptive function name
also, move implementation of both functions to cpp?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we start from min timeout if adaptive is enabled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been considering a smooth transition from the state 'enable_async_insert_adaptive_busy_timeout=0' to '1'. async_insert_busy_timeout_ms is a timeout anticipated by a customer, and the adaptive algorithm treats it as the initial point. It minimizes the value if inserts are not frequent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is a timeout anticipated by a customer

if they touch for some reason the timeout setting, otherwise they don't anticipate anything as long as it works

the problem here is that we would like to solve the timeout of the initial insert, and if the user set async_insert_busy_timeout_ms to some high value, it will take a long time for even a simple insert of 1 row.
IMO we should start from minimal flushtimeout and gradually grow from there if the insert rate increases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be same as async_insert_busy_timeout_ms?
in the current implementation async_insert_busy_timeout_ms is the total time limit which will force a flush, i.e. maximum we can keep the entries before flushing.
Also, what if user changed async_insert_busy_timeout_ms to 2s? He would get errors how async_insert_busy_timeout_max_ms is not okay, and then he needs to figure out what it is and what value to set it to.

Why not put async_insert_busy_timeout_ms as alias to async_insert_busy_timeout_max_ms to keep backward compatibility?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds good to me. Is there a way to alias two settings (async_insert_busy_timeout_ms and async_insert_busy_timeout_max_ms) so that they are tied to the same value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL ALIAS expression, will do.

@jkartseva jkartseva force-pushed the adjustable-async-timeouts branch from 4a7c239 to ff3408d Compare January 25, 2024 23:31
@jkartseva
Copy link
Member Author

jkartseva commented Jan 26, 2024

The newly added metrics

Screenshot 2024-01-25 at 6 48 21 PM
SELECT 
  toStartOfInterval(event_time, INTERVAL {rounding:UInt32} SECOND)::INT AS t,
  avg(metric)
FROM (
  SELECT event_time, sum(CurrentMetric_AsynchronousInsertQueueBytes) AS metric 
  FROM clusterAllReplicas(default, system.metric_log)
  WHERE event_date >= toDate(now() - {seconds:UInt32})
    AND event_time >= now() - {seconds:UInt32}
  GROUP BY event_time)
GROUP BY t
ORDER BY t WITH FILL STEP {rounding:UInt32} SETTINGS skip_unavailable_shards = 1
SELECT 
  toStartOfInterval(event_time, INTERVAL {rounding:UInt32} SECOND)::INT AS t,
  avg(value)
FROM (
  SELECT event_time, avg(timeout_milliseconds) AS value 
  FROM clusterAllReplicas(default, system.asynchronous_insert_log)
  WHERE event_date >= toDate(now() - {seconds:UInt32})
    AND event_time >= now() - {seconds:UInt32}
  GROUP BY event_time)
GROUP BY t
ORDER BY t WITH FILL STEP {rounding:UInt32} SETTINGS skip_unavailable_shards = 1

@jkartseva jkartseva force-pushed the adjustable-async-timeouts branch from 0a64193 to 7b37a27 Compare January 30, 2024 21:06
@clickhouse-ci
Copy link

clickhouse-ci bot commented Jan 30, 2024

This is an automatic comment. The PR descriptions does not match the template.

Please, edit it accordingly.

The error is: Changelog entry required for category 'Improvement'

@jkartseva jkartseva force-pushed the adjustable-async-timeouts branch from 7b37a27 to da12bf7 Compare January 31, 2024 05:02
@antonio2368
Copy link
Member

We can maybe add async_insert_use_adaptive_busy_timeout to random settings in clickhouse-test?

@jkartseva
Copy link
Member Author

I'm confused about "New settings are not reflected in settings changes history (see new_settings.txt)", though I have updated settings.md

https://s3.amazonaws.com/clickhouse-test-reports/58486/da12bf77cfeceef9476c29c447ef82acacf6cb8d/upgrade_check__debug_.html

Do you know by any chance where this new_settings.txt is located? @antonio2368

@antonio2368
Copy link
Member

@jkartseva that file is generated by a query during upgrade test run to tell you which new settings were not added to src/Core/SettingsChangesHistory.h
Go to the failed test and open new_settings.txt
https://s3.amazonaws.com/clickhouse-test-reports/58486/da12bf77cfeceef9476c29c447ef82acacf6cb8d/upgrade_check__debug_.html

@jkartseva
Copy link
Member Author

We can maybe add async_insert_use_adaptive_busy_timeout to random settings in clickhouse-test?

I'll publish a follow-up PR.

@jkartseva jkartseva force-pushed the adjustable-async-timeouts branch from d51cfe5 to f776ec7 Compare February 1, 2024 05:06
@jkartseva
Copy link
Member Author

Set version to 24.2 in settings history. It should be good to merge.

The current timeout for checking updates in the asynchronous queue is
equal to the timeout used for queue entry
(async_insert_busy_timeout_ms).
That means that, in the worst case, an entry spends twice the time of the
asynchronous timeout in the queue.
@jkartseva jkartseva force-pushed the adjustable-async-timeouts branch from f776ec7 to 8ac5ccd Compare February 1, 2024 08:28
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will we do then for backward compatibility if someone set async_insert_busy_timeout_ms to a value less then default async_insert_busy_timeout_min_ms?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it slipped my mind to update it, thanks. In that case, the minimum value of the two is taken as the async timeout.

Implement the algorithm described in ClickHouse#56783 for adaptive asynchronous
insert timeouts.
- The adaptive async insert timeout can take values within
[async_insert_busy_timeout_min_ms, async_insert_busy_timeout_max_ms].
- The initial value is set to async_insert_busy_timeout_min_ms.
- If the elapsed time since the most recent queue insert was
  greater than the maximum timeout, process the queue content immediately,
  and reduce the timeout.
- If the elapsed time was long enough (longer than a would-be decreased
  timeout), decrease the timeout.
- The adaptive timeout is changes exponentially based on the
  async_insert_busy_timeout_{increase|decrease}_rate.

Fixes: ClickHouse#56783
Add AsynchronousInsertQueueSize and AsynchronousInsertQueueBytes
metrics to improve observability of asynchronous inserts.
The metrics do not account for tasks dispatched for immediate processing
(as opposed to, e.g., PendingAsyncInsert).

```
SELECT value
FROM system.metrics
WHERE metric IN ('AsynchronousInsertQueueSize', 'PendingAsyncInsert')

Query id: a711dd83-b48d-4ad5-8031-fa59b21a7c38

┌─value─┐
│    18 │
│    23 │
└───────┘
```

```
SELECT value
FROM system.metrics
WHERE metric IN ('AsynchronousInsertQueueSize', 'AsynchronousInsertQueueBytes')

Query id: b35a7ceb-2bb5-46ad-b301-e6cf03508699

┌─value─┐
│    28 │
│  1372 │
└───────┘
```
In addition to the time since the most recent insert,
consider the elapsed time between the two recent queue
flushes when decreasing the timeout or processing an
entry synchronously.
@jkartseva jkartseva force-pushed the adjustable-async-timeouts branch 3 times, most recently from 56f24fd to dda191d Compare February 1, 2024 23:58
@jkartseva jkartseva force-pushed the adjustable-async-timeouts branch from dda191d to 505b9ba Compare February 2, 2024 07:17
@jkartseva
Copy link
Member Author

The adaptive timeout ramping up from the minimum could contradict the async_insert_deduplicate setting, whose intention is to eliminate duplicates in a batch. At the very least, synchronous queue flush should be disabled for the setting. I'll file a follow-up.

@antonio2368
Copy link
Member

The adaptive timeout ramping up from the minimum could contradict the async_insert_deduplicate setting, whose intention is to eliminate duplicates in a batch. At the very least, synchronous queue flush should be disabled for the setting. I'll file a follow-up.

No, deduplication eliminates duplicates in general (or at least in replicated_deduplication_window_for_async_inserts window).
Removing duplicates from batch is just optimization so we don't have to do it after verifying with Keeper.

@jkartseva jkartseva merged commit f41f4e4 into ClickHouse:master Feb 2, 2024
fixed_tm_table_name,
fixed_tm_settings,
thread_num=15,
tasks=1000,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkartseva can you help me understand why you check with fixed timeout 1000 iterations, but compare it with 100 of adaptive timeout?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-improvement Pull request with some product improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants