Add number of errors to ignore while choosing replicas by azat · Pull Request #11669 · ClickHouse/ClickHouse

azat · 2020-06-14T22:18:16Z

Changelog category (leave one):

Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Add number of errors to ignore while choosing replicas (distributed_replica_error_ignore)

Detailed description / Documentation draft:
This will allow avoid switching to another replica in case of error
(since error can be temporary).

Refs: #10564

Details

Previous HEAD's:

09f8f7227297989aaed45d8fc943eafa01bab779
c851407ffe0624f54567c71666a74874a8de8acf

akuzm · 2020-06-19T20:21:18Z

FWIW right now there distributed_replica_error_cap that is the high limit for the error numbers, so how about rename these settings to (not sure what is the policy for settings renames):

distributed_replica_error_ignore -> distributed_replica_error_limit_low

distributed_replica_error_cap -> distributed_replica_error_limit_high

Renaming settings is almost never good -- we strive to be backwards compatible where possible.

Speaking about the names themselves, "cap" is a saturation limit for the number of errors we track for a replica. It's limited so that the error number decays in a sane time, and the replica is not considered broken forever. I think it's not complementary to your new setting, and they don't form any meaningful range, so giving them complementary names would be misleading. I'd name your new setting "distributed_replica_max_ignored_errors". Although it's kind of hard to tell from these names which limit is which...

src/Common/PoolWithFailoverBase.h

src/Client/ConnectionPoolWithFailover.cpp

azat · 2020-06-19T22:51:25Z

Renaming settings is almost never good -- we strive to be backwards compatible where possible.

Will remember.

Speaking about the names themselves, "cap" is a saturation limit for the number of errors we track for a replica. It's limited so that the error number decays in a sane time, and the replica is not considered broken forever. I think it's not complementary to your new setting, and they don't form any meaningful range, so giving them complementary names would be misleading.

Yeah, when I was writing this I have some similar thoughts, ok - was a bad idea anyway.

I'd name your new setting "distributed_replica_max_ignored_errors". Although it's kind of hard to tell from these names which limit is which...

Ok, will rename

…eplica_max_ignored_errors) This will allow avoid switching to another replica in case of error (since error can be temporary).

…rBase::get() too

At startup, server loads configuration files. However ConfigReloader does not know about already loaded files (files is empty()), hence it will always reload the configuration just after server starts (+ 2 seconds, reload timeout). And on configuration reload the clusters will be re-created, so some internal stuff will be reseted: - error_count - last_used (round_robing) And if the reload will happen during round_robin test it will start querying from the beginning, so let's issue config reload just after start to avoid reload in the middle of the test execution.

azat · 2020-06-21T13:44:44Z

ClickHouse build check — 11/17 builds are OK

Some CI issues
The only difference between b8ee2ea (everything is green) and bd45592 (this build fail) is the integration test anyway

alexey-milovidov · 2020-06-21T15:51:13Z

I'm not sure how admin is supposed to choose the value of this setting.

nvartolomei · 2020-06-21T16:13:40Z

I think a better way to balance things would be to track the rate of errors (instead of an absolute value) with a sliding window and allow a small variability in the rate of errors between replicas.

The ignore value added here seems to help only in the beginning before the err counter reaches that value and then the behaviour will be exactly the same as it currently is until the counters come back to a values less than this min constant.

2c

alexey-milovidov · 2020-06-21T21:40:04Z

@nvartolomei The calculated number of errors is already a sliding "window" with exponential smoothing.

azat · 2020-06-21T23:52:12Z

I'm not sure how admin is supposed to choose the value of this setting.

via config.xml?

alexey-milovidov · 2020-06-22T00:09:23Z

What value will be good: 1 or 10 or 100?

azat · 2020-06-22T00:21:39Z

It depends, and should be adjusted only after looking at system.clusters (that has error_count rate) + server logs (for issues around distributed queries).

I would say that 100 definitely is not good, but 1-10 may be a good idea.

Also rate of queries and time of replica error_count recover (distributed_replica_error_half_life) also should be take into account.

(at first glance this can be enough)

akuzm · 2020-06-22T12:49:43Z

P.S. I'm trying to keep PRs without cleanups/refactoring to reduce number of changes (lots of changes is hard to review), but looks like it is not a problem for you, so I will consider deviating from this rule in some cases

That's generally a good idea if there is a non-trivial amount of refactoring. One way to do it is maintain logically separate changes in separate commits -- it works well with the github interface. For very big PRs, we can even merge some commits earlier, to have less conflicts. It's not very convenient for the developer though -- lots of fiddling with interactive rebase.

azat · 2020-06-22T17:05:43Z

That's generally a good idea if there is a non-trivial amount of refactoring.

And sometimes even trivial may be a problem (especially for backporting)

One way to do it is maintain logically separate changes in separate commits -- it works well with the github interface.

Well, keeping separate changes in separate commit is a rule of thumb, but this also does not solves issues with conflicts and so on.

lots of fiddling with interactive rebase.

ccache helps at least here :)

Fixed in: ClickHouse#11669 bd45592 ("Fix test_distributed_load_balancing flaps (due to config reload)")

azat marked this pull request as draft June 14, 2020 22:18

blinkov added the pr-improvement Pull request with some product improvements label Jun 14, 2020

azat force-pushed the distributed_replica_error_ignore branch from 09f8f72 to c851407 Compare June 17, 2020 18:02

azat marked this pull request as ready for review June 17, 2020 18:02

azat mentioned this pull request Jun 18, 2020

Balance the query load between replicas in a shard #10564

Closed

akuzm reviewed Jun 19, 2020

View reviewed changes

src/Common/PoolWithFailoverBase.h Outdated Show resolved Hide resolved

akuzm reviewed Jun 19, 2020

View reviewed changes

src/Client/ConnectionPoolWithFailover.cpp Outdated Show resolved Hide resolved

akuzm self-assigned this Jun 19, 2020

azat added 2 commits June 20, 2020 11:05

Drop unused PoolWithFailoverBase<>::reportError()

2a5a8e7

Fix divizion by zero for distributed_replica_error_half_life=0

687eb24

azat force-pushed the distributed_replica_error_ignore branch 2 times, most recently from e525fd1 to 7ac46d7 Compare June 20, 2020 08:12

azat added 3 commits June 20, 2020 11:20

Add number of errors to ignore while choosing replicas (distributed_r…

caa195c

…eplica_max_ignored_errors) This will allow avoid switching to another replica in case of error (since error can be temporary).

Add a test for distributed_replica_max_ignored_errors

9bfda65

Drop default values for some arguments of PoolWithFailoverBase::getMany

dd05438

azat force-pushed the distributed_replica_error_ignore branch from 7ac46d7 to 9bfda65 Compare June 20, 2020 08:21

azat added 2 commits June 20, 2020 11:39

Mark PoolWithFailoverBase::get() protected

de011a6

Pass max_ignored_errors/fallback_to_stale_replicas to PoolWithFailove…

b8ee2ea

…rBase::get() too

azat mentioned this pull request Jun 20, 2020

Add round_robin load_balancing #11645

Merged

alexey-milovidov approved these changes Jun 21, 2020

View reviewed changes

qoega added the doc-alert label Jun 22, 2020

akuzm added 2 commits June 22, 2020 15:28

Merge remote-tracking branch 'origin/master' into HEAD

ffc5b21

remove unneeded define

7ab326a

akuzm merged commit e76941b into ClickHouse:master Jun 22, 2020

azat deleted the distributed_replica_error_ignore branch June 23, 2020 16:39

azat added a commit to azat/ClickHouse that referenced this pull request Jun 29, 2020

Drop TODO about possible failures from test_distributed_load_balancing

72db50f

Fixed in: ClickHouse#11669 bd45592 ("Fix test_distributed_load_balancing flaps (due to config reload)")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add number of errors to ignore while choosing replicas#11669

Add number of errors to ignore while choosing replicas#11669
akuzm merged 10 commits intoClickHouse:masterfrom
azat:distributed_replica_error_ignore

azat commented Jun 14, 2020 •

edited

Loading

Uh oh!

akuzm commented Jun 19, 2020

Uh oh!

Uh oh!

Uh oh!

azat commented Jun 19, 2020

Uh oh!

azat commented Jun 21, 2020

Uh oh!

alexey-milovidov commented Jun 21, 2020

Uh oh!

nvartolomei commented Jun 21, 2020 •

edited

Loading

Uh oh!

alexey-milovidov commented Jun 21, 2020

Uh oh!

azat commented Jun 21, 2020

Uh oh!

alexey-milovidov commented Jun 22, 2020

Uh oh!

azat commented Jun 22, 2020

Uh oh!

akuzm commented Jun 22, 2020

Uh oh!

azat commented Jun 22, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

azat commented Jun 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akuzm commented Jun 19, 2020

Uh oh!

Uh oh!

Uh oh!

azat commented Jun 19, 2020

Uh oh!

azat commented Jun 21, 2020

Uh oh!

alexey-milovidov commented Jun 21, 2020

Uh oh!

nvartolomei commented Jun 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexey-milovidov commented Jun 21, 2020

Uh oh!

azat commented Jun 21, 2020

Uh oh!

alexey-milovidov commented Jun 22, 2020

Uh oh!

azat commented Jun 22, 2020

Uh oh!

akuzm commented Jun 22, 2020

Uh oh!

azat commented Jun 22, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

azat commented Jun 14, 2020 •

edited

Loading

nvartolomei commented Jun 21, 2020 •

edited

Loading