Fixed reschedule issue in Kafka#11149
Conversation
There was a problem hiding this comment.
Before
┌───────────────────────x─┬─count()─┐
│ 2020-05-22 19:01:04.089 │ 1000 │
│ 2020-05-22 19:01:04.656 │ 1000 │
│ 2020-05-22 19:01:05.225 │ 1000 │
│ 2020-05-22 19:01:05.788 │ 1000 │
│ 2020-05-22 19:01:06.356 │ 1000 │
│ 2020-05-22 19:01:06.932 │ 1000 │
│ 2020-05-22 19:01:07.517 │ 1000 │
│ 2020-05-22 19:01:08.095 │ 1000 │
│ 2020-05-22 19:01:08.659 │ 1000 │
│ 2020-05-22 19:01:09.230 │ 1000 │
│ 2020-05-22 19:01:09.801 │ 1000 │
│ 2020-05-22 19:01:10.374 │ 1000 │
│ 2020-05-22 19:01:10.936 │ 1000 │
│ 2020-05-22 19:01:11.519 │ 1000 │
│ 2020-05-22 19:01:12.092 │ 1000 │
│ 2020-05-22 19:01:12.663 │ 1000 │
│ 2020-05-22 19:01:13.236 │ 1000 │
│ 2020-05-22 19:01:13.808 │ 1000 │
│ 2020-05-22 19:01:14.383 │ 1000 │
│ 2020-05-22 19:01:14.956 │ 1000 │
└─────────────────────────┴─────────┘
After:
┌───────────────────────x─┬─count()─┐
│ 2020-05-22 19:08:07.700 │ 1000 │
│ 2020-05-22 19:08:07.781 │ 1000 │
│ 2020-05-22 19:08:07.845 │ 1000 │
│ 2020-05-22 19:08:07.912 │ 1000 │
│ 2020-05-22 19:08:07.979 │ 1000 │
│ 2020-05-22 19:08:08.047 │ 1000 │
│ 2020-05-22 19:08:08.116 │ 1000 │
│ 2020-05-22 19:08:08.180 │ 1000 │
│ 2020-05-22 19:08:08.238 │ 1000 │
│ 2020-05-22 19:08:08.300 │ 1000 │
│ 2020-05-22 19:08:08.367 │ 1000 │
│ 2020-05-22 19:08:08.436 │ 1000 │
│ 2020-05-22 19:08:08.505 │ 1000 │
│ 2020-05-22 19:08:08.575 │ 1000 │
│ 2020-05-22 19:08:08.647 │ 1000 │
│ 2020-05-22 19:08:08.705 │ 1000 │
│ 2020-05-22 19:08:08.763 │ 1000 │
│ 2020-05-22 19:08:08.827 │ 1000 │
│ 2020-05-22 19:08:08.892 │ 1000 │
│ 2020-05-22 19:08:08.964 │ 1000 │
└─────────────────────────┴─────────┘
|
Thank you! I was just about to make a PR fixing this same issue. In our case 500ms reschedule was limiting rate to about 300k rows per second. With a similar fix - 1-2M+/s |
azat
left a comment
There was a problem hiding this comment.
LGTM (merging with squashing is preferable)
I would also add details about the performance tests (and notes from the PR description) into the commit itself
And also maybe it worth updating title of the PR so that it includes the possible "performance" benefits.
|
Actually there are tests failures that looks related (well maybe not directly, but they should not fail), @filimonov can you take a look? |
|
Also seems that now the upstream cppkafka can be used (your PR is merged, and other build fixes already there too) |
|
@filimonov Do you have any ideas about test_storage_kafka/test.py::test_kafka_flush_by_block_size test fail? |
|
UPD: #11149 (comment) |
Not in that PR. Cppkafka has some other changes too. That PR is a bugfix, want to keep it compact and limit side effects. |
Changelog entry contains that. Commit messages - discussed on telegram (it's a nice practice, but has no sence if only one person will follow that, and anyway it can't replace PR description and documentation in PR/tasks etc) |
|
Hm. It looks like a real issue (minor), It can exit the poll loop prematurely by eof, w/o retries. One line fix. Will add one more test for that and fix it here too. UPD. I've messed the builds. Reproduces well. |
…hich were always applied.
f67ab80 to
efc0b97
Compare
|
Rebased on the newer master (was on the top of broken one), squashed. Test fixed, another bug discovered #11216 |
|
Linking error on clang-10 - unrelated. |
|
Did a performance tests: Test https://github.com/filimonov/ch-kafka-consume-perftest Format TSV Consuming time with single table with kafka_num_consumers=2 measured.
Related: #8917 (comment) |
|
Maybe worth checking with more partitions and compare it with old simple P.S. , something like this may be used to run 10 P.S. |
|
Same setup, same data, your test: |
|
Just like I thought - Kafka engine is still slow |
|
I agree that there are several places that can be done better. And the performance can be improved. But your test is not 100% fair. With commits, it gives [1.29M/s] for 2 consumers. |
Fixed reschedule issue in Kafka (cherry picked from commit 992f0df)
Fixed reschedule issue in Kafka (cherry picked from commit 992f0df)
I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fix Kafka performance issue related to reschedules based on limits, which were always applied.
Detailed description / Documentation draft:
It was checking limits before and always reschedule, leading to poor performance.
(it was able to consume only up to kafka_max_block_size rows and after that 0.5 second delay for rescheduling was happening, decreasing the performance).
Reported by @azat