Skip to content

Fixed reschedule issue in Kafka#11149

Merged
alexey-milovidov merged 2 commits intoClickHouse:masterfrom
filimonov:kafka_reschedule
May 27, 2020
Merged

Fixed reschedule issue in Kafka#11149
alexey-milovidov merged 2 commits intoClickHouse:masterfrom
filimonov:kafka_reschedule

Conversation

@filimonov
Copy link
Contributor

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • Bug Fix

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fix Kafka performance issue related to reschedules based on limits, which were always applied.

Detailed description / Documentation draft:
It was checking limits before and always reschedule, leading to poor performance.
(it was able to consume only up to kafka_max_block_size rows and after that 0.5 second delay for rescheduling was happening, decreasing the performance).

Reported by @azat

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before

┌───────────────────────x─┬─count()─┐
│ 2020-05-22 19:01:04.089 │    1000 │
│ 2020-05-22 19:01:04.656 │    1000 │
│ 2020-05-22 19:01:05.225 │    1000 │
│ 2020-05-22 19:01:05.788 │    1000 │
│ 2020-05-22 19:01:06.356 │    1000 │
│ 2020-05-22 19:01:06.932 │    1000 │
│ 2020-05-22 19:01:07.517 │    1000 │
│ 2020-05-22 19:01:08.095 │    1000 │
│ 2020-05-22 19:01:08.659 │    1000 │
│ 2020-05-22 19:01:09.230 │    1000 │
│ 2020-05-22 19:01:09.801 │    1000 │
│ 2020-05-22 19:01:10.374 │    1000 │
│ 2020-05-22 19:01:10.936 │    1000 │
│ 2020-05-22 19:01:11.519 │    1000 │
│ 2020-05-22 19:01:12.092 │    1000 │
│ 2020-05-22 19:01:12.663 │    1000 │
│ 2020-05-22 19:01:13.236 │    1000 │
│ 2020-05-22 19:01:13.808 │    1000 │
│ 2020-05-22 19:01:14.383 │    1000 │
│ 2020-05-22 19:01:14.956 │    1000 │
└─────────────────────────┴─────────┘

After:

┌───────────────────────x─┬─count()─┐
│ 2020-05-22 19:08:07.700 │    1000 │
│ 2020-05-22 19:08:07.781 │    1000 │
│ 2020-05-22 19:08:07.845 │    1000 │
│ 2020-05-22 19:08:07.912 │    1000 │
│ 2020-05-22 19:08:07.979 │    1000 │
│ 2020-05-22 19:08:08.047 │    1000 │
│ 2020-05-22 19:08:08.116 │    1000 │
│ 2020-05-22 19:08:08.180 │    1000 │
│ 2020-05-22 19:08:08.238 │    1000 │
│ 2020-05-22 19:08:08.300 │    1000 │
│ 2020-05-22 19:08:08.367 │    1000 │
│ 2020-05-22 19:08:08.436 │    1000 │
│ 2020-05-22 19:08:08.505 │    1000 │
│ 2020-05-22 19:08:08.575 │    1000 │
│ 2020-05-22 19:08:08.647 │    1000 │
│ 2020-05-22 19:08:08.705 │    1000 │
│ 2020-05-22 19:08:08.763 │    1000 │
│ 2020-05-22 19:08:08.827 │    1000 │
│ 2020-05-22 19:08:08.892 │    1000 │
│ 2020-05-22 19:08:08.964 │    1000 │
└─────────────────────────┴─────────┘

@blinkov blinkov added the pr-bugfix Pull request with bugfix, not backported by default label May 22, 2020
Copy link
Member

@azat azat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general looks good to me (left a few comments and also there are some coding style issues that will be reported by CI anyway)

@oandrew
Copy link
Contributor

oandrew commented May 22, 2020

Thank you! I was just about to make a PR fixing this same issue. In our case 500ms reschedule was limiting rate to about 300k rows per second. With a similar fix - 1-2M+/s

Copy link
Member

@azat azat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (merging with squashing is preferable)

I would also add details about the performance tests (and notes from the PR description) into the commit itself
And also maybe it worth updating title of the PR so that it includes the possible "performance" benefits.

@azat
Copy link
Member

azat commented May 23, 2020

Actually there are tests failures that looks related (well maybe not directly, but they should not fail), @filimonov can you take a look?

@azat
Copy link
Member

azat commented May 24, 2020

Also seems that now the upstream cppkafka can be used (your PR is merged, and other build fixes already there too)

@alexey-milovidov
Copy link
Member

@filimonov Do you have any ideas about test_storage_kafka/test.py::test_kafka_flush_by_block_size test fail?

@filimonov
Copy link
Contributor Author

filimonov commented May 25, 2020

Can't reproduce locally. Most probably the PR just pushed a strange balance of sleeps inside those Kafka tests, and some sleeps need to be reviewed/adjusted. Locally i have another test flapping now.

UPD: #11149 (comment)

@filimonov
Copy link
Contributor Author

Also seems that now the upstream cppkafka can be used (your PR is merged, and other build fixes already there too)

Not in that PR. Cppkafka has some other changes too. That PR is a bugfix, want to keep it compact and limit side effects.

@filimonov
Copy link
Contributor Author

filimonov commented May 25, 2020

I would also add details about the performance tests (and notes from the PR description) into the commit itself
And also maybe it worth updating title of the PR so that it includes the possible "performance" benefits.

Changelog entry contains that. Commit messages - discussed on telegram (it's a nice practice, but has no sence if only one person will follow that, and anyway it can't replace PR description and documentation in PR/tasks etc)

@filimonov
Copy link
Contributor Author

filimonov commented May 26, 2020

Hm. It looks like a real issue (minor), strange that it didn't hit me locally.

It can exit the poll loop prematurely by eof, w/o retries. One line fix.

Will add one more test for that and fix it here too.

UPD. I've messed the builds. Reproduces well.

@filimonov
Copy link
Contributor Author

filimonov commented May 27, 2020

Rebased on the newer master (was on the top of broken one), squashed.

Test fixed, another bug discovered #11216
(Will fix the bug in follow-up PR)

@filimonov
Copy link
Contributor Author

Linking error on clang-10 - unrelated.

2020-05-27 06:37:47 [112/249] Linking CXX executable src/Storages/tests/system_numbers
2020-05-27 06:37:47 FAILED: src/Storages/tests/system_numbers 
2020-05-27 06:37:47 : && /usr/bin/ccache /usr/bin/clang++-10  -g -fno-omit-frame-pointer -DSANITIZER -fsanitize=thread -fsanitize-blacklist=/build/tests/tsan_suppressions.txt -fdiagnostics-color=always -std=c++2a -fsized-deallocation  -gdwarf-aranges -pipe -msse4.1 -msse4.2 -mpopcnt   -Wall -Wno-unused-command-line-argument  -fdiagnostics-absolute-paths -Werror -Wextra -Wpedantic -Wno-vla-extension -Wno-zero-length-array -Wcomma -Wconditional-uninitialized -Wcovered-switch-default -Wdeprecated -Wembedded-directive -Wempty-init-stmt -Wextra-semi-stmt -Wextra-semi -Wgnu-case-range -Winconsistent-missing-destructor-override -Wnewline-eof -Wold-style-cast -Wrange-loop-analysis -Wredundant-parens -Wreserved-id-macro -Wshadow-field -Wshadow-uncaptured-local -Wshadow -Wstring-plus-int -Wundef -Wunreachable-code-return -Wunreachable-code -Wunused-exception-parameter -Wunused-macros -Wunused-member-function -Wzero-as-null-pointer-constant -Weverything -Wno-c++98-compat-pedantic -Wno-c++98-compat -Wno-c99-extensions -Wno-conversion -Wno-ctad-maybe-unsupported -Wno-deprecated-dynamic-exception-spec -Wno-disabled-macro-expansion -Wno-documentation-unknown-command -Wno-double-promotion -Wno-exit-time-destructors -Wno-float-equal -Wno-global-constructors -Wno-missing-prototypes -Wno-missing-variable-declarations -Wno-nested-anon-types -Wno-packed -Wno-padded -Wno-return-std-move-in-c++11 -Wno-shift-sign-overflow -Wno-sign-conversion -Wno-switch-enum -Wno-undefined-func-template -Wno-unused-template -Wno-vla -Wno-weak-template-vtables -Wno-weak-vtables -O2 -g -DNDEBUG -O3  -fno-pie  -fuse-ld=lld -rdynamic -Wl,--gdb-index -Wl,-no-pie   -rdynamic src/CMakeFiles/clickhouse_malloc.dir/Common/malloc.cpp.o src/Storages/tests/CMakeFiles/system_numbers.dir/system_numbers.cpp.o  -o src/Storages/tests/system_numbers  src/libclickhouse_new_delete.a src/libdbms.a src/Storages/System/libclickhouse_storages_system.a src/libclickhouse_common_io.a src/libdbms.a contrib/llvm/llvm/lib/libLLVMOrcJIT.a contrib/llvm/llvm/lib/libLLVMJITLink.a contrib/llvm/llvm/lib/libLLVMExecutionEngine.a contrib/llvm/llvm/lib/libLLVMRuntimeDyld.a contrib/llvm/llvm/lib/libLLVMX86CodeGen.a contrib/llvm/llvm/lib/libLLVMX86Desc.a contrib/llvm/llvm/lib/libLLVMX86Info.a contrib/llvm/llvm/lib/libLLVMX86Utils.a contrib/llvm/llvm/lib/libLLVMAsmPrinter.a contrib/llvm/llvm/lib/libLLVMDebugInfoDWARF.a contrib/llvm/llvm/lib/libLLVMGlobalISel.a contrib/llvm/llvm/lib/libLLVMSelectionDAG.a contrib/llvm/llvm/lib/libLLVMMCDisassembler.a contrib/llvm/llvm/lib/libLLVMPasses.a contrib/llvm/llvm/lib/libLLVMCodeGen.a contrib/llvm/llvm/lib/libLLVMipo.a contrib/llvm/llvm/lib/libLLVMIRReader.a contrib/llvm/llvm/lib/libLLVMAsmParser.a contrib/llvm/llvm/lib/libLLVMLinker.a contrib/llvm/llvm/lib/libLLVMBitWriter.a contrib/llvm/llvm/lib/libLLVMInstrumentation.a contrib/llvm/llvm/lib/libLLVMScalarOpts.a contrib/llvm/llvm/lib/libLLVMAggressiveInstCombine.a contrib/llvm/llvm/lib/libLLVMInstCombine.a contrib/llvm/llvm/lib/libLLVMVectorize.a contrib/llvm/llvm/lib/libLLVMTransformUtils.a contrib/llvm/llvm/lib/libLLVMTarget.a contrib/llvm/llvm/lib/libLLVMAnalysis.a contrib/llvm/llvm/lib/libLLVMProfileData.a contrib/llvm/llvm/lib/libLLVMObject.a contrib/llvm/llvm/lib/libLLVMBitReader.a contrib/llvm/llvm/lib/libLLVMCore.a contrib/llvm/llvm/lib/libLLVMRemarks.a contrib/llvm/llvm/lib/libLLVMBitstreamReader.a contrib/llvm/llvm/lib/libLLVMMCParser.a contrib/llvm/llvm/lib/libLLVMMC.a contrib/llvm/llvm/lib/libLLVMBinaryFormat.a contrib/llvm/llvm/lib/libLLVMDebugInfoCodeView.a contrib/llvm/llvm/lib/libLLVMDebugInfoMSF.a contrib/llvm/llvm/lib/libLLVMSupport.a contrib/llvm/llvm/lib/libLLVMDemangle.a contrib/cppkafka-cmake/libcppkafka.a contrib/librdkafka-cmake/librdkafka.a contrib/libbtrie/libbtrie.a src/Common/Config/libclickhouse_common_config.a src/Dictionaries/Embedded/libclickhouse_dictionaries_embedded.a base/mysqlxx/libmysqlxx.a contrib/mariadb-connector-c/libmariadb/libmariadbclient.a contrib/icu-cmake/libicui18n.a contrib/icu-cmake/libicuuc.a contrib/icu-cmake/libicudata.a contrib/capnproto-cmake/libcapnpc.a contrib/capnproto-cmake/libcapnp.a contrib/capnproto-cmake/libkj.a contrib/arrow-cmake/libparquet_static.a contrib/arrow-cmake/libarrow_static.a contrib/boost-cmake/libboost_filesystem_internal.a contrib/lz4-cmake/liblz4.a contrib/zstd-cmake/libzstd.a contrib/flatbuffers/libflatbuffers.a contrib/arrow-cmake/libthrift_static.a contrib/boost-cmake/libboost_regex_internal.a contrib/avro-cmake/libavrocpp.a contrib/boost-cmake/libboost_iostreams_internal.a contrib/snappy/libsnappy.a contrib/openldap-cmake/libldap_r.a contrib/openldap-cmake/liblber.a src/Parsers/libclickhouse_parsers.a src/Common/ZooKeeper/libclickhouse_common_zookeeper.a src/libclickhouse_common_io.a contrib/boost-cmake/libboost_program_options_internal.a base/widechar_width/libwidechar_width.a contrib/double-conversion-cmake/libdouble-conversion.a contrib/ryu-cmake/libryu.a contrib/re2/libre2.a contrib/re2_st/libre2_st.a contrib/libcpuid-cmake/libcpuid.a contrib/croaring/libroaring.a contrib/libhdfs3-cmake/libhdfs3.a contrib/protobuf/libprotobuf.a contrib/libgsasl/liblibgsasl.a contrib/libxml2-cmake/liblibxml2.a contrib/aws-s3-cmake/libaws_s3.a contrib/aws-s3-cmake/libaws_s3_checksums.a contrib/curl-cmake/libcurl.a contrib/brotli-cmake/libbrotli.a src/Common/StringUtils/libstring_utils.a base/common/libcommon.a contrib/boost-cmake/libboost_system_internal.a contrib/cityhash102/libcityhash.a contrib/FastMemcpy/libFastMemcpy.a contrib/poco-cmake/Net/SSL/lib_poco_net_ssl.a contrib/poco-cmake/Net/lib_poco_net.a contrib/poco-cmake/Crypto/lib_poco_crypto.a contrib/openssl-cmake/libssl.a contrib/openssl-cmake/libcrypto.a contrib/poco-cmake/Util/lib_poco_util.a contrib/poco-cmake/JSON/lib_poco_json.a contrib/poco-cmake/JSON/lib_poco_json_pdjson.a contrib/poco-cmake/XML/lib_poco_xml.a contrib/poco-cmake/Foundation/lib_poco_foundation.a contrib/zlib-ng/libzlib.a contrib/poco-cmake/Foundation/lib_poco_foundation_pcre.a contrib/poco-cmake/XML/lib_poco_xml_expat.a contrib/replxx-cmake/libreplxx.a contrib/fmtlib-cmake/libfmt.a contrib/cctz-cmake/libcctz.a -Wl,--whole-archive contrib/cctz-cmake/libtzdata.a -Wl,--start-group base/glibc-compatibility/libglibc-compatibility.a contrib/libcxx-cmake/libcxx.a contrib/libcxxabi-cmake/libcxxabi.a contrib/libunwind-cmake/libunwind.a -Wl,--end-group -Wl,--no-whole-archive -nodefaultlibs /usr/lib/llvm-10/lib/clang/10.0.1/lib/linux/libclang_rt.builtins-x86_64.a  -lc -lm -lrt -lpthread -ldl && :
2020-05-27 06:37:47 ld.lld: error: duplicate symbol: __cxa_guard_abort

@alexey-milovidov alexey-milovidov merged commit 992f0df into ClickHouse:master May 27, 2020
@filimonov
Copy link
Contributor Author

filimonov commented May 27, 2020

Did a performance tests:

Test https://github.com/filimonov/ch-kafka-consume-perftest

Format TSV
Topic with 4 partitions, prefilled with 33554432 records.

Consuming time with single table with kafka_num_consumers=2 measured.

kafka_max_block_size consume_time (s, before) consume_speed (rec/sec, before) consume_time (s, that PR) consume_speed (rec/sec, that PR) improvement
65536 174 192842 99 338934 +75.76%
131072 130 258111 97 345922 +34.02%
262144 88 381300 64 524288 +37.50%
524288 66 508400 68 493448 -2.94%
1048576 65 516222 60 559241 +8.33%

Related: #8917 (comment)

@azat
Copy link
Member

azat commented May 27, 2020

Maybe worth checking with more partitions and compare it with old simple kafkacat, since it is unclear from the test what is the maximum throughput and kafkacat works more or less fast, even though it does not uses batching,especially if you run multiple instances of it?

P.S. , something like this may be used to run 10 kafkacat in parallel (using xargs) and track progress (using pv) yes | head -n10 | xargs -P10 -n1 -i kafkacat -X enable.auto.commit=false -X group.id=ch-test -C -b kf1.brk -t ch-test-topic -o beginning | pv -la | head -n200M

P.S. kafka_num_consumers ~= kafkacat processes

@filimonov
Copy link
Contributor Author

filimonov commented May 27, 2020

Same setup, same data, your test:
1 kafkacat: 1.29M/s
2 kafkacat: 2.32M/s
4 kafkacat: 3.66M/s

@azat
Copy link
Member

azat commented May 27, 2020

Just like I thought - Kafka engine is still slow

@filimonov
Copy link
Contributor Author

filimonov commented May 27, 2020

I agree that there are several places that can be done better. And the performance can be improved.

But your test is not 100% fair. With commits, it gives [1.29M/s] for 2 consumers.
If we will add parsing/writing compressed data to the disk etc, the result we have currently is not bad.

KochetovNicolai pushed a commit that referenced this pull request Jun 8, 2020
Fixed reschedule issue in Kafka

(cherry picked from commit 992f0df)
filimonov pushed a commit to filimonov/ClickHouse that referenced this pull request Jun 8, 2020
Fixed reschedule issue in Kafka

(cherry picked from commit 992f0df)
filimonov pushed a commit to filimonov/ClickHouse that referenced this pull request Jun 8, 2020
Fixed reschedule issue in Kafka

(cherry picked from commit 992f0df)
(cherry picked from commit 9176b78)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-docs-needed pr-bugfix Pull request with bugfix, not backported by default

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants