[WIP] tcmalloc allocator (new C++ version by google)#11590
[WIP] tcmalloc allocator (new C++ version by google)#11590azat wants to merge 27 commits intoClickHouse:masterfrom
Conversation
So it does used, but slower, need to dig |
|
Why do we see something like https://github.com/abseil/abseil-cpp/blob/master/absl/base/internal/errno_saver.h in flame graph? This function should be 100% inlined. Maybe it is accidentially compiled with wrong flags? |
|
By the way, your flame graph is not from perf test, because it is:
Let's look at flame graphs that are automatically built as a result of perftest. |
|
Maybe it is just slower on middle-sized allocations (from a few KB to a few MB)? |
Looks like it is inlined -- the file is errno_saver, but the function is AbslInternalSpinLockDelay, something locking-related. And there is really a lot of this function, I guess there is heavy contention on some internal data structures. |
|
Maybe we can create an isolated test case with a huge number of randomly distributed middle sized allocations from miltiple threads? |
|
Simple test with threads shows that it is slower (2x), but before making conclusions the results should be analyzed more first, so I'm still digging. |
|
Please note that the original tcmalloc also was not working well without some tuning: Actually it was working well but degraded after some release. For some reason, the scenario of middle-sized allocations is "dark zone" for allocators, they just don't have it in their test suite (and the reason of degradation is usually - more page faults due to less caching, more early pages deallocation/invalidation). We need to isolate this scenario and submit to them. Many programs have their own allocation cache on top of allocator... but it looks counter productive, because allocator should work well. |
|
There is a feature request to add CMake to tcmalloc: google/tcmalloc#4 |
Yep, I'm going to submit this (and also some patches against upstream should be submitted too), but first this should be done in a cleaner way (and I need to dig into abseil more too).
Thanks, but this implementation is completely different.
Yep, I saw your realloc test (and actually my testing was based on it, but a little bit different)
Yep, or better investigate and share solution :) (if it is more or less simple)
And actually I'm not sure that the problem is only in the middle-sized, but it is too early to make conclusions, I'm not finished yet |
|
tcmalloc uses twice less memory than jemalloc (in this perf test run): To build this graph from performance test output: To estimate the difference using the graph data: |
bac1a58 to
b830fb4
Compare
@alexey-milovidov AFAICS the problem is not in the middle-sized but in allocations >~256K (depends on the number of classes for fast path, this depends on the size of pages is used in tcmalloc, tried 8k/256k not helps a lot), and it does not looks like it can be fixed easily. I have some ideas left, but now I'm already not expecting that it will works better then jemalloc As the last steps:
|
|
I name allocation "middle sized" if it is less than what should go directly to mmap (64 MiB in my experience) but greater than a few KiB. We use these allocations a lot for columns and buffers. There is some code from a sibling project in Yandex to cache middle-sized allocations on top of allocator. We can bring this code into ClickHouse... I did not tested it, but we can try. |
We need to make a test case that will show when tcmalloc is worse than jemalloc and send it to Google as an issue. |
|
BTW it's possible that Linux kernel in our CI does not support "restartable sequences". |
We can print Linux kernel version at server startup.
THP was working extremely bad at least a few years ago, so all the reccomendations state that you should disable them (otherwise you get memory fragmentation when program runs for significantly long time).
How it is implemented? Looks very useful. Maybe we can use the same technique independently of tcmalloc?
We also use (and enable) sized deallocations in our builds.
Mostly not, the most cases are 4K alignment for direct IO. |
Hope that you are not talking about LFAlloc (#5369)
I'm testing on 5.1.5 And the problem does not looks like due to RSEQ, but due to locking overhead (and also maybe more heavy allocation code path in general, not sure) |
That separate caching layer was implemented to mitigate the issue that LFAlloc does mmap too frequently :) |
I'm seeing this everywhere without any details, but not sure that it is true for recent kernels (I saw some stuff around it in LKML)
That's true, but you need to pass size of the object to the allocator, which is not done right now (AFAICS)
I though that masking MSB (for example when I last saw at rbtree in linux it uses MSB to encode the color) can be a problem, but minimal alignment satisfied, so should not be a problem Indeed, 4k alignment is another issue
Nice! :) |
Our new_delete.cpp does it for jemalloc. |
Can be enabled with: -DENABLE_JEMALLOC=OFF -DENABLE_TCMALLOC_CPP=ON
Thus the target that is linked with clickhouse_new_delete will see every define and so on.
Thus target that is linked with tcmalloc-cpp/clickhouse_new_delete can use internal tcmalloc code (that depends on abseil and/or it's defines).
This will avoid extra memset()
Since: - this clickhouse uses thread pool anyway - allocator can have some hooks/extra stuff that is done for each new thread
Since this is the last size in tcmalloc for which allocation can be done fast, and now jemalloc is slower then tcmalloc: - tcmalloc real 0m2.335s user 0m28.804s sys 0m0.010s - jemalloc real 0m2.567s user 0m32.748s sys 0m0.020s
/build/base/common/memory.h:14:5: error: 'USE_TCMALLOC_CPP' is not defined, evaluates to 0 [-Werror,-Wundef]
There is endless recursion from __ubsan_handle_dynamic_type_cache_miss_abort This reverts commit 2b9ef87a2b8dce246ed744db5fb9c8b18e97e603.
This should be in the INTERFACE part, otherwise does not makes any difference, and besides it is required only for allocator-perf test. This reverts commit ddd4e5cd62714aa14fa4601e17b0eae661269994.
This should fix SIGSEGV for TCMALLOC_4M_MAX_SIZE (doh, debugging allocator w/o sane stacktrace and w/o asan/valgrind is not that easy)

tcmallocperformance checklist:pre-upstream checklist:
Details
../contrib/tcmalloc-cpp/tcmalloc/tcmalloc.cc:2270] tcmalloc::GetOwnership(ptr) != tcmalloc::MallocExtension::Ownership::kNotOwned)noexcept,-DABSL_ALLOCATOR_NOTHROW)ReleaseMemoryToSystem(AFAIR there is some API for this in newer version)?-fsized-deallocationby default?)static_assertagainst__STDCPP_DEFAULT_NEW_ALIGNMENT__TCMallocInternalNew/TCMallocInternalDeleteSized-- does not makes any difference forallocator-perfTCMALLOC_256K_PAGES) -- does not makes any difference in performance testsTCMALLOC_4M_MAX_SIZE)Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Ability to use tcmalloc allocator (new C++ version by google)
Can be enabled with:
Suggested-by: @alexey-milovidov
Cc: @akuzm (perf test)
Details
HEAD:TCMALLOC_256K_PAGESTCMALLOC_4M_MAX_SIZE(9 faster, 98 slower, 210 unstable)