Skip to content

[WIP] tcmalloc allocator (new C++ version by google)#11590

Closed
azat wants to merge 27 commits intoClickHouse:masterfrom
azat:tcmalloc-cpp
Closed

[WIP] tcmalloc allocator (new C++ version by google)#11590
azat wants to merge 27 commits intoClickHouse:masterfrom
azat:tcmalloc-cpp

Conversation

@azat
Copy link
Member

@azat azat commented Jun 10, 2020

tcmalloc performance checklist:

  • extract allocations pattern from the perf tests and write new size class for it

pre-upstream checklist:

Details
  • loaded too late (../contrib/tcmalloc-cpp/tcmalloc/tcmalloc.cc:2270] tcmalloc::GetOwnership(ptr) != tcmalloc::MallocExtension::Ownership::kNotOwned )
  • stable abseil-cpp (or the same as in tcmalloc bazel)
  • tcmalloc does not have tags
  • separate build with tcmalloc on CI
  • split into logical parts
  • uses patched version (see commits descriptions/comments for the reason)
  • tcmalloc RSEQ (restartable sequences) requires linux 4.18+
  • telemetry via MallocExtension (like statistics for jemalloc)
  • nallocx
  • terminate in case of allocation failure allows optimization (noexcept, -DABSL_ALLOCATOR_NOTHROW)
  • doc states that it depends from THP
  • background thread that calls ReleaseMemoryToSystem (AFAIR there is some API for this in newer version)?
  • GWP-ASan for debug builds?
  • use sized deallocations (optimized, clang does not enables -fsized-deallocation by default?)
  • report performance "issue" to upstream (if it will not be resolved)
  • aligned allocations, also check clickhouse code for possible dependency on alignment (hashtables? trees optimizations?) and add static_assert against __STDCPP_DEFAULT_NEW_ALIGNMENT__
  • tried sized alloc/dealloc via TCMallocInternalNew/TCMallocInternalDeleteSized -- does not makes any difference for allocator-perf
  • Logical Page Size (default 8k, try increase, TCMALLOC_256K_PAGES) -- does not makes any difference in performance tests
  • use front-end for allocations up to 4MB (TCMALLOC_4M_MAX_SIZE)

Changelog category (leave one):

  • Other

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Ability to use tcmalloc allocator (new C++ version by google)

Can be enabled with:

-DENABLE_JEMALLOC=OFF -DENABLE_TCMALLOC_CPP=ON

Suggested-by: @alexey-milovidov
Cc: @akuzm (perf test)

Details

HEAD:

  • 6d83a078e3f9caf169aa3074315b67c51a2c73e4 -- initial (1 errors, 8 faster, 232 slower, 215 unstable)
  • cd902e7d2b0c0c1183507560a6edab8a9d5b1b7a -- TCMALLOC_256K_PAGES
  • 19c4aa66e2c00c1bde2aa6b1d2e3ab9b0037c039 - TCMALLOC_4M_MAX_SIZE (9 faster, 98 slower, 210 unstable)
  • f694f412800cebf00ebe4941939d0bc5323c07e4 - same as above
  • 87a1002 - same as above + google/tcmalloc@3e01aa2

@azat azat marked this pull request as draft June 10, 2020 22:15
@blinkov blinkov added the pr-other Pull request with changes not fitting to other categories label Jun 10, 2020
@robot-clickhouse robot-clickhouse added the submodule changed At least one submodule changed in this PR. label Jun 10, 2020
@azat
Copy link
Member Author

azat commented Jun 11, 2020

Performance — 1 errors, 8 faster, 232 slower, 215 unstable

right.svg.zip

So it does used, but slower, need to dig

@alexey-milovidov
Copy link
Member

Why do we see something like https://github.com/abseil/abseil-cpp/blob/master/absl/base/internal/errno_saver.h in flame graph? This function should be 100% inlined. Maybe it is accidentially compiled with wrong flags?

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Jun 11, 2020

By the way, your flame graph is not from perf test, because it is:

  • reversed;
  • contains inline functions...

Let's look at flame graphs that are automatically built as a result of perftest.
(The link to .7z file at bottom)

@azat
Copy link
Member Author

azat commented Jun 11, 2020

Maybe it is accidentially compiled with wrong flags?

Yes this definitely should be verified

By the way, your flame graph is not from perf test, because it is:

It was built based on the output.7z from the performance test, using script by @akuzm (and updated to for TCMalloc)

@alexey-milovidov
Copy link
Member

svg.zip

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Jun 11, 2020

Maybe it is just slower on middle-sized allocations (from a few KB to a few MB)?

@akuzm
Copy link
Contributor

akuzm commented Jun 11, 2020

Why do we see something like https://github.com/abseil/abseil-cpp/blob/master/absl/base/internal/errno_saver.h in flame graph? This function should be 100% inlined. Maybe it is accidentially compiled with wrong flags?

Looks like it is inlined -- the file is errno_saver, but the function is AbslInternalSpinLockDelay, something locking-related. And there is really a lot of this function, I guess there is heavy contention on some internal data structures.

@alexey-milovidov
Copy link
Member

Maybe we can create an isolated test case with a huge number of randomly distributed middle sized allocations from miltiple threads?

@azat
Copy link
Member Author

azat commented Jun 12, 2020

Simple test with threads shows that it is slower (2x), but before making conclusions the results should be analyzed more first, so I'm still digging.

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Jun 13, 2020

Please note that the original tcmalloc also was not working well without some tuning:
ca9c4bb

Actually it was working well but degraded after some release.
This is very similar to what happened with jemalloc:
jemalloc/jemalloc#1347 (comment)

For some reason, the scenario of middle-sized allocations is "dark zone" for allocators, they just don't have it in their test suite (and the reason of degradation is usually - more page faults due to less caching, more early pages deallocation/invalidation).

We need to isolate this scenario and submit to them.

Many programs have their own allocation cache on top of allocator... but it looks counter productive, because allocator should work well.

@alexey-milovidov
Copy link
Member

There is a feature request to add CMake to tcmalloc: google/tcmalloc#4
As you have added it in this PR, you can also contribute it to upstream.

@azat
Copy link
Member Author

azat commented Jun 13, 2020

There is a feature request to add CMake to tcmalloc: google/tcmalloc#4
As you have added it in this PR, you can also contribute it to upstream.

Yep, I'm going to submit this (and also some patches against upstream should be submitted too), but first this should be done in a cleaner way (and I need to dig into abseil more too).

Please note that the original tcmalloc also was not working well without some tuning:
ca9c4bb

Thanks, but this implementation is completely different.

Actually it was working well but degraded after some release.
This is very similar to what happened with jemalloc:
jemalloc/jemalloc#1347 (comment)

Yep, I saw your realloc test (and actually my testing was based on it, but a little bit different)

We need to isolate this scenario and submit to them.

Yep, or better investigate and share solution :) (if it is more or less simple)

For some reason, the scenario of middle-sized allocations is "dark zone" for allocators, they just don't have it in their test suite (and the reason of degradation is usually - more page faults due to less caching, more early pages deallocation/invalidation).

And actually I'm not sure that the problem is only in the middle-sized, but it is too early to make conclusions, I'm not finished yet

@akuzm
Copy link
Contributor

akuzm commented Jun 16, 2020

tcmalloc uses twice less memory than jemalloc (in this perf test run):

tcmalloc-MemoryResident

To build this graph from performance test output:

export metric="MemoryResident" && clickhouse-local --query "create view l as select * from file('left-async-metric-log.tsv', TSVWithNamesAndTypes, 'event_date Date, event_time DateTime, name String, value Float64'); with (select min(event_time) from l) as min_time select l.event_time - min_time, l.value, r.value from l asof join file('right-async-metric-log.tsv', TSVWithNamesAndTypes, 'event_date Date, event_time DateTime, name String, value Float64') r on l.name = r.name and l.event_time <= r.event_time where name = '$metric' order by event_time" > 1.txt && gnuplot --persist -e "set datafile separator '\t'; set terminal png size 960,540; set xtics time format '%tH:%tM'; set title '$metric' offset 0,-3; set key left top; plot '1.txt' using 1:2 with lines title 'Left', '1.txt' using 1:3 with lines title 'Right'" > tcmalloc-$metric.png

To estimate the difference using the graph data:

$ clickhouse-local --query "with 0.05 as dd select median((right - left)/left) from file('1.txt', TSV, 'seconds int, left Float64, right Float64')"

-0.5157845029932705

@azat
Copy link
Member Author

azat commented Jun 21, 2020

Maybe it is just slower on middle-sized allocations (from a few KB to a few MB)?

@alexey-milovidov AFAICS the problem is not in the middle-sized but in allocations >~256K (depends on the number of classes for fast path, this depends on the size of pages is used in tcmalloc, tried 8k/256k not helps a lot), and it does not looks like it can be fixed easily.

I have some ideas left, but now I'm already not expecting that it will works better then jemalloc

As the last steps:

  • I can bring it into a more cleaner state to keep it in upstream (to update periodically and see if it will be better)
  • or just close it

@alexey-milovidov
Copy link
Member

I name allocation "middle sized" if it is less than what should go directly to mmap (64 MiB in my experience) but greater than a few KiB. We use these allocations a lot for columns and buffers.

There is some code from a sibling project in Yandex to cache middle-sized allocations on top of allocator. We can bring this code into ClickHouse... I did not tested it, but we can try.

@alexey-milovidov
Copy link
Member

As the last steps:

We need to make a test case that will show when tcmalloc is worse than jemalloc and send it to Google as an issue.

@alexey-milovidov
Copy link
Member

BTW it's possible that Linux kernel in our CI does not support "restartable sequences".

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Jun 21, 2020

tcmalloc RSEQ (restartable sequences) requires linux 4.18+

We can print Linux kernel version at server startup.

doc states that it depends from THP

THP was working extremely bad at least a few years ago, so all the reccomendations state that you should disable them (otherwise you get memory fragmentation when program runs for significantly long time).

GWP-ASan for debug builds?

How it is implemented? Looks very useful. Maybe we can use the same technique independently of tcmalloc?

use sized deallocations

We also use (and enable) sized deallocations in our builds.

aligned allocations, also check clickhouse code for possible dependency on alignment (hashtables? trees optimizations?)

Mostly not, the most cases are 4K alignment for direct IO.

@azat
Copy link
Member Author

azat commented Jun 21, 2020

There is some code from a sibling project in Yandex to cache middle-sized allocations on top of allocator. We can bring this code into ClickHouse... I did not tested it, but we can try.

Hope that you are not talking about LFAlloc (#5369)

BTW it's possible that Linux kernel in our CI does not support "restartable sequences".

I'm testing on 5.1.5

$ fgrep RSEQ /boot/config-$(uname -r)
CONFIG_RSEQ=y
CONFIG_HAVE_RSEQ=y

And the problem does not looks like due to RSEQ, but due to locking overhead (and also maybe more heavy allocation code path in general, not sure)

@alexey-milovidov
Copy link
Member

Hope that you are not talking about LFAlloc (#5369)

That separate caching layer was implemented to mitigate the issue that LFAlloc does mmap too frequently :)

@azat
Copy link
Member Author

azat commented Jun 21, 2020

THP was working extremely bad at least a few years ago, so all the reccomendations states that you should disable them (otherwise you get memory fragmentation when program runs for significantly long time).

I'm seeing this everywhere without any details, but not sure that it is true for recent kernels (I saw some stuff around it in LKML)

We also use (and enable) sized deallocations in our builds.

That's true, but you need to pass size of the object to the allocator, which is not done right now (AFAICS)

aligned allocations, also check clickhouse code for possible dependency on alignment (hashtables? trees optimizations?)

Mostly not, the most cases are 4K alignment for direct IO.

I though that masking MSB (for example when I last saw at rbtree in linux it uses MSB to encode the color) can be a problem, but minimal alignment satisfied, so should not be a problem

Indeed, 4k alignment is another issue

That separate caching layer was implemented to mitigate the issue that LFAlloc does mmap too frequently :)

Nice! :)

@alexey-milovidov
Copy link
Member

That's true, but you need to pass size of the object to the allocator, which is not done right now (AFAICS)

Our new_delete.cpp does it for jemalloc.

azat added 24 commits July 20, 2020 20:53
Can be enabled with:
  -DENABLE_JEMALLOC=OFF -DENABLE_TCMALLOC_CPP=ON
Thus the target that is linked with clickhouse_new_delete will see every
define and so on.
Thus target that is linked with tcmalloc-cpp/clickhouse_new_delete can
use internal tcmalloc code (that depends on abseil and/or it's defines).
This will avoid extra memset()
Since:
- this clickhouse uses thread pool anyway
- allocator can have some hooks/extra stuff that is done for each new
  thread
Since this is the last size in tcmalloc for which allocation can be done
fast, and now jemalloc is slower then tcmalloc:

- tcmalloc
  real    0m2.335s
  user    0m28.804s
  sys     0m0.010s

- jemalloc
  real    0m2.567s
  user    0m32.748s
  sys     0m0.020s
    /build/base/common/memory.h:14:5: error: 'USE_TCMALLOC_CPP' is not defined, evaluates to 0 [-Werror,-Wundef]
There is endless recursion from __ubsan_handle_dynamic_type_cache_miss_abort

This reverts commit 2b9ef87a2b8dce246ed744db5fb9c8b18e97e603.
This should be in the INTERFACE part, otherwise does not makes any
difference, and besides it is required only for allocator-perf test.

This reverts commit ddd4e5cd62714aa14fa4601e17b0eae661269994.
This should fix SIGSEGV for TCMALLOC_4M_MAX_SIZE (doh, debugging
allocator w/o sane stacktrace and w/o asan/valgrind is not that easy)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-other Pull request with changes not fitting to other categories submodule changed At least one submodule changed in this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants