[WIP] tcmalloc allocator (new C++ version by google) by azat · Pull Request #11590 · ClickHouse/ClickHouse

azat · 2020-06-10T22:14:54Z

`tcmalloc` performance checklist:

extract allocations pattern from the perf tests and write new size class for it

pre-upstream checklist:

Details

Changelog category (leave one):

Other

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Ability to use tcmalloc allocator (new C++ version by google)

Can be enabled with:

-DENABLE_JEMALLOC=OFF -DENABLE_TCMALLOC_CPP=ON

Suggested-by: @alexey-milovidov
Cc: @akuzm (perf test)

Details

HEAD:

6d83a078e3f9caf169aa3074315b67c51a2c73e4 -- initial (1 errors, 8 faster, 232 slower, 215 unstable)
cd902e7d2b0c0c1183507560a6edab8a9d5b1b7a -- TCMALLOC_256K_PAGES
19c4aa66e2c00c1bde2aa6b1d2e3ab9b0037c039 - TCMALLOC_4M_MAX_SIZE (9 faster, 98 slower, 210 unstable)
f694f412800cebf00ebe4941939d0bc5323c07e4 - same as above
87a1002 - same as above + google/tcmalloc@3e01aa2

docker/packager/packager

azat · 2020-06-11T08:31:13Z

Performance — 1 errors, 8 faster, 232 slower, 215 unstable

right.svg.zip

So it does used, but slower, need to dig

alexey-milovidov · 2020-06-11T12:11:40Z

Why do we see something like https://github.com/abseil/abseil-cpp/blob/master/absl/base/internal/errno_saver.h in flame graph? This function should be 100% inlined. Maybe it is accidentially compiled with wrong flags?

alexey-milovidov · 2020-06-11T12:13:33Z

By the way, your flame graph is not from perf test, because it is:

reversed;
contains inline functions...

Let's look at flame graphs that are automatically built as a result of perftest.
(The link to .7z file at bottom)

azat · 2020-06-11T12:18:09Z

Maybe it is accidentially compiled with wrong flags?

Yes this definitely should be verified

By the way, your flame graph is not from perf test, because it is:

It was built based on the output.7z from the performance test, using script by @akuzm (and updated to for TCMalloc)

alexey-milovidov · 2020-06-11T12:19:25Z

svg.zip

alexey-milovidov · 2020-06-11T12:21:21Z

Maybe it is just slower on middle-sized allocations (from a few KB to a few MB)?

akuzm · 2020-06-11T12:42:40Z

Why do we see something like https://github.com/abseil/abseil-cpp/blob/master/absl/base/internal/errno_saver.h in flame graph? This function should be 100% inlined. Maybe it is accidentially compiled with wrong flags?

Looks like it is inlined -- the file is errno_saver, but the function is AbslInternalSpinLockDelay, something locking-related. And there is really a lot of this function, I guess there is heavy contention on some internal data structures.

alexey-milovidov · 2020-06-11T12:47:43Z

Maybe we can create an isolated test case with a huge number of randomly distributed middle sized allocations from miltiple threads?

azat · 2020-06-12T23:25:57Z

Simple test with threads shows that it is slower (2x), but before making conclusions the results should be analyzed more first, so I'm still digging.

alexey-milovidov · 2020-06-13T17:41:36Z

Please note that the original tcmalloc also was not working well without some tuning:
ca9c4bb

Actually it was working well but degraded after some release.
This is very similar to what happened with jemalloc:
jemalloc/jemalloc#1347 (comment)

For some reason, the scenario of middle-sized allocations is "dark zone" for allocators, they just don't have it in their test suite (and the reason of degradation is usually - more page faults due to less caching, more early pages deallocation/invalidation).

We need to isolate this scenario and submit to them.

Many programs have their own allocation cache on top of allocator... but it looks counter productive, because allocator should work well.

alexey-milovidov · 2020-06-13T23:31:45Z

There is a feature request to add CMake to tcmalloc: google/tcmalloc#4
As you have added it in this PR, you can also contribute it to upstream.

azat · 2020-06-13T23:51:49Z

There is a feature request to add CMake to tcmalloc: google/tcmalloc#4
As you have added it in this PR, you can also contribute it to upstream.

Yep, I'm going to submit this (and also some patches against upstream should be submitted too), but first this should be done in a cleaner way (and I need to dig into abseil more too).

Please note that the original tcmalloc also was not working well without some tuning:
ca9c4bb

Thanks, but this implementation is completely different.

Actually it was working well but degraded after some release.
This is very similar to what happened with jemalloc:
jemalloc/jemalloc#1347 (comment)

Yep, I saw your realloc test (and actually my testing was based on it, but a little bit different)

We need to isolate this scenario and submit to them.

Yep, or better investigate and share solution :) (if it is more or less simple)

For some reason, the scenario of middle-sized allocations is "dark zone" for allocators, they just don't have it in their test suite (and the reason of degradation is usually - more page faults due to less caching, more early pages deallocation/invalidation).

And actually I'm not sure that the problem is only in the middle-sized, but it is too early to make conclusions, I'm not finished yet

akuzm · 2020-06-16T12:11:30Z

tcmalloc uses twice less memory than jemalloc (in this perf test run):

To build this graph from performance test output:

export metric="MemoryResident" && clickhouse-local --query "create view l as select * from file('left-async-metric-log.tsv', TSVWithNamesAndTypes, 'event_date Date, event_time DateTime, name String, value Float64'); with (select min(event_time) from l) as min_time select l.event_time - min_time, l.value, r.value from l asof join file('right-async-metric-log.tsv', TSVWithNamesAndTypes, 'event_date Date, event_time DateTime, name String, value Float64') r on l.name = r.name and l.event_time <= r.event_time where name = '$metric' order by event_time" > 1.txt && gnuplot --persist -e "set datafile separator '\t'; set terminal png size 960,540; set xtics time format '%tH:%tM'; set title '$metric' offset 0,-3; set key left top; plot '1.txt' using 1:2 with lines title 'Left', '1.txt' using 1:3 with lines title 'Right'" > tcmalloc-$metric.png

To estimate the difference using the graph data:

$ clickhouse-local --query "with 0.05 as dd select median((right - left)/left) from file('1.txt', TSV, 'seconds int, left Float64, right Float64')"

-0.5157845029932705

azat · 2020-06-21T09:16:31Z

Maybe it is just slower on middle-sized allocations (from a few KB to a few MB)?

@alexey-milovidov AFAICS the problem is not in the middle-sized but in allocations >~256K (depends on the number of classes for fast path, this depends on the size of pages is used in tcmalloc, tried 8k/256k not helps a lot), and it does not looks like it can be fixed easily.

I have some ideas left, but now I'm already not expecting that it will works better then jemalloc

As the last steps:

I can bring it into a more cleaner state to keep it in upstream (to update periodically and see if it will be better)
or just close it

alexey-milovidov · 2020-06-21T10:03:24Z

I name allocation "middle sized" if it is less than what should go directly to mmap (64 MiB in my experience) but greater than a few KiB. We use these allocations a lot for columns and buffers.

There is some code from a sibling project in Yandex to cache middle-sized allocations on top of allocator. We can bring this code into ClickHouse... I did not tested it, but we can try.

alexey-milovidov · 2020-06-21T10:06:00Z

As the last steps:

We need to make a test case that will show when tcmalloc is worse than jemalloc and send it to Google as an issue.

alexey-milovidov · 2020-06-21T10:17:31Z

BTW it's possible that Linux kernel in our CI does not support "restartable sequences".

alexey-milovidov · 2020-06-21T10:22:37Z

tcmalloc RSEQ (restartable sequences) requires linux 4.18+

We can print Linux kernel version at server startup.

doc states that it depends from THP

THP was working extremely bad at least a few years ago, so all the reccomendations state that you should disable them (otherwise you get memory fragmentation when program runs for significantly long time).

GWP-ASan for debug builds?

How it is implemented? Looks very useful. Maybe we can use the same technique independently of tcmalloc?

use sized deallocations

We also use (and enable) sized deallocations in our builds.

aligned allocations, also check clickhouse code for possible dependency on alignment (hashtables? trees optimizations?)

Mostly not, the most cases are 4K alignment for direct IO.

azat · 2020-06-21T10:22:37Z

There is some code from a sibling project in Yandex to cache middle-sized allocations on top of allocator. We can bring this code into ClickHouse... I did not tested it, but we can try.

Hope that you are not talking about LFAlloc (#5369)

BTW it's possible that Linux kernel in our CI does not support "restartable sequences".

I'm testing on 5.1.5

$ fgrep RSEQ /boot/config-$(uname -r)
CONFIG_RSEQ=y
CONFIG_HAVE_RSEQ=y

And the problem does not looks like due to RSEQ, but due to locking overhead (and also maybe more heavy allocation code path in general, not sure)

alexey-milovidov · 2020-06-21T10:25:29Z

Hope that you are not talking about LFAlloc (#5369)

That separate caching layer was implemented to mitigate the issue that LFAlloc does mmap too frequently :)

azat · 2020-06-21T10:33:35Z

THP was working extremely bad at least a few years ago, so all the reccomendations states that you should disable them (otherwise you get memory fragmentation when program runs for significantly long time).

I'm seeing this everywhere without any details, but not sure that it is true for recent kernels (I saw some stuff around it in LKML)

We also use (and enable) sized deallocations in our builds.

That's true, but you need to pass size of the object to the allocator, which is not done right now (AFAICS)

aligned allocations, also check clickhouse code for possible dependency on alignment (hashtables? trees optimizations?)

Mostly not, the most cases are 4K alignment for direct IO.

I though that masking MSB (for example when I last saw at rbtree in linux it uses MSB to encode the color) can be a problem, but minimal alignment satisfied, so should not be a problem

Indeed, 4k alignment is another issue

That separate caching layer was implemented to mitigate the issue that LFAlloc does mmap too frequently :)

Nice! :)

alexey-milovidov · 2020-06-21T10:36:30Z

That's true, but you need to pass size of the object to the allocator, which is not done right now (AFAICS)

Our new_delete.cpp does it for jemalloc.

Can be enabled with: -DENABLE_JEMALLOC=OFF -DENABLE_TCMALLOC_CPP=ON

Thus the target that is linked with clickhouse_new_delete will see every define and so on.

Thus target that is linked with tcmalloc-cpp/clickhouse_new_delete can use internal tcmalloc code (that depends on abseil and/or it's defines).

This will avoid extra memset()

Since: - this clickhouse uses thread pool anyway - allocator can have some hooks/extra stuff that is done for each new thread

Since this is the last size in tcmalloc for which allocation can be done fast, and now jemalloc is slower then tcmalloc: - tcmalloc real 0m2.335s user 0m28.804s sys 0m0.010s - jemalloc real 0m2.567s user 0m32.748s sys 0m0.020s

/build/base/common/memory.h:14:5: error: 'USE_TCMALLOC_CPP' is not defined, evaluates to 0 [-Werror,-Wundef]

There is endless recursion from __ubsan_handle_dynamic_type_cache_miss_abort This reverts commit 2b9ef87a2b8dce246ed744db5fb9c8b18e97e603.

This should be in the INTERFACE part, otherwise does not makes any difference, and besides it is required only for allocator-perf test. This reverts commit ddd4e5cd62714aa14fa4601e17b0eae661269994.

This should fix SIGSEGV for TCMALLOC_4M_MAX_SIZE (doh, debugging allocator w/o sane stacktrace and w/o asan/valgrind is not that easy)

azat marked this pull request as draft June 10, 2020 22:15

blinkov added the pr-other Pull request with changes not fitting to other categories label Jun 10, 2020

robot-clickhouse added the submodule changed At least one submodule changed in this PR. label Jun 10, 2020

akuzm reviewed Jun 10, 2020

View reviewed changes

docker/packager/packager Outdated Show resolved Hide resolved

alexey-milovidov mentioned this pull request Jun 13, 2020

Any plans to add cmake support? google/tcmalloc#4

Open

alexey-milovidov mentioned this pull request Jun 20, 2020

Add TCMALLOC_NO_NEW_DELETE_OVERRIDES/TCMALLOC_NO_C_ALLOC_OVERRIDES google/tcmalloc#38

Closed

azat force-pushed the tcmalloc-cpp branch 2 times, most recently from bac1a58 to b830fb4 Compare June 20, 2020 14:09

azat added 24 commits July 20, 2020 20:53

tcmalloc-cpp: support of new C++ version by google

15350fb

Can be enabled with: -DENABLE_JEMALLOC=OFF -DENABLE_TCMALLOC_CPP=ON

Use PUBLIC linkage for clickhouse_new_delete and allocator

a266fb2

Thus the target that is linked with clickhouse_new_delete will see every define and so on.

tcmalloc-cpp: Use PUBLIC linkage for tcmalloc-cpp and abseil targets

ca25d6a

Thus target that is linked with tcmalloc-cpp/clickhouse_new_delete can use internal tcmalloc code (that depends on abseil and/or it's defines).

tests/allocator: use malloc/free over realloc

7888321

This will avoid extra memset()

tests/allocator: do not create threads on fly, pre-create them instead

41dd5e1

Since: - this clickhouse uses thread pool anyway - allocator can have some hooks/extra stuff that is done for each new thread

tcmalloc-cpp: define USE_TCMALLOC_CPP

df42ffc

tcmalloc-cpp: cmake coding style

ddc17d9

tests/allocator: tune tcmalloc-cpp

6433143

tcmalloc-cpp: enable TCMALLOC_256K_PAGES

6561600

tcmalloc-cpp: tune tcmalloc parameters in clickhouse_common_new_delete

a20360e

tcmalloc-cpp: enable sized deallocation

41a5a27

tcmalloc-cpp: check against nullptr for sized deallocation

5462f1c

tests/allocator: use TCMallocInternalNew

0ece137

tcmalloc-cpp: use TCMallocInternalNewNothrow/TCMallocInternalNew

e3f4984

Bump tcmalloc-cpp and use TCMALLOC_4M_MAX_SIZE

d912fa2

tcmalloc-cpp: fix compilation under clang

08950a0

/build/base/common/memory.h:14:5: error: 'USE_TCMALLOC_CPP' is not defined, evaluates to 0 [-Werror,-Wundef]

abseil-cpp: add -Wno-gcc-compact

8430431

tcmalloc-cpp-cmake: allow build with undefined sanitizer

280a54f

tcmalloc-cpp: use TCMallocInternalDelete

dd97bb4

Revert "tcmalloc-cpp-cmake: allow build with undefined sanitizer"

07b5f00

There is endless recursion from __ubsan_handle_dynamic_type_cache_miss_abort This reverts commit 2b9ef87a2b8dce246ed744db5fb9c8b18e97e603.

Revert "abseil-cpp: add -Wno-gcc-compact"

8d7b6fa

This should be in the INTERFACE part, otherwise does not makes any difference, and besides it is required only for allocator-perf test. This reverts commit ddd4e5cd62714aa14fa4601e17b0eae661269994.

Bump tcmalloc-cpp

0439572

This should fix SIGSEGV for TCMALLOC_4M_MAX_SIZE (doh, debugging allocator w/o sane stacktrace and w/o asan/valgrind is not that easy)

Bump tcmalloc-cpp (google/tcmalloc@3e01aa2)

87a1002

azat force-pushed the tcmalloc-cpp branch from f694f41 to 87a1002 Compare July 20, 2020 18:13

alexey-milovidov closed this Aug 26, 2020

azat mentioned this pull request Aug 1, 2021

Add a technique similar to GWP-ASan #27039

Closed

azat mentioned this pull request Dec 9, 2021

Try experimental hualloc memory allocator from Yandex #31376

Closed

alexey-milovidov mentioned this pull request Jan 30, 2022

Umbrella task for memory allocators #34157

Closed

azat deleted the tcmalloc-cpp branch November 19, 2022 12:01

Conversation

azat commented Jun 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

tcmalloc performance checklist:

pre-upstream checklist:

Uh oh!

Uh oh!

azat commented Jun 11, 2020

Uh oh!

alexey-milovidov commented Jun 11, 2020

Uh oh!

alexey-milovidov commented Jun 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

azat commented Jun 11, 2020

Uh oh!

alexey-milovidov commented Jun 11, 2020

Uh oh!

alexey-milovidov commented Jun 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akuzm commented Jun 11, 2020

Uh oh!

alexey-milovidov commented Jun 11, 2020

Uh oh!

azat commented Jun 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexey-milovidov commented Jun 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexey-milovidov commented Jun 13, 2020

Uh oh!

azat commented Jun 13, 2020

Uh oh!

akuzm commented Jun 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

azat commented Jun 21, 2020

Uh oh!

alexey-milovidov commented Jun 21, 2020

Uh oh!

alexey-milovidov commented Jun 21, 2020

Uh oh!

alexey-milovidov commented Jun 21, 2020

Uh oh!

alexey-milovidov commented Jun 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

azat commented Jun 21, 2020

Uh oh!

alexey-milovidov commented Jun 21, 2020

Uh oh!

azat commented Jun 21, 2020

Uh oh!

alexey-milovidov commented Jun 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

azat commented Jun 10, 2020 •

edited

Loading

`tcmalloc` performance checklist:

alexey-milovidov commented Jun 11, 2020 •

edited

Loading

alexey-milovidov commented Jun 11, 2020 •

edited

Loading

azat commented Jun 12, 2020 •

edited

Loading

alexey-milovidov commented Jun 13, 2020 •

edited

Loading

akuzm commented Jun 16, 2020 •

edited

Loading

alexey-milovidov commented Jun 21, 2020 •

edited

Loading