ORC-1767: [C++] Improve writing performance of encoded string column and support EncodedStringVectorBatch for StringColumnWriter#2010
Conversation
…dd/StringColumnWriter::add/CharColumnWriter::add
|
Thanks for opening a PR! Could you please create a JIRA for it? |
|
@wgtmac could you help review this pr, thanks very much |
|
@taiyang-li Hi, thank you for opening a PR! |
@ffacs I had measured about this. From flamegraph we could see that |
Oh yes, std::map takes |
|
@ffacs do you know where can I add google benchmark in orc ? |
|
I added a benchmark $ ./build_gcc/src/Common/benchmarks/orc_string_dictionary
2024-08-26T15:00:09+08:00
Running ./build_gcc/src/Common/benchmarks/orc_string_dictionary
Run on (32 X 2100 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x16)
L1 Instruction 32 KiB (x16)
L2 Unified 1024 KiB (x16)
L3 Unified 11264 KiB (x2)
Load Average: 6.35, 7.05, 6.55
------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------------------------------
BM_writeStringDictionary<OldSortedStringDictionary, 10> 74094421 ns 74091713 ns 9
BM_writeStringDictionary<NewSortedStringDictionary, 10> 73200223 ns 73197770 ns 9
BM_writeStringDictionary<OldSortedStringDictionary, 100> 137097736 ns 137091909 ns 5
BM_writeStringDictionary<NewSortedStringDictionary, 100> 86304767 ns 86301101 ns 8
BM_writeStringDictionary<OldSortedStringDictionary, 1000> 237995798 ns 237986072 ns 3
BM_writeStringDictionary<NewSortedStringDictionary, 1000> 101562538 ns 101559063 ns 7
BM_writeStringDictionary<OldSortedStringDictionary, 10000> 397019465 ns 397002121 ns 2
BM_writeStringDictionary<NewSortedStringDictionary, 10000> 153278355 ns 153274828 ns 4
BM_writeStringDictionary<OldSortedStringDictionary, 100000> 787423737 ns 787397582 ns 1
BM_writeStringDictionary<NewSortedStringDictionary, 100000> 334958714 ns 334938666 ns 2
|
|
|
@ffacs any new requested changes ? |
ffacs
left a comment
There was a problem hiding this comment.
This path almost LGTM other than the typo. Thank you @taiyang-li !
…and support EncodedStringVectorBatch for StringColumnWriter Improve writing performance of encoded string column and support EncodedStringVectorBatch for StringColumnWriter. Performance was measured in ClickHouse#15 original tests. Closes apache#2010 from taiyang-li/apache_improve_dict_write. Lead-authored-by: taiyang-li <[email protected]> Co-authored-by: 李扬 <[email protected]> Signed-off-by: ffacs <[email protected]>
…and support EncodedStringVectorBatch for StringColumnWriter Improve writing performance of encoded string column and support EncodedStringVectorBatch for StringColumnWriter. Performance was measured in ClickHouse#15 original tests. Closes apache#2010 from taiyang-li/apache_improve_dict_write. Lead-authored-by: taiyang-li <[email protected]> Co-authored-by: 李扬 <[email protected]> Signed-off-by: ffacs <[email protected]>
What changes were proposed in this pull request?
Improve writing performance of encoded string column and support EncodedStringVectorBatch for StringColumnWriter.
Performance was measured in ClickHouse#15
Why are the changes needed?
How was this patch tested?
original tests.
Was this patch authored or co-authored using generative AI tooling?