Skip to content

Improve string column dict encoding performance#15

Closed
taiyang-li wants to merge 2 commits intoClickHouse:mainfrom
taiyang-li:improve_dict_write
Closed

Improve string column dict encoding performance#15
taiyang-li wants to merge 2 commits intoClickHouse:mainfrom
taiyang-li:improve_dict_write

Conversation

@taiyang-li
Copy link

@taiyang-li taiyang-li commented Aug 20, 2024

What changes were proposed in this pull request?

Improve writing performance of encoded string column. Speed up by 2x

Why are the changes needed?

How was this patch tested?

I tested it with clickhouse branch: ClickHouse/ClickHouse#68591

Query:

set output_format_orc_dictionary_key_size_threshold = 1;
select concat('gluten ', cast(rand()%1000 as String)) from numbers(10000000) into outfile 'dict.orc' truncate;

Before current optimization:

10000000 rows in set. Elapsed: 2.794 sec. Processed 10.00 million rows, 80.00 MB (3.58 million rows/s., 28.63 MB/s.)
Peak memory usage: 9.07 MiB.

After current optimization:

10000000 rows in set. Elapsed: 1.423 sec. Processed 10.00 million rows, 80.00 MB (7.02 million rows/s., 56.20 MB/s.)
Peak memory usage: 9.07 MiB.

…dd/StringColumnWriter::add/CharColumnWriter::add
@taiyang-li
Copy link
Author

I have created a pr with current changes to upstream: apache#2010

@taiyang-li
Copy link
Author

It is probably that another PR contributed to upsteam will be merged in a few days. Let's turn this PR into draft status.

@taiyang-li taiyang-li marked this pull request as draft August 28, 2024 02:11
@taiyang-li taiyang-li closed this Aug 28, 2024
ffacs pushed a commit to apache/orc that referenced this pull request Sep 3, 2024
…and support EncodedStringVectorBatch for StringColumnWriter

### What changes were proposed in this pull request?

Improve writing performance of encoded string column and support EncodedStringVectorBatch for StringColumnWriter.
Performance was measured in ClickHouse#15

### Why are the changes needed?

### How was this patch tested?

original tests.

### Was this patch authored or co-authored using generative AI tooling?

Closes #2010 from taiyang-li/apache_improve_dict_write.

Lead-authored-by: taiyang-li <[email protected]>
Co-authored-by: 李扬 <[email protected]>
Signed-off-by: ffacs <[email protected]>
taiyang-li added a commit to taiyang-li/orc that referenced this pull request Sep 4, 2024
…and support EncodedStringVectorBatch for StringColumnWriter

Improve writing performance of encoded string column and support EncodedStringVectorBatch for StringColumnWriter.
Performance was measured in ClickHouse#15

original tests.

Closes apache#2010 from taiyang-li/apache_improve_dict_write.

Lead-authored-by: taiyang-li <[email protected]>
Co-authored-by: 李扬 <[email protected]>
Signed-off-by: ffacs <[email protected]>
taiyang-li added a commit to taiyang-li/orc that referenced this pull request Sep 4, 2024
…and support EncodedStringVectorBatch for StringColumnWriter

Improve writing performance of encoded string column and support EncodedStringVectorBatch for StringColumnWriter.
Performance was measured in ClickHouse#15

original tests.

Closes apache#2010 from taiyang-li/apache_improve_dict_write.

Lead-authored-by: taiyang-li <[email protected]>
Co-authored-by: 李扬 <[email protected]>
Signed-off-by: ffacs <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant