Skip to content

Add settings output_format_orc_dictionary_key_size_threshold to allow user to enable dict encoding for string column in ORC output format#68591

Merged
Avogar merged 13 commits intoClickHouse:masterfrom
bigo-sg:orc_dict_encode
Sep 17, 2024
Merged

Add settings output_format_orc_dictionary_key_size_threshold to allow user to enable dict encoding for string column in ORC output format#68591
Avogar merged 13 commits intoClickHouse:masterfrom
bigo-sg:orc_dict_encode

Conversation

@taiyang-li
Copy link
Contributor

@taiyang-li taiyang-li commented Aug 20, 2024

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Add settings output_format_orc_dictionary_key_size_threshold to allow user to enable dict encoding for string column in ORC output format. It helps reduce the output orc file size and improve reading performance significantly.

TODO

  1. Write ORC file with string dict encoding enabled
set output_format_orc_dictionary_key_size_threshold = 1;
select concat('gluten ', cast(rand()%1000 as String)) from numbers(10000000) into outfile 'dict.orc' truncate;
10000000 rows in set. Elapsed: 2.794 sec. Processed 10.00 million rows, 80.00 MB (3.58 million rows/s., 28.63 MB/s.)
Peak memory usage: 9.07 MiB.
  1. Write ORC file with string dict encoding disabled
set output_format_orc_dictionary_key_size_threshold = 0; 
select concat('gluten ', cast(rand()%1000 as String)) from numbers(10000000) into outfile 'no-dict.orc' truncate;
10000000 rows in set. Elapsed: 1.222 sec. Processed 10.00 million rows, 80.00 MB (8.18 million rows/s., 65.46 MB/s.)
Peak memory usage: 9.07 MiB.
  1. Compare ORC file size
$ ll -rt *dict*
-rw-r--r-- 1 root root 22731823 Aug 20 15:42 no-dict.orc
-rw-r--r-- 1 root root 14921714 Aug 20 15:42 dict.orc
  1. Compare reading performance of both ORC files
select * from file('dict.orc') format Null;
0 rows in set. Elapsed: 0.166 sec. Processed 10.00 million rows, 14.92 MB (60.19 million rows/s., 89.82 MB/s.)

select * from file('no-dict.orc') format Null;
0 rows in set. Elapsed: 0.385 sec. Processed 10.00 million rows, 22.73 MB (25.96 million rows/s., 59.00 MB/s.)
  1. Conclusions
  • Writing string column with dict encoding is 2.3x slower than without dict encoding. But I improved its performance in Improve string column dict encoding performance orc#15. Hope for your review.
  • ORC files written with dict encoding is 1.5x smaller than without dict encoding
  • Reading string column with dict encoding is 2.3x faster than without dict encoding

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

CI Settings (Only check the boxes if you know what you are doing):

  • Allow: All Required Checks
  • Allow: Stateless tests
  • Allow: Stateful tests
  • Allow: Integration Tests
  • Allow: Performance tests
  • Allow: All Builds
  • Allow: batch 1, 2 for multi-batch jobs
  • Allow: batch 3, 4, 5, 6 for multi-batch jobs

  • Exclude: Style check
  • Exclude: Fast test
  • Exclude: All with ASAN
  • Exclude: All with TSAN, MSAN, UBSAN, Coverage
  • Exclude: All with aarch64, release, debug

  • Run only fuzzers related jobs (libFuzzer fuzzers, AST fuzzers, etc.)
  • Exclude: AST fuzzers

  • Do not test
  • Woolen Wolfdog
  • Upload binaries for special builds
  • Disable merge-commit
  • Disable CI cache

@taiyang-li taiyang-li marked this pull request as draft August 20, 2024 08:01
@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-feature Pull request with new product feature label Aug 20, 2024
@robot-clickhouse-ci-1
Copy link
Contributor

robot-clickhouse-ci-1 commented Aug 20, 2024

This is an automated comment for commit 4412946 with description of existing statuses. It's updated for the latest CI running

✅ Click here to open a full report in a separate page

Successful checks
Check nameDescriptionStatus
BuildsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Docs checkBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integration tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc✅ success
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests✅ success
Performance ComparisonMeasure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests✅ success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success

@taiyang-li taiyang-li marked this pull request as ready for review August 21, 2024 07:21
@Avogar Avogar self-assigned this Aug 21, 2024
@robot-ch-test-poll3 robot-ch-test-poll3 added pr-performance Pull request with some performance improvements and removed pr-feature Pull request with new product feature labels Aug 22, 2024
@taiyang-li
Copy link
Contributor Author

@devcrafter
Copy link
Member

devcrafter commented Sep 13, 2024

https://s3.amazonaws.com/clickhouse-test-reports/68591/0de3b1dacbc587e7f6789dec3fe92e276f332f56/stateful_tests__ubsan_.html I can't figure out why it failed

@taiyang-li Please update the PR with recent master. It was an issue in CI, - was fixed by #69483

@Avogar Avogar added this pull request to the merge queue Sep 17, 2024
Merged via the queue into ClickHouse:master with commit b21be2b Sep 17, 2024
@robot-ch-test-poll robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants