[SPARK-48518][CORE] Make LZF compression be able to run in parallel by yaooqinn · Pull Request #46858 · apache/spark

yaooqinn · 2024-06-04T07:09:18Z

What changes were proposed in this pull request?

This PR introduced a config that turns on LZF compression to parallel mode via using PLZFOutputStream.

FYI, https://github.com/ning/compress?tab=readme-ov-file#parallel-processing

Why are the changes needed?

Improve performance

[info] OpenJDK 64-Bit Server VM 17.0.10+0 on Mac OS X 14.5
[info] Apple M2 Max
[info] Compress large objects:                        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] -----------------------------------------------------------------------------------------------------------------------------
[info] Compression 1024 array values in 7 threads                12             13           1          0.1       11788.2       1.0X
[info] Compression 1024 array values single-threaded             23             23           0          0.0       22512.7       0.5X

Does this PR introduce any user-facing change?

no

How was this patch tested?

benchmark

Was this patch authored or co-authored using generative AI tooling?

no

yaooqinn · 2024-06-04T07:11:10Z

core/benchmarks/LZFBenchmark-jdk21-results.txt

+AMD EPYC 7763 64-Core Processor
+Compress large objects:                        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+-----------------------------------------------------------------------------------------------------------------------------
+Compression 1024 array values in 1 threads                39             45           5          0.0       38475.4       1.0X


With GitHub standard action runners, it seems that we only get 1 thread.

hmm...

[info] OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 14.5 [info] Apple M2 Max [info] Compress small objects: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] -------------------------------------------------------------------------------------------------------------------------------- [info] Compression 256000000 int values in parallel 548 550 2 467.0 2.1 1.0X [info] Compression 256000000 int values single-threaded 522 523 1 490.5 2.0 1.1X [info] Running benchmark: Compress large objects [info] Running case: Compression 1024 array values in 8 threads [info] Stopped after 123 iterations, 2009 ms [info] Running case: Compression 1024 array values single-threaded [info] Stopped after 83 iterations, 2003 ms [info] OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 14.5 [info] Apple M2 Max [info] Compress large objects: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------------------------------------- [info] Compression 1024 array values in 8 threads 12 16 13 0.1 11546.1 1.0X [info] Compression 1024 array values single-threaded 23 24 1 0.0 22767.9 0.5X

I ran this benchmark locally, and it seems that the performance of Compression 256000000 int values single-threaded and Compression 256000000 int values in parallel is almost the same

I guess the rate is limited by the producer

LuciferYang · 2024-06-04T07:23:18Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

      .intConf
      .createWithDefault(1)

+  private[spark] val IO_COMPRESSION_LZF_PARALLEL =


Should we add an explanation of this configuration in configuration.md

thanks, addressed

LuciferYang

LGTM

yaooqinn · 2024-06-04T10:59:55Z

Thank you @LuciferYang

Merged to master

mridulm · 2024-06-04T13:51:26Z

core/src/main/scala/org/apache/spark/io/CompressionCodec.scala

  override def compressedOutputStream(s: OutputStream): OutputStream = {
-    new LZFOutputStream(s).setFinishBlockOnFlush(true)
+    if (parallelCompression) {
+      new PLZFOutputStream(s)


This is creating a threadpool per compressedOutputStream - which can end up being quite expensive (num thread is num processors + some 'interesting' logic which tries to modulate it) .

Did you get a chance to try this on some nontrivial jobs ? Very curious about the experience.
Given this is turned off by default, dont see any concerns with the change itself though ! Would be a good way to understand the impact.

…` by default ### What changes were proposed in this pull request? This PR aims to enable `spark.io.compression.lzf.parallel.enabled` by default at Apache Spark 4.1.0. ### Why are the changes needed? `spark.io.compression.lzf.parallel.enabled` was introduced at Apache Spark 4.0.0 and has been used stably so far. We can enable this by default. - #46858 ### Does this PR introduce _any_ user-facing change? Yes for `LZF` users. The migration guide is updated. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #52603 from dongjoon-hyun/SPARK-53896. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…` by default ### What changes were proposed in this pull request? This PR aims to enable `spark.io.compression.lzf.parallel.enabled` by default at Apache Spark 4.1.0. ### Why are the changes needed? `spark.io.compression.lzf.parallel.enabled` was introduced at Apache Spark 4.0.0 and has been used stably so far. We can enable this by default. - apache#46858 ### Does this PR introduce _any_ user-facing change? Yes for `LZF` users. The migration guide is updated. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52603 from dongjoon-hyun/SPARK-53896. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

yaooqinn added 2 commits June 4, 2024 15:04

[SPARK-48518][CORE] Make LZF compression be able to run in parallel

57662d7

Add benchmark

d372b12

github-actions bot added the CORE label Jun 4, 2024

yaooqinn commented Jun 4, 2024

View reviewed changes

LuciferYang reviewed Jun 4, 2024

View reviewed changes

Add doc

3501b22

github-actions bot added the DOCS label Jun 4, 2024

LuciferYang approved these changes Jun 4, 2024

View reviewed changes

yaooqinn closed this in 90ee299 Jun 4, 2024

yaooqinn deleted the SPARK-48518 branch June 4, 2024 10:59

mridulm reviewed Jun 4, 2024

View reviewed changes

dongjoon-hyun mentioned this pull request Oct 14, 2025

[SPARK-53896][CORE] Enable spark.io.compression.lzf.parallel.enabled by default #52603

Closed

dongjoon-hyun mentioned this pull request Feb 13, 2026

[SPARK-55508][BUILD] Upgrade compress-lzf to 1.2.0 #54292

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48518][CORE] Make LZF compression be able to run in parallel#46858

[SPARK-48518][CORE] Make LZF compression be able to run in parallel#46858
yaooqinn wants to merge 3 commits intoapache:masterfrom
yaooqinn:SPARK-48518

yaooqinn commented Jun 4, 2024

Uh oh!

yaooqinn Jun 4, 2024

Uh oh!

LuciferYang Jun 4, 2024

Uh oh!

yaooqinn Jun 4, 2024

Uh oh!

LuciferYang Jun 4, 2024

Uh oh!

yaooqinn Jun 4, 2024

Uh oh!

LuciferYang left a comment

Uh oh!

yaooqinn commented Jun 4, 2024

Uh oh!

mridulm Jun 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yaooqinn commented Jun 4, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

yaooqinn Jun 4, 2024

Choose a reason for hiding this comment

Uh oh!

LuciferYang Jun 4, 2024

Choose a reason for hiding this comment

Uh oh!

yaooqinn Jun 4, 2024

Choose a reason for hiding this comment

Uh oh!

LuciferYang Jun 4, 2024

Choose a reason for hiding this comment

Uh oh!

yaooqinn Jun 4, 2024

Choose a reason for hiding this comment

Uh oh!

LuciferYang left a comment

Choose a reason for hiding this comment

Uh oh!

yaooqinn commented Jun 4, 2024

Uh oh!

mridulm Jun 4, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants