Igor Frenkel (caa8ee9c) at 18 Mar 07:08
Extract decompression.rb, add RSS pass to full run
... and 9 more commits
@onaaman these results are rough in the sense of their conclusions but fairly thorough in methodology and used claude so I haven't cross checked every thing. But I want down a rabbit hole of different tradeoffs in format and compression parameters.
The main constraints I was optimizing for: archives need to support resumable streaming so instances can start processing before the full download finishes, they need to be small for network transfer and storage, and ruby process memory needs to be bounded and predictable so nodes in rails-background-jobs have predictable usage and no OOM.
With that in mind I looked at three formats: .zst.tar (compress each .ndjson individually with zstd, then tar them together), .tar.zst (tar the whole dataset directory then compress the stream), and a single concatenated file where all records are merged into one file with a _seq field injected as the first key of each record to track which sequence it came from.
Best results across seven purl types I sampled (advisories: pypi, maven, npm / licenses: go, maven, nuget, pypi):
| Format | Advisory ratio | License ratio |
|---|---|---|
.zst.tar L19 |
~12x | ~14x |
.tar.zst L19 |
~12x | ~14x |
.tar.zst L19 --long=27 |
40–50x | 15–27x |
concat .ndjson.zst L19 --long=27 |
marginally better than above row | marginally better than above row |
I tried dictionaries for .zst.tar at four sizes (100KB, 1MB, 10MB, 20MB) with several sampling strategies. The idea was that since the dictionary is trained on individual .ndjson files and each file is compressed separately, the structural mismatch that hurts stream compression (where 512-byte tar headers are interspersed with content the dictionary never saw during training) shouldn't apply. Results were still consistently neutral or worse. Dictionary didn't add much for this data.
.tar.zst with zstd -19 --long=27 is the winner. Why those specific settings:
--long=27 sets a 128MB back-reference window, which is the main lever for advisory datasets. Advisory datasets are 50–75MB, so a 128MB window covers the whole thing - every record can reference any earlier one, which is why you get 40–50x instead of 11–12x. For license datasets (228MB - 899MB) the window helps less but still adds a few x.The concatenated format had slightly better ratios (1 to 2x improvement) but needs new exporter logic, a new instance parser, and format versioning, so not worth it for a marginal gain.
licenses/go is an outlier at ~11x no matter what. Go pseudoversion strings (v0.0.0-YYYYMMDDHHMMSS-<12-char-hex-commit>) are basically incompressible entropy and there are a lot of them per record.
zstd -22 --ultra gets another 15–33% on license datasets but compression takes a lot longer (30+ minutes across some instance types though this is in a docker container. This is probably impractical for the exporter adding to the file every day or multiple times per day.
Code changes needed:
Exporter
tar -cf - <dataset>/ | zstd -19 --long=27 -f -o <output>.tar.zst
.tar.gz → .tar.zst, download URLs, content-type headersInstance
zstd-ruby gem (statically linked, no system zstd dependency on the instance)minitar gem for streaming tar reading - the standard library doesn't support streaming, it buffers the whole archive first which defeats the point of predictable memory use.Edit: this is the test repo I used to hopefully make it easier to reproduce these results: https://gitlab.com/gitlab-org/secure/tests/ifrenkel/zstd-testing (the only variable is you need to sync data from the bucket to use as a dataset).
@onaaman could you do the initial review of this MR please? It's still in draft because the MR is in draft and there are some naming things to take care of. In addition there's a test suite that will help with automation of most of the test cases. I'm hoping to have it ready by 3.18.
Igor Frenkel (79466445) at 18 Mar 06:32
Fix minor doc inaccuracies: flags.go description, example project n...
... and 6 more commits
Igor Frenkel (5ef95016) at 18 Mar 06:20
Seed missing dependency-resolution expectations
... and 6 more commits
Igor Frenkel (b8acffcb) at 18 Mar 06:02
Seed missing dependency-resolution expectations
... and 17 more commits
Igor Frenkel (32625aa3) at 18 Mar 03:11
refactor(redhat-csaf): replace CustomPut with Store interface (#648)
Igor Frenkel (32625aa3) at 18 Mar 02:57
refactor(redhat-csaf): replace CustomPut with Store interface (#648)
Igor Frenkel (fe77a37b) at 18 Mar 00:22
Add pom.xml with provided-scoped dependency
Igor Frenkel (e078e49a) at 18 Mar 00:21
Add pom.xml and Gemfile.lock
Igor Frenkel (c0928f4a) at 18 Mar 00:20
Add pom.xml with resolution disabled
Igor Frenkel (5e2b3fa6) at 18 Mar 00:18
Disable DS_INCLUDE_DEV_DEPENDENCIES
Igor Frenkel (fdd95cbd) at 18 Mar 00:17
Add pom.xml with test-scoped dependency
Igor Frenkel (82c34f42) at 18 Mar 00:15
Add pom.xml with MVN_CLI_OPTS set
Igor Frenkel (4f0757d6) at 18 Mar 00:14
Add project with deeply nested pom.xml
Igor Frenkel (85304d26) at 18 Mar 00:12
Add project with excluded subdirectory
Igor Frenkel (a56da277) at 18 Mar 00:10
Add multi-module Maven project
Igor Frenkel (12eebeeb) at 18 Mar 00:09
Add pom.xml with committed maven.graph.json