Igor Frenkel activity https://gitlab.com/ifrenkel 2026-03-18T07:08:37Z tag:gitlab.com,2026-03-18:5215961725 Igor Frenkel pushed new project branch main at GitLab.org / Application Security Testing Stage / Tests / ifrenkel / zstd-testing 2026-03-18T07:08:37Z ifrenkel Igor Frenkel

Igor Frenkel (caa8ee9c) at 18 Mar 07:08

Extract decompression.rb, add RSS pass to full run

... and 9 more commits

tag:gitlab.com,2026-03-18:5215960418 Igor Frenkel created project GitLab.org / Application Security Testing Stage / Tests / ifrenkel / zstd-testing 2026-03-18T07:08:10Z ifrenkel Igor Frenkel tag:gitlab.com,2026-03-18:5215940033 Igor Frenkel commented on issue #593080 at GitLab.org / GitLab 2026-03-18T07:00:10Z ifrenkel Igor Frenkel

@onaaman these results are rough in the sense of their conclusions but fairly thorough in methodology and used claude so I haven't cross checked every thing. But I want down a rabbit hole of different tradeoffs in format and compression parameters.

The main constraints I was optimizing for: archives need to support resumable streaming so instances can start processing before the full download finishes, they need to be small for network transfer and storage, and ruby process memory needs to be bounded and predictable so nodes in rails-background-jobs have predictable usage and no OOM.

With that in mind I looked at three formats: .zst.tar (compress each .ndjson individually with zstd, then tar them together), .tar.zst (tar the whole dataset directory then compress the stream), and a single concatenated file where all records are merged into one file with a _seq field injected as the first key of each record to track which sequence it came from.

Best results across seven purl types I sampled (advisories: pypi, maven, npm / licenses: go, maven, nuget, pypi):

Format Advisory ratio License ratio
.zst.tar L19 ~12x ~14x
.tar.zst L19 ~12x ~14x
.tar.zst L19 --long=27 40–50x 15–27x
concat .ndjson.zst L19 --long=27 marginally better than above row marginally better than above row

I tried dictionaries for .zst.tar at four sizes (100KB, 1MB, 10MB, 20MB) with several sampling strategies. The idea was that since the dictionary is trained on individual .ndjson files and each file is compressed separately, the structural mismatch that hurts stream compression (where 512-byte tar headers are interspersed with content the dictionary never saw during training) shouldn't apply. Results were still consistently neutral or worse. Dictionary didn't add much for this data.

.tar.zst with zstd -19 --long=27 is the winner. Why those specific settings:

  • L19 roughly doubles the ratio over default L3 without much hit on compression speed
  • --long=27 sets a 128MB back-reference window, which is the main lever for advisory datasets. Advisory datasets are 50–75MB, so a 128MB window covers the whole thing - every record can reference any earlier one, which is why you get 40–50x instead of 11–12x. For license datasets (228MB - 899MB) the window helps less but still adds a few x.
  • 128MB is exactly zstd's default decompression cap, so instances don't need to pass any special flags. Decompression memory is predictable at around 100 - 180MB RSS delta during streaming regardless of dataset size.

The concatenated format had slightly better ratios (1 to 2x improvement) but needs new exporter logic, a new instance parser, and format versioning, so not worth it for a marginal gain.

licenses/go is an outlier at ~11x no matter what. Go pseudoversion strings (v0.0.0-YYYYMMDDHHMMSS-<12-char-hex-commit>) are basically incompressible entropy and there are a lot of them per record.

zstd -22 --ultra gets another 15–33% on license datasets but compression takes a lot longer (30+ minutes across some instance types though this is in a docker container. This is probably impractical for the exporter adding to the file every day or multiple times per day.

Code changes needed:

Exporter

  • Install zstd >= 1.3.0
  • Change command to impement in go: tar -cf - <dataset>/ | zstd -19 --long=27 -f -o <output>.tar.zst
  • Update file extensions .tar.gz.tar.zst, download URLs, content-type headers

Instance

  • Add zstd-ruby gem (statically linked, no system zstd dependency on the instance)
  • Add minitar gem for streaming tar reading - the standard library doesn't support streaming, it buffers the whole archive first which defeats the point of predictable memory use.

Edit: this is the test repo I used to hopefully make it easier to reproduce these results: https://gitlab.com/gitlab-org/secure/tests/ifrenkel/zstd-testing (the only variable is you need to sync data from the bucket to use as a dataset).

tag:gitlab.com,2026-03-18:5215874407 Igor Frenkel commented on merge request !495 at GitLab.org / security-products / analyzers / Dependency Scanning 2026-03-18T06:33:18Z ifrenkel Igor Frenkel

@onaaman could you do the initial review of this MR please? It's still in draft because the MR is in draft and there are some naming things to take care of. In addition there's a test suite that will help with automation of most of the test cases. I'm hoping to have it ready by 3.18.

tag:gitlab.com,2026-03-18:5215872549 Igor Frenkel pushed to project branch main at GitLab.org / Application Security Testing Stage / Tests / ifrenkel / DS pipeline integration tests 2026-03-18T06:32:21Z ifrenkel Igor Frenkel

Igor Frenkel (79466445) at 18 Mar 06:32

Fix minor doc inaccuracies: flags.go description, example project n...

... and 6 more commits

tag:gitlab.com,2026-03-18:5215846661 Igor Frenkel pushed to project branch main at GitLab.org / Application Security Testing Stage / Tests / ifrenkel / DS pipeline integration tests 2026-03-18T06:20:01Z ifrenkel Igor Frenkel

Igor Frenkel (5ef95016) at 18 Mar 06:20

Seed missing dependency-resolution expectations

... and 6 more commits

tag:gitlab.com,2026-03-18:5215805504 Igor Frenkel pushed to project branch main at GitLab.org / Application Security Testing Stage / Tests / ifrenkel / DS pipeline integration tests 2026-03-18T06:02:58Z ifrenkel Igor Frenkel

Igor Frenkel (b8acffcb) at 18 Mar 06:02

Seed missing dependency-resolution expectations

... and 17 more commits

tag:gitlab.com,2026-03-18:5215789300 Igor Frenkel pushed to project branch ifrenkel/588765-add-service-mode at GitLab.org / security-products / analyzers / Dependency Scanning 2026-03-18T05:54:31Z ifrenkel Igor Frenkel

Igor Frenkel (3556fc4f) at 18 Mar 05:54

Remove service package README

... and 8 more commits

tag:gitlab.com,2026-03-18:5215512940 Igor Frenkel pushed to project branch main at GitLab.org / security-products / dependencies / trivy-db 2026-03-18T03:11:52Z ifrenkel Igor Frenkel

Igor Frenkel (32625aa3) at 18 Mar 03:11

refactor(redhat-csaf): replace CustomPut with Store interface (#648)

tag:gitlab.com,2026-03-18:5215487184 Igor Frenkel pushed to project branch main at Igor Frenkel / Trivy Db Mirror 1 2026-03-18T02:57:10Z ifrenkel Igor Frenkel

Igor Frenkel (32625aa3) at 18 Mar 02:57

refactor(redhat-csaf): replace CustomPut with Store interface (#648)

tag:gitlab.com,2026-03-18:5215239729 Igor Frenkel pushed new project branch dependency-resolution-1773792418-maven-scope-filtering at GitLab.org / Application Security Testing Stage / Tests / if... 2026-03-18T00:22:37Z ifrenkel Igor Frenkel

Igor Frenkel (fe77a37b) at 18 Mar 00:22

Add pom.xml with provided-scoped dependency

tag:gitlab.com,2026-03-18:5215237430 Igor Frenkel pushed new project branch dependency-resolution-1773792418-mixed-monorepo at GitLab.org / Application Security Testing Stage / Tests / ifrenkel ... 2026-03-18T00:21:01Z ifrenkel Igor Frenkel

Igor Frenkel (e078e49a) at 18 Mar 00:21

Add pom.xml and Gemfile.lock

tag:gitlab.com,2026-03-18:5215236707 Igor Frenkel pushed new project branch dependency-resolution-1773792418-resolution-disabled at GitLab.org / Application Security Testing Stage / Tests / ifre... 2026-03-18T00:20:29Z ifrenkel Igor Frenkel

Igor Frenkel (c0928f4a) at 18 Mar 00:20

Add pom.xml with resolution disabled

tag:gitlab.com,2026-03-18:5215234215 Igor Frenkel pushed to project branch dependency-resolution-1773792418-ds-include-dev-deps at GitLab.org / Application Security Testing Stage / Tests / ifren... 2026-03-18T00:18:54Z ifrenkel Igor Frenkel

Igor Frenkel (5e2b3fa6) at 18 Mar 00:18

Disable DS_INCLUDE_DEV_DEPENDENCIES

tag:gitlab.com,2026-03-18:5215232053 Igor Frenkel pushed new project branch dependency-resolution-1773792418-ds-include-dev-deps at GitLab.org / Application Security Testing Stage / Tests / ifre... 2026-03-18T00:17:29Z ifrenkel Igor Frenkel

Igor Frenkel (fdd95cbd) at 18 Mar 00:17

Add pom.xml with test-scoped dependency

tag:gitlab.com,2026-03-18:5215228560 Igor Frenkel pushed new project branch dependency-resolution-1773792418-mvn-cli-opts at GitLab.org / Application Security Testing Stage / Tests / ifrenkel / ... 2026-03-18T00:15:55Z ifrenkel Igor Frenkel

Igor Frenkel (82c34f42) at 18 Mar 00:15

Add pom.xml with MVN_CLI_OPTS set

tag:gitlab.com,2026-03-18:5215224375 Igor Frenkel pushed new project branch dependency-resolution-1773792418-ds-max-depth at GitLab.org / Application Security Testing Stage / Tests / ifrenkel / ... 2026-03-18T00:14:19Z ifrenkel Igor Frenkel

Igor Frenkel (4f0757d6) at 18 Mar 00:14

Add project with deeply nested pom.xml

tag:gitlab.com,2026-03-18:5215221658 Igor Frenkel pushed new project branch dependency-resolution-1773792418-ds-excluded-paths at GitLab.org / Application Security Testing Stage / Tests / ifrenk... 2026-03-18T00:12:44Z ifrenkel Igor Frenkel

Igor Frenkel (85304d26) at 18 Mar 00:12

Add project with excluded subdirectory

tag:gitlab.com,2026-03-18:5215218223 Igor Frenkel pushed new project branch dependency-resolution-1773792418-multi-module at GitLab.org / Application Security Testing Stage / Tests / ifrenkel / ... 2026-03-18T00:10:47Z ifrenkel Igor Frenkel

Igor Frenkel (a56da277) at 18 Mar 00:10

Add multi-module Maven project

tag:gitlab.com,2026-03-18:5215215404 Igor Frenkel pushed new project branch dependency-resolution-1773792418-lockfile-present at GitLab.org / Application Security Testing Stage / Tests / ifrenke... 2026-03-18T00:09:12Z ifrenkel Igor Frenkel

Igor Frenkel (12eebeeb) at 18 Mar 00:09

Add pom.xml with committed maven.graph.json