Igor Frenkel activity

Igor Frenkel pushed new project branch main at GitLab.org / Application Security Testing Stage / Tests / ifrenkel / zstd-testing

2026-03-18T07:08:37Z

Igor Frenkel (caa8ee9c) at 18 Mar 07:08

Extract decompression.rb, add RSS pass to full run

... and 9 more commits

Igor Frenkel created project GitLab.org / Application Security Testing Stage / Tests / ifrenkel / zstd-testing

2026-03-18T07:08:10Z

Igor Frenkel commented on issue #593080 at GitLab.org / GitLab

2026-03-18T07:00:10Z

@onaaman these results are rough in the sense of their conclusions but fairly thorough in methodology and used claude so I haven't cross checked every thing. But I want down a rabbit hole of different tradeoffs in format and compression parameters.

The main constraints I was optimizing for: archives need to support resumable streaming so instances can start processing before the full download finishes, they need to be small for network transfer and storage, and ruby process memory needs to be bounded and predictable so nodes in rails-background-jobs have predictable usage and no OOM.

With that in mind I looked at three formats: .zst.tar (compress each .ndjson individually with zstd, then tar them together), .tar.zst (tar the whole dataset directory then compress the stream), and a single concatenated file where all records are merged into one file with a _seq field injected as the first key of each record to track which sequence it came from.

Best results across seven purl types I sampled (advisories: pypi, maven, npm / licenses: go, maven, nuget, pypi):

Format	Advisory ratio	License ratio
`.zst.tar` L19	~12x	~14x
`.tar.zst` L19	~12x	~14x
`.tar.zst` L19 --long=27	40–50x	15–27x
concat `.ndjson.zst` L19 --long=27	marginally better than above row	marginally better than above row

I tried dictionaries for .zst.tar at four sizes (100KB, 1MB, 10MB, 20MB) with several sampling strategies. The idea was that since the dictionary is trained on individual .ndjson files and each file is compressed separately, the structural mismatch that hurts stream compression (where 512-byte tar headers are interspersed with content the dictionary never saw during training) shouldn't apply. Results were still consistently neutral or worse. Dictionary didn't add much for this data.

.tar.zst with zstd -19 --long=27 is the winner. Why those specific settings:

L19 roughly doubles the ratio over default L3 without much hit on compression speed
--long=27 sets a 128MB back-reference window, which is the main lever for advisory datasets. Advisory datasets are 50–75MB, so a 128MB window covers the whole thing - every record can reference any earlier one, which is why you get 40–50x instead of 11–12x. For license datasets (228MB - 899MB) the window helps less but still adds a few x.
128MB is exactly zstd's default decompression cap, so instances don't need to pass any special flags. Decompression memory is predictable at around 100 - 180MB RSS delta during streaming regardless of dataset size.

The concatenated format had slightly better ratios (1 to 2x improvement) but needs new exporter logic, a new instance parser, and format versioning, so not worth it for a marginal gain.

licenses/go is an outlier at ~11x no matter what. Go pseudoversion strings (v0.0.0-YYYYMMDDHHMMSS-<12-char-hex-commit>) are basically incompressible entropy and there are a lot of them per record.

zstd -22 --ultra gets another 15–33% on license datasets but compression takes a lot longer (30+ minutes across some instance types though this is in a docker container. This is probably impractical for the exporter adding to the file every day or multiple times per day.

Code changes needed:

Exporter

Install zstd >= 1.3.0
Change command to impement in go: tar -cf - / | zstd -19 --long=27 -f -o .tar.zst
Update file extensions .tar.gz → .tar.zst, download URLs, content-type headers

Instance

Add zstd-ruby gem (statically linked, no system zstd dependency on the instance)
Add minitar gem for streaming tar reading - the standard library doesn't support streaming, it buffers the whole archive first which defeats the point of predictable memory use.

Edit: this is the test repo I used to hopefully make it easier to reproduce these results: https://gitlab.com/gitlab-org/secure/tests/ifrenkel/zstd-testing (the only variable is you need to sync data from the bucket to use as a dataset).

Igor Frenkel commented on merge request !495 at GitLab.org / security-products / analyzers / Dependency Scanning

2026-03-18T06:33:18Z

@onaaman could you do the initial review of this MR please? It's still in draft because the MR is in draft and there are some naming things to take care of. In addition there's a test suite that will help with automation of most of the test cases. I'm hoping to have it ready by 3.18.

Igor Frenkel pushed to project branch main at GitLab.org / Application Security Testing Stage / Tests / ifrenkel / DS pipeline integration tests

2026-03-18T06:32:21Z

Igor Frenkel (79466445) at 18 Mar 06:32

Fix minor doc inaccuracies: flags.go description, example project n...

... and 6 more commits

Igor Frenkel pushed to project branch main at GitLab.org / Application Security Testing Stage / Tests / ifrenkel / DS pipeline integration tests

2026-03-18T06:20:01Z

Igor Frenkel (5ef95016) at 18 Mar 06:20

Seed missing dependency-resolution expectations

... and 6 more commits

Igor Frenkel pushed to project branch main at GitLab.org / Application Security Testing Stage / Tests / ifrenkel / DS pipeline integration tests

2026-03-18T06:02:58Z

Igor Frenkel (b8acffcb) at 18 Mar 06:02

Seed missing dependency-resolution expectations

... and 17 more commits

Igor Frenkel pushed to project branch ifrenkel/588765-add-service-mode at GitLab.org / security-products / analyzers / Dependency Scanning

2026-03-18T05:54:31Z

Igor Frenkel (3556fc4f) at 18 Mar 05:54

Remove service package README

... and 8 more commits

Igor Frenkel pushed to project branch main at GitLab.org / security-products / dependencies / trivy-db

2026-03-18T03:11:52Z

Igor Frenkel (32625aa3) at 18 Mar 03:11

refactor(redhat-csaf): replace CustomPut with Store interface (#648)

Igor Frenkel pushed to project branch main at Igor Frenkel / Trivy Db Mirror 1

2026-03-18T02:57:10Z

Igor Frenkel (32625aa3) at 18 Mar 02:57

refactor(redhat-csaf): replace CustomPut with Store interface (#648)

Igor Frenkel pushed new project branch dependency-resolution-1773792418-maven-scope-filtering at GitLab.org / Application Security Testing Stage / Tests / if...

2026-03-18T00:22:37Z

Igor Frenkel (fe77a37b) at 18 Mar 00:22

Add pom.xml with provided-scoped dependency

Igor Frenkel pushed new project branch dependency-resolution-1773792418-mixed-monorepo at GitLab.org / Application Security Testing Stage / Tests / ifrenkel ...

2026-03-18T00:21:01Z

Igor Frenkel (e078e49a) at 18 Mar 00:21

Add pom.xml and Gemfile.lock

Igor Frenkel pushed new project branch dependency-resolution-1773792418-resolution-disabled at GitLab.org / Application Security Testing Stage / Tests / ifre...

2026-03-18T00:20:29Z

Igor Frenkel (c0928f4a) at 18 Mar 00:20

Add pom.xml with resolution disabled

Igor Frenkel pushed to project branch dependency-resolution-1773792418-ds-include-dev-deps at GitLab.org / Application Security Testing Stage / Tests / ifren...

2026-03-18T00:18:54Z

Igor Frenkel (5e2b3fa6) at 18 Mar 00:18

Disable DS_INCLUDE_DEV_DEPENDENCIES

Igor Frenkel pushed new project branch dependency-resolution-1773792418-ds-include-dev-deps at GitLab.org / Application Security Testing Stage / Tests / ifre...

2026-03-18T00:17:29Z

Igor Frenkel (fdd95cbd) at 18 Mar 00:17

Add pom.xml with test-scoped dependency

Igor Frenkel pushed new project branch dependency-resolution-1773792418-mvn-cli-opts at GitLab.org / Application Security Testing Stage / Tests / ifrenkel / ...

2026-03-18T00:15:55Z

Igor Frenkel (82c34f42) at 18 Mar 00:15

Add pom.xml with MVN_CLI_OPTS set

Igor Frenkel pushed new project branch dependency-resolution-1773792418-ds-max-depth at GitLab.org / Application Security Testing Stage / Tests / ifrenkel / ...

2026-03-18T00:14:19Z

Igor Frenkel (4f0757d6) at 18 Mar 00:14

Add project with deeply nested pom.xml

Igor Frenkel pushed new project branch dependency-resolution-1773792418-ds-excluded-paths at GitLab.org / Application Security Testing Stage / Tests / ifrenk...

2026-03-18T00:12:44Z

Igor Frenkel (85304d26) at 18 Mar 00:12

Add project with excluded subdirectory

Igor Frenkel pushed new project branch dependency-resolution-1773792418-multi-module at GitLab.org / Application Security Testing Stage / Tests / ifrenkel / ...

2026-03-18T00:10:47Z

Igor Frenkel (a56da277) at 18 Mar 00:10

Add multi-module Maven project

Igor Frenkel pushed new project branch dependency-resolution-1773792418-lockfile-present at GitLab.org / Application Security Testing Stage / Tests / ifrenke...

2026-03-18T00:09:12Z

Igor Frenkel (12eebeeb) at 18 Mar 00:09

Add pom.xml with committed maven.graph.json