Jayakrishnan Mallissery activity

Jayakrishnan Mallissery commented on issue #592903 at GitLab.org / GitLab

2026-03-18T20:58:42Z

@fcatteau I did digging into this.

Can you take a look and see if it make sense.

Jayakrishnan Mallissery commented on issue #592903 at GitLab.org / GitLab

2026-03-18T20:46:13Z

Validation in GKE CN environment

Environment: 300 namespaces, rotation_period=5m, max_parallel=3, 200m CPU limit.

1. O(N²) bug reproduced — v2.4.4

Created 300 namespaces with OIDC named keys and bootstrapped key material.

Wave 1  2026-03-17T18:31 UTC   903 warnings
Wave 2  2026-03-17T18:47 UTC   713 warnings
Total                         1,616 warnings

Sample log:

{"@level":"warn","@message":"error rotating OIDC keys","@module":"secrets.identity.identity_9239dadb","err":"context deadline exceeded"}
{"@level":"warn","@message":"error expiring OIDC public keys","@module":"secrets.identity.identity_9239dadb","err":"context deadline exceeded"}

All goroutines emit the same module ID (identity_9239dadb) — confirms the shared-state O(N²) behaviour where every goroutine iterates all namespaces.

2. v2.5.1 fixes the cascade

Upgraded to v2.5.1, monitored across multiple rotation windows:

kubectl logs -n gitlab -l app.kubernetes.io/name=openbao \
  --all-containers --since=15m | grep '"@level":"warn"' | grep -c "error rotating"
# 0

Zero cascade warnings. O(N²) fix confirmed.

3. Regression — non-root namespace rotation silently broken in v2.5.1

After confirming zero warnings, checked whether JWKS kids were actually rotating:

test-ns-1    T=0: 72c430c2 bfef93d2   T+12m: 72c430c2 bfef93d2   # unchanged
test-ns-100  T=0: 715151ca aca1f43c   T+12m: 715151ca aca1f43c   # unchanged
test-ns-300  T=0: f497c59c 526b86e1   T+12m: f497c59c 526b86e1   # unchanged

No rotation across any of the 300 non-root namespaces after 12+ rotation windows. Silent — no errors or warnings.

Root cause: the v2.5.1 refactor introduced a double namespace prefix. oidcPeriodicFunc passes nsPath + "identity/oidc" to MatchingStorageByAPIPath, but the router already prepends nsPath internally:

# Non-root namespace (ns.Path = "test-ns-1/")
"test-ns-1/" + "test-ns-1/identity/oidc" → no radix tree match → nil → silent early return

Root namespace (ns.Path = "") is unaffected, which masked the regression initially.

Fix is a one-liner in vault/identity_store_oidc.go:

// v2.5.1 — buggy
s := i.router.MatchingStorageByAPIPath(ctx, nsPath+"identity/oidc")

// Fixed — router prepends nsPath internally
s := i.router.MatchingStorageByAPIPath(ctx, "identity/oidc")

Upstream bug filed: https://github.com/openbao/openbao/issues/2664

4. Patched binary verified — GKE (Mar 18)

Re-ran v2.4.4 baseline (972 warnings, 3 waves), then deployed the patched binary.

Warnings:

Initial burst (first tick, all 300 overdue): 29
All subsequent windows:                       0

The 29 initial failures are expected — all 300 namespaces had overdue NextRotation on pod start and fired simultaneously. No cascade. In production, namespaces are provisioned gradually so this burst won't occur.

JWKS kids rotating (test-ns-300, 28-minute window):

T=0:    ab661e85 950d5a31 49f327de
T+5m:   8e015056 49f327de 337c775c
T+10m:  337c775c 8e015056 85df2fe5
T+28m:  f94c0644 f6c44966 85df2fe5

6 rotations in 28 min — matches rotation_period=5m. Expired keys cleaned from JWKS confirmed.

Summary

	v2.4.4	v2.5.1	v2.5.1 + fix
`context deadline exceeded` warnings	1,616 (cascade)	0	29 one-time burst
Non-root namespace rotation	Eventually	Silently broken	Continuously
Expired key cleanup	Eventually	Broken	Working
Retry cascade	Yes	No	No

Recommendation: v2.5.1 eliminates the context deadline exceeded cascade but introduces a silent regression — non-root namespace OIDC keys never rotate. Upgrade to v2.5.1 with the fix from https://github.com/openbao/openbao/issues/2664.

Jayakrishnan Mallissery commented on issue #592905 at GitLab.org / GitLab

2026-03-18T18:33:59Z

@fcatteau Done ✅

Documentation issue - #594034
production change request - https://gitlab.com/gitlab-com/gl-infra/production/-/work_items/21589

If it looks good, we can close this issue

Jayakrishnan Mallissery opened issue #594034: Document Openbao recovery key generation in admin docs at GitLab.org / GitLab

2026-03-18T18:28:46Z

Jayakrishnan Mallissery commented on issue #592905 at GitLab.org / GitLab

2026-03-18T15:35:15Z

@fcatteau Instead of mentioning this in the troubleshooting section, why not document this rake task and the need to run it to generate recovery keys - for self-managed ? Maybe in the post installation docs. WDYT ?

Do we need to create an issue to track the tasks that we need to do w.r.t. running the rake task in Staging and production to generate recovery keys ? Do we need to do that before G/A ?

Let me know your thoughts. Maybe I am missing something

Jayakrishnan Mallissery commented on issue #592905 at GitLab.org / GitLab

2026-03-18T15:26:23Z

@fcatteau I verified in a CN environment that running the rake task indeed fixed the warnings ✅

Jayakrishnan Mallissery commented on issue #592905 at GitLab.org / GitLab

2026-03-18T15:21:55Z

@fcatteau I reproduced the bug in a CN environment. Observed the warnings. Ran the rake task and it solved the issue. Warnings disappeared.

Second run of the rake task — results:

Output: "Cannot get key, key has already been retrieved." — task detected SecretShares == 1, skipped key generation, called cancel_rotate_recovery (DELETE to clean up in-memory rotation state), exited cleanly
DB state: unchanged — still 1 recovery-key row (79 bytes), still 1 active Rails key row
No duplicate rows, no new key generated
Warning remains gone after restart — core/recovery-key is still present in PostgreSQL, upgradeRecoveryKey() takes the happy path

The cancel_rotate_recovery call at the end of the second run is important — without it, the in-memory recoveryRotationConfig set by initRecoveryRotation would block any future POST sys/rotate/recovery/init with "rotation already in progress" until the pod restarts. The task handles this correctly.

Jayakrishnan Mallissery commented on issue #592905 at GitLab.org / GitLab

2026-03-18T14:44:19Z

@fcatteau I updated the comment a bit. It looks like we have only implemented recovery key generation and not the rotation of the recovery key

A proper second rotation (shares=1 → new key) requires a multi-step ceremony via POST sys/rotate/recovery/update:
vault/logical_system_rotate.go#L676-L743
This requires providing the existing plaintext shard from the Rails DB as authorization. This flow is not implemented in the rake task. The rake task is a one-shot bootstrap tool only.

Jayakrishnan Mallissery commented on issue #592905 at GitLab.org / GitLab

2026-03-18T12:06:40Z

Why do we store the recovery key in Rails DB as well ? Is is for redundancy?

Jayakrishnan Mallissery commented on issue #592905 at GitLab.org / GitLab

2026-03-18T12:05:05Z

@fcatteau Can you review ☝️ the thread above on the analysis of this warning

It appears to me that the warning is a genuine warning because we did NOT run the one time rake task that generates and stores the Openbao recovery key
We will need to validate this hypothesis by running the rake task in Staging and see if the warning goes away after the recovery keys are generated and stored.
We also need to document this step for self-managed in the installation docs I think.

Jayakrishnan Mallissery commented on issue #592905 at GitLab.org / GitLab

2026-03-18T12:01:05Z

Risk of Never Running the Rake Task

Jayakrishnan Mallissery commented on issue #592905 at GitLab.org / GitLab

2026-03-18T12:00:48Z

Warning Disappears After Rake Task Runs — Permanently

Jayakrishnan Mallissery commented on issue #592905 at GitLab.org / GitLab

2026-03-18T12:00:04Z

What Happens When the Rake Task Runs a Second Time — Nothing Changes

Jayakrishnan Mallissery commented on issue #592905 at GitLab.org / GitLab

2026-03-18T11:59:30Z

What Happens on Every Unseal — Related to Recovery Key

Jayakrishnan Mallissery commented on issue #592905 at GitLab.org / GitLab

2026-03-18T11:58:28Z

How Recovery is Managed in GitLab Secrets Manager — Two-Phase Model

Jayakrishnan Mallissery commented on issue #592905 at GitLab.org / GitLab

2026-03-18T11:57:42Z

Role of GCP KMS