Jayakrishnan Mallissery activity https://gitlab.com/jmallissery 2026-03-18T20:58:42Z tag:gitlab.com,2026-03-18:5219472479 Jayakrishnan Mallissery commented on issue #592903 at GitLab.org / GitLab 2026-03-18T20:58:42Z jmallissery Jayakrishnan Mallissery [email protected]

@fcatteau I did digging into this.

Can you take a look and see if it make sense.

tag:gitlab.com,2026-03-18:5219435349 Jayakrishnan Mallissery commented on issue #592903 at GitLab.org / GitLab 2026-03-18T20:46:13Z jmallissery Jayakrishnan Mallissery [email protected]

Validation in GKE CN environment

Environment: 300 namespaces, rotation_period=5m, max_parallel=3, 200m CPU limit.


1. O(N²) bug reproduced — v2.4.4

Created 300 namespaces with OIDC named keys and bootstrapped key material.

Wave 1  2026-03-17T18:31 UTC   903 warnings
Wave 2  2026-03-17T18:47 UTC   713 warnings
Total                         1,616 warnings

Sample log:

{"@level":"warn","@message":"error rotating OIDC keys","@module":"secrets.identity.identity_9239dadb","err":"context deadline exceeded"}
{"@level":"warn","@message":"error expiring OIDC public keys","@module":"secrets.identity.identity_9239dadb","err":"context deadline exceeded"}

All goroutines emit the same module ID (identity_9239dadb) — confirms the shared-state O(N²) behaviour where every goroutine iterates all namespaces.


2. v2.5.1 fixes the cascade

Upgraded to v2.5.1, monitored across multiple rotation windows:

kubectl logs -n gitlab -l app.kubernetes.io/name=openbao \
  --all-containers --since=15m | grep '"@level":"warn"' | grep -c "error rotating"
# 0

Zero cascade warnings. O(N²) fix confirmed.


3. Regression — non-root namespace rotation silently broken in v2.5.1

After confirming zero warnings, checked whether JWKS kids were actually rotating:

test-ns-1    T=0: 72c430c2 bfef93d2   T+12m: 72c430c2 bfef93d2   # unchanged
test-ns-100  T=0: 715151ca aca1f43c   T+12m: 715151ca aca1f43c   # unchanged
test-ns-300  T=0: f497c59c 526b86e1   T+12m: f497c59c 526b86e1   # unchanged

No rotation across any of the 300 non-root namespaces after 12+ rotation windows. Silent — no errors or warnings.

Root cause: the v2.5.1 refactor introduced a double namespace prefix. oidcPeriodicFunc passes nsPath + "identity/oidc" to MatchingStorageByAPIPath, but the router already prepends nsPath internally:

# Non-root namespace (ns.Path = "test-ns-1/")
"test-ns-1/" + "test-ns-1/identity/oidc" → no radix tree match → nil → silent early return

Root namespace (ns.Path = "") is unaffected, which masked the regression initially.

Fix is a one-liner in vault/identity_store_oidc.go:

// v2.5.1 — buggy
s := i.router.MatchingStorageByAPIPath(ctx, nsPath+"identity/oidc")

// Fixed — router prepends nsPath internally
s := i.router.MatchingStorageByAPIPath(ctx, "identity/oidc")

Upstream bug filed: https://github.com/openbao/openbao/issues/2664


4. Patched binary verified — GKE (Mar 18)

Re-ran v2.4.4 baseline (972 warnings, 3 waves), then deployed the patched binary.

Warnings:

Initial burst (first tick, all 300 overdue): 29
All subsequent windows:                       0

The 29 initial failures are expected — all 300 namespaces had overdue NextRotation on pod start and fired simultaneously. No cascade. In production, namespaces are provisioned gradually so this burst won't occur.

JWKS kids rotating (test-ns-300, 28-minute window):

T=0:    ab661e85 950d5a31 49f327de
T+5m:   8e015056 49f327de 337c775c
T+10m:  337c775c 8e015056 85df2fe5
T+28m:  f94c0644 f6c44966 85df2fe5

6 rotations in 28 min — matches rotation_period=5m. Expired keys cleaned from JWKS confirmed.


Summary

v2.4.4 v2.5.1 v2.5.1 + fix
context deadline exceeded warnings 1,616 (cascade) 0 29 one-time burst
Non-root namespace rotation Eventually Silently broken Continuously
Expired key cleanup Eventually Broken Working
Retry cascade Yes No No

Recommendation: v2.5.1 eliminates the context deadline exceeded cascade but introduces a silent regression — non-root namespace OIDC keys never rotate. Upgrade to v2.5.1 with the fix from https://github.com/openbao/openbao/issues/2664.

tag:gitlab.com,2026-03-18:5219027173 Jayakrishnan Mallissery commented on issue #592905 at GitLab.org / GitLab 2026-03-18T18:33:59Z jmallissery Jayakrishnan Mallissery [email protected]

@fcatteau Done

  1. Documentation issue - #594034
  2. production change request - https://gitlab.com/gitlab-com/gl-infra/production/-/work_items/21589

If it looks good, we can close this issue

tag:gitlab.com,2026-03-18:5219011678 Jayakrishnan Mallissery opened issue #594034: Document Openbao recovery key generation in admin docs at GitLab.org / GitLab 2026-03-18T18:28:46Z jmallissery Jayakrishnan Mallissery [email protected] tag:gitlab.com,2026-03-18:5218265628 Jayakrishnan Mallissery commented on issue #592905 at GitLab.org / GitLab 2026-03-18T15:35:15Z jmallissery Jayakrishnan Mallissery [email protected]

@fcatteau Instead of mentioning this in the troubleshooting section, why not document this rake task and the need to run it to generate recovery keys - for self-managed ? Maybe in the post installation docs. WDYT ?

Do we need to create an issue to track the tasks that we need to do w.r.t. running the rake task in Staging and production to generate recovery keys ? Do we need to do that before G/A ?

Let me know your thoughts. Maybe I am missing something

tag:gitlab.com,2026-03-18:5218220110 Jayakrishnan Mallissery commented on issue #592905 at GitLab.org / GitLab 2026-03-18T15:26:23Z jmallissery Jayakrishnan Mallissery [email protected]

@fcatteau I verified in a CN environment that running the rake task indeed fixed the warnings

tag:gitlab.com,2026-03-18:5218199812 Jayakrishnan Mallissery commented on issue #592905 at GitLab.org / GitLab 2026-03-18T15:21:55Z jmallissery Jayakrishnan Mallissery [email protected]

@fcatteau I reproduced the bug in a CN environment. Observed the warnings. Ran the rake task and it solved the issue. Warnings disappeared.

Second run of the rake task — results:

  • Output: "Cannot get key, key has already been retrieved." — task detected SecretShares == 1, skipped key generation, called cancel_rotate_recovery (DELETE to clean up in-memory rotation state), exited cleanly
  • DB state: unchanged — still 1 recovery-key row (79 bytes), still 1 active Rails key row
  • No duplicate rows, no new key generated
  • Warning remains gone after restart — core/recovery-key is still present in PostgreSQL, upgradeRecoveryKey() takes the happy path

The cancel_rotate_recovery call at the end of the second run is important — without it, the in-memory recoveryRotationConfig set by initRecoveryRotation would block any future POST sys/rotate/recovery/init with "rotation already in progress" until the pod restarts. The task handles this correctly.

tag:gitlab.com,2026-03-18:5218004175 Jayakrishnan Mallissery commented on issue #592905 at GitLab.org / GitLab 2026-03-18T14:44:19Z jmallissery Jayakrishnan Mallissery [email protected]

@fcatteau I updated the comment a bit. It looks like we have only implemented recovery key generation and not the rotation of the recovery key

A proper second rotation (shares=1 → new key) requires a multi-step ceremony via POST sys/rotate/recovery/update:
vault/logical_system_rotate.go#L676-L743
This requires providing the existing plaintext shard from the Rails DB as authorization. This flow is not implemented in the rake task. The rake task is a one-shot bootstrap tool only.
tag:gitlab.com,2026-03-18:5217215116 Jayakrishnan Mallissery commented on issue #592905 at GitLab.org / GitLab 2026-03-18T12:06:40Z jmallissery Jayakrishnan Mallissery [email protected]
Why do we store the recovery key in Rails DB as well ? Is is for redundancy?

Why the Rails DB Copy Is Operationally Necessary

What each store actually contains

  ┌───────────────────────────────────────────┬────────────────────┬──────────────────────┬─────────────────────────┐
  │                   Store                   │      Content       │     Encrypted by     │      Accessible to      │
  ├───────────────────────────────────────────┼────────────────────┼──────────────────────┼─────────────────────────┤
  │ OpenBao PostgreSQL core/recovery-key      │ KMS-encrypted blob │ GCP KMS              │ OpenBao internally only │
  ├───────────────────────────────────────────┼────────────────────┼──────────────────────┼─────────────────────────┤
  │ Rails DB secrets_management_recovery_keys │ Plaintext shard    │ Rails attr_encrypted │ Operators via GitLab    │
  └───────────────────────────────────────────┴────────────────────┴──────────────────────┴─────────────────────────┘

These are not two copies of the same thing — they serve fundamentally different roles.


The plaintext is only revealed once

When the rake task calls POST sys/rotate/recovery/init and OpenBao generates the key, it:

  1. Returns the plaintext shard in the API response
  2. Immediately stores it KMS-encrypted in PostgreSQL

After that moment, the plaintext exists nowhere inside OpenBao. OpenBao's PostgreSQL copy can only be recovered by calling GCP KMS to decrypt it — which OpenBao does internally. Operators cannot extract it directly.

The rake task captures the plaintext at the only moment it is exposed and stores it in the Rails DB. This is the operator's only persistent access to the plaintext.


The chicken-and-egg problem this solves

To use the recovery key for any break-glass operation (generate root token, re-key), an operator must provide the plaintext to POST sys/rotate/recovery/update. OpenBao then:

  1. Reads its KMS-encrypted copy from PostgreSQL
  2. Decrypts it with GCP KMS
  3. Compares against the operator-provided plaintext via VerifyRecoveryKey

If the plaintext was never stored anywhere, operators have no way to produce it. You cannot get it out of OpenBao's PostgreSQL without GCP KMS — and if GCP KMS is available, you don't need the recovery key in the first place.

The Rails DB is the only durable, operator-accessible store of the plaintext.


It also enables future rotation

When a proper rotation is eventually implemented (the current rake task only handles the initial bootstrap), the flow would be:

  1. Fetch current plaintext from Rails DB
  2. POST sys/rotate/recovery/init → get nonce
  3. POST sys/rotate/recovery/update → provide current plaintext + nonce → get new plaintext
  4. Store new plaintext in Rails DB, deactivate old record

Without the Rails DB, step 1 is impossible — there is no way to retrieve the current plaintext to authorize the next rotation.


Independent security boundary (secondary benefit)

The two stores use entirely separate encryption domains:

  • OpenBao PostgreSQL: protected by GCP KMS
  • Rails DB: protected by Rails attr_encrypted (application-level key)

An attacker compromising one does not compromise the other. But this is a secondary benefit — the primary reason is operational necessity, not defence-in-depth.


Bottom line: The Rails DB is not redundancy. It is the only place the plaintext shard durably lives. Without it, the recovery key is effectively unusable — stored in OpenBao but unreachable by any operator.

tag:gitlab.com,2026-03-18:5217206609 Jayakrishnan Mallissery commented on issue #592905 at GitLab.org / GitLab 2026-03-18T12:05:05Z jmallissery Jayakrishnan Mallissery [email protected]

@fcatteau Can you review ☝️ the thread above on the analysis of this warning

  • It appears to me that the warning is a genuine warning because we did NOT run the one time rake task that generates and stores the Openbao recovery key
  • We will need to validate this hypothesis by running the rake task in Staging and see if the warning goes away after the recovery keys are generated and stored.
  • We also need to document this step for self-managed in the installation docs I think.
tag:gitlab.com,2026-03-18:5217189653 Jayakrishnan Mallissery commented on issue #592905 at GitLab.org / GitLab 2026-03-18T12:01:05Z jmallissery Jayakrishnan Mallissery [email protected]
Risk of Never Running the Rake Task
Scenario Risk
Normal read/write secrets, CI/CD pipelines None — KMS handles everything
Root token is lost Unrecoverable — cannot run bao operator generate-root
Need to change KMS provider Blocked — seal migration requires recovery key
Need to re-key the barrier Blocked
Every pod restart WARN logged — noisy, masks real warnings

The rake task is a required initial operational step. The warning is the system's signal that this step has been skipped.

tag:gitlab.com,2026-03-18:5217188434 Jayakrishnan Mallissery commented on issue #592905 at GitLab.org / GitLab 2026-03-18T12:00:48Z jmallissery Jayakrishnan Mallissery [email protected]
Warning Disappears After Rake Task Runs — Permanently

Once core/recovery-key exists in Cloud SQL PostgreSQL, upgradeRecoveryKey takes the happy path on every subsequent restart:

vault/seal_autoseal.go#L444-L451

pe, err := d.core.physical.Get(ctx, recoveryKeyPath)
// pe is NOT nil after rake task runs
if pe == nil { ... }   // ← never reached again
// check KMS key ID, re-encrypt if rotated → return nil → no error → no WARN

Cloud SQL persists across Cloud Run revisions, pod restarts, and scaling to zero. The warning disappears permanently on all subsequent starts.


tag:gitlab.com,2026-03-18:5217185075 Jayakrishnan Mallissery commented on issue #592905 at GitLab.org / GitLab 2026-03-18T12:00:04Z jmallissery Jayakrishnan Mallissery [email protected]
What Happens When the Rake Task Runs a Second Time — Nothing Changes

On the second run, existingRecoveryConfig.SecretShares is now 1. InitRotation still calls initRecoveryRotation first (setting an in-memory rotation config), but then the fast path check fails:

vault/rotate.go#L128-L157

if existingRecoveryConfig.SecretShares == 0 {
    // ← SKIPPED on second run, shares is now 1
}
return nil, nil   // rotation was started in memory but no key is generated

handleRotateInitPut falls through to handleRotateInitGet which returns a status response with no "keys" field:

vault/logical_system_rotate.go#L635-L660

The rake task sees no "keys" and cancels:

ee/lib/tasks/gitlab/secrets_management/openbao.rake#L100-L107

else
  puts "Cannot get key, key has already been retrieved."
  secrets_manager_client.cancel_rotate_recovery   # DELETE sys/rotate/recovery/init
  nil
end

The cancel_rotate_recovery call is essential, not just a courtesy. Because initRecoveryRotation was already called and set c.recoveryRotationConfig, if the DELETE fails, any subsequent call to POST sys/rotate/recovery/init will return "rotation already in progress" until the pod restarts. The net result when cancel succeeds: no PostgreSQL state changes, the warning continues firing as before.

A proper second rotation (shares=1 → new key) requires a multi-step ceremony via POST sys/rotate/recovery/update:

vault/logical_system_rotate.go#L676-L743

This requires providing the existing plaintext shard from the Rails DB as authorization. This flow is not implemented in the rake task. The rake task is a one-shot bootstrap tool only.


tag:gitlab.com,2026-03-18:5217182758 Jayakrishnan Mallissery commented on issue #592905 at GitLab.org / GitLab 2026-03-18T11:59:30Z jmallissery Jayakrishnan Mallissery [email protected]
What Happens on Every Unseal — Related to Recovery Key vault/core.go
1. Physical cache purged
2. Recovery config cache cleared    (forces re-read from DB on next access)
3. unsealer.unseal()
     → reads "core/hsm/barrier-unseal-keys" from PostgreSQL
     → GCP KMS decrypts it → gets barrier key
     → barrier key unlocks storage layer
4. if autoSeal: seal.UpgradeKeys()
     a. Encrypt("a")              → refreshes internal KMS key ID
     b. upgradeRecoveryKey()
          BEFORE rake task: pe == nil → error → WARN logged
          AFTER  rake task: pe exists → check KMS key ID → re-encrypt if rotated → nil (no-op)
     c. upgradeStoredKeys()
          → re-encrypts "core/hsm/barrier-unseal-keys" if KMS key version changed
5. StartHealthCheck()              → background goroutine pings KMS every 10 min
6. [INFO] core: post-unseal setup complete

The recovery key is never used during normal unsealing. It is only touched in step 4b to handle KMS key rotation.


tag:gitlab.com,2026-03-18:5217178468 Jayakrishnan Mallissery commented on issue #592905 at GitLab.org / GitLab 2026-03-18T11:58:28Z jmallissery Jayakrishnan Mallissery [email protected]
How Recovery is Managed in GitLab Secrets Manager — Two-Phase Model

Phase 1: Initialization with recovery_shares=0

OpenBao initializes with no recovery key. KMS-based auto-unseal works immediately. core/recovery-key is never written. Every pod restart triggers the WARN.

Phase 2: Bootstrap via Rake Task

ee/lib/tasks/gitlab/secrets_management/openbao.rake#L67-L113

task :recovery_key_retrieve, [] => :gitlab_environment do
  privileged_jwt = SecretsManagement::GlobalSecretsManagerJwt.new.encoded
  secrets_manager_client = SecretsManagement::SecretsManagerClient.new(jwt: privileged_jwt)

  result = secrets_manager_client.init_rotate_recovery
  if result["data"].key? "keys"
    key = result["data"]["keys"][0]
    # deactivate old key, store new key in Rails DB, mark active
  else
    puts "Cannot get key, key has already been retrieved."
    secrets_manager_client.cancel_rotate_recovery
  end
end

The task uses a privileged system-level JWT (not user/project scoped):

ee/lib/secrets_management/global_secrets_manager_jwt.rb#L26-L27

secrets_manager_scope: 'privileged',
sub: 'gitlab_secrets_manager'    # system UID, not a real user

It calls POST sys/rotate/recovery/init with shares=1, threshold=1:

ee/lib/secrets_management/secrets_manager_client.rb#L276-L283

OPENBAO_RECOVERY_SHARES_THRESHOLD = 1

def init_rotate_recovery
  recovery_values = {
    secret_shares: OPENBAO_RECOVERY_SHARES_THRESHOLD,    # = 1
    secret_threshold: OPENBAO_RECOVERY_SHARES_THRESHOLD  # = 1
  }
  make_request(:post, rotate_recovery_url, recovery_values)
end

Because existingRecoveryConfig.SecretShares == 0, OpenBao takes a special fast path — generates the key and returns it immediately without any multi-party ceremony:

vault/rotate.go#L129-L157

 if existingRecoveryConfig.SecretShares == 0 {
      // special path: no ceremony needed, generate and return immediately
      newRecoveryKey, result, err := c.generateKey(c.recoveryRotationConfig, true)
      ...
      c.performRecoveryRekey(ctx, newRecoveryKey)  // writes to PostgreSQL, updates config to shares=1
      return result, nil                           // returns key shard to caller
  }
  return nil, nil  // shares>0: key not returned immediately; falls through to handleRotateInitGet (status response, no "keys" field)

performRecoveryRekey writes two durable records to PostgreSQL:

vault/rekey.go#L743-L759

c.seal.SetRecoveryKey(ctx, newRootKey)                   // writes "core/recovery-key" (KMS-encrypted)
c.seal.SetRecoveryConfig(ctx, c.recoveryRotationConfig)  // updates "core/recovery-config" to shares=1

The rake task then stores the plaintext shard in GitLab's Rails database:

ee/app/models/secrets_management/recovery_key.rb

  class RecoveryKey < ApplicationRecord
    self.table_name = 'secrets_management_recovery_keys'
    encrypts :key              # Rails 7 Active Record Encryption (not attr_encrypted gem)
    validate :no_other_active  # maximum one active key at a time
  end

After the rake task, the shard lives in two places for different purposes:

Location What's stored Encrypted by Purpose
OpenBao PostgreSQL core/recovery-key Recovery key blob GCP KMS Used by OpenBao to verify the operator-provided shard during re-key
GitLab Rails DB secrets_management_recovery_keys Plaintext shard Rails attr_encrypted Operator's copy — provided as input to authorize re-key operations

tag:gitlab.com,2026-03-18:5217175411 Jayakrishnan Mallissery commented on issue #592905 at GitLab.org / GitLab 2026-03-18T11:57:42Z jmallissery Jayakrishnan Mallissery [email protected]
Role of GCP KMS

GCP KMS is the auto-seal backend. It encrypts and decrypts the barrier key so OpenBao can seal/unseal without human intervention.

Every pod restart — auto-unseal flow:

OpenBao starts
  → reads "core/hsm/barrier-unseal-keys" from PostgreSQL  (KMS-encrypted blob)
  → calls GCP KMS Decrypt API                             → gets barrier key (plaintext)
  → barrier key decrypts OpenBao's storage encryption layer
  → OpenBao is unsealed

Every UpgradeKeys call — KMS key rotation handling:

GCP KMS periodically rotates its key versions. UpgradeKeys re-encrypts both core/hsm/barrier-unseal-keys and core/recovery-key with the current KMS key version, implementing the NIST recommendation to re-encrypt after key rotation.

The recovery key blob is also KMS-encrypted when stored:

vault/seal_autoseal.go#L382-L413

func (d *autoSeal) SetRecoveryKey(ctx context.Context, key []byte) error {
    blobInfo, err := d.Encrypt(ctx, key, nil)   // GCP KMS encrypts it
    ...
    be := &physical.Entry{Key: recoveryKeyPath, Value: value}
    d.core.physical.Put(ctx, be)                // stored in PostgreSQL
}

VerifyRecoveryKey also needs KMS to decrypt the stored blob for verification:

vault/seal_autoseal.go#L365-L380

func (d *autoSeal) VerifyRecoveryKey(ctx context.Context, key []byte) error {
    pt, err := d.getRecoveryKeyInternal(ctx)   // needs KMS to decrypt "core/recovery-key"
    ...
    if subtle.ConstantTimeCompare(key, pt) != 1 {
        return errors.New("recovery key does not match submitted values")
    }
}

Critical implication: GCP KMS failure = total outage regardless of whether a recovery key exists. The recovery key does not protect against KMS unavailability.