@fcatteau I did digging into this.
Can you take a look and see if it make sense.
Environment: 300 namespaces, rotation_period=5m, max_parallel=3, 200m CPU limit.
Created 300 namespaces with OIDC named keys and bootstrapped key material.
Wave 1 2026-03-17T18:31 UTC 903 warnings
Wave 2 2026-03-17T18:47 UTC 713 warnings
Total 1,616 warnings
Sample log:
{"@level":"warn","@message":"error rotating OIDC keys","@module":"secrets.identity.identity_9239dadb","err":"context deadline exceeded"}
{"@level":"warn","@message":"error expiring OIDC public keys","@module":"secrets.identity.identity_9239dadb","err":"context deadline exceeded"}
All goroutines emit the same module ID (identity_9239dadb) — confirms the shared-state O(N²) behaviour where every goroutine iterates all namespaces.
Upgraded to v2.5.1, monitored across multiple rotation windows:
kubectl logs -n gitlab -l app.kubernetes.io/name=openbao \
--all-containers --since=15m | grep '"@level":"warn"' | grep -c "error rotating"
# 0
Zero cascade warnings. O(N²) fix confirmed.
After confirming zero warnings, checked whether JWKS kids were actually rotating:
test-ns-1 T=0: 72c430c2 bfef93d2 T+12m: 72c430c2 bfef93d2 # unchanged
test-ns-100 T=0: 715151ca aca1f43c T+12m: 715151ca aca1f43c # unchanged
test-ns-300 T=0: f497c59c 526b86e1 T+12m: f497c59c 526b86e1 # unchanged
No rotation across any of the 300 non-root namespaces after 12+ rotation windows. Silent — no errors or warnings.
Root cause: the v2.5.1 refactor introduced a double namespace prefix. oidcPeriodicFunc passes nsPath + "identity/oidc" to MatchingStorageByAPIPath, but the router already prepends nsPath internally:
# Non-root namespace (ns.Path = "test-ns-1/")
"test-ns-1/" + "test-ns-1/identity/oidc" → no radix tree match → nil → silent early return
Root namespace (ns.Path = "") is unaffected, which masked the regression initially.
Fix is a one-liner in vault/identity_store_oidc.go:
// v2.5.1 — buggy
s := i.router.MatchingStorageByAPIPath(ctx, nsPath+"identity/oidc")
// Fixed — router prepends nsPath internally
s := i.router.MatchingStorageByAPIPath(ctx, "identity/oidc")
Upstream bug filed: https://github.com/openbao/openbao/issues/2664
Re-ran v2.4.4 baseline (972 warnings, 3 waves), then deployed the patched binary.
Warnings:
Initial burst (first tick, all 300 overdue): 29
All subsequent windows: 0
The 29 initial failures are expected — all 300 namespaces had overdue NextRotation on pod start and fired simultaneously. No cascade. In production, namespaces are provisioned gradually so this burst won't occur.
JWKS kids rotating (test-ns-300, 28-minute window):
T=0: ab661e85 950d5a31 49f327de
T+5m: 8e015056 49f327de 337c775c
T+10m: 337c775c 8e015056 85df2fe5
T+28m: f94c0644 f6c44966 85df2fe5
6 rotations in 28 min — matches rotation_period=5m. Expired keys cleaned from JWKS confirmed.
| v2.4.4 | v2.5.1 | v2.5.1 + fix | |
|---|---|---|---|
context deadline exceeded warnings |
1,616 (cascade) | 0 | 29 one-time burst |
| Non-root namespace rotation | Eventually | Silently broken | Continuously |
| Expired key cleanup | Eventually | Broken | Working |
| Retry cascade | Yes | No | No |
Recommendation: v2.5.1 eliminates the context deadline exceeded cascade but introduces a silent regression — non-root namespace OIDC keys never rotate. Upgrade to v2.5.1 with the fix from https://github.com/openbao/openbao/issues/2664.
@fcatteau Done
If it looks good, we can close this issue
@fcatteau Instead of mentioning this in the troubleshooting section, why not document this rake task and the need to run it to generate recovery keys - for self-managed ? Maybe in the post installation docs. WDYT ?
Do we need to create an issue to track the tasks that we need to do w.r.t. running the rake task in Staging and production to generate recovery keys ? Do we need to do that before G/A ?
Let me know your thoughts. Maybe I am missing something
@fcatteau I verified in a CN environment that running the rake task indeed fixed the warnings
@fcatteau I reproduced the bug in a CN environment. Observed the warnings. Ran the rake task and it solved the issue. Warnings disappeared.
Second run of the rake task — results:
The cancel_rotate_recovery call at the end of the second run is important — without it, the in-memory recoveryRotationConfig set by initRecoveryRotation would block any future POST sys/rotate/recovery/init with "rotation already in progress" until the pod restarts. The task handles this correctly.
@fcatteau I updated the comment a bit. It looks like we have only implemented recovery key generation and not the rotation of the recovery key
A proper second rotation (shares=1 → new key) requires a multi-step ceremony via POST sys/rotate/recovery/update:
vault/logical_system_rotate.go#L676-L743
This requires providing the existing plaintext shard from the Rails DB as authorization. This flow is not implemented in the rake task. The rake task is a one-shot bootstrap tool only.
Why the Rails DB Copy Is Operationally Necessary
What each store actually contains
┌───────────────────────────────────────────┬────────────────────┬──────────────────────┬─────────────────────────┐
│ Store │ Content │ Encrypted by │ Accessible to │
├───────────────────────────────────────────┼────────────────────┼──────────────────────┼─────────────────────────┤
│ OpenBao PostgreSQL core/recovery-key │ KMS-encrypted blob │ GCP KMS │ OpenBao internally only │
├───────────────────────────────────────────┼────────────────────┼──────────────────────┼─────────────────────────┤
│ Rails DB secrets_management_recovery_keys │ Plaintext shard │ Rails attr_encrypted │ Operators via GitLab │
└───────────────────────────────────────────┴────────────────────┴──────────────────────┴─────────────────────────┘
These are not two copies of the same thing — they serve fundamentally different roles.
The plaintext is only revealed once
When the rake task calls POST sys/rotate/recovery/init and OpenBao generates the key, it:
After that moment, the plaintext exists nowhere inside OpenBao. OpenBao's PostgreSQL copy can only be recovered by calling GCP KMS to decrypt it — which OpenBao does internally. Operators cannot extract it directly.
The rake task captures the plaintext at the only moment it is exposed and stores it in the Rails DB. This is the operator's only persistent access to the plaintext.
The chicken-and-egg problem this solves
To use the recovery key for any break-glass operation (generate root token, re-key), an operator must provide the plaintext to POST sys/rotate/recovery/update. OpenBao then:
If the plaintext was never stored anywhere, operators have no way to produce it. You cannot get it out of OpenBao's PostgreSQL without GCP KMS — and if GCP KMS is available, you don't need the recovery key in the first place.
The Rails DB is the only durable, operator-accessible store of the plaintext.
It also enables future rotation
When a proper rotation is eventually implemented (the current rake task only handles the initial bootstrap), the flow would be:
Without the Rails DB, step 1 is impossible — there is no way to retrieve the current plaintext to authorize the next rotation.
Independent security boundary (secondary benefit)
The two stores use entirely separate encryption domains:
An attacker compromising one does not compromise the other. But this is a secondary benefit — the primary reason is operational necessity, not defence-in-depth.
Bottom line: The Rails DB is not redundancy. It is the only place the plaintext shard durably lives. Without it, the recovery key is effectively unusable — stored in OpenBao but unreachable by any operator.
@fcatteau Can you review
| Scenario | Risk |
|---|---|
| Normal read/write secrets, CI/CD pipelines | None — KMS handles everything |
| Root token is lost |
Unrecoverable — cannot run bao operator generate-root
|
| Need to change KMS provider | Blocked — seal migration requires recovery key |
| Need to re-key the barrier | Blocked |
| Every pod restart | WARN logged — noisy, masks real warnings |
The rake task is a required initial operational step. The warning is the system's signal that this step has been skipped.
Once core/recovery-key exists in Cloud SQL PostgreSQL, upgradeRecoveryKey takes the happy path on every subsequent restart:
vault/seal_autoseal.go#L444-L451
pe, err := d.core.physical.Get(ctx, recoveryKeyPath)
// pe is NOT nil after rake task runs
if pe == nil { ... } // ← never reached again
// check KMS key ID, re-encrypt if rotated → return nil → no error → no WARN
Cloud SQL persists across Cloud Run revisions, pod restarts, and scaling to zero. The warning disappears permanently on all subsequent starts.
On the second run, existingRecoveryConfig.SecretShares is now 1. InitRotation still calls initRecoveryRotation first (setting an in-memory rotation config), but then the fast path check fails:
if existingRecoveryConfig.SecretShares == 0 {
// ← SKIPPED on second run, shares is now 1
}
return nil, nil // rotation was started in memory but no key is generated
handleRotateInitPut falls through to handleRotateInitGet which returns a status response with no "keys" field:
vault/logical_system_rotate.go#L635-L660
The rake task sees no "keys" and cancels:
ee/lib/tasks/gitlab/secrets_management/openbao.rake#L100-L107
else
puts "Cannot get key, key has already been retrieved."
secrets_manager_client.cancel_rotate_recovery # DELETE sys/rotate/recovery/init
nil
end
The cancel_rotate_recovery call is essential, not just a courtesy. Because initRecoveryRotation was already called and set c.recoveryRotationConfig, if the DELETE fails, any subsequent call to POST sys/rotate/recovery/init will return "rotation already in progress" until the pod restarts. The net result when cancel succeeds: no PostgreSQL state changes, the warning continues firing as before.
A proper second rotation (shares=1 → new key) requires a multi-step ceremony via POST sys/rotate/recovery/update:
vault/logical_system_rotate.go#L676-L743
This requires providing the existing plaintext shard from the Rails DB as authorization. This flow is not implemented in the rake task. The rake task is a one-shot bootstrap tool only.
vault/core.go
1. Physical cache purged
2. Recovery config cache cleared (forces re-read from DB on next access)
3. unsealer.unseal()
→ reads "core/hsm/barrier-unseal-keys" from PostgreSQL
→ GCP KMS decrypts it → gets barrier key
→ barrier key unlocks storage layer
4. if autoSeal: seal.UpgradeKeys()
a. Encrypt("a") → refreshes internal KMS key ID
b. upgradeRecoveryKey()
BEFORE rake task: pe == nil → error → WARN logged
AFTER rake task: pe exists → check KMS key ID → re-encrypt if rotated → nil (no-op)
c. upgradeStoredKeys()
→ re-encrypts "core/hsm/barrier-unseal-keys" if KMS key version changed
5. StartHealthCheck() → background goroutine pings KMS every 10 min
6. [INFO] core: post-unseal setup complete
The recovery key is never used during normal unsealing. It is only touched in step 4b to handle KMS key rotation.
recovery_shares=0
OpenBao initializes with no recovery key. KMS-based auto-unseal works immediately. core/recovery-key is never written. Every pod restart triggers the WARN.
ee/lib/tasks/gitlab/secrets_management/openbao.rake#L67-L113
task :recovery_key_retrieve, [] => :gitlab_environment do
privileged_jwt = SecretsManagement::GlobalSecretsManagerJwt.new.encoded
secrets_manager_client = SecretsManagement::SecretsManagerClient.new(jwt: privileged_jwt)
result = secrets_manager_client.init_rotate_recovery
if result["data"].key? "keys"
key = result["data"]["keys"][0]
# deactivate old key, store new key in Rails DB, mark active
else
puts "Cannot get key, key has already been retrieved."
secrets_manager_client.cancel_rotate_recovery
end
end
The task uses a privileged system-level JWT (not user/project scoped):
ee/lib/secrets_management/global_secrets_manager_jwt.rb#L26-L27
secrets_manager_scope: 'privileged',
sub: 'gitlab_secrets_manager' # system UID, not a real user
It calls POST sys/rotate/recovery/init with shares=1, threshold=1:
ee/lib/secrets_management/secrets_manager_client.rb#L276-L283
OPENBAO_RECOVERY_SHARES_THRESHOLD = 1
def init_rotate_recovery
recovery_values = {
secret_shares: OPENBAO_RECOVERY_SHARES_THRESHOLD, # = 1
secret_threshold: OPENBAO_RECOVERY_SHARES_THRESHOLD # = 1
}
make_request(:post, rotate_recovery_url, recovery_values)
end
Because existingRecoveryConfig.SecretShares == 0, OpenBao takes a special fast path — generates the key and returns it immediately without any multi-party ceremony:
if existingRecoveryConfig.SecretShares == 0 {
// special path: no ceremony needed, generate and return immediately
newRecoveryKey, result, err := c.generateKey(c.recoveryRotationConfig, true)
...
c.performRecoveryRekey(ctx, newRecoveryKey) // writes to PostgreSQL, updates config to shares=1
return result, nil // returns key shard to caller
}
return nil, nil // shares>0: key not returned immediately; falls through to handleRotateInitGet (status response, no "keys" field)
performRecoveryRekey writes two durable records to PostgreSQL:
c.seal.SetRecoveryKey(ctx, newRootKey) // writes "core/recovery-key" (KMS-encrypted)
c.seal.SetRecoveryConfig(ctx, c.recoveryRotationConfig) // updates "core/recovery-config" to shares=1
The rake task then stores the plaintext shard in GitLab's Rails database:
ee/app/models/secrets_management/recovery_key.rb
class RecoveryKey < ApplicationRecord
self.table_name = 'secrets_management_recovery_keys'
encrypts :key # Rails 7 Active Record Encryption (not attr_encrypted gem)
validate :no_other_active # maximum one active key at a time
end
After the rake task, the shard lives in two places for different purposes:
| Location | What's stored | Encrypted by | Purpose |
|---|---|---|---|
OpenBao PostgreSQL core/recovery-key
|
Recovery key blob | GCP KMS | Used by OpenBao to verify the operator-provided shard during re-key |
GitLab Rails DB secrets_management_recovery_keys
|
Plaintext shard | Rails attr_encrypted
|
Operator's copy — provided as input to authorize re-key operations |
GCP KMS is the auto-seal backend. It encrypts and decrypts the barrier key so OpenBao can seal/unseal without human intervention.
Every pod restart — auto-unseal flow:
OpenBao starts
→ reads "core/hsm/barrier-unseal-keys" from PostgreSQL (KMS-encrypted blob)
→ calls GCP KMS Decrypt API → gets barrier key (plaintext)
→ barrier key decrypts OpenBao's storage encryption layer
→ OpenBao is unsealed
Every UpgradeKeys call — KMS key rotation handling:
GCP KMS periodically rotates its key versions. UpgradeKeys re-encrypts both core/hsm/barrier-unseal-keys and core/recovery-key with the current KMS key version, implementing the NIST recommendation to re-encrypt after key rotation.
The recovery key blob is also KMS-encrypted when stored:
vault/seal_autoseal.go#L382-L413
func (d *autoSeal) SetRecoveryKey(ctx context.Context, key []byte) error {
blobInfo, err := d.Encrypt(ctx, key, nil) // GCP KMS encrypts it
...
be := &physical.Entry{Key: recoveryKeyPath, Value: value}
d.core.physical.Put(ctx, be) // stored in PostgreSQL
}
VerifyRecoveryKey also needs KMS to decrypt the stored blob for verification:
vault/seal_autoseal.go#L365-L380
func (d *autoSeal) VerifyRecoveryKey(ctx context.Context, key []byte) error {
pt, err := d.getRecoveryKeyInternal(ctx) // needs KMS to decrypt "core/recovery-key"
...
if subtle.ConstantTimeCompare(key, pt) != 1 {
return errors.New("recovery key does not match submitted values")
}
}
Critical implication: GCP KMS failure = total outage regardless of whether a recovery key exists. The recovery key does not protect against KMS unavailability.