@jtouchstone1 I have some answers but I'm still figuring this out.
I would suggest the admin area.
The SaaS platform would be the exception though.
We might not want this UI on SaaS anyway. Let's say we can turn that off on that platform.
We only generate the recovery key once. That would happen:
According to the issue, it should be displayed. This ensures that admins on Dedicated can get the key at all time. I'd like to discuss this with SREs familiar with OpenBao/Vault though.
According to the issue, GitLab would verify the recovery key. Admins don't need to type anything since the GitLab backend has the recovery key, and it can connect to OpenBao API to verify it.
cc @jmallissery
@jmallissery Could you go ahead and create these two issue?
@pguinoiseau there – or any SRE familiar with Vault. Issue to be added to gitlab-org#19390.Then we can close this very issue.
@jmallissery Right, that makes sense.
quoting !208680 (merged):
OpenBao provides more documentation, but the summary of it is that the recovery key can be retrieved once. Once retrieved, the endpoint returns an empty array. To my mind, this enforces two requirements:
- We should enforce that there is only one recovery key in the database.
- We should avoid losing the recovery key.
Also, the implementation and its spec make it clear that this is expected.
We can only generate a new recovery key when we reset the OpenBao database.
That's not so obvious in the docs though, in my opinion.
@jmallissery Yes, I believe this should be document in post-install instructions of admin docs. It could be repeated in the troubleshooting section – that would be a second entry point. That would be worth a small doc issue.
Do we need to create an issue to track the tasks that we need to do w.r.t. running the rake task in Staging and production to generate recovery keys ? Do we need to do that before G/A ?
Yes, let's create a Change Request issue. We need a separate confidential issue to discuss this with SREs anyways.
@jmallissery Thanks. Let's try running the rake task a second time, and see what happens. I'm adding that to the action items.
I've updated the user docs but with a different wording:
Value: Cannot be more than 10 KB (10,000 bytes).
Validation passes if the secret is exactly 10 KB.
Fabien Catteau (b4408e44) at 18 Mar 14:29
Fix secret value size limit wording
Fabien Catteau (3e32356f) at 18 Mar 14:25
Document secret value size limit
@marcel.amirault Thanks for taking a look. I thought this would be an instance limit similar to https://docs.gitlab.com/administration/instance_limits/#size-of-commit-titles-and-descriptions for instance. In any case, I agree we should fix user documentation first since it's incorrect.
As mentioned in the related issue discussion, secrets that you try to expose in the job log are [masked], like other CI/CD variables, so we should clarify this. While here, it's worth pointing out that they act like file type variables, so they'd get exposed with cat, not echo.
Additionally, instead of using cat in the example, which is like showing people echo $MY_SECRET (risky example), let's put a more realistic fake example of using a command that accepts credentials from a file.
If you are a GitLab team member and only adding documentation, do not add any of the following labels:
~"frontend"~"backend"~"type::bug"~"database"These labels cause the MR to be added to code verification QA issues.
Documentation-related MRs should be reviewed by a Technical Writer for a non-blocking review, based on Documentation Guidelines and the Style Guide.
If you aren't sure which tech writer to ask, use roulette or ask in the #docs Slack channel.
Default behavior, say something like Default behavior when you close an issue.Configuring GDK, say something like Configure GDK.@jmallissery Thanks for sharing this. Overall that seems correct – though I would have to read thoroughly all that's been shared.
When a proper rotation is eventually implemented (the current rake task only handles the initial bootstrap), the flow would be:
My understanding is that the recovery_key_retrieve task rotates the recovery key and can be called a second time.
@jmallissery Thanks a lot for researching. Indeed this warning makes sense since we haven't generated the recovery. Related issues:
I wouldn't run the Rake task on staging – at least not for testing purposes. To run a rake task on staging and production we would need to go through the change management process. See https://handbook.gitlab.com/handbook/engineering/infrastructure-platforms/change-management/
We could wait for the rake task to be ported to the GitLab UI but that would delay verification.
I suggest we test the Rake task on a GitLab CN or CNH deployment. WDYT?
Indeed we don't have documentation for the rake task. That would probably go under https://docs.gitlab.com/administration/secrets_manager/.
gitlab:secrets_management:openbao:recovery_key_retrieve rake task on a GitLab CN/CNH deployment. We expect the warning to no longer appear when the server starts.
None of this seems critical.
@jrandazzo I'm confirming the two problems reported here.
extraVolumes and extraVolumeMounts are set by users, the predefined secrets for then unseal key and the audit token are no longer set in the OpenBao chart. There's a workaround though. See #592988 (comment 3169984641)
I need to open two issues for this.
@reprazent As discussed above, the openbao_core_active metric tells if if we have 1 or 0 active node. We should always have 1. Is that something we can leverage in service metrics? It's redundant with the liveness probe. Does it matter?
We can't get an apdex from requests from HTTP requests since we don't have an error rate. But can we have an apdex based on the active state? Should we?
I think this is finally ready for dev.
@clemensbeck OK, so we would update OpenBao Chart to the leverage what's already defined in https://gitlab.com/gitlab-org/charts/gitlab/-/blob/master/templates/_certificates.tpl. No need for additional values.
I assume there's no workaround for this b/c initContainers can't be set, right?
Thanks for sharing a workaround!
@clemensbeck Thanks! I finally get it.
static-unseal and http-audit secrets to volumes and volumesMount. That doesn't collide with extraVolumes and extraVolumnesMount. https://gitlab.com/gitlab-org/cloud-native/charts/openbao/-/blob/406a95f3e3f85668e986fbf7704085cd7434db6c/templates/deployment.yaml#L76-112
generate is true.generate to false. It prepares secrets for the static unseal key and the http audit token, and sets these using extraVolumes and extraVolumesMount.extraVolumes and extraVolumesMount really.NOTE: The updated https://gitlab.com/gitlab-com/runbooks/-/blob/master/metrics-catalog/services/secrets-manager.jsonnet would look like this:
metricsCatalog.serviceDefinition(
runwayArchetype(
type='secrets-manager',
team='pipeline_security',
featureCategory='secrets_management'
) + {
serviceLevelIndicators+: {
openbao_requests: {
userImpacting: true,
featureCategory: 'secrets_management',
requestRate: rateMetric(
counter='secrets_manager_openbao_core_handle_request_count'
),
significantLabels: [],
},
},
}
)
For GET Hybrid the metrics is openbao_core_handle_request_count. The Runway service and the OpenBao chart use a different prefix for metrics.
@reprazent Thanks again for your help. So we would follow these steps:
secrets-manager Runway service with SLI based on OpenBao metrics.I can't find anything on testing what ends up in get-hybrid/config though.
How do we do that? Do we have dev docs for this?