Awesome work @marcogreg, apart from 1 nitpick the change LGTM and 2 of Duo's suggestion
I'm requesting a maintainer from @bmarjanovic to pottentially speed up the process here.
@bmarjanovic let us know if you won't have the bandwidth to take a look at this MR, thanks
Cells::ClaimsVerificationWorker to run verification per model, gated by a dynamic ops feature flag per model nameCells::ScheduleClaimsVerificationWorker as a cron job that enqueues a ClaimsVerificationWorker for every model registered via Cells::Claimable . This cronjob runs once every weekend 12am UTC for now to minimize database load.This worker calls Cells::Claims::VerificationService introduced in !226233, to backfill and reconcile changes between Rails and Topology Service.
gitlab-com/gl-infra/tenant-scale/cells-infrastructure/team#468
In console, run Rails.application.eager_load! to ensure all models have been loaded
Enable the feature flags:
%w[
cells_claims_verification_worker_organizations_organization_model
cells_claims_verification_worker_project_model
cells_claims_verification_worker_namespace_model
cells_claims_verification_worker_user_model
cells_claims_verification_worker_key_model
cells_claims_verification_worker_email_model
cells_claims_verification_worker_gpg_key_model
cells_claims_verification_worker_redirect_route_model
cells_claims_verification_worker_route_model
cells_claims_verification_worker_service_desk_setting_model
].each { |flag| Feature.enable(flag) }
Run the Cells::ScheduleClaimsVerificationWorker
Cells::ScheduleClaimsVerificationWorker.new.perform
Check that the claims records are backfilled in topology service database:
gdk psql -d topology_service
SELECT * FROM claims;
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
nitpick: can we remove the _model suffix from the FF? They feel redudant and just having cells_claims_verification_worker_email reads better
@droarty2, we can't complete it by 10th July. We were initially expecting it to take a full quarter, i.e., till the end of July. Given that the milestone and quarter ends don't align very well. We went with 19.3.
I've updated the Due date and Milestone accordingly.
yeah the other reason why it's unlikely is because we run this rake task automatically as part of loading the DB, and since loading the DB happens before we start writing anything on the DB, this becomes really unlikely.
Tarun Khandelwal (2bffa9a3) at 19 Mar 02:03
update log to specify MINVALUE getting skipped.
Thanks @daveyleach for the work, just a few more comments.
question: why are we adding this incident, this was not on HTTP Router or Topology Service?
suggestion: Can we link / explain a bit on how these URLs are formed (managed_domain in the tenant model). and also link the upstream dedicated docs, given the user would also have to extract the creds to access these servers.
suggestion: Maybe we should link the config along with the ADR that helped us in finalizing these regions?
suggestion: Should we remove the k8s setup documentation from here? Given it's not really applicable for Cell pod access?
question: What do you mean by this? If it's how cloudflare environments are created and managed I think we should link the upstream documentation for that.
LGTM
Peer Approved
Hi @vyaklushin, we recently merged: gitlab-org/cells/topology-service!488 (merged), which adds support for classifying with claims in our TS.
We are still working on the documentation, but wanted to give you a heads up first.
PS: If you try this in prod right now it wouldn;t work as we have not backfilled the data yet, however that is something we would be doing in next couple of week, I'll give you a ping when that happens, thanks
@vglafirov overall looks good, I've few things that can be added / updated here.
All authenticated requests will have a
cell-1prefix added to the_gitlab_sessioncookie
This is not true; all the requests will go through the topology service, and not only authenticated requests, as we can see unauthenticated requests on staging also have that session prefix:
Should the above MR remove the region-specific override, given that we are adding in the global config?
worth linking to the first attempt and mention what changed since then
I think we have missed this from Steve's comment from below, can we add it in the description, mentioning that the previous attempt failed and from then we have done the following to make sure it doesn't happen this time? and load testing link that we have validated that TS can handle 40k RPS in case caching doesn't work?
/chatops run canary --disable --gprd
Should this be updated to /chatops gitlab run canary --disable --gprd in light of the recent announcement?
Remove
gitlab.com/*routing rule from CloudFlare configuration
Should we also link the config-mgmt link, given if someone runs TF apply this route will come back again?
Topology service logs
Given in this metric, the log volume be too high as soon as we enable this, should we add a filter to the logs saying level != info, so it's easier to see if any error happend?
ref: https://dashboards.gitlab.net/goto/efge6ngfv7u9sd?orgId=1
If you can get the optional field in quickly and start socializing the need to provide a value with Switchboard, EA, and Pubsec, then you can piggy back on https://gitlab.com/groups/gitlab-com/gl-infra/gitlab-dedicated/-/work_items/876 to make it mandatory
In the spirit of this, I ended up opening: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/tenant-model-schema/-/merge_requests/632 post, which we can start socializing the need ot start providing a value with the different teams (along with exposing it in the Switchboard UI).