@tkhandelwal3 Ugh. Can we limit by time? Seems that we should drop some of those as well?
@marcogreg Yes, please solve it before enabling feature flag, but for sake of iterations lets move on. Overall, long running jobs is big anti-pattern, as we also do not see progress properly.
@marcogreg There are also some pipeline failures unsure if they are related. Retried.
Cells::ClaimsVerificationWorker to run verification per model, gated by a dynamic ops feature flag per model name. This worker calls Cells::Claims::VerificationService introduced in !226233, to backfill and reconcile changes between Rails and Topology Service.Cells::ScheduleClaimsVerificationWorker as a cron job that enqueues a ClaimsVerificationWorker for every model registered via Cells::Claimable with 10 mins delay. This cronjob runs once every Saturdays, and the schedule is randomized between 00:00 to 23:59 to avoid thunderring herd from all cells running this worker. This is done via Gitlab::Scheduling::ScheduleWithinWorker
gitlab-com/gl-infra/tenant-scale/cells-infrastructure/team#468
In console, run Rails.application.eager_load! to ensure all models have been loaded
Enable the feature flags:
%w[
cells_claims_verification_worker_organizations_organization
cells_claims_verification_worker_project
cells_claims_verification_worker_namespace
cells_claims_verification_worker_user
cells_claims_verification_worker_key
cells_claims_verification_worker_email
cells_claims_verification_worker_gpg_key
cells_claims_verification_worker_redirect_route
cells_claims_verification_worker_route
cells_claims_verification_worker_service_desk_setting
].each { |flag| Feature.enable(flag) }
Run the Cells::ScheduleClaimsVerificationWorker
Cells::ScheduleClaimsVerificationWorker.new.perform
Check that the claims records are backfilled in topology service database:
gdk psql -d topology_service
SELECT * FROM claims;
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
@marcogreg I'm unsure about those recommendations, but what data set we expect now, or how long those jobs will run today? Overall, anything that runs longer than like 10 minutes, would be better to be retried and re-run again later.
@GitLabDuo If this worker takes 30mins, is there issue with it? Like TTL or something else?
@GitLabDuo But, if the worker is executing for given args it should not be interrupted. This worker might take significant amount of time. Also what is including_scheduled:?
@GitLabDuo What it should be here? Overall, is the deduplicate based on args, or based only on class name?
Kamil Trzciński (e60c3ab0) at 19 Mar 12:22
chore(core): delegate Security::VulnerabilityUUID.generate to gem i...
... and 4 more commits
OK, the rest can be follow-up. Can you create an issue? to figure out proper spacing between operations to avoid the thundering herd.
Kamil Trzciński (f60e8396) at 19 Mar 12:01
chore(core): restore comments for Gitlab::Json subclasses in gem js...
... and 7 more commits
@tkhandelwal3 Fine, but we should at least ensure that within the same cell this process just once if this is not already guaranteed.
Also, we could as part of scheduling the worker, run each next model with a few minutes delay like 10minutes, etc.
@GitLabDuo Is there some way to enforce it without ExclusiveLease, maybe there is some method in here that is worker configuration?
@GitLabDuo Does it use exclusive locking to prevent schedule across multiple nodes at the same time.
This makes us schedule all models at the same time, and create extra pressure. Also, we will be scheduling this individually on each sidekiq processing cron jobs, so multiple times.
This is also not ideal, since we do schedule those across all cells at the same time.
In general the pattern should be:
Does it make sense to retry those even if they run on a schedule anyway?
Kamil Trzciński (1d4325d8) at 19 Mar 11:29
chore(core): scaffold and extract gitlab-ci-report-parsers gem
... and 5523 more commits
Kamil Trzciński (f3c57dd3) at 19 Mar 11:26
chore(core): add nokogiri dependency to gitlab-utils
... and 3 more commits
Kamil Trzciński (da34a04f) at 19 Mar 10:48
chore(core): use BuildInfo instead of Junit::JobInfo in collect_tes...
Kamil Trzciński (434b3c28) at 19 Mar 10:10
chore(core): migrate next_instance_of to gitlab-rspec and remove du...