Furhan Shabir (efd2eb14) at 19 Mar 16:22
Furhan Shabir (d11e142d) at 19 Mar 16:21
Merge branch 'revert-2543ec35' into 'master'
... and 1 more commit
Revert "Merge branch 'chore/add-name-to-keda-triggers' into 'master'"
This reverts merge request !5284
Adding name to trigger didn't change external metrics name for KEDA scaler.
Please read the Contributing document and once you do, complete the following:
+1 to not paging on 6h burn-rate for Sidekiq and using shorter (1 hour?) burn rate for alerting. We can evaluate new queuing apdex using 1 hour burn rate then.
Adding name to the trigger didn't change the external metrics name that we wanted to change in the first place.
Revert "Merge branch 'chore/add-name-to-keda-triggers' into 'master'"
This reverts merge request !5284
Adding name to trigger didn't change external metrics name for KEDA scaler.
Please read the Contributing document and once you do, complete the following:
Furhan Shabir (efd2eb14) at 19 Mar 06:34
Revert "Merge branch 'chore/add-name-to-keda-triggers' into 'master'"
Furhan Shabir (3c99fd0c) at 18 Mar 14:14
Furhan Shabir (2543ec35) at 18 Mar 14:14
Merge branch 'chore/add-name-to-keda-triggers' into 'master'
... and 1 more commit
Add name for the KEDA prometheus triggers for urgent-cpu-bound sidekiq shard
We suspect the default name, used by multiple sidekiq shards, is causing prometheus scrape failures and this trigger is often erroring out as a result.
Reference issue: gitlab-com/gl-infra/tenant-scale/tenant-services/team#373 (comment 3170674284)
Please read the Contributing document and once you do, complete the following:
Add name for the KEDA prometheus triggers for urgent-cpu-bound sidekiq shard
We suspect the default name, used by multiple sidekiq shards, is causing prometheus scrape failures and this trigger is often erroring out as a result.
Reference issue: gitlab-com/gl-infra/tenant-scale/tenant-services/team#373 (comment 3170674284)
Please read the Contributing document and once you do, complete the following:
Furhan Shabir (3c99fd0c) at 18 Mar 13:56
chore: Add name to KEDA prometheus triggers for urgent-cpu-bound
There is an improvement in HPA scaling during peak hours, where we are seeing flat-line saturation lasting for shorter times as compared to week before decreasing concurrency, which is a good sign.
Looking at the horizontal scaler, it relies on cpu utilization and shard worker saturation. However, looking through the scaler logs, it seems that scaling was almost always happening based on cpu utilization.
The prometheus metrics based scaler is almost always erroring out:
Mimir credentials look alright and there is no explicit reason for the failure in scaler logs:
unable to get external metric gitlab/s1-prometheus/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: urgent-sidekiq-cpu-bound-v2,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: rpc error: code = Unknown desc = error when getting metric values metric:s1-prometheus encountered error
Since there is no explicit name for KEDA scaling source, it is defaulted to s1-prometheus but then this is happening for all the other scalers (shards) to, which could be the cause of Unknown errors.
Logged an issue to add GVL wait time measurements to sidekiq job logs: #384
Duplicate of #373
GVL metrics were enabled here, but it doesn't seem to be much useful in its current form since we only get wait time for GVL for a random job but we can't compare it with total duration of the job to get the measure of time spent in waiting.
We would need to add this metric to logs, which already has duration_s, to come up with wait time percentage.