Nick Duff activity https://gitlab.com/nduff 2026-03-18T15:06:15Z tag:gitlab.com,2026-03-18:5218120594 Nick Duff pushed to project branch main at Nick Duff / Alloy 2026-03-18T15:06:15Z nduff Nick Duff [email protected]

Nick Duff (3ecca296) at 18 Mar 15:06

Merge remote-tracking branch 'upstream/main'

... and 2 more commits

tag:gitlab.com,2026-03-18:5217518550 Nick Duff pushed to project branch main at Nick Duff / Alloy 2026-03-18T13:10:06Z nduff Nick Duff [email protected]

Nick Duff (810ed4df) at 18 Mar 13:10

Merge remote-tracking branch 'upstream/main'

... and 1 more commit

tag:gitlab.com,2026-03-18:5216140249 Nick Duff pushed to project branch main at Nick Duff / Alloy 2026-03-18T08:06:30Z nduff Nick Duff [email protected]

Nick Duff (b6e69a62) at 18 Mar 08:06

Merge remote-tracking branch 'upstream/main'

... and 1 more commit

tag:gitlab.com,2026-03-18:5215583490 Nick Duff approved merge request !1005: Enable autosync for Atlantis at GitLab.com / GitLab Infrastructure Team / ArgoCD / ArgoCD Applications 2026-03-18T03:59:17Z nduff Nick Duff [email protected]

What

Enable autosync for Atlantis.

Why

Last step of CR gitlab-com/gl-infra/production#21581

gitlab-com/gl-infra/production-engineering#28241

tag:gitlab.com,2026-03-18:5215571125 Nick Duff deleted project branch nduff/mimir-ingesters-moar at GitLab.com / GitLab Infrastructure Team / Kubernetes Workloads / GitLab Helmfiles 2026-03-18T03:51:20Z nduff Nick Duff [email protected]

Nick Duff (343c689b) at 18 Mar 03:51

tag:gitlab.com,2026-03-18:5215571020 Nick Duff pushed to project branch master at GitLab.com / GitLab Infrastructure Team / Kubernetes Workloads / GitLab Helmfiles 2026-03-18T03:51:15Z nduff Nick Duff [email protected]

Nick Duff (1c86fc4d) at 18 Mar 03:51

Merge branch 'nduff/mimir-ingesters-moar' into 'master'

... and 1 more commit

tag:gitlab.com,2026-03-18:5215571007 Nick Duff accepted merge request !10103: chore(mimir): increase ingester replicas at GitLab.com / GitLab Infrastructure Team / Kubernetes Workloads / GitLab ... 2026-03-18T03:51:15Z nduff Nick Duff [email protected]

What

  • Increase Ingester Replicas for Mimir

Why

We have had two recent saturation events due to the increase in peak metric usage.

This will increase the ingester count by 5 per zone. It could require more, but I'm hoping it buys enough time to audit and drop some of our expensive metrics.

See the linked issue for more details.

Related Incident: gitlab-com/gl-infra/observability/team#4516

tag:gitlab.com,2026-03-18:5215546813 Nick Duff approved merge request !983: Add Atlantis at GitLab.com / GitLab Infrastructure Team / ArgoCD / ArgoCD Applications 2026-03-18T03:33:58Z nduff Nick Duff [email protected]

What

Add Atlantis service.

Why

Migrated from Helmfile.

gitlab-com/gl-infra/production-engineering#28241

tag:gitlab.com,2026-03-18:5215528815 Nick Duff commented on issue #4516 at GitLab.com / GitLab Infrastructure Team / Observability / Observability Issue Tracker 2026-03-18T03:23:24Z nduff Nick Duff [email protected]

I've add a subtle increase to the ingesters here.

This will give 5 more per zone. Hopefully it is enough to work on point 3.

tag:gitlab.com,2026-03-18:5215526045 Nick Duff commented on issue #4488 at GitLab.com / GitLab Infrastructure Team / Observability / Observability Issue Tracker 2026-03-18T03:21:20Z nduff Nick Duff [email protected]

@reprazent I was looking at some of these data today following the recent Ingester OOM incident. I've scaled the Ingesters slightly but it would be good to figure out what we can address on this side of the fence also.

In the process I noticed these metrics are producing a large amount of data:

image

I had a look at where these are used and none of the recording rules are working properly, as well as querying the data is almost impossible as it needs to load more chunks than we currently permit.

I actually did something rather stupid on my part in the process which is drop some of the high cardinality labels in this MR only to realize during roll out I have killed some of the unique dimensions, so we are now seeing some err-mimir-sample-duplicate-timestamp errors. Instead of fixing this strait away because as noted these have been pretty impossible to query for a while, I would rather raise the question of do we need gitlab_database_connection_pool_(busy|dead) at all?

I can understand the usefulness to some degree but we can also know if the pool is busy or exhausted on the server side. Not to mention, since it looks like this have been struggling in recording rules and dashboards to query for a long time...that can't have been that relied on. Only really scoped adhoc queries would have been successfully but then those also provide little value.

Keen to get your thoughts on it.

tag:gitlab.com,2026-03-18:5215515336 Nick Duff approved merge request !983: Add Atlantis at GitLab.com / GitLab Infrastructure Team / ArgoCD / ArgoCD Applications 2026-03-18T03:13:37Z nduff Nick Duff [email protected]

What

Add Atlantis service.

Why

Migrated from Helmfile.

gitlab-com/gl-infra/production-engineering#28241

tag:gitlab.com,2026-03-18:5215514166 Nick Duff commented on issue #21581 at GitLab.com / GitLab Infrastructure Team / Production 2026-03-18T03:12:46Z nduff Nick Duff [email protected]

@pguinoiseau LGTM!

tag:gitlab.com,2026-03-18:5215510553 Nick Duff opened merge request !10103: chore(mimir): increase ingester replicas at GitLab.com / GitLab Infrastructure Team / Kubernetes Workloads / GitLab He... 2026-03-18T03:10:06Z nduff Nick Duff [email protected]

What

  • Increase Ingester Replicas for Mimir

Why

We have had two recent saturation events due to the increase in peak metric usage.

This will increase the ingester count by 5 per zone. It could require more, but I'm hoping it buys enough time to audit and drop some of our expensive metrics.

See the linked issue for more details.

Related Incident: gitlab-com/gl-infra/observability/team#4516

tag:gitlab.com,2026-03-18:5215506822 Nick Duff commented on issue #4516 at GitLab.com / GitLab Infrastructure Team / Observability / Observability Issue Tracker 2026-03-18T03:07:37Z nduff Nick Duff [email protected]

Just a quick summary of this:

There were several ingesters across multiple AZ that experienced OOMKills.

This was due to a large increase in in-memory series.

Which for our largest tenant we can see the growth recently with the high water marker getting higher and higher

image

source

So naturally as we see more metrics over time we are going to be experiencing more pressure on our ingesters.

We have a few options for this:

  1. Enable blocks-storage.tsdb.early-head-compaction-min-in-memory-series
  • We previously looked at this but found it caused issues with some series that are coming in with old timestamps. This requires further auditing.
  1. Scale Ingster Replicas
  • Easiest option for now to provide relief.
  • We also need to enable autoscaling for the ingesters but this isn't trivial due to the stateful nature and the fact they keep 6 hours of recent data in them. Ideally we would want to move to the new architecture first
  1. Start to remove high cardinality and un-used metrics
tag:gitlab.com,2026-03-18:5215502622 Nick Duff pushed new project branch nduff/mimir-ingesters-moar at GitLab.com / GitLab Infrastructure Team / Kubernetes Workloads / GitLab Helmfiles 2026-03-18T03:04:44Z nduff Nick Duff [email protected]

Nick Duff (343c689b) at 18 Mar 03:04

chore(mimir): increase ingester replicas

tag:gitlab.com,2026-03-18:5215306119 Nick Duff deleted project branch nduff/metric-label-clean at GitLab.com / GitLab Infrastructure Team / Kubernetes Workloads / GitLab Helmfiles 2026-03-18T01:05:00Z nduff Nick Duff [email protected]

Nick Duff (26fd2b1c) at 18 Mar 01:05

tag:gitlab.com,2026-03-18:5215305971 Nick Duff pushed to project branch master at GitLab.com / GitLab Infrastructure Team / Kubernetes Workloads / GitLab Helmfiles 2026-03-18T01:04:52Z nduff Nick Duff [email protected]

Nick Duff (db175b5b) at 18 Mar 01:04

Merge branch 'nduff/metric-label-clean' into 'master'

... and 1 more commit

tag:gitlab.com,2026-03-18:5215305936 Nick Duff accepted merge request !10100: chore(prometheus): drop un-used labels from some metrics at GitLab.com / GitLab Infrastructure Team / Kubernetes Wor... 2026-03-18T01:04:50Z nduff Nick Duff [email protected]

What

  • Drops un-used labels from gitlab_database_connection_pool_(busy|dead)
  • Fixes a typo in an old drop rule

Why

Follow up to !10099 to enable for all envs.

tag:gitlab.com,2026-03-18:5215291559 Nick Duff opened merge request !10100: chore(prometheus): drop un-used labels from some metrics at GitLab.com / GitLab Infrastructure Team / Kubernetes Workl... 2026-03-18T00:55:42Z nduff Nick Duff [email protected]

What

  • Drops un-used labels from gitlab_database_connection_pool_(busy|dead)
  • Fixes a typo in an old drop rule

Why

Follow up to !10099 to enable for all envs.

tag:gitlab.com,2026-03-18:5215289083 Nick Duff pushed new project branch nduff/metric-label-clean at GitLab.com / GitLab Infrastructure Team / Kubernetes Workloads / GitLab Helmfiles 2026-03-18T00:54:24Z nduff Nick Duff [email protected]

Nick Duff (26fd2b1c) at 18 Mar 00:54

chore(prometheus): drop un-used labels from some metrics