Nick Duff activity

Nick Duff pushed to project branch main at Nick Duff / Alloy

2026-03-18T15:06:15Z

Nick Duff (3ecca296) at 18 Mar 15:06

Merge remote-tracking branch 'upstream/main'

... and 2 more commits

Nick Duff pushed to project branch main at Nick Duff / Alloy

2026-03-18T13:10:06Z

Nick Duff (810ed4df) at 18 Mar 13:10

Merge remote-tracking branch 'upstream/main'

... and 1 more commit

Nick Duff pushed to project branch main at Nick Duff / Alloy

2026-03-18T08:06:30Z

Nick Duff (b6e69a62) at 18 Mar 08:06

Merge remote-tracking branch 'upstream/main'

... and 1 more commit

Nick Duff approved merge request !1005: Enable autosync for Atlantis at GitLab.com / GitLab Infrastructure Team / ArgoCD / ArgoCD Applications

2026-03-18T03:59:17Z

What

Enable autosync for Atlantis.

Why

Last step of CR gitlab-com/gl-infra/production#21581

Links to relevant issues

gitlab-com/gl-infra/production-engineering#28241

Nick Duff deleted project branch nduff/mimir-ingesters-moar at GitLab.com / GitLab Infrastructure Team / Kubernetes Workloads / GitLab Helmfiles

2026-03-18T03:51:20Z

Nick Duff (343c689b) at 18 Mar 03:51

Nick Duff pushed to project branch master at GitLab.com / GitLab Infrastructure Team / Kubernetes Workloads / GitLab Helmfiles

2026-03-18T03:51:15Z

Nick Duff (1c86fc4d) at 18 Mar 03:51

Merge branch 'nduff/mimir-ingesters-moar' into 'master'

... and 1 more commit

Nick Duff accepted merge request !10103: chore(mimir): increase ingester replicas at GitLab.com / GitLab Infrastructure Team / Kubernetes Workloads / GitLab ...

2026-03-18T03:51:15Z

What

Increase Ingester Replicas for Mimir

Why

We have had two recent saturation events due to the increase in peak metric usage.

This will increase the ingester count by 5 per zone. It could require more, but I'm hoping it buys enough time to audit and drop some of our expensive metrics.

See the linked issue for more details.

Links to relevant issues

Nick Duff approved merge request !983: Add Atlantis at GitLab.com / GitLab Infrastructure Team / ArgoCD / ArgoCD Applications

2026-03-18T03:33:58Z

What

Add Atlantis service.

Why

Migrated from Helmfile.

Links to relevant issues

gitlab-com/gl-infra/production-engineering#28241

Nick Duff commented on issue #4516 at GitLab.com / GitLab Infrastructure Team / Observability / Observability Issue Tracker

2026-03-18T03:23:24Z

I've add a subtle increase to the ingesters here.

This will give 5 more per zone. Hopefully it is enough to work on point 3.

Nick Duff commented on issue #4488 at GitLab.com / GitLab Infrastructure Team / Observability / Observability Issue Tracker

2026-03-18T03:21:20Z

@reprazent I was looking at some of these data today following the recent Ingester OOM incident. I've scaled the Ingesters slightly but it would be good to figure out what we can address on this side of the fence also.

In the process I noticed these metrics are producing a large amount of data:

I had a look at where these are used and none of the recording rules are working properly, as well as querying the data is almost impossible as it needs to load more chunks than we currently permit.

I actually did something rather stupid on my part in the process which is drop some of the high cardinality labels in this MR only to realize during roll out I have killed some of the unique dimensions, so we are now seeing some err-mimir-sample-duplicate-timestamp errors. Instead of fixing this strait away because as noted these have been pretty impossible to query for a while, I would rather raise the question of do we need gitlab_database_connection_pool_(busy|dead) at all?

I can understand the usefulness to some degree but we can also know if the pool is busy or exhausted on the server side. Not to mention, since it looks like this have been struggling in recording rules and dashboards to query for a long time...that can't have been that relied on. Only really scoped adhoc queries would have been successfully but then those also provide little value.

Keen to get your thoughts on it.

Nick Duff approved merge request !983: Add Atlantis at GitLab.com / GitLab Infrastructure Team / ArgoCD / ArgoCD Applications

2026-03-18T03:13:37Z

What

Add Atlantis service.

Why

Migrated from Helmfile.

Links to relevant issues

gitlab-com/gl-infra/production-engineering#28241

Nick Duff commented on issue #21581 at GitLab.com / GitLab Infrastructure Team / Production

2026-03-18T03:12:46Z

@pguinoiseau LGTM!

Nick Duff opened merge request !10103: chore(mimir): increase ingester replicas at GitLab.com / GitLab Infrastructure Team / Kubernetes Workloads / GitLab He...

2026-03-18T03:10:06Z

What

Increase Ingester Replicas for Mimir

Why

We have had two recent saturation events due to the increase in peak metric usage.

This will increase the ingester count by 5 per zone. It could require more, but I'm hoping it buys enough time to audit and drop some of our expensive metrics.

See the linked issue for more details.

Links to relevant issues

Nick Duff commented on issue #4516 at GitLab.com / GitLab Infrastructure Team / Observability / Observability Issue Tracker

2026-03-18T03:07:37Z

Just a quick summary of this:

There were several ingesters across multiple AZ that experienced OOMKills.

This was due to a large increase in in-memory series.

Which for our largest tenant we can see the growth recently with the high water marker getting higher and higher

source

So naturally as we see more metrics over time we are going to be experiencing more pressure on our ingesters.

We have a few options for this:

Enable blocks-storage.tsdb.early-head-compaction-min-in-memory-series

We previously looked at this but found it caused issues with some series that are coming in with old timestamps. This requires further auditing.

Scale Ingster Replicas

Easiest option for now to provide relief.
We also need to enable autoscaling for the ingesters but this isn't trivial due to the stateful nature and the fact they keep 6 hours of recent data in them. Ideally we would want to move to the new architecture first

Start to remove high cardinality and un-used metrics

Nick Duff pushed new project branch nduff/mimir-ingesters-moar at GitLab.com / GitLab Infrastructure Team / Kubernetes Workloads / GitLab Helmfiles

2026-03-18T03:04:44Z

Nick Duff (343c689b) at 18 Mar 03:04

chore(mimir): increase ingester replicas

Nick Duff deleted project branch nduff/metric-label-clean at GitLab.com / GitLab Infrastructure Team / Kubernetes Workloads / GitLab Helmfiles

2026-03-18T01:05:00Z

Nick Duff (26fd2b1c) at 18 Mar 01:05

Nick Duff pushed to project branch master at GitLab.com / GitLab Infrastructure Team / Kubernetes Workloads / GitLab Helmfiles

2026-03-18T01:04:52Z

Nick Duff (db175b5b) at 18 Mar 01:04

Merge branch 'nduff/metric-label-clean' into 'master'

... and 1 more commit

Nick Duff accepted merge request !10100: chore(prometheus): drop un-used labels from some metrics at GitLab.com / GitLab Infrastructure Team / Kubernetes Wor...

2026-03-18T01:04:50Z

What

Drops un-used labels from gitlab_database_connection_pool_(busy|dead)
Fixes a typo in an old drop rule

Why

Follow up to !10099 to enable for all envs.

Nick Duff opened merge request !10100: chore(prometheus): drop un-used labels from some metrics at GitLab.com / GitLab Infrastructure Team / Kubernetes Workl...

2026-03-18T00:55:42Z

What

Drops un-used labels from gitlab_database_connection_pool_(busy|dead)
Fixes a typo in an old drop rule

Why

Follow up to !10099 to enable for all envs.

Nick Duff pushed new project branch nduff/metric-label-clean at GitLab.com / GitLab Infrastructure Team / Kubernetes Workloads / GitLab Helmfiles

2026-03-18T00:54:24Z

Nick Duff (26fd2b1c) at 18 Mar 00:54

chore(prometheus): drop un-used labels from some metrics