Nick Duff (3ecca296) at 18 Mar 15:06
Merge remote-tracking branch 'upstream/main'
... and 2 more commits
Nick Duff (810ed4df) at 18 Mar 13:10
Merge remote-tracking branch 'upstream/main'
... and 1 more commit
Nick Duff (b6e69a62) at 18 Mar 08:06
Merge remote-tracking branch 'upstream/main'
... and 1 more commit
Enable autosync for Atlantis.
Last step of CR gitlab-com/gl-infra/production#21581
Nick Duff (343c689b) at 18 Mar 03:51
Nick Duff (1c86fc4d) at 18 Mar 03:51
Merge branch 'nduff/mimir-ingesters-moar' into 'master'
... and 1 more commit
We have had two recent saturation events due to the increase in peak metric usage.
This will increase the ingester count by 5 per zone. It could require more, but I'm hoping it buys enough time to audit and drop some of our expensive metrics.
See the linked issue for more details.
Related Incident: gitlab-com/gl-infra/observability/team#4516
I've add a subtle increase to the ingesters here.
This will give 5 more per zone. Hopefully it is enough to work on point 3.
@reprazent I was looking at some of these data today following the recent Ingester OOM incident. I've scaled the Ingesters slightly but it would be good to figure out what we can address on this side of the fence also.
In the process I noticed these metrics are producing a large amount of data:
I had a look at where these are used and none of the recording rules are working properly, as well as querying the data is almost impossible as it needs to load more chunks than we currently permit.
I actually did something rather stupid on my part in the process which is drop some of the high cardinality labels in this MR only to realize during roll out I have killed some of the unique dimensions, so we are now seeing some err-mimir-sample-duplicate-timestamp errors. Instead of fixing this strait away because as noted these have been pretty impossible to query for a while, I would rather raise the question of do we need gitlab_database_connection_pool_(busy|dead) at all?
I can understand the usefulness to some degree but we can also know if the pool is busy or exhausted on the server side. Not to mention, since it looks like this have been struggling in recording rules and dashboards to query for a long time...that can't have been that relied on. Only really scoped adhoc queries would have been successfully but then those also provide little value.
Keen to get your thoughts on it.
@pguinoiseau LGTM!
We have had two recent saturation events due to the increase in peak metric usage.
This will increase the ingester count by 5 per zone. It could require more, but I'm hoping it buys enough time to audit and drop some of our expensive metrics.
See the linked issue for more details.
Related Incident: gitlab-com/gl-infra/observability/team#4516
Just a quick summary of this:
There were several ingesters across multiple AZ that experienced OOMKills.
This was due to a large increase in in-memory series.
Which for our largest tenant we can see the growth recently with the high water marker getting higher and higher
So naturally as we see more metrics over time we are going to be experiencing more pressure on our ingesters.
We have a few options for this:
blocks-storage.tsdb.early-head-compaction-min-in-memory-series
Nick Duff (343c689b) at 18 Mar 03:04
chore(mimir): increase ingester replicas
Nick Duff (26fd2b1c) at 18 Mar 01:05
Nick Duff (db175b5b) at 18 Mar 01:04
Merge branch 'nduff/metric-label-clean' into 'master'
... and 1 more commit
gitlab_database_connection_pool_(busy|dead)
Follow up to !10099 to enable for all envs.
gitlab_database_connection_pool_(busy|dead)
Follow up to !10099 to enable for all envs.
Nick Duff (26fd2b1c) at 18 Mar 00:54
chore(prometheus): drop un-used labels from some metrics