The overall error rate has remained stable since the larger incidents last week had caused big spikes again until initial infrastructure improvements were made to help mitigate:
The smaller spikes we saw yesterday and today around noon/afternoon line up with when we still saw increased saturation on our Sidekiq workers. (As well as the even flatter curve on the weekend aligning with less saturation due to generally lower load.) We're engaging with the infra team to understand the timeline for further improvements – one idea being explored was to separate out the CI-related Sidekiq workloads entirely to improve the resiliency. But while this may sound like an obvious solution, there is complexities involved that may count against that approach.
On the application side itself, we're looking into more ways to use caching or pure optimizations to further reduce the amount of I/O operations during config processing, in order to decrease the likelihood of running into the contention issues in the first place, see !227483 for the most current efforts.
Component and project includes are the most common usage patterns we see for the namespaces that are most affected. We've been mostly optimizing for component includes in our application-level efforts recently, and will begin looking into project includes next, based on what we've learned.
Finally, the increase of the GITLAB_CI_CONFIG_FETCH_TIMEOUT_SECONDS value has not actually taken effect yet – environment variables are a bit of an outdated pattern for adjusting something like this on GitLab.com, this one was mostly added to allow self-managed users to tune this value. I expect the increase to take effect within the next 24 hours. Hopefully Sidekiq remains stable as through the last few days, so we can more easily gauge the effect of the change. If it does cause a notable drop in error frequency and no other adverse effects, we can consider temporarily raising it even further.
Thanks @ddieulivol – not sure if I wouldn't even consider that an upside. Imo it's a bit of an antipattern to reference a non-static URL for content that is (mainly) static. Not that I don't see your perspective, but the flipside of it is that you forget this include even exists and then when changing it you break the pipeline that includes it. (Although this particular included file does have a comment that addresses this.)
Yes, using a versioning (aka published) component would be even better – and we did indeed recently add caching for those as well. (Also behind a flag and caching might actually not even be the most efficient solution there, but either way – that fact that a published component version references an immutable git tag makes things a lot easier for any optimizations we're playing with.)
So I think in the bigger picture we should definitely try and get rid of any remote includes in the GitLab pipeline config. I was just happy that we even had one, as it would allow us testing the opt-in caching we built.
@sxuereb No worries, it's a pretty busy time for all of us I think.
As for the question from Slack a while back: "How many private groups (ideally top-level-groups) use any public component and what is that percentage?"
Apparently the answer is… three.
I remember Dov mentioning in the past that private component usage was much higher than public, but… that seems extreme. @furkanayhan @fabiopitino Do you think that can make sense?
If we need a tie-breaker, I'd also vote for Marcel's suggestion. While the MVC does indeed only create a CI-focused pipeline, I think the branding should win – as José said, we've traditionally not really made this distinction in similar situations.
Manuel Grabowski (4f193946) at 17 Mar 17:11
Update .gitlab-ci.yml file
Manuel Grabowski (b2d11ef3) at 17 Mar 17:11
Update .gitlab-ci.yml file
Manuel Grabowski (7d9e8c63) at 17 Mar 17:08
Update .gitlab-ci.yml file
Manuel Grabowski (e7965ee4) at 17 Mar 16:56
Save actual config fetch duration after final timeout check
Manuel Grabowski (caab93fa) at 17 Mar 14:55
Update .gitlab-ci.yml file
Ah, right – seeing the steps list locked me into thinking too technical, so I skipped a bit over the existing proposal. A project setting sounds like a fairly cheap first step we could take. If we'd want it enabled by default, we'd have to make sure to expose it via API so that people who automate project creation and rely on the old behavior could easily avoid breaking their workflows. We could consider splitting up the suggested Phase 1 even further and just start with an opt-in setting.
To be documented in the future via #593134
"Pipeline Builder Agent" (without "AI" at the start)
I'd be in favor of that, yes. @veethika @csaez-ext @dhershkovitch Any concerns?
Thanks Furkan, in light of the recent Sidekiq issues this is a good perspective.
The late processing of workflow rules also came up with CI/CD Pipeline with inputs ignores workflow rules (gitlab#574807) recently, so I had already been thinking about it a bit – how early can we meaningfully parse them? Variables have to be resolved already given that most people use variables for their rules – so we would need to integrate it into step 12 somehow? Or split up and reorganize step 12?