Chloe Liu (2f632e5d) at 17 Mar 22:09
Merge branch 'fix-webhook-auto-disable-flaky-test-18789' into 'master'
... and 1 more commit
Chloe Liu (c48a6a4c) at 17 Mar 22:08
The flaky test (webhook_events_spec.rb:108 — "hook is auto-disabled") had two problems:
Wrong assertion target. The test asserted alert_status == 'disabled' (permanently disabled), but permanent disable requires recent_failures > 39 (40 consecutive failures). The test only triggers 5 issues, so the hook can reach at most recent_failures == 4, which is temporarily_disabled — never disabled. The test was likely written when the threshold was lower and was not updated when it changed.
Race condition on async worker. The alert_status was checked with a single-shot webhook.reload! + expect(...).to eql(...). Since failures are processed asynchronously via WebHooks::LogExecutionWorker (protected by an ExclusiveLease), the failure count may not have been fully persisted at the time of the assertion. This explains why the test sometimes saw executable instead of any disabled state.
Stack trace from the issue confirms this: expected: "disabled", got: "executable".
temporarily_disabled instead of disabled — This matches what 4 consecutive failures actually produce given the current TEMPORARILY_DISABLED_FAILURE_THRESHOLD = 3 and PERMANENTLY_DISABLED_FAILURE_THRESHOLD = 39.alert_status check in eventually_eq with retry — Polls webhook.reload!.alert_status for up to 30 seconds, accommodating async worker processing time.eventually_be_truthy for event count — Assert at least 4 events arrived (not exactly 4), since timing may allow a 5th event before the hook is disabled.backoff! calls racing on the ExclusiveLease, where one call may see the hook as already temporarily disabled and skip incrementing.Resolves https://gitlab.com/gitlab-org/quality/test-failure-issues/-/work_items/18789
@carlad-gl Thanks for the update! Changes LGTM
The flaky test (webhook_events_spec.rb:108 — "hook is auto-disabled") had two problems:
Wrong assertion target. The test asserted alert_status == 'disabled' (permanently disabled), but permanent disable requires recent_failures > 39 (40 consecutive failures). The test only triggers 5 issues, so the hook can reach at most recent_failures == 4, which is temporarily_disabled — never disabled. The test was likely written when the threshold was lower and was not updated when it changed.
Race condition on async worker. The alert_status was checked with a single-shot webhook.reload! + expect(...).to eql(...). Since failures are processed asynchronously via WebHooks::LogExecutionWorker (protected by an ExclusiveLease), the failure count may not have been fully persisted at the time of the assertion. This explains why the test sometimes saw executable instead of any disabled state.
Stack trace from the issue confirms this: expected: "disabled", got: "executable".
temporarily_disabled instead of disabled — This matches what 4 consecutive failures actually produce given the current TEMPORARILY_DISABLED_FAILURE_THRESHOLD = 3 and PERMANENTLY_DISABLED_FAILURE_THRESHOLD = 39.alert_status check in eventually_eq with retry — Polls webhook.reload!.alert_status for up to 30 seconds, accommodating async worker processing time.eventually_be_truthy for event count — Assert at least 4 events arrived (not exactly 4), since timing may allow a 5th event before the hook is disabled.backoff! calls racing on the ExclusiveLease, where one call may see the hook as already temporarily disabled and skip incrementing.Resolves https://gitlab.com/gitlab-org/quality/test-failure-issues/-/work_items/18789
@SamWord Thanks for the clarification eventually_eq in this scenario because we're just polling on a same cached value with a 30s timeout. I'd suggest a direct eq assertion would be sufficient.
Chloe Liu (ccc19ffb) at 17 Mar 11:24
Merge branch 'bump-helm-kind-ci' into 'master'
... and 1 more commit
Chloe Liu (a141b806) at 17 Mar 11:24
Helm v4 is the recommended version for latest version of gitlab chart
Updates CI tool versions:
kindest/node:v1.34.0)
kind 0.31 ships with a default node image of kindest/node:v1.35.0 (Kubernetes v1.35.0). This caused kind cluster creation to fail in CI because the kubelet inside the kind node never becomes healthy in the Docker-in-Docker environment used by the docker+machine executor.
The failure manifests during kubeadm init at the wait-control-plane phase:
[kubelet-check] The kubelet is not healthy after 4m0.000442569s
Kubernetes v1.35.0 is very new and has compatibility issues with the nested DinD setup on CI runners. To resolve this, a --kind-image option has been added to the orchestrator, allowing the kind node image to be pinned independently of the kind binary version. The CI configuration pins kindest/node:v1.34.0 to maintain a working Kubernetes version while still benefiting from kind 0.31 improvements.
.gitlab/ci/version.yml: Bumps kind (0.29 → 0.31) and Helm (3.16 → 4.1) versions.gitlab/ci/test-on-cng/main.gitlab-ci.yml: Passes --kind-image kindest/node:v1.34.0 to the orchestratorqa/gems/gitlab-orchestrator/lib/gitlab/orchestrator/commands/subcommands/deployment.rb: Adds --kind-image CLI optionqa/gems/gitlab-orchestrator/lib/gitlab/orchestrator/lib/kind/cluster.rb: Accepts and passes the kind_image parameter to kind create cluster --image
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Helm v4 is the recommended version for latest version of gitlab chart
Updates CI tool versions:
kindest/node:v1.34.0)
kind 0.31 ships with a default node image of kindest/node:v1.35.0 (Kubernetes v1.35.0). This caused kind cluster creation to fail in CI because the kubelet inside the kind node never becomes healthy in the Docker-in-Docker environment used by the docker+machine executor.
The failure manifests during kubeadm init at the wait-control-plane phase:
[kubelet-check] The kubelet is not healthy after 4m0.000442569s
Kubernetes v1.35.0 is very new and has compatibility issues with the nested DinD setup on CI runners. To resolve this, a --kind-image option has been added to the orchestrator, allowing the kind node image to be pinned independently of the kind binary version. The CI configuration pins kindest/node:v1.34.0 to maintain a working Kubernetes version while still benefiting from kind 0.31 improvements.
.gitlab/ci/version.yml: Bumps kind (0.29 → 0.31) and Helm (3.16 → 4.1) versions.gitlab/ci/test-on-cng/main.gitlab-ci.yml: Passes --kind-image kindest/node:v1.34.0 to the orchestratorqa/gems/gitlab-orchestrator/lib/gitlab/orchestrator/commands/subcommands/deployment.rb: Adds --kind-image CLI optionqa/gems/gitlab-orchestrator/lib/gitlab/orchestrator/lib/kind/cluster.rb: Accepts and passes the kind_image parameter to kind create cluster --image
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
given helm v4 is literal recommendation from our chart so it's probably been tested by team developing chart itself
Good point! I don’t have anything particular in mind, so I’m good with merging this
@acunskis Thanks for working on this! Just one nitpick and also an overall question since we're bumping major helm version with breaking changes, have we verified all chart operations are compatible? We can go ahead and merge this if this is confirmed
Helm v4 is the recommended version for latest version of gitlab chart
Updates CI tool versions:
kindest/node:v1.34.0)
kind 0.31 ships with a default node image of kindest/node:v1.35.0 (Kubernetes v1.35.0). This caused kind cluster creation to fail in CI because the kubelet inside the kind node never becomes healthy in the Docker-in-Docker environment used by the docker+machine executor.
The failure manifests during kubeadm init at the wait-control-plane phase:
[kubelet-check] The kubelet is not healthy after 4m0.000442569s
Kubernetes v1.35.0 is very new and has compatibility issues with the nested DinD setup on CI runners. To resolve this, a --kind-image option has been added to the orchestrator, allowing the kind node image to be pinned independently of the kind binary version. The CI configuration pins kindest/node:v1.34.0 to maintain a working Kubernetes version while still benefiting from kind 0.31 improvements.
.gitlab/ci/version.yml: Bumps kind (0.29 → 0.31) and Helm (3.16 → 4.1) versions.gitlab/ci/test-on-cng/main.gitlab-ci.yml: Passes --kind-image kindest/node:v1.34.0 to the orchestratorqa/gems/gitlab-orchestrator/lib/gitlab/orchestrator/commands/subcommands/deployment.rb: Adds --kind-image CLI optionqa/gems/gitlab-orchestrator/lib/gitlab/orchestrator/lib/kind/cluster.rb: Accepts and passes the kind_image parameter to kind create cluster --image
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
--kind-image "${KIND_NODE_IMAGE}" \nitpick