John Cai (a9ede422) at 17 Mar 21:34
hook: Fix race condition in PostReceiveRegistry signal handling
... and 5 more commits
cc: @stanhu
Praefect provides high availability for Gitaly clusters by coordinating replication, routing, and failover. To ensure production reliability, we need to systematically test Praefect's behavior under various failure conditions.
Praefect is a gRPC proxy that sits between GitLab Rails and Gitaly nodes, managing:
| Scenario | Expected Behavior |
|---|---|
| Primary node goes offline | Failover to healthy secondary within ~3 health check cycles |
| Secondary node goes offline | Replication jobs queue; reads continue from remaining replicas |
| Node returns after failure | Reconciliation triggers; node catches up via replication |
| Scenario | Expected Behavior |
|---|---|
| >50% of Gitaly nodes offline | Virtual storage becomes unavailable or read-only |
| Exactly 50% nodes offline (split-brain) | No quorum reached; transactions abort |
| All secondaries offline | Writes succeed on primary only; replication queues |
| Scenario | Expected Behavior |
|---|---|
| PostgreSQL completely unavailable | All routing fails; cluster unavailable |
| Database connection pool exhausted | Request timeouts; graceful degradation |
| Slow database queries (>30s) | Operations timeout; error propagation |
| Database briefly unavailable (<10s) | Automatic reconnection; minimal disruption |
| Scenario | Expected Behavior |
|---|---|
| Praefect |
Node marked unhealthy; failover if primary |
| Praefect |
Cluster operations fail |
| Praefect instance isolation | Remaining instances maintain quorum |
| Gitaly |
Replication fails; retried with backoff |
| Scenario | Expected Behavior |
|---|---|
| Primary crashes mid-transaction | Transaction aborts; no partial commits |
| Secondary fails after voting | Replication job scheduled for recovery |
| Network delay causes vote timeout | Transaction aborts; client retry needed |
| Scenario | Expected Behavior |
|---|---|
| Single Praefect instance down | Other instances continue; health quorum maintained |
| Majority of Praefect instances down | Cannot elect new primaries; existing primaries continue |
| All Praefect instances restart simultaneously | Cluster recovers; re-establishes node health |
| Scenario | Expected Behavior |
|---|---|
| Replication queue grows unbounded | Backpressure; oldest jobs processed first |
| High concurrent write load | Transaction timeouts; load shedding |
| Gitaly node disk full | Node fails health checks; failover |
Chaos service (see internal/praefect/service/server/info.go)testhelper.MustCreateGitalyClient() patternstestdb package with connection manipulationtoxiproxy or iptables rules for network simulationgitaly_praefect_transactions_total (success/failure breakdown)gitaly_praefect_replication_jobs (queue depth)gitaly_praefect_node_latency_bucket (health check latency)So at a high level, this is essentially a message broker between system resource conditions being met, and whatever needs to respond to those events?
With our current adaptive limiter, would this sit between the watchers and the adaptive limiters?
John Cai (11a5ad5e) at 17 Mar 19:46
hook: Fix race condition in PostReceiveRegistry signal handling
... and 5 more commits
@leeeee-gtlb per our conversation around Gitaly on k8s support, what if we introduced this idea of "early support" which essentially means you have to register in order to receive support if you're running production workloads on Gitaly on k8s?
This would still allow us to "GA" while gating the support load. Thoughts?
We are ready to declare Gitaly on K8s GA but with early support. This means customers must register to receive support on Gitaly on K8s.
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
John Cai (d7e3d125) at 17 Mar 19:21
Gitaly on K8s GA
Also, the diagram has a step
Git writes refs. Should there be a separate step where Git is writing the objects? Does that come before or after the reference-transaction hook?
Yeah that could clarify things. I think objects get written before the reference-transaction hook.
Makes sense! Do you know what would happen if the secondaries were lagging behind in generation number?
Only secondaries that are up to date get included in the set of voters
@GitLabDuo can you give me an example in the codebase where it's formatted like this?
John Cai (d1d47aa7) at 17 Mar 14:27
Praefect configuration: Enable TLS + DNS
... and 591 more commits
@GitLabDuo i can't find anywhere in the codebase where it's formatted like that.
i believe we have a plan but we are waiting for the upstream contribution, and then we can go ahead with @eric.p.ju's design
great work @oli.campeau and @gl-gitaly!
UPDATE: The issues described above represent edge cases in Praefect's consistency model. They do not constitute data loss—Git objects remain durable, and replicas self-heal through reconciliation. Several of these issues are being addressed directly (see #7059+, PostReceiveHook invoked before write has been r... (gitaly#5406)).
While RAFT does remain the long term vision of Gitaly, Praefect remains a viable HA solution.
John Cai (030021bb) at 16 Mar 20:24
hook: Fix race condition in PostReceiveRegistry signal handling
@pks-gitlab will gitaly#7059 unblock this effort?
One way to address inconsistencies on crash is to ensure clean startup. Here's a possibility
CREATE TABLE pending_transactions (
transaction_id BIGINT PRIMARY KEY,
repository_id BIGINT NOT NULL REFERENCES repositories(repository_id) ON DELETE CASCADE,
primary_storage TEXT NOT NULL,
expected_voters TEXT[] NOT NULL,
change_type TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
Normal operation:
IncrementGeneration()
On Praefect startup:
Why This Works