John Cai activity https://gitlab.com/jcaigitlab 2026-03-17T21:34:20Z tag:gitlab.com,2026-03-17:5214835180 John Cai pushed to project branch jc/praefect-wait-for-voting-post-receive-hook at GitLab.org / Gitaly 2026-03-17T21:34:20Z jcaigitlab John Cai [email protected]

John Cai (a9ede422) at 17 Mar 21:34

hook: Fix race condition in PostReceiveRegistry signal handling

... and 5 more commits

tag:gitlab.com,2026-03-17:5214613673 John Cai commented on issue #7123 at GitLab.org / Gitaly 2026-03-17T20:22:18Z jcaigitlab John Cai [email protected]

cc: @stanhu

tag:gitlab.com,2026-03-17:5214611471 John Cai opened issue #7123: Implement Chaos Testing Framework for Praefect High Availability at GitLab.org / Gitaly 2026-03-17T20:21:29Z jcaigitlab John Cai [email protected]

Description

Praefect provides high availability for Gitaly clusters by coordinating replication, routing, and failover. To ensure production reliability, we need to systematically test Praefect's behavior under various failure conditions.

Background

Praefect is a gRPC proxy that sits between GitLab Rails and Gitaly nodes, managing:

  • Request routing (accessor vs mutator RPCs)
  • Primary election via SQL elector with quorum-based consensus
  • Replication queue processing
  • Reference transaction voting for strong consistency

Proposed Chaos Test Scenarios

1. Single Gitaly Node Failures

Scenario Expected Behavior
Primary node goes offline Failover to healthy secondary within ~3 health check cycles
Secondary node goes offline Replication jobs queue; reads continue from remaining replicas
Node returns after failure Reconciliation triggers; node catches up via replication

2. Quorum Loss Scenarios

Scenario Expected Behavior
>50% of Gitaly nodes offline Virtual storage becomes unavailable or read-only
Exactly 50% nodes offline (split-brain) No quorum reached; transactions abort
All secondaries offline Writes succeed on primary only; replication queues

3. Database Failures

Scenario Expected Behavior
PostgreSQL completely unavailable All routing fails; cluster unavailable
Database connection pool exhausted Request timeouts; graceful degradation
Slow database queries (>30s) Operations timeout; error propagation
Database briefly unavailable (<10s) Automatic reconnection; minimal disruption

4. Network Partition Scenarios

Scenario Expected Behavior
Praefect ↔️ single Gitaly partition Node marked unhealthy; failover if primary
Praefect ↔️ PostgreSQL partition Cluster operations fail
Praefect instance isolation Remaining instances maintain quorum
Gitaly ↔️ Gitaly partition during replication Replication fails; retried with backoff

5. Transaction/Consistency Failures

Scenario Expected Behavior
Primary crashes mid-transaction Transaction aborts; no partial commits
Secondary fails after voting Replication job scheduled for recovery
Network delay causes vote timeout Transaction aborts; client retry needed

6. Praefect Instance Failures

Scenario Expected Behavior
Single Praefect instance down Other instances continue; health quorum maintained
Majority of Praefect instances down Cannot elect new primaries; existing primaries continue
All Praefect instances restart simultaneously Cluster recovers; re-establishes node health

7. Resource Exhaustion

Scenario Expected Behavior
Replication queue grows unbounded Backpressure; oldest jobs processed first
High concurrent write load Transaction timeouts; load shedding
Gitaly node disk full Node fails health checks; failover

Acceptance Criteria

  • Chaos test framework integrated with existing test infrastructure
  • Each scenario has automated test coverage
  • Tests validate both failure behavior AND recovery
  • Metrics captured during chaos events (latency, error rates, recovery time)
  • Documentation of expected vs actual behavior for each scenario
  • CI/CD integration for periodic chaos testing

Suggested Implementation Approach

  1. Use existing chaos patterns - Praefect already has a Chaos service (see internal/praefect/service/server/info.go)
  2. Leverage testhelper infrastructure - Use testhelper.MustCreateGitalyClient() patterns
  3. Database chaos - Use testdb package with connection manipulation
  4. Network chaos - Consider tools like toxiproxy or iptables rules for network simulation

Key Metrics to Monitor

  • gitaly_praefect_transactions_total (success/failure breakdown)
  • gitaly_praefect_replication_jobs (queue depth)
  • gitaly_praefect_node_latency_bucket (health check latency)
  • Recovery time to healthy state
  • Data consistency verification post-recovery
tag:gitlab.com,2026-03-17:5214526544 John Cai commented on merge request !8510 at GitLab.org / Gitaly 2026-03-17T19:54:09Z jcaigitlab John Cai [email protected]

So at a high level, this is essentially a message broker between system resource conditions being met, and whatever needs to respond to those events?

With our current adaptive limiter, would this sit between the watchers and the adaptive limiters?

tag:gitlab.com,2026-03-17:5214507594 John Cai pushed to project branch jc/praefect-wait-for-voting-post-receive-hook at GitLab.org / Gitaly 2026-03-17T19:46:46Z jcaigitlab John Cai [email protected]

John Cai (11a5ad5e) at 17 Mar 19:46

hook: Fix race condition in PostReceiveRegistry signal handling

... and 5 more commits

tag:gitlab.com,2026-03-17:5214444279 John Cai commented on merge request !227743 at GitLab.org / GitLab 2026-03-17T19:25:10Z jcaigitlab John Cai [email protected]

@leeeee-gtlb per our conversation around Gitaly on k8s support, what if we introduced this idea of "early support" which essentially means you have to register in order to receive support if you're running production workloads on Gitaly on k8s?

This would still allow us to "GA" while gating the support load. Thoughts?

tag:gitlab.com,2026-03-17:5214434011 John Cai opened merge request !227743: Draft: Gitaly on K8s GA at GitLab.org / GitLab 2026-03-17T19:22:14Z jcaigitlab John Cai [email protected]

What does this MR do and why?

We are ready to declare Gitaly on K8s GA but with early support. This means customers must register to receive support on Gitaly on K8s.

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

tag:gitlab.com,2026-03-17:5214432407 John Cai pushed new project branch jc/gitaly-k8s-ga at GitLab.org / GitLab 2026-03-17T19:21:38Z jcaigitlab John Cai [email protected]

John Cai (d7e3d125) at 17 Mar 19:21

Gitaly on K8s GA

tag:gitlab.com,2026-03-17:5214337588 John Cai commented on issue #5406 at GitLab.org / Gitaly 2026-03-17T18:50:36Z jcaigitlab John Cai [email protected]

Also, the diagram has a step Git writes refs. Should there be a separate step where Git is writing the objects? Does that come before or after the reference-transaction hook?

Yeah that could clarify things. I think objects get written before the reference-transaction hook.

Makes sense! Do you know what would happen if the secondaries were lagging behind in generation number?

Only secondaries that are up to date get included in the set of voters

tag:gitlab.com,2026-03-17:5213904190 John Cai commented on merge request !227341 at GitLab.org / GitLab 2026-03-17T16:50:51Z jcaigitlab John Cai [email protected]

@GitLabDuo can you give me an example in the codebase where it's formatted like this?

tag:gitlab.com,2026-03-17:5213215570 John Cai pushed to project branch jc/docs-on-dns-tls at GitLab.org / GitLab 2026-03-17T14:27:57Z jcaigitlab John Cai [email protected]

John Cai (d1d47aa7) at 17 Mar 14:27

Praefect configuration: Enable TLS + DNS

... and 591 more commits

tag:gitlab.com,2026-03-17:5213205491 John Cai commented on merge request !227341 at GitLab.org / GitLab 2026-03-17T14:26:15Z jcaigitlab John Cai [email protected]

@GitLabDuo i can't find anywhere in the codebase where it's formatted like that.

tag:gitlab.com,2026-03-17:5213115321 John Cai commented on epic #19936 at GitLab.org 2026-03-17T14:08:18Z jcaigitlab John Cai [email protected]

i believe we have a plan but we are waiting for the upstream contribution, and then we can go ahead with @eric.p.ju's design

tag:gitlab.com,2026-03-17:5213107764 John Cai commented on epic #20030 at GitLab.org 2026-03-17T14:06:55Z jcaigitlab John Cai [email protected]

great work @oli.campeau and @gl-gitaly!

tag:gitlab.com,2026-03-16:5209989492 John Cai commented on epic #8175 at GitLab.org 2026-03-16T20:30:02Z jcaigitlab John Cai [email protected]

UPDATE: The issues described above represent edge cases in Praefect's consistency model. They do not constitute data loss—Git objects remain durable, and replicas self-heal through reconciliation. Several of these issues are being addressed directly (see #7059+, PostReceiveHook invoked before write has been r... (gitaly#5406)).

While RAFT does remain the long term vision of Gitaly, Praefect remains a viable HA solution.

tag:gitlab.com,2026-03-16:5209975936 John Cai pushed to project branch jc/praefect-wait-for-voting-post-receive-hook at GitLab.org / Gitaly 2026-03-16T20:24:44Z jcaigitlab John Cai [email protected]

John Cai (030021bb) at 16 Mar 20:24

hook: Fix race condition in PostReceiveRegistry signal handling

tag:gitlab.com,2026-03-16:5209930938 John Cai commented on epic #19936 at GitLab.org 2026-03-16T20:08:12Z jcaigitlab John Cai [email protected]

@pks-gitlab will gitaly#7059 unblock this effort?

tag:gitlab.com,2026-03-16:5209840708 John Cai commented on issue #3955 at GitLab.org / Gitaly 2026-03-16T19:36:18Z jcaigitlab John Cai [email protected]

One way to address inconsistencies on crash is to ensure clean startup. Here's a possibility

Pending Transaction Tracking for Praefect Crash Consistency

TL;DR

Record transaction intent in PostgreSQL before telling Gitalys to commit. On Praefect startup, check for any pending transactions and verify the actual state of replicas to reconcile the database.

New Table

CREATE TABLE pending_transactions (
    transaction_id   BIGINT PRIMARY KEY,
    repository_id    BIGINT NOT NULL REFERENCES repositories(repository_id) ON DELETE CASCADE,
    primary_storage  TEXT NOT NULL,
    expected_voters  TEXT[] NOT NULL,
    change_type      TEXT NOT NULL,
    created_at       TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Flow

Normal operation:

  1. Transaction quorum reached
  2. INSERT pending transaction record
  3. Send COMMIT to Gitalys (they write to disk)
  4. Request finalizer runs IncrementGeneration()
  5. DELETE pending transaction record

On Praefect startup:

  1. Query all pending transactions
  2. For each: verify actual replica states (e.g., compare HEAD refs across storages)
  3. Update database generation to reflect which replicas actually committed
  4. Queue replication jobs for any that didn't
  5. Delete the pending record

Why This Works

  • If Praefect crashes before INSERT: No record exists, no inconsistency
  • If Praefect crashes after INSERT but before COMMIT sent: Recovery finds record, verifies no Gitaly committed, deletes record
  • If Praefect crashes after COMMIT: Recovery finds record, detects which Gitalys committed, updates generations accordingly The database always ends up reflecting the true on-disk state.
tag:gitlab.com,2026-03-16:5209787900 John Cai opened issue #3955: Gitaly inconsisteny on Praefect crash at GitLab.org / Gitaly 2026-03-16T19:20:03Z jcaigitlab John Cai [email protected]