Biren Shah activity

Biren Shah commented on issue #21587 at GitLab.com / GitLab Infrastructure Team / Production

2026-03-18T17:25:11Z

Thanks, @cmcfarland updated the steps.

Biren Shah commented on issue #21587 at GitLab.com / GitLab Infrastructure Team / Production

2026-03-18T17:06:59Z

@gitlab-org/saas-platforms/change-review-leadership We are looking for Infrastructure Manager approval to execute this CR.

cc @alexives @rnienaber

Biren Shah opened issue #21587: [C2][GPRD] Add a new CI replica database to the CI patroni cluster in GPRD (node 110) at GitLab.com / GitLab Infrastructure T...

2026-03-18T17:05:00Z

Biren Shah commented on issue #21214 at GitLab.com / GitLab Infrastructure Team / Production

2026-03-18T01:00:27Z

Sharing some additional info for reference. I ran the following queries on patroni-main-v17-01-db-gstg to understand the logical_test_id_seq sequence better:

1. Check the sequence object:

SELECT
  c.relname,
  c.oid,
  pg_catalog.obj_description(c.oid, 'pg_class') AS description
FROM pg_class c
WHERE c.relname = 'logical_test_id_seq'
AND c.relkind = 'S';

       relname       |   oid    | description
---------------------+----------+-------------
 logical_test_id_seq | 60601280 |
(1 row)

2. Check the owning table and column:

SELECT
  seq.relname AS sequence_name,
  tab.relname AS table_name,
  attr.attname AS column_name
FROM pg_class seq
JOIN pg_depend dep ON dep.objid = seq.oid
JOIN pg_class tab ON dep.refobjid = tab.oid
JOIN pg_attribute attr ON attr.attrelid = tab.oid AND attr.attnum = dep.refobjsubid
WHERE seq.relname = 'logical_test_id_seq'
AND seq.relkind = 'S';

    sequence_name    |  table_name  | column_name
---------------------+--------------+-------------
 logical_test_id_seq | logical_test | id
(1 row)

The sequence is owned by the logical_test table's id column — so dropping the table would automatically drop the sequence too.

Also checked activity for the past one year — pg_stat_user_tables_seq_scan{env="gstg", relname="logical_test"} shows a cumulative count of only 3-5 scans with zero recent rate, confirming the table has not been actively used. This does not exist in production.

Source

Waiting for confirmation from the team before proceeding with the drop, but I think it's safe to remove.

Biren Shah commented on issue #21214 at GitLab.com / GitLab Infrastructure Team / Production

2026-03-18T00:30:24Z

@tkhandelwal3 Sorry for the delayed response - I was out sick and then on PTO and missed the CR review request.

Looking at the logical_test_id_seq sequence, it is owned by a logical_test table which points to it being a leftover from one of the PG upgrade or logical replication validation exercises on staging.

I've asked the Data Engineering team to confirm who created it and why here: https://gitlab.com/gitlab-org/database-team/meetings/-/work_items/18#note_3168827548

To me, it looks like it can be safely removed, but let's wait for confirmation before proceeding.

@dbre-sre

Biren Shah commented on issue #21460 at GitLab.com / GitLab Infrastructure Team / Production

2026-03-10T23:38:27Z

Also did a quick search and found similar CRs that might be helpful as a reference:

Biren Shah commented on issue #21460 at GitLab.com / GitLab Infrastructure Team / Production

2026-03-10T23:34:52Z

@saadullah707 Not to nitpick but since it's your first CR, please align all the steps from Monitor build of archived/delayed to completion under Change steps - steps to take to execute the change for completeness and ease of execution. Also remove the old reference copied from the upgrade CR in the chef-client-disable command:

sudo chef-client-disable "rebuild/restore - https://gitlab.com/gitlab-com/gl-infra/production/-/work_items/20991"

Replace it with a reference to this CR instead.

Biren Shah commented on merge request !665 at GitLab.com / GitLab Infrastructure Team / db-migration

2026-03-10T21:43:10Z

First of all, this is really awesome @vporalla. Great job! It will surely help and we will refine it as we go as we validate the automation from benchmarking to staging to production.

A few thoughts and questions:

Idempotency / Re-runnability

With Ansible, tasks are idempotent so we can safely re-run them. How does this work for the Python phases here? If we realize we missed a step or a step failed mid-phase, can we re-run the same phase without causing issues? For example, if phase4 fails halfway through, is it safe to re-run pgupgrade phase4 from the beginning, or do we need to manually undo some steps first? It would be good to document the re-run behaviour per phase — especially for the more destructive ones like switchover and DDL lock.

Selective task execution

With Ansible we can use tags to selectively run specific tasks (e.g. --tags "pre-checks"). Is there an equivalent here? Can we run a specific step within a phase without running the whole phase? This would be very useful when a single step fails and we just want to retry that one step rather than the entire phase.

Manual command output / fallback

Thinking out loud here — would it be possible for the tool to optionally print the underlying commands it would execute for a given phase or step, without actually running them? Something like a --dry-run or --show-commands flag. The reason I ask is that as we rely more on this automation, the manual runbook will naturally become outdated and eventually unmaintained. Having the tool itself be the source of truth for "what commands does this step run" would be a great safety net if someone ever needs to execute a step manually in an emergency.

upgrade_manifest.yml documentation

I'll read up on the manifest more carefully, but it would be helpful to understand the validation behaviour around each field — I can see from the MR description that manifest.py has full type validation, which is great. It would be useful to have inline comments or a reference in the YAML itself explaining the purpose and expected values of each field, particularly for fields like host lists, schedule times, and cluster names, so it's easy for anyone filling in the manifest to understand what's expected without having to dig into the code.

GitLab issue integration

The automatic posting of phase completion comments to the ops CR with Posted automatically by pg-upgrade-automation is great — it gives a clean audit trail without any manual effort.

Upgrade issue with captured outputs

It's great that you have the upgrade issue with all the captured outputs and validation from the db-benchmarking environment. It covers a lot of ground so it will take the team some time to review everything thoroughly. A recorded demo would also be really valuable for future reference and onboarding.

Long-running tasks and session management

I assume you are running these phases in tmux given the long-running tasks like amcheck that can run for many hours — and it sounds like the tool is designed with that in mind. Good to know that's the expected way to run it.

I'll add more detailed inline comments when I have free cycles, but wanted to share these thoughts early.