Stormatics

Thinking of PostgreSQL High Availability as Layers

Umair Shahid — Mon, 09 Mar 2026 14:03:16 +0000

High availability for PostgreSQL is often treated as a single, big, dramatic decision: “Are we doing HA or not?”

That framing pushes teams into two extremes:

a “hero architecture” that costs a lot and still feels tense to operate, or
a minimalistic architecture that everyone hopes will just keep running.

A calmer way to design this is to treat HA and DR as layers. You start with a baseline, then add specific capabilities only when your RPO/RTO and budget justify them.

Let us walk through the layers from “single primary” to “multi-site DR posture”.

Start with outcomes

Before topology, align on three things:

1. Failure scope

- A database host fails
- A zone or data center goes away
- A full region outage happens
- Human error

2. RPO (Recovery Point Objective)

- We can tolerate up to 15 minutes of data loss
- We want close to zero

3. RTO (Recovery Time Objective)

- We can be back in 30 minutes
- We want service back in under 2 minutes

Here is my stance (and it saves money!): You get strong availability outcomes by layering in the right order.

Layer 0 – Single primary (baseline, no backups)

This is the baseline: one PostgreSQL primary in one site. All reads and writes go to it.

That is it. No replicas. No archiving. No backup flow in this model.

What you get:

simplicity
low cost
low operational overhead

What it means operationally:

Your “recovery plan” is effectively “rebuild and rehydrate from wherever you can” (which might be infrastructure snapshots, application-level rebuilds, or other ad hoc processes depending on your environment).
Your availability depends heavily on the stability of the underlying host, storage, and platform.

If you are running Layer 0, the best mindset is: keep it stable and observable.

solid monitoring (latency, errors, saturation)
sane maintenance (bloat, stats, connection hygiene)
predictable change management

Layer 0 is not a “bad” architecture. It is simply the baseline. The moment you want a reliable recovery posture, you move to Layer 1.

Layer 1 – Add offsite backups (your first real safety net)

Layer 1 keeps the same single primary in Site A, and adds backup storage in Site B.

This model introduces a defined recovery path.

What you gain:

You can lose the primary server and still recover your data.
You can meet an RPO that is “last successful backup” (which is often perfectly acceptable for many systems).

Practical ways teams implement this:

pgBackRest or Barman sending backups to object storage (often in another region/account)
retention policies that reflect compliance and business needs

An important point to note here – a backup is only as good as its ‘restorability’. If you can’t restore a backup, there is no point in taking one. Best practice is to run periodic drills to test the restore procedure, measure the time it takes, and verify the data it restores.

Layer 2 – Add WAL archiving (PITR-ready recovery)

Layer 2 builds on Layer 1 by adding WAL archiving from Site A to Site B.

This is where recovery becomes precise and continuous.

Backups alone restore you to “the last backup.” WAL archiving lets you restore to a point in time.

What you gain:

PITR (Point-in-Time Recovery)
Tighter RPO
A clean response to human error

The habit that makes this layer valuable:

restore drills
timed drills
runbooks that a tired engineer can follow at 2 AM

Layer 2 is one of the highest-ROI layers in the entire model because it turns recovery into a controlled process rather than improvisation.

Layer 3 – Add a hot standby

Layer 3 keeps backups + WAL archiving, and adds a hot standby in Site A (often in a different zone or DC).

Primary → standby uses asynchronous streaming replication.

What you gain:

much faster RTO (fail over to the standby instead of rebuilding)
the option for load balancing (route read queries to the standby)
planned switchovers for maintenance that do not disrupt operations

Additional monitoring requirements:

replication lag
WAL generation rate
standby replay delay
failover readiness

This is also where teams choose between:

disciplined manual failover
Auto failover using an HA manager

Either path works when it is tested and documented.

Layer 4 – Add synchronous replication

Layer 4 is where teams typically run a primary and multiple standbys, using:

synchronous replication for stronger data guarantees, and
asynchronous replication for flexibility and additional redundancy.

What you gain:

near-zero data loss for transactions protected by synchronous commit

What you accept:

added write latency
more explicit failure handling

An important part of the policy:

When the synchronous standby is unavailable, do you prefer continued writes (async mode) or do you prefer waiting until sync returns?

Teams that decide this up front operate Layer 4 calmly. Teams that leave it implicit tend to discover their “real” policy during an incident.

Layer 5 – Add a warm standby in Site B

Layer 5 is where you treat a second site as a true recovery location, adding regional redundancy.

You keep your HA setup in Site A and maintain a warm standby in Site B, fed by backups and WAL archives that are continuously applied to the standby node.

What you gain:

a cleaner plan for site-level outages
a faster recovery path to Site B, reducing RTO

This layer also forces a useful reality check, DR is not only a database design. You also want:

routing (DNS/LB) that can switch cleanly
application configuration that supports failover
secrets and access that work in the DR site
rehearsed runbooks

When those pieces are ready, Layer 5 feels like a controlled switchover instead of a high-stress scramble.

Common gotchas that show up in production

These are the ones I see repeatedly:

Backups exist; restore is untested. At best, this is Schrodinger’s backup – and you will only know when there is an outage.
WAL archiving is configured but not monitored. You want to make sure the consumer is consuming the files, so they don’t pile up on the producer.
Replication slots retain WAL longer than expected. This needs to be monitored, and you need to ask ‘why’.
Synchronous replication without a clear failure policy. Write the rule down, test it, and make it visible to the on-call team.
Read traffic routed to standbys without thinking about staleness. Replica reads are great when you choose the right queries and accept the consistency model.

The post Thinking of PostgreSQL High Availability as Layers appeared first on Stormatics.

How PostgreSQL Scans Your Data

Warda Bibi — Thu, 05 Mar 2026 08:15:20 +0000

To understand how PostgreSQL scans data, we first need to understand how PostgreSQL stores it.

A table is stored as a collection of 8KB pages (by default) on disk.
Each page has a header, an array of item pointers (also called line pointers), and the actual tuple data growing from the bottom up.
Each tuple has its own header containing visibility info: xmin, xmax, cmin/cmax, and infomask bits.

There are different ways PostgreSQL can read data from disk. Depending on the query and available indexes, it can choose from several scan strategies:

Sequential Scan
Index Scan
Index-Only Scan
Bitmap Index Scan

In this blog post, we’ll explore each of these scan types one by one.

Sequential Scan

The sequential scan is PostgreSQL’s brute-force access method. It reads every page of the table from block 0 to relpages – 1.

Step by Step

PostgreSQL opens the primary physical file for the relation.
Each 8 KB page is pulled into the shared buffer pool. If already cached, it’s a hit. Otherwise, Postgres reads it from disk. A lightweight pin prevents the buffer manager from evicting the page during reading.
For each tuple, PostgreSQL compares xmin/xmax against the current transaction’s snapshot to determine whether the tuple is visible. Dead and in-progress tuples are skipped.
Visible tuples are tested against the WHERE clause quals, and non-matching tuples are discarded.

Visibility Map Optimization

Normally, PostgreSQL must check each row’s visibility before returning it. Because of MVCC, a row might have been inserted, updated, or deleted by another transaction, so PostgreSQL verifies that the row is visible to the current snapshot. To make this faster, PostgreSQL maintains a visibility map. The visibility map stores a bit for each page in the table. If all rows on a page are visible to all transactions, that page is marked as all-visible in the visibility map. When this bit is set:

PostgreSQL knows every row on that page is visible.
It can skip the expensive per-row visibility checks.
It can move straight to applying the WHERE condition.

This significantly reduces CPU overhead during large scans.

Synchronized Scans

If multiple backends start a sequential scan on the same large table around the same time, PostgreSQL doesn’t let them all begin at page 0. Instead, it coordinates them using synchronized scans. When a second scan starts, it joins the first scan at its current position in the table and continues scanning from there. After reaching the end of the table, it wraps around to read the remaining pages from the beginning.

This approach reduces redundant work and makes better use of shared memory and disk bandwidth.

Parallel Seqscans

For very large tables, PostgreSQL can scan in parallel. A parallel leader process divides the table into block ranges. Multiple worker processes are launched, and each worker scans a different portion of the table.

When a worker finishes its assigned range, it requests another. This ensures that the work is evenly distributed and all CPU cores can participate in large scans to complete faster.

When Does the Planner Choose seqscan?

PostgreSQL uses a cost model to estimate which plan will be cheapest. Two important settings influence this decision: seq_page_cost (default: 1.0) and random_page_cost (default: 4.0).

Sequential page reads are considered cheap and random page reads are assumed to be more expensive, so they’re given a higher cost. Because of this, a sequential scan is usually chosen when:

A large percentage of the table’s rows are expected to match
The table is small
The table already fits in memory
Using an index would require too many random page reads

In these cases, reading the table once from start to finish is estimated to be cheaper than jumping around via an index.

Index Scan

An index scan uses a separate structure to find matching rows. Instead of reading the entire table, PostgreSQL:

Looks up matching values in the index
Gets the physical locations of those rows
Fetches only those rows from the table

The index and the table (heap) are two separate structures. The index stores keys and pointers, while the heap stores the actual row data.

What Is a TID?

Every index entry points to something called a TID (Tuple ID). A TID (also called an ItemPointer) is a small physical address made of:

Block number (which heap page to read)
Offset (which row slot inside that page)

Every index entry points to exactly one TID in the heap. Now let’s say you run:

SELECT * FROM users WHERE id = 42;

Here’s what happens during an index scan on id, step by step:

PostgreSQL scans the index to find the entry where id = 42.
From that index entry, it reads the TID .
Using the block number in the TID, Postgres fetches the corresponding heap page into shared buffers (or uses it if it’s already cached).
Inside that page, it uses the offset to locate the exact row.
Before returning it, Postgres performs an MVCC visibility check (to make sure the row is visible to the current transaction).
If the row is visible (and matches the query), PostgreSQL returns it.

So for each row in Index scan, we have one I/O for the index page, another for the heap.

HOT Chains: Avoiding Index Bloat

When a row is updated, PostgreSQL creates a new version of that row. If any indexed column changes, the index must also be updated. But if the indexed columns stay the same, PostgreSQL can avoid touching the index entirely. This is called a HOT (Heap-Only Tuple) update.

In a HOT update, the new row version is placed on the same heap page (if there’s space). The existing index entry still points to the original TID. Instead of creating a new index entry, PostgreSQL links the old tuple to the new version using an internal pointer (t_ctid). During an index scan, PostgreSQL fetches the heap page using the TID from the index. If that tuple was updated, it follows the chain(t_ctid) inside the same page to find the latest visible version and returns it.

Because no new index entry is created, the index stays smaller and cleaner. This reduces index bloat and makes updates cheaper when indexed columns don’t change.

Index-Only Scans

If all columns the query needs are in the index, PostgreSQL can skip the heap fetch entirely. But since the index has no visibility info, it consults the visibility map. If the VM bit for the corresponding heap page is set, the tuple is guaranteed visible and no heap fetch is needed. Otherwise, tables with frequent VACUUM have more VM bits set, which is why index-only scans improve dramatically on well-vacuumed tables.

Normally, even when PostgreSQL uses an index, it still has to visit the heap to fetch the row. However, if all the columns required by the query are already stored in the index, PostgreSQL can return the result directly from the index without fetching the heap row. This is called an index-only scan.

However, there’s one catch: indexes do not store visibility information. So PostgreSQL still needs to make sure the row is visible to the current transaction. To do that, it checks the visibility map.

If the visibility map bit for the corresponding heap page is set, it means all rows on that page are visible to all transactions. In that case, PostgreSQL can safely return the data directly from the index, so no heap access is needed.
If the visibility map bit is not set, PostgreSQL must fall back to fetching the heap page to verify visibility.

This is why index-only scans work much better on tables that are regularly vacuumed. Frequent VACUUM updates the visibility map, setting more pages to all-visible, which allows PostgreSQL to skip heap lookups more often.

Bitmap Scan

If too many rows match, a regular index scan can become expensive as it causes excessive random I/O. And if not enough rows match to justify reading the entire table with a sequential scan, PostgreSQL chooses a third option: Bitmap Index Scan.

It’s similar to an index scan, but instead of jumping to the heap immediately for every match, PostgreSQL splits the work into two phases.

Phase 1: Build the Bitmap (Bitmap Index Scan)

First, PostgreSQL scans the index. But instead of fetching rows right away, it collects all matching TIDs into an in-memory bitmap. The bitmap groups match by heap page. Conceptually, it looks like this:

Page 3: [0, 1, 0, 0, 1, 0, 1, 0] ← rows 1, 4, 6 match

Page 8: [1, 0, 0, 0, 0, 0, 0, 0] ← row 0 matches

Page 472: [0, 0, 1, 0, 0, 1, 0, 0] ← rows 2, 5 match

Instead of scanning x random rows, PostgreSQL now thinks that these are the pages that contain matches.

Phase 2: Read Heap Pages in Order (Bitmap heap Scan)

Once the bitmap is built, PostgreSQL sorts the matching heap pages by block number and reads them in physical order. Each page is read at most once. Instead of jumping around like this:

Page 472 → Page 3 → Page 472 → Page 8801 → Page 3 → Page 8

It reads pages like this:

Page 3 → Page 8 → Page 472 → Page 8801

If multiple matching rows are on the same page, they are all processed together. This avoids repeatedly reading the same page from disk. Without this approach, a page might be loaded into shared buffers, evicted, and then loaded again later. Bitmap scans prevent that waste.

Exact vs Lossy Bitmap

When memory allows, PostgreSQL keeps an exact bitmap. That means it tracks the exact tuple positions inside each page. Later, it knows precisely which rows to fetch. But the bitmap is stored in memory, and it is limited by work_mem. If the bitmap grows too large, PostgreSQL switches to a lossy bitmap. In this mode, it no longer remembers exact tuple positions. It only remembers:

This page has at least one matching row.

Because it no longer knows exactly which rows match, PostgreSQL must re-check every row on those pages during the heap scan. In EXPLAIN ANALYZE, we will see:

Recheck Cond:

If we increase work_mem, PostgreSQL is more likely to use an exact bitmap and avoid this extra recheck work.

Combining Indexes

Bitmap scans can combine multiple indexes. For example:

SELECT * FROM orders
WHERE amount > 100 AND region = 'APAC';

PostgreSQL can:

Build one bitmap from the amount index
Build another bitmap from the region index
Combine them using a fast bitwise AND

Only pages present in both bitmaps are scanned. This is not possible with a regular index scan, which can only use one index at a time. Bitmap scans can also combine indexes with OR conditions using a bitwise OR. This ability to merge multiple index results is one of the main reasons bitmap scans exist.

Choosing the Right Plan

The PostgreSQL planner uses a cost model to pick between these access methods. Here is the rough decision landscape:

Selectivity	Rows matched	Likely plan	Why
< 1%	Very few rows	Index Scan	Random I/O cost acceptable at this scale
1–20%	Moderate rows	Bitmap Index Scan	Sorting TIDs converts random IO to sequential
> 20%	Most rows	Seq Scan	Index overhead exceeds benefit
Multi-column conditions	Any	BitmapAnd / BitmapOr	Only mechanism that combines two indexes

The default random_page_cost = 4.0 was tuned for spinning disks. On SSDs, lower it to around 1.1. This tells the planner that random I/O is cheap, so it will favor index scans more aggressively. Understanding these internals gives us the ability to reason about query performance, knowing why the planner made a decision, and how to guide it with random_page_cost, work_mem, or appropriate indexes.

The post How PostgreSQL Scans Your Data appeared first on Stormatics.

Fixing ORM Slowness by 80% with Strategic PostgreSQL Indexing

Hamza Sajawal — Fri, 20 Feb 2026 11:41:10 +0000

Modern applications heavily rely on ORMs (Object-Relational Mappers) for rapid development. While ORMs accelerate development, they often generate queries that are not fully optimized for database performance. In such environments, database engineers have limited control over query structure, leaving indexing and database tuning as the primary performance optimization tools.

In this article, I’ll share how we improved PostgreSQL performance dramatically for one of our enterprise customers by applying strategic indexing techniques,without modifying application queries.

The Challenge: High Read IOPS and Slow Query Performance

One of our customers experienced severe performance degradation, including:

High Read IOPS on the database
Slow page loads and delayed reports
Increasing database load during peak hours

After analyzing PostgreSQL configuration parameters, we confirmed that:

Memory parameters were properly tuned
Autovacuum was functioning correctly
Hardware resources were sufficient

However, performance issues persisted.

Since the application relied entirely on ORM-generated queries, rewriting queries was not an option. We needed a solution at the database level.

Root Cause Analysis: Excessive Sequential Scans

We analyzed PostgreSQL statistics using:

pg_stat_user_tables
pg_stat_user_indexes
pg_constraint
pg_index

We discovered extremely high sequential scans on large tables—some exceeding 41 million scans.

Sequential scans on large tables significantly increase disk I/O and slow query execution.

The primary reason: Missing indexes on foreign key columns and frequently filtered columns.

Strategy #1: Foreign Key Index Optimization

Why this matters

PostgreSQL does NOT automatically create indexes on foreign key columns.

Without these indexes, PostgreSQL must perform sequential scans when:

Joining tables
Filtering by foreign keys
Enforcing referential integrity

This is especially critical in ORM-based systems, where joins on foreign keys are extremely common.

How we identified missing FK indexes

We ran the following query to detect missing indexes on foreign key columns in large, frequently scanned tables:

WITH high_seq_tables AS (
SELECT
     st.relid,
     st.schemaname,
     st.relname AS table_name,
     st.seq_scan,
        pg_total_relation_size(st.relid) AS table_size_bytes
FROM pg_stat_user_tables st
WHERE st.seq_scan > 10000
   AND pg_total_relation_size(st.relid) >= 524288000
),
fk_columns AS (
SELECT
     con.conrelid,
     att.attname AS fk_column
FROM pg_constraint con
JOIN unnest(con.conkey) AS colnum(colnum) ON true
JOIN pg_attribute att
     ON att.attrelid = con.conrelid
    AND att.attnum = colnum.colnum
WHERE con.contype = 'f'
),
fk_index_check AS (
SELECT
     fk.conrelid,
     fk.fk_column,
     NOT EXISTS (
         SELECT 1
         FROM pg_index idx
         JOIN pg_attribute ia
             ON ia.attrelid = idx.indrelid
            AND ia.attnum = ANY(idx.indkey)
         WHERE idx.indrelid = fk.conrelid
           AND idx.indisvalid
           AND ia.attname = fk.fk_column
     ) AS is_missing_fk_index
FROM fk_columns fk
)
SELECT *
FROM fk_index_check
WHERE is_missing_fk_index = true;

Results after implementing FK indexes

After creating indexes on critical foreign key columns:

Sequential scans reduced by over 87%
Query response time improved by 60-80%
Disk read IOPS dropped significantly
Overall system responsiveness improved dramatically

This was the single most impactful optimization.

Strategy #2: Slow Query Driven Index Optimization

Instead of blindly indexing everything, we followed a targeted approach using slow query analysis.

Step 1: Enable Slow Query Logging

SET log_min_duration_statement = 420000;

This logs queries taking longer than 7 minutes. You can adjust this threshold based on your workload.

Step 2: Identify Query Patterns

From slow query logs, we identified common patterns:

Columns frequently used in:

WHERE clauses
JOIN conditions
GROUP BY operations
Aggregations (AVG, COUNT, SUM)

Step 3: Create Targeted Indexes

We created indexes on:

Foreign key columns
Join columns
Filter columns
Frequently aggregated columns

Example:

CREATE INDEX CONCURRENTLY idx_orders_customer_id ON orders_main(customer_id);

Index Usage Validation

After creating indexes, we verified usage via:

SELECT *
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC;

This helped ensure indexes were actually used and provided performance benefits.

Overall Performance Improvements

Metric	Improvement
Sequential Scans	↓ 87%
Query Execution Time	↓ 60–80%
Disk Read IOPS	↓ 70%
Application Response Time	Dramatically Improved

Key Lessons Learned

Always index foreign key columns in ORM-driven applications
Use PostgreSQL statistics views to identify missing indexes
Enable slow query logging to identify optimization opportunities
Create targeted indexes, not excessive indexes
Continuously monitor index usage and performance

Final Thoughts

When application-level optimization is limited due to ORM constraints, database-level indexing becomes the most powerful performance optimization tool.

Strategic indexing can dramatically improve PostgreSQL performance without modifying application code or upgrading hardware.

The post Fixing ORM Slowness by 80% with Strategic PostgreSQL Indexing appeared first on Stormatics.

PostgreSQL Materialized Views: When Caching Your Query Results Makes Sense (And When It Doesn’t)

Umair Shahid — Tue, 03 Feb 2026 09:17:40 +0000

The Pain and the Real Constraint

Your dashboard queries are timing out at 30 seconds. Your BI tool is showing spinners. Your users are refreshing the page, wondering if something’s broken.

You’ve indexed everything. You’ve tuned shared_buffers. You’ve rewritten the query three times. The problem isn’t bad SQL – it’s that you’re forcing PostgreSQL to aggregate, join, and scan millions of rows every single time someone opens that report.

Here’s a clear stance: repeated heavy computations are a design choice, not a badge of honour. If you’re running the same expensive calculation dozens of times a day, you’re choosing to do more work than necessary.

This post shows you how to turn one expensive query shape into a fast, indexed object with explicit freshness and operational control. Materialized views give you predictable reads when you’re willing to accept a refresh contract.

What a Materialized View Actually Is (and What It Is Not)

Definition in Plain Words

A materialized view is a physical relation that stores the result set of a query.

When you create one, PostgreSQL runs your query, writes the output to disk, and keeps it there until you tell it to refresh. That’s it. No magic. Just a snapshot you control.

Compare the Three Common Patterns

Let’s be precise about what you’re choosing:

View: Computed at read time, always current. PostgreSQL rewrites your query against the underlying tables every time. Zero staleness, full computation cost on every read.

Materialized view: Computed at refresh time, fast reads. You decide when to refresh. Reads are fast and predictable because they’re hitting stored data. Staleness is explicit and bounded by your refresh schedule.

Summary table: You own the update pipeline. Whether it’s ETL jobs, application code, or triggers—you’re writing the insert/update logic and managing incremental changes yourself.

Why the “Physical” Part Matters

Because it’s a physical relation, you can:

Index it like any table
Let the planner treat it like a table (predictable execution plans)
Pay for storage and refresh work in exchange for making reads fast and deterministic

You’re trading computation-on-read for computation-on-schedule. That’s the contract.

When Materialized Views Are a Strong Fit

The Best-Fit Workload Shapes

Materialized views work best when:

Repeated reporting queries with stable patterns hit the same aggregations (BI dashboards, executive summaries, weekly rollups)
Heavy joins and aggregations across large tables that don’t change second-by-second
Precomputed metrics that are “fresh enough” on a schedule your business can accept

If your query shape is stable and your freshness requirement is measured in minutes or hours (not milliseconds), materialized views are worth evaluating.

A Concrete Motivating Example

Let’s use an e-commerce order revenue summary. You’re joining:

orders (10M rows)
order_items (40M rows)
products (500K rows)
customers (2M rows)

Your query aggregates revenue by product category, customer region, and week. It’s grouped, filtered by tenant, and sorted by revenue descending.

Baseline symptoms:

Execution time: 28 seconds
The query runs 40+ times per day (dashboard loads, exports, API calls)
Users complain, BI tool times out, someone opens a ticket

You know the query is expensive. The question is whether you want to keep paying that cost every single time.

Build It Step-by-Step (Copy/Paste SQL)

Start with the Baseline Query

Here’s the query you want to stop recomputing:

SELECT 
    p.category,
    c.region,
    DATE_TRUNC('week', o.order_date) AS week,
    COUNT(DISTINCT o.order_id) AS order_count,
    SUM(oi.quantity * oi.unit_price) AS total_revenue
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.tenant_id = 'acme_corp'
  AND o.order_date >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY p.category, c.region, DATE_TRUNC('week', o.order_date)
ORDER BY total_revenue DESC;

Execution: 28 seconds. Buffers: scanning 80GB+ across four tables.

Create the Materialized View Safely

CREATE MATERIALIZED VIEW mv_order_revenue_summary AS
SELECT 
    o.tenant_id,
    p.category,
    c.region,
    DATE_TRUNC('week', o.order_date) AS week,
    COUNT(DISTINCT o.order_id) AS order_count,
    SUM(oi.quantity * oi.unit_price) AS total_revenue
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY o.tenant_id, p.category, c.region, DATE_TRUNC('week', o.order_date)
WITH NO DATA;

Why WITH NO DATA?

It creates the structure without populating it immediately. This gives you control—you can add indexes first, then populate in a maintenance window. It’s a cleaner rollout.

Populate It

REFRESH MATERIALIZED VIEW mv_order_revenue_summary;

First refresh: 4.2 seconds. That’s the cost you’ll pay each time you refresh.

Index It Like a Production Object

Here’s where most performance wins happen. Your materialized view is a table—treat it like one.

-- Filter columns (tenant, date bucket)
CREATE INDEX idx_mv_revenue_tenant_week 
ON mv_order_revenue_summary(tenant_id, week);

— Common grouping dimensions
CREATE INDEX idx_mv_revenue_category
ON mv_order_revenue_summary(category);

CREATE INDEX idx_mv_revenue_region
ON mv_order_revenue_summary(region);

Critical guidance: Index design should match the read patterns of the MV consumers, not the base tables. Ask yourself: how will people query this snapshot?

Now your dashboard query becomes:

SELECT category, region, week, order_count, total_revenue
FROM mv_order_revenue_summary
WHERE tenant_id = 'acme_corp'
  AND week >= CURRENT_DATE - INTERVAL '90 days'
ORDER BY total_revenue DESC;

Execution: 180 milliseconds. Index scan, no joins, no aggregation.

You’ve turned a 28-second computation into a 180ms index lookup.

Refresh Strategies (and How to Choose)

Full Refresh (Simple, Predictable)

REFRESH MATERIALIZED VIEW mv_order_revenue_summary;

This rewrites the entire materialized view. While it’s running, reads are blocked. The view is locked until the refresh completes.

Operational notes:

Schedule off-peak if your refresh takes more than a few seconds
Treat it as a batch job with a runtime budget
Monitor duration as data grows

Concurrent Refresh (Keeps Reads Available)

REFRESH MATERIALIZED VIEW CONCURRENTLY mv_order_revenue_summary;

This builds the new snapshot in the background and swaps it in atomically. Reads stay available throughout.

Requirements:

A unique index that reflects the logical uniqueness of MV rows

For our example:

CREATE UNIQUE INDEX idx_mv_revenue_unique 
ON mv_order_revenue_summary(tenant_id, category, region, week);

Trade-offs (framed constructively):

Higher refresh overhead (PostgreSQL does more work to build and merge)
You need to design for uniqueness and accept slightly longer refresh times

I believe concurrent refresh is worth it when your materialized view serves user-facing queries and a 4-second lock would be visible.

Freshness as a Contract (What Teams That Do Well Always Define)

Define “fresh enough” in business terms:

Every 5 minutes for near-real-time dashboards
Hourly for internal reporting
Daily for executive summaries

Define operational SLOs:

Max refresh runtime: 10 seconds
Acceptable staleness window: up to 1 hour

Make staleness visible. Users should know when the data was last refreshed. We’ll cover how in the observability section.

Scheduling Refreshes (Cron, pg_cron, K8s, Managed Cloud)

One Scheduler Owns Refresh

Avoid multiple sources triggering refresh. Choose one mechanism and stick with it.

Scheduling Options

OS cron + psql:

# /etc/cron.d/refresh-mv
0 * * * * postgres psql -d production -c "REFRESH MATERIALIZED VIEW CONCURRENTLY mv_order_revenue_summary;"

pg_cron (when available):

SELECT cron.schedule('refresh-revenue-mv', '0 * * * *', 
  $$REFRESH MATERIALIZED VIEW CONCURRENTLY mv_order_revenue_summary$$);

Kubernetes CronJob:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: refresh-mv-revenue
spec:
  schedule: "0 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: refresh
            image: postgres:16
            command:
            - psql
            - -c
            - "REFRESH MATERIALIZED VIEW CONCURRENTLY mv_order_revenue_summary;"
          restartPolicy: OnFailure

App-driven refresh (only when you control concurrency and backoff):

Use this sparingly. Application-triggered refreshes can lead to refresh storms if you’re not careful.

Managed Cloud Realities (Practical Notes)

RDS, Azure Database for PostgreSQL, and Cloud SQL have different extension policies. pg_cron might not be available, or you might not have permissions to install it.

In that case, scheduling often moves to:

Cloud Scheduler (GCP)
EventBridge (AWS)
Azure Automation
Kubernetes CronJobs in the same environment

These work fine. The important part is having one source of truth for your refresh schedule.

Prevent Overlapping Refresh Runs

If your refresh takes 6 minutes and you schedule it every 5 minutes, you’ll have multiple refresh jobs competing.

Advisory lock pattern (conceptual):

DO $$
BEGIN
  IF pg_try_advisory_lock(12345) THEN
    REFRESH MATERIALIZED VIEW CONCURRENTLY mv_order_revenue_summary;
    PERFORM pg_advisory_unlock(12345);
  ELSE
    RAISE NOTICE 'Refresh already running, skipping';
  END IF;
END $$;

Refresh starts only if the lock is acquired. This avoids refresh storms during delays.

Observability: Measure Refresh Cost and Staleness

Before/After Proof (What to Show in the Post)

Let’s put real numbers on this:

Baseline query time: 28 seconds
MV query time: 180 milliseconds
Refresh overhead: 4.2 seconds
Refresh cadence: Every hour

You’re doing 4.2 seconds of work every hour to save 28 seconds on each of 40+ queries per day. The math works.

EXPLAIN (ANALYZE, BUFFERS) for baseline:

GroupAggregate  (cost=2847392.18..2847395.32 rows=1 width=89) (actual time=27823.445..27823.451 rows=156 loops=1)
  Buffers: shared hit=9234 read=1847234
  ->  Sort  (cost=2847392.18..2847392.68 rows=200 width=57) (actual time=27811.234..27812.891 rows=8923456 loops=1)

After (querying the MV):

Index Scan using idx_mv_revenue_tenant_week on mv_order_revenue_summary  (cost=0.42..23.18 rows=156 width=89) (actual time=0.034..0.178 rows=156 loops=1)

Buffers: shared hit=12

That’s the difference. You’ve moved the heavy lifting to a scheduled job.

Tracking “Last Refresh Time” (Correct Approach)

Important: pg_matviews does not store last_refresh. That column doesn’t exist.

Practical patterns:

A small mv_refresh_log table updated by your refresh job:

CREATE TABLE mv_refresh_log (
    mv_name TEXT PRIMARY KEY,
    last_refresh_at TIMESTAMPTZ,
    refresh_duration_ms INTEGER
);

DO $$
DECLARE
start_time TIMESTAMPTZ := clock_timestamp();
end_time TIMESTAMPTZ;
duration_ms INTEGER;
BEGIN
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_order_revenue_summary;
end_time := clock_timestamp();
duration_ms := EXTRACT(EPOCH FROM (end_time – start_time)) * 1000;

INSERT INTO mv_refresh_log (mv_name, last_refresh_at, refresh_duration_ms)
VALUES (‘mv_order_revenue_summary’, end_time, duration_ms)
ON CONFLICT (mv_name)
DO UPDATE SET last_refresh_at = EXCLUDED.last_refresh_at,
refresh_duration_ms = EXCLUDED.refresh_duration_ms;
END $$;

Now you can expose freshness to users:

SELECT 
    mv_name,
    last_refresh_at,
    NOW() - last_refresh_at AS staleness,
    refresh_duration_ms
FROM mv_refresh_log
WHERE mv_name = 'mv_order_revenue_summary';

Scheduler job history (pg_cron job run details, or platform logs)
Logging refresh duration in PostgreSQL logs and aggregating in your observability stack

Watch the Impact on the Rest of the System

Monitor:

Refresh CPU and IO (does it spike? does it compete with writes?)
Temp file usage (large sorts/hashes during refresh can spill to disk)
Replica lag sensitivity (if refresh competes with write workload on the primary, replicas might fall behind)

Materialized view refreshes are queries. They use resources. Plan accordingly.

Performance Tuning That Consistently Pays Off

Indexes on the MV aligned to read patterns: We covered this. It’s the single biggest lever.

Ensure base tables stay healthy:

ANALYZE keeps statistics fresh
VACUUM and autovacuum prevent bloat
Bloated base tables make refresh slower

Reduce refresh work:

Simplify the MV query (do you really need all those joins?)
Pre-filter with partitions where relevant (if your base tables are partitioned by date, your refresh can scan fewer partitions)

Resource guardrails:

Schedule refresh during low-traffic windows
Understand sort and hash behavior in the refresh query (use EXPLAIN to spot large temp writes)

Trade-Offs (Framed as Design Decisions)

Every materialized view comes with trade-offs. Frame them as conscious decisions:

Freshness window: Staleness is a contract. You’re accepting data that’s up to X minutes or hours old. If you need second-by-second accuracy, you will get better outcomes with a different pattern—streaming aggregates, app-level cache with invalidation, or summary tables maintained incrementally.

Refresh cost and operational ownership: Someone has to own the refresh schedule, monitor it, and tune it as data grows.

Storage and backup footprint: You’re duplicating data. Factor this into disk capacity and backup windows.

Maintenance surface area: Another object to index, refresh, monitor, and document.

I believe the trade-offs are worth it when your workload matches the pattern. If it doesn’t, don’t force it.

Common Gotchas and Troubleshooting (Symptom → Cause → Fix)

MV Is Fast at First, Then Slows Down

Symptom: Queries against the MV start fast, then degrade over weeks.

Cause: Data growth + indexes not aligned to access patterns. As the MV grows, full scans become expensive.

Fix: Add indexes for filter columns and common joins. Run EXPLAIN ANALYZE on MV queries to confirm index usage.

Concurrent Refresh Fails

Symptom: REFRESH MATERIALIZED VIEW CONCURRENTLY errors out.

ERROR: cannot refresh materialized view “public.mv_order_revenue_summary” concurrently

HINT: Create a unique index with no WHERE clause on one or more columns of the materialized view.

Cause: Missing or incorrect unique index for logical uniqueness.

Fix: Identify the columns that make each row unique and create a unique index:

CREATE UNIQUE INDEX idx_mv_revenue_unique

ON mv_order_revenue_summary(tenant_id, category, region, week);

Refresh Jobs Overlap

Symptom: Multiple refresh processes running at the same time, competing for resources.

Cause: Scheduler runs without a guardrail. Refresh duration exceeds the schedule interval.

Fix: Implement an advisory lock pattern (shown earlier) or increase the schedule interval.

Refresh Workload Disrupts Primary Workload

Symptom: Writes slow down during refresh. Replica lag spikes.

Cause: Refresh scheduled during peak hours or competing for IO/CPU.

Fix:

Move refresh to off-peak windows
Simplify the refresh query
Consider resource limits (statement timeout, work_mem tuning)

MV Results Surprise People

Symptom: Users report “wrong” data. Confusion about why numbers don’t match real-time queries.

Cause: Freshness contract not explicit and not visible.

Fix: Expose last_refresh_at in your application. Add a timestamp to the dashboard showing when data was last updated. Communicate the refresh schedule clearly.

Alternatives Worth Considering (and When They Win)

Normal views: Use when you have few reads and an always-current requirement. No storage overhead, no refresh jobs, but you pay computation cost on every read.

Summary tables + ETL: Use when incremental updates are feasible and you want full control over the update pipeline. More work to build, but you own the logic and can optimize for incremental changes.

Partitioning + indexing: Use when the real constraint is data layout and pruning. If your query scans too many partitions, materialized views won’t help. Fix the partitioning strategy first.

Cache layer (Redis, Memcached): Use when app-level latency goals dominate and you need sub-millisecond response times. Caches are great for key-value lookups, less so for complex aggregations.

Timescale continuous aggregates: Use when time-series rollups are the core pattern. Continuous aggregates in TimescaleDB handle incremental refresh automatically for time-bucketed data.

Pick the tool that fits the constraint. Materialized views are one option, not the only option.

Conclusion

Materialized views give you predictable reads, controlled computation, and clear freshness. You’re trading real-time accuracy for speed and resource efficiency – on purpose, with a contract.

When the pattern fits, the results are immediate: dashboard queries drop from 28 seconds to 180 milliseconds, users stop complaining, and your database does less work.

The operational commitment is real – you own the refresh schedule, the indexes, and the monitoring. But the alternative is running the same expensive query over and over, hoping the database can keep up.

The post PostgreSQL Materialized Views: When Caching Your Query Results Makes Sense (And When It Doesn’t) appeared first on Stormatics.

Unlocking High-Performance PostgreSQL: Key Memory Optimizations

Warda Bibi — Thu, 29 Jan 2026 07:30:12 +0000

PostgreSQL can scale extremely well in production, but many deployments run on conservative defaults that are safe yet far from optimal. The crux of performance optimization is to understand what each setting really controls, how settings interact under concurrency, and how to verify impact with real metrics.

This guide walks through the two most important memory parameters:

shared_buffers
work_mem

shared_buffers

Let’s start with shared_buffers, because this is one of the most important concepts in PostgreSQL. When a client connects to PostgreSQL and asks for data, PostgreSQL does not read directly from disk and stream it back to the client. Instead, PostgreSQL does something that pulls the required data page into shared memory first and then serves it from there. The same design applies to writes. When the client updates a row, PostgreSQL does not immediately write that change to disk. It loads the page into memory, updates it in RAM, and marks that page as dirty. Disk writes come later.

And this design is intentional because reading and writing in memory are orders of magnitude faster than reading from or writing to disk, and it dramatically reduces random I/O overhead.

So what exactly is shared_buffers?

shared_buffers defines the size of the shared memory region that PostgreSQL uses as its internal buffer cache. And all the reads and writes go through shared_buffers. Disk interaction happens later asynchronously through background writing and checkpoints. So shared_buffers is the layer between the database processes and the disk.

By default, PostgreSQL sets shared_buffers to 128MB. That might be fine for local environments; however, it is not enough cache for real working sets, which means more disk reads, more I/O pressure, and less stable latency.

How do we size shared_buffers?

A common starting rule of thumb is:

If the server has more than 1GB RAM, start with 20–25% of total RAM on a dedicated PostgreSQL server and increase gradually if needed. Values above ~40% usually stop helping much.

There’s a reason we don’t just set it to ‘as high as possible’. If you give PostgreSQL too much buffer cache, you can start competing with the OS page cache, and you can also increase the volume of dirty data that must be flushed during checkpoints, which can increase checkpoint pressure and write spikes.

One more important thing to remember is that shared_buffers is a postmaster-level parameter. That means PostgreSQL allocates it at startup, and changing it requires a server restart.

How do I know if my current value is good?

As database engineers, our job is to size shared_buffers correctly:

large enough to reduce disk reads
but not so large that it harms the OS cache or makes checkpoints heavier

Step 1: Look at the cache hit ratio

One simple way is to look at the cache hit ratio using pg_stat_database.

SELECT 
     sum(blks_hit)/nullif(sum(blks_hit+blks_read),0) AS cache_hit_ratio 
FROM 
     Pg_stat_database;

If the cache hit ratio is close to 1, it means most reads are being served from memory, and this is generally what we want. If it’s low, it means PostgreSQL is doing more physical reads from disk, and that’s a signal to investigate.

Step 2: Verify it at the query level

To see whether a specific query is using cache, run:

EXPLAIN (ANALYZE, BUFFERS) 
SELECT …

In the output, look for :

buffer hits – served from shared_buffers
buffer reads – pulled from disk

If you run the same query again, most of the time, the second run shows far more hits because now the pages are already in shared_buffers.

Important Note

In large production workloads, not everything can or should fit in memory. So you will see disk reads, and that’s normal. The goal isn’t that everything must be a cache hit. The goal is :

Disk I/O shouldn’t be your bottleneck, and
Reads and writes should be smooth and
Latency shouldn’t spike because the cache is too small or mis-sized

If you want deeper visibility into what is currently stored in shared_buffers or which tables are occupying memory, PostgreSQL gives you tools for that. Extensions like:

pg_buffercache
pginspect

let you inspect shared buffers directly and understand memory usage patterns.

work_mem

After shared_buffers, the next memory parameter we need to focus on is work_mem.

And this is probably the most dangerous memory setting in PostgreSQL if you don’t fully understand how it works – not because it’s bad, but because it multiplies quietly. Many production outages caused by out-of-memory errors can be traced back to a misunderstanding of work_mem.

work_mem defines the limit or the maximum amount of memory allocated for executing operations such as:

Sorting, when performing operations like ORDER BY, DISTINCT, and GROUP BY.
JOINs usage (with hashing to build in-memory hash-tables, for example, for the hash join).
Set operations like UNION, INTERSECT, and EXCEPT.
Creating the bitmap arrays for the bitmap scan method

This parameter affects the efficiency of query execution and the overall performance of the database. It’s important to note that work_mem is allocated for each operation, not per the PostgreSQL session. This is a crucial detail, as a single SQL query can perform multiple sorting or join operations, each of which will consume its own area of memory. And some of these can be paralleized by PostgreSQL, and when that happens, each parallel worker uses up to work_mem per operation. If an operation runs sequentially, it can use up to work_mem. But if the same operation runs under a Gather node with, say, five parallel workers, then that single operation can consume:

5 × work_mem

This is exactly how databases run out of memory, even when the application hasn’t changed, because work_mem multiplies across:

Parallel workers
Multiple memory-intensive operations in a query
Concurrent queries running at the same time

This is why the most important thing to remember is that work_mem is per operation, and it can be used multiple times inside a single query, across many concurrent queries.

How do we tune work_mem?

By default, PostgreSQL sets work_mem to 4MB. For many simple OLTP workloads with high concurrency, this is actually fine. But for analytical or reporting queries, 4MB is often too small.

If the work_mem is too small, PostgreSQL starts spilling to disk, and you’ll typically see:

Temporary files are being created
Sorts switching to disk-based algorithms
Increased disk I/O and latency spikes

If the work_mem is too large, it will cause memory pressure or worst OOM kills.

We can measure if work_mem needs tuning using:

EXPLAIN (ANALYZE, BUFFERS)
SELECT …

If you look for:

Sort Method: external merge
temp file usage
temp reads and writes
disk usage reported in the plan

These signals describe exactly which operations are memory-bound, and those are the places worth tuning. Good thing about work_mem is that it is not a postmaster-level parameter and you can tune it:

per session
per role
per transaction

For systems with less than 64 GB of RAM, you can start with:

work_mem = 0.25% of total system RAM

On smaller systems, this translates to ~3 MB per GB of RAM. This is because on smaller machines, concurrency and parallelism are usually limited, so this sizing is aggressive enough to reduce unnecessary disk spills without creating memory risk. On large machines, however, scaling work_mem linearly with RAM becomes dangerous. Parallel queries, many concurrent sessions, and multiple operations can cause memory usage to grow exponentially. So for larger systems(>64GB), we can switch to a more conservative formula:

work_mem = max(162MB, 0.125% of RAM + 80MB)

This approach does two important things:

It still allows work_mem to grow with system size
But it slows down the growth rate as RAM increases. In other words, it avoids giving every query an unnecessarily large memory area just because the machine is big.

Conclusion

Start with conservative, safe defaults. Measure behavior using real metrics like EXPLAIN (ANALYZE, BUFFERS) and system statistics. Tune selectively, especially for high-impact queries, instead of applying aggressive global changes.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.

The post Unlocking High-Performance PostgreSQL: Key Memory Optimizations appeared first on Stormatics.

Unused Indexes In PostgreSQL: Risks, Detection, And Safe Removal

Semab Tariq — Tue, 27 Jan 2026 09:57:39 +0000

Indexes exist to speed up data access. They allow PostgreSQL to avoid full table scans, significantly reducing query execution time for read-heavy workloads.

From real production experience, we have observed that well-designed, targeted indexes can improve query performance by 5× or more, especially on large transactional tables.

However, indexes are not free.

And in this blog, we are going to discuss what issues unused indexes can cause and how to remove them from production systems with a rollback plan, safely

1. Why Unused Large Indexes Become a Long-Term Problem

Over time, unused indexes can silently degrade database performance. Below are some of the most common issues they cause in production systems.

1.1. Slower INSERT, UPDATE, And DELETE Operations

Every write operation must update all indexes on a table, including those that are never used by queries.

1.2. Increased Vacuum And Autovacuum Overhead

Indexes accumulate dead tuples just like tables. These must be vacuumed, increasing I/O usage and extending vacuum runtimes.

1.3. Longer Maintenance Windows

Operations such as VACUUM and REINDEX take longer as the number and size of indexes grow.

1.4. Disk Space Waste And Cache Pollution

Large unused indexes consume disk space and can evict useful data from shared buffers, reducing cache efficiency.

Because of these reasons, it is always recommended to periodically identify and safely remove unused indexes from production systems, but only through a controlled and well-validated process.

2. How To Safely Drop Unused Indexes In PostgreSQL

Below is a step-by-step, production-safe checklist that should be followed before dropping any index.

2.1. Check When System Statistics Were Last Reset

If statistics were reset recently, an index may appear unused even though it is actively required by workloads.

SELECT
  datname,
  stats_reset
FROM pg_stat_database
WHERE datname = current_database();

An older stats_reset timestamp (or NULL, meaning statistics were never reset) provides more confidence in index usage data.

2.2. Check Whether The Index Backs Any Constraint

A large index can appear unused in statistics, but must not be dropped if it enforces a PRIMARY, UNIQUE, or FOREIGN KEY constraint.

PostgreSQL uses these indexes to guarantee data integrity and will not allow them to be dropped unless the constraint itself is explicitly removed.

SELECT
  i.relname AS index_name,
  c.conname AS constraint_name,
  c.contype AS constraint_type,
  c.conrelid::regclass AS table_name
FROM pg_constraint c
JOIN pg_class i ON i.oid = c.conindid
WHERE i.relname = '';

If this query returns rows, the index can not be dropped.

2.3. Check Index Usage Statistics

This confirms whether PostgreSQL’s query planner has used the index during query execution.

SELECT
  s.indexrelname AS index_name,
  s.relname AS table_name,
  s.idx_scan,
  s.idx_tup_read,
  s.idx_tup_fetch
FROM pg_stat_user_indexes s
WHERE s.indexrelname = '';

All the counts must be 0

3. Rollback Preparation

Before dropping any index, always capture its definition so it can be recreated quickly if needed.

SELECT pg_get_indexdef(''::regclass) AS create_index_sql;

Store this output as part of your rollback plan.

4. Drop The Index Safely

Using DROP INDEX CONCURRENTLY avoids blocking reads and writes on the table, making it safe for production environments.

DROP INDEX CONCURRENTLY ;

If performance issues are observed after dropping the index, the rollback plan can be used to recreate the index concurrently without impacting availability.

5. Final thoughts

Dropping unused indexes can deliver meaningful performance and maintenance benefits, but only when done carefully.

Never rely on statistics alone; always validate constraints, understand workload patterns, and prepare a rollback plan.

In production systems, correctness and stability must always take priority over cleanup speed.

The post Unused Indexes In PostgreSQL: Risks, Detection, And Safe Removal appeared first on Stormatics.

PostgreSQL on Kubernetes vs VMs: A Technical Decision Guide

Umair Shahid — Tue, 20 Jan 2026 11:01:24 +0000

If your organization is standardizing on Kubernetes, this question shows up fast:

“Should PostgreSQL run on Kubernetes too?”

The worst answers are the confident ones:

“Yes, because everything else is on Kubernetes.”
“No, because databases are special.”

Both are lazy. The right answer depends on what you’re optimizing for: delivery velocity, platform consistency, latency predictability, operational risk, compliance constraints, and, most importantly, who is on-call when things go sideways.

I have seen PostgreSQL run very well on Kubernetes. I’ve also seen teams pay a high “complexity tax” for benefits they never actually used. This post is an attempt to give you a technical evaluation you can use to make a decision that fits your environment.

Start with the real question: are you running a database, or building a database platform?

This is the cleanest framing I have found:

Running a database: You have a small number of production clusters that are business-critical. You want predictable performance, understandable failure modes, straightforward upgrades, and clean runbooks.
Building a database platform: You want self-service provisioning, standardized guardrails, GitOps workflows, multi-tenancy controls, and a repeatable API so teams can spin up PostgreSQL clusters without opening tickets.

Kubernetes shines in the second world. VMs shine in the first.

Yes, you can do either on either platform. But the default fit differs.

A neutral comparison model: 6 dimensions that actually matter

Here is a practical rubric you can use in architecture reviews.

If you want a quick decision shortcut:

If your main goal is self-service and standardization, Kubernetes is compelling. If your main goal is predictable performance and lower operational surface area, VMs metal are compelling.

What Kubernetes adds (and why it’s both good and risky)

Kubernetes wasn’t designed primarily for databases. It was designed for scheduling workloads, handling health checks, rolling updates, and service discovery. PostgreSQL can run well there, but you typically stack multiple control layers:

Stateful identity and scheduling
Persistent volumes
CSI/storage drivers
Operators for lifecycle management
Sidecars for backups/metrics/log shipping

That’s not inherently bad. It’s powerful. But each layer is another thing to understand, upgrade, monitor, and debug. There is also the ‘agony of choice’ when selecting the operator for lifecycle management. There are quite a few available, and none are perfect.

The biggest Kubernetes “gotcha” for PostgreSQL isn’t that it doesn’t work. It’s that when something goes wrong, the failure analysis can shift from “what is Postgres doing?” to “which Kubernetes subsystem is influencing Postgres right now?”

A very common pattern: a performance incident that starts as “write latency spiked” turns out to be tied to eviction behavior, scheduling pressure, or storage-layer hiccups. Those are solvable problems, but only if you already have deep Kubernetes operational maturity.

What VMs give you (and what they don’t)

VMs are boring in the best way: fewer abstraction layers between PostgreSQL and the hardware.

That usually means:

More predictable latency (especially disk + network)
Easier kernel-level tuning (huge pages, I/O scheduler, NUMA considerations)
Simpler operational failure analysis (“the host is slow” is a real thing you can measure and act on)
More straightforward incident response for teams that already have VM/host tooling

But VM isn’t “free” either. The cost shows up in different places:

Slower provisioning and less self-service
More configuration drift risk (“snowflake servers”)
More manual day-2 operations unless you build good automation
Higher discipline required for patching, backups, and failover testing

The platform might be simpler; the process still needs maturity.

The performance reality: storage and network decide more than “K8s vs VM”

Most “Postgres on Kubernetes is slow” stories are really one of these:

The storage class wasn’t suited for database workloads.
CPU throttling or noisy neighbor effects were introduced through cgroups / limits / oversubscription.
Network paths became less predictable (overlay, MTU issues, cross-zone routing).
Failover / restart behavior wasn’t tested under real load.

Storage: the durability and jitter problem

PostgreSQL is very sensitive to storage behavior because it relies heavily on fsync semantics, WAL throughput, and predictable latency for sync writes. On bare metal or a well-provisioned VM, you can often get very stable performance by:

Using fast SSD/NVMe
Separating WAL and data volumes when appropriate
Benchmarking with fio and Postgres tools (pg_test_fsync) before you commit to architecture

On Kubernetes, you can do this too, but you must be intentional:

Prefer storage classes built for sustained IOPS and latency stability (not just “it supports PVCs”)
Validate snapshot/restore behavior end-to-end (because snapshots that exist but can’t restore correctly are theatre)
Consider dedicated node pools and careful volume placement if you’re chasing low jitter

Network: the “multi-region makes everything harder” lesson

Replication lag is a good example of why network matters more than platform ideology. In one benchmark study1 (single-region vs multi-region), average replication lag in single-region was around a few milliseconds, while multi-region averaged tens of milliseconds with occasional spikes under load. The big takeaway: geography and network dominate lag behavior far more than whether you run inside a pod or on a VM.

So if your decision is driven by “we want multi-region active-active,” focus on replication architecture and network reality first. Kubernetes won’t save you from physics.

Reliability and HA: Kubernetes gives you rescheduling, not correctness

A controversial statement that’s still true:

Kubernetes gives you rescheduling. PostgreSQL needs correctness.

If a Postgres pod dies, Kubernetes will restart it. Great. But high availability for PostgreSQL is about:

avoiding split brain
promoting the right node at the right time
fencing the old primary
ensuring replicas are consistent
ensuring client traffic shifts cleanly
ensuring backups and restore paths are proven

Kubernetes can help you automate that with mature operators. VMs can help you automate it with mature HA tooling (Patroni/repmgr + a DCS + load balancers, etc.). In both cases, correctness comes from your HA design, your fencing strategy, and your tests, not from the platform’s marketing.

When Kubernetes is a strong fit for PostgreSQL

Kubernetes becomes a very rational choice when:

1. You already run a mature Kubernetes platform

You have stable storage classes
You have strong observability
You have SREs who understand scheduling, disruption, and capacity planning

2. You want an internal “Postgres-as-a-service” model

Developers request databases via a ticket/API and get guardrails by default
Standardized backups, monitoring, parameter baselines, and security policies

3. You need many isolated Postgres clusters

Multi-tenant environments where per-tenant isolation is valuable
Frequent creation/destruction of clusters (CI, preview environments, ephemeral staging)

4. Your org operates with GitOps discipline

Declarative config changes
Reviewable diffs
Automated drift detection

In these cases, the platform benefits can outweigh the complexity, because you’re actually using the platform benefits.

When VMs are a stronger fit

VMs tend to be the better choice when:

1. Your Postgres cluster is “crown jewel” infrastructure

Latency-sensitive OLTP
Predictable I/O behavior matters more than provisioning speed

2. You don’t have Kubernetes specialists on-call

The fastest path to reliability is fewer moving parts, not more automation

3. You’re running a small number of large databases

Dedicated instances, tuned for workload
Scaling is mostly vertical and carefully planned

4. You need tight control over kernel + host settings

NUMA behavior, huge pages, I/O scheduling, direct-attached NVMe, etc.

If you’re in this world, “boring infrastructure” is a feature.

Two reference architectures you can copy

Option A: Kubernetes with an operator (platform-oriented)

Key design choices:

Use a mature Postgres operator for day-2 operations (backups, failover, upgrades)
Use dedicated node pools for Postgres
Use pod anti-affinity so replicas land on different nodes
Use PodDisruptionBudgets so maintenance doesn’t take you down
Keep backups off-cluster (object storage) and run restore drills

And your operator-managed cluster spec should include:

explicit resource requests
storage class selection
monitoring enablement
backup configuration
replication settings

Option B: VMs with Patroni (database-runbook oriented)

Key design choices:

3-node cluster (1 primary, 2 replicas)
Patroni for HA with a DCS (etcd/Consul)
HAProxy for routing writes to primary and reads to replicas (optional)
PgBouncer for connection pooling
pgBackRest (or similar) for backups and PITR
Monitoring stack: node metrics + Postgres metrics + log analysis

This model is widely understood, auditable, and tends to fail in more predictable ways.

Common gotchas (the ones that create 2am incidents)

Kubernetes gotchas

1. CPU limits causing throttling

You can meet “CPU request” but still get throttled under burst if limits are too tight.

2. Pod evictions during load

Especially if PDBs, priorities, and eviction policies aren’t designed for stateful workloads.

3. Storage that looks fast on paper but has latency spikes

Sustained performance is what matters, not peak IOPS marketing.

4. Backups that exist but restores that fail

Test restores on a schedule as a drill, not during an incident.

5. Operator upgrades as a hidden dependency

Your database lifecycle now depends on the operator lifecycle.

VM gotchas

1. Unvalidated failover

You “have HA” but haven’t practiced it under load with real application behavior.

2. Backup confidence without restore drills

The only backup that matters is the one you restored successfully.

3. Configuration drift

Two replicas that aren’t actually identical are a slow-motion outage.

4. Noisy neighbor on shared hypervisors

“It’s on a VM” doesn’t mean you own the underlying contention story.

5. OS patching and reboots without a runbook

Routine maintenance becomes risky without clear procedures.

The punchline: choose the platform that matches your org’s operating model

My take is simple:

Kubernetes is excellent when you’re building a database platform.
VMs are excellent when you’re running a database.

Both can be production-grade. Both can be disasters. The difference is whether your organization is set up to operate the platform you choose.

If you want one practical recommendation that avoids regret, this is it:

Run dev/test Postgres on Kubernetes if it helps delivery speed. Run production Postgres where you can guarantee predictable storage, clear failure modes, and strong operational ownership. That might be Kubernetes, or it might not.

[1] Benchmark Study on Replication Lag in PostgreSQL using Single Region and Multi-Region Architectures

[2] Is Your PostgreSQL Deployment Production Grade?

[3] From Downtime to Reliability: How Stormatics Solved High Availability Issues for a Middle Eastern Government’s Critical Cybersecurity Operations

[4] Clustering in PostgreSQL: Because One Database Server is Never Enough (and neither is two)

[5] Database in Kubernetes: Is that a good idea?

[6] Databases on K8s — Really? [Part 1] [Part 2] [Part 3] [Part 4]

Frequently Asked Questions (FAQs)

Q. What defines a PostgreSQL Database Cluster and its physical structure?

PostgreSQL Column Limits

Umair Shahid — Tue, 23 Dec 2025 08:03:36 +0000

The 1,600‑column ceiling, and the real production problems that show up long before it

If you’ve ever had a deployment fail with “tables can have at most 1600 columns”, you already know this isn’t an academic limit. It shows up at the worst time: during a release, during a migration, or right when a customer escalation is already in flight.

But here’s the more common reality: most teams never hit 1,600 columns; they hit the consequences of wide tables first:

Query latency creeps up (more I/O per row, less cache efficiency)
WAL volume and replication lag increase (updates write new row versions)
Backups get bigger, restores get slower
Schema changes become scarier (locks + rewrites + “what else will break?”)

This post explains what PostgreSQL’s column limits really are, how to spot risk early, and what the limits imply

“We didn’t design a 1,600‑column table, but PostgreSQL says we did”

This problem tends to arrive in one of three ways:

1. The table got wide naturally

You add “just one more feature flag” every sprint
You support more integrations, more optional attributes, more “customer‑specific” fields
Your ORM makes it easy to keep shoving columns into a single “God” table

2. You dropped columns, so you think you’re safe
You look at \d my_table and see 400 columns. But PostgreSQL refuses to add column #401 because dropped columns still count toward the limit. That’s not folklore. It’s documented behavior. (Ref: https://www.postgresql.org/docs/current/limits.html)

3. You didn’t hit the table limit, you hit the query limit
Even if no single table has 1,600 columns, a SELECT * across several wide tables (or a wide view) can hit the result set column limit (1,664) and fail in surprising ways. ORM-generated queries are famous for this. (Ref: https://www.postgresql.org/docs/current/limits.html)

And when you are accountable for uptime and customer SLAs, or for scaling without a dedicated PostgreSQL team, this results in either firefighting or expensive “just scale the instance” decisions.

What PostgreSQL actually limits, and why it matters

1) The hard limits you should know

From PostgreSQL’s own limits table: (Ref: https://www.postgresql.org/docs/current/limits.html)

Columns per table: 1,600
Columns in a result set: 1,664
Columns per index: 32
Identifier length: 63 bytes
Field size: 1 GB

The “hidden” piece that bites wide tables is this:

PostgreSQL stores rows in fixed-size pages (commonly 8 kB) and does not allow a row (tuple) to span multiple pages. Large values can be moved out-of-line via TOAST, but there’s still per-column overhead and an in-row pointer cost. (Ref: https://www.postgresql.org/docs/current/storage-toast.html)

That’s why PostgreSQL documents the 1,600 column limit with a blunt footnote: even if 1,600 columns are allowed, your row might not fit, depending on data types and overhead.

2) Why “wide tables” hurt before you hit the limit

Even with far fewer than 1,600 columns, wide tables create predictable production pain:

Fewer rows per page → more pages read
Wider tuples reduce tuple density. That means more heap pages for the same number of rows, which raises I/O and buffer churn.
Updates become more expensive
PostgreSQL uses MVCC, so updates create new tuple versions. Wider rows can mean more WAL, more vacuum work, and more bloat risk (especially if HOT updates can’t happen due to index coverage).
Indexes get harder
You can’t just “index everything.” You’re limited to 32 columns per index, and wide tables tempt teams into massive composite indexes that don’t age well.
Schema evolution gets fragile
Add/drop cycles plus ORMs can push you toward limits. And dropped columns don’t disappear the way many people expect.

3) Audit your schema in 10 minutes

Run these in psql (or from your app’s SQL console). They’re safe and read-only.

A) Find your widest tables (by column count)

SELECT
  n.nspname AS schema,
  c.relname AS table,
  count(*) FILTER (WHERE a.attnum > 0 AND NOT a.attisdropped) AS live_columns,
  count(*) FILTER (WHERE a.attisdropped) AS dropped_columns,
  count(*) FILTER (WHERE a.attnum > 0) AS total_attnums
FROM pg_attribute a
JOIN pg_class c      ON c.oid = a.attrelid
JOIN pg_namespace n  ON n.oid = c.relnamespace
WHERE c.relkind IN ('r','p')   -- r=table, p=partitioned table
  AND n.nspname NOT IN ('pg_catalog','information_schema')
GROUP BY 1,2
ORDER BY total_attnums DESC
LIMIT 50;

What you’re watching for:

High total_attnums even if live_columns looks fine (this is the dropped-column trap).
Tables drifting past 200–400 columns (a smell worth reviewing), even if you’ll never approach 1,600.

B) Identify “mostly NULL” columns (classic wide-table driver)

SELECT
  attname,
  null_frac,
  avg_width
FROM pg_stats
WHERE schemaname = 'public'
  AND tablename  = 'your_table'
ORDER BY null_frac DESC, avg_width DESC
LIMIT 30;

If you see lots of columns with null_frac close to 1.0, you’re probably paying storage + maintenance overhead for attributes almost no rows use.

C) Get a rough “row width” estimate (planner stats)

SELECT
  relname,
  n_live_tup,
  n_dead_tup,
  pg_size_pretty(pg_relation_size(relid)) AS heap_size
FROM pg_stat_user_tables
ORDER BY pg_relation_size(relid) DESC
LIMIT 20;

This doesn’t directly give row width, but it helps you correlate “wide table” candidates with bloat risk and maintenance cost.

Fix patterns that work (and how to choose)

If you’re close to the limit, or simply feeling the pain of wide tables, there are a few fixes that scale cleanly.

Pattern 1: Vertical partitioning (hot/cold split)

This is the most common “production-friendly” fix.

Keep the hot, frequently used columns in the main table.
Move rarely used or optional columns into a 1:1 side table.

CREATE TABLE orders (
  id           bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  customer_id  bigint NOT NULL,
  status       text NOT NULL,
  created_at   timestamptz NOT NULL DEFAULT now()
);

CREATE TABLE orders_ext (
  id           bigint PRIMARY KEY REFERENCES orders(id) ON DELETE CASCADE,
  marketing    jsonb,
  notes        text,
  metadata     jsonb
);

Why this works:

Most queries touch the hot table only → fewer pages, better cache behavior
You still keep relational integrity where it matters
The app can join only when needed

A small trick that helps adoption: create a view for legacy queries:

CREATE VIEW orders_v AS
SELECT o.*, e.marketing, e.notes, e.metadata
  FROM orders o
  LEFT JOIN orders_ext e USING (id);

Pattern 2: Use JSONB for sparse attributes (with clear guardrails)

If attributes are truly dynamic or customer-specific, JSONB can be a good fit.

Use JSONB for “sparse” / evolving fields
Keep core relational fields as proper columns
Add GIN indexes only where query patterns justify them

Example:

ALTER TABLE orders_ext
  ADD COLUMN attrs jsonb;
CREATE INDEX orders_ext_attrs_gin
  ON orders_ext USING gin (attrs);

Pattern 3: Stop doing “column per X” models

If you’re doing column-per-tenant, column-per-question, column-per-feature-flag-at-scale… that’s the design that eventually breaks.

A safer relational shape is usually:

one row per entity
one row per attribute/value (only if truly needed)
or JSONB for the long tail

Pattern 4: Fix the dropped-column problem (requires a rewrite)

If you are blocked because dropped columns count toward the limit, the real fix is a table rewrite.

The common operational approach:

Create a new table with only the columns you actually want
Backfill in batches
Dual-write (trigger or application) during migration window
Swap over (rename tables or switch a view)

This is also where careful ALTER TABLE planning matters. PostgreSQL improved add-column behavior (constant defaults don’t rewrite the table from v11 onward), but many schema operations still take strong locks and need planning. (Ref: https://www.postgresql.org/docs/current/ddl-alter.html)

Implications

SLA risk and infrastructure spend

If you care about SLA risk and cloud spend, wide tables translate directly to:

Higher I/O → bigger instances as the “quick fix”
Replication/backup windows stretching → higher incident risk
Surprise deployment failures due to limits → escalations

This is a predictable cost and reliability leak. Fixing schema width reduces both spend and incident rate.

Scaling without building a PostgreSQL team

If you care about scaling without building a PostgreSQL team, wide tables can be a governance issue:

ORMs can silently create schema debt
“Just add columns” compounds, until it becomes a production constraint
Design choices today become migration pain later

Put a simple guardrail in place now (schema audit + thresholds). Avoid a high-risk rewrite later.

Comparison to other DBMS

If you care about capability parity + safe operations, the key reassurance is:

PostgreSQL has hard limits too (columns, row width, result columns)
TOAST changes how large values are stored, but rows still can’t span pages (Ref: https://www.postgresql.org/docs/current/storage-toast.html)
Dropped-column behavior is different from what many people expect

You can manage this like any enterprise RDBMS: know the limits, design for evolution, and plan rewrites safely.

Reactive firefighting

If you care about not getting trapped in reactive firefighting, wide tables are a reliability tax:

More vacuum pressure
More bloat
Slower root-cause cycles
Higher blast radius for schema change

A quick, reproducible lab so that you can see the impact yourself

If you want something concrete, run this in a dev database to see how width changes storage and plans.

1) Create a narrow table and a wide table

DROP TABLE IF EXISTS t_narrow;
DROP TABLE IF EXISTS t_wide;
CREATE TABLE t_narrow (
  id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  created_at timestamptz NOT NULL DEFAULT now(),
  payload text
);

CREATE TABLE t_wide (LIKE t_narrow INCLUDING ALL);
DO $$
DECLARE
  i int;
BEGIN
  FOR i IN 1..200 LOOP
    EXECUTE format('ALTER TABLE t_wide ADD COLUMN c%03s text;', i);
  END LOOP;
END $$;

2) Insert some data (lots of NULLs in the wide table)

INSERT INTO t_narrow(payload)
SELECT repeat('x', 200)
FROM generate_series(1, 200000);

INSERT INTO t_wide(payload)
SELECT repeat('x', 200)
FROM generate_series(1, 200000);

3) Compare sizes

SELECT
  't_narrow' AS table,
  pg_size_pretty(pg_relation_size('t_narrow')) AS heap
UNION ALL
SELECT
  't_wide',
  pg_size_pretty(pg_relation_size('t_wide'));

Then run:

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM t_narrow WHERE id = 150000;
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM t_wide WHERE id = 150000;

You’re watching for:

More buffer hits/reads for the wide table
More page churn over time
“It’s NULL so it’s free” turning out to be false in practice

And if you want the underlying “why,” PostgreSQL’s own docs spell out the constraints: rows must fit on a single page, TOAST stores large values out-of-line with an in-row pointer, and dropped columns still count toward the column limit.

Troubleshooting: common gotchas (and quick fixes)

“But we only have 300 columns.”
Check dropped columns (attisdropped), they still count.
“We didn’t hit 1,600 columns, but the query fails.”
You may be hitting the 1,664 result-column limit (wide view / ORM SELECT * join). Fix by selecting only needed columns or returning a structured payload (JSON) for the long tail.
“Inserts fail with row-too-big errors.”
Remember: tuples can’t span pages. TOAST helps with large varlena values, but fixed-size columns and in-row overhead still matter.
“We’ll just add a massive composite index.”
You’re capped at 32 columns per index by default, and very wide indexes are operationally expensive anyway.
“We can fix it with ALTER TABLE in production.”
Some ALTER TABLE operations are fast now (constant defaults since v11), but locks still matter. Plan migrations with expand/contract patterns to keep production stable.

Frequently Asked Questions (FAQs)

Q. Why does PostgreSQL reject adding new columns even though my table has far fewer than 1,600 visible columns?

The Road to Deploy a Production-Grade, Highly Available System with Open-Source Tools

Semab Tariq — Wed, 17 Dec 2025 10:48:58 +0000

Everyone wants high availability, and that’s completely understandable. When an app goes down, users get frustrated, business stops, and pressure builds.

But here’s the challenge: high availability often feels like a big monster. Many people think, If I need to set up high availability, I must master every tool involved. And there’s another common belief too: Open-source tools are not enough for real HA, so I must buy paid tools.

These assumptions make high availability seem far more complex than it really is, and in this series, we are going to address it.

This Is a 2-Part Series.

Part 1 (this one): We will lay the foundation by answering the most important questions you should consider before going hands-on with HA systems.
Part 2: We will go fully hands-on. I will walk through the architecture diagram, the tool stack, and provide the exact commands and step-by-step instructions to deploy the cluster based on your requirements.

The “Number of Nines”, RTO, and RPO

These are the foundations of a high-availability cluster. If you understand them and answer them clearly, you are already very close to building your HA setup.

Imagine you have a main site (your primary system). Things are working fine. Life is good. But one day, a disaster happens, maybe a server fails, a region goes down, or your database crashes.

At that moment, three questions decide everything.

1) How much downtime can you accept?

This is where the number of nines comes in (like 99.9% uptime, 99.99%, and so on). More nines usually mean less downtime, but also more effort and architectural cost.

Here’s a simple table to help you choose the right architecture based on your needs.

Target uptime (“nines”)	Allowed downtime (per year)	A setup that usually fits	Notes / what you must be ready for
99%	~3.65 days	Single node + solid backups + tested restore	Backups and restore drills matter more than fancy tooling.
99.9%	~8.8 hours	2-node setup + proper witness node (same region)	Witness helps avoid split-brain and supports clean failover decisions inside one region.
99.99%	~52.6 minutes	Multi-region (2 regions)	To reach 4 nines, you usually need to survive a full region failure, so you move beyond single-region design.
99.999%	~5.3 minutes	Active-active / multi-master style setup	This level is extremely hard. It often needs multi-master/active-active patterns and very mature operations; many teams use specialized (often paid) solutions.

2) How fast do you want to recover?

RTO (Recovery Time Objective) is how long you are willing to wait before your system is back up after a failure.

If your RTO is 5 minutes, it means that when your main database crashes, your failover database must take over and be serving traffic within 5 minutes. If it takes 10 minutes, you have missed your RTO.

Important

One can confuse the number of nines with RTO because both relate to downtime, but they measure different things. A system can have high uptime, but a longer RTO:

Imagine a website is up 99.99% of the time, but if it crashes, it takes 30 minutes to recover. Most of the year, it’s available, but when it fails, recovery is slower.

Or a very short RTO but slightly lower uptime:

Now imagine a website might be up 99.9% of the time, but if it crashes, it recovers within 5 minutes. It goes down slightly more often, but when it does, users are back online quickly.

RTO should always fit within your number of nines target.

3) How much data loss can you tolerate?

RPO (Recovery Point Objective) is how much data you can afford to lose if things go wrong.

In simple words: “If we go back in time, how far back is acceptable?”

If your RPO is 30 seconds, it means that in a failure, the most data you can afford to lose is the last 30 seconds of transactions.

Once You Answer These, The Path Becomes Clear

Then you can choose an architecture that matches your needs, based on:

Your downtime limit
Your recovery time and your data loss tolerance.

The main takeaway from this first part is that you don’t need to be an expert to start. What matters is asking the right questions and understanding your business’s operational needs.

What’s Coming in Part 2

In Part 2, we will deploy an architecture that delivers near-zero RTO and RPO with 99.99% high availability, all without requiring deep technical expertise. and with open-source tools only.

The post The Road to Deploy a Production-Grade, Highly Available System with Open-Source Tools appeared first on Stormatics.

PostgreSQL, MongoDB, and what “cannot scale” really means

Umair Shahid — Mon, 08 Dec 2025 15:58:36 +0000

Last week, I read The Register’s coverage of MongoDB CEO Chirantan “CJ” Desai telling analysts that a “super-high growth AI company … switched from PostgreSQL to MongoDB because PostgreSQL could not just scale.” (The Register)

I believe you can show the value of your own technology without tearing down another. That is really what this post is about.

I run a company that lives inside PostgreSQL production issues every day. We deal with performance, HA and DR design, cost optimisation, and the occasional “we are down, please help” call from a fintech or SaaS platform. My perspective comes from that hands-on work, not from a quarterly earnings script.

Scaling stories are always more complex than a soundbite

When a CEO says, “This AI customer could not scale with PostgreSQL,” a lot of details disappear:

What did the schema and access patterns look like?
How was the application managing connections and transactions?
What was running underneath? Cloud managed PostgreSQL, a home-rolled cluster, or something else?
Were vertical and horizontal scaling options fully explored?
Did the team have PostgreSQL specialists involved, or was everything on default settings?

None of that reduces MongoDB’s strengths. Document databases are excellent for certain workloads: highly variable document shapes, rapid iteration, and development teams that think in JSON first. MongoDB has earned its place.

My concern is with the narrative that PostgreSQL, as a technology, “cannot scale,” especially given that it is the most popular database among professional developers and has overtaken MongoDB in the DB-Engines ranking over the last decade. (DB-Engines)

The method for calculating this ranking is detailed here: https://db-engines.com/en/ranking_definition.

Every database has strengths. Every database has trade-offs. Real-world results come from matching those strengths to the workload and from running the system with care.

What PostgreSQL offers for scaling

PostgreSQL scaling is a spectrum. When someone says “it does not scale,” the real question is usually “which dimension of scale, under which design?”

1. Vertical scale: one node, serious throughput

A single well-configured PostgreSQL instance on modern hardware handles:

Hundreds of thousands of transactions per second
Tens of terabytes of data on a single node

This is not a marketing copy; PostgreSQL consultants have demonstrated these ranges repeatedly in the field.

In practice, many SaaS and fintech workloads have lived entirely within this vertical envelope for years. A typical high-end OLTP node may look like:

# Excerpt from postgresql.conf on a busy OLTP system
shared_buffers       = '32GB'
Effective_cache_size = '96GB'
work_mem             = '64MB'
Maintenance_work_mem = '4GB'
max_connections      = 500

wal_level            = replica
max_wal_senders      = 20
synchronous_commit   = on

Layer that on top of fast NVMe storage, tuned Linux parameters (huge pages, I/O schedulers, TCP settings), and connection pooling (PgBouncer or Pgpool-II), and you already have a very capable single node.

2. Horizontal scale: scale-out patterns that teams use today

When a single node envelope is not enough, PostgreSQL offers multiple, well-understood patterns:

Read scaling with replicas

Native streaming replication provides multiple read replicas.
Connection routing at the application or proxy layer directs reads to replicas and writes to the primary.

Partitioning and sharding

Start with native partitioning:

CREATE TABLE events (
    tenant_id    bigint,
    created_at   timestamptz,
    payload      jsonb
) PARTITION BY RANGE (created_at);

CREATE TABLE events_2025_q4
PARTITION OF events
FOR VALUES FROM (‘2025-10-01’) TO (‘2026-01-01’);

This reduces index and table bloat, keeps hot data in smaller structures, and improves cache efficiency. From there, you can shard by tenant or region using extensions such as Citus, which turn PostgreSQL into a distributed cluster while keeping SQL semantics.

Specialised distributed PostgreSQL services

The ecosystem now includes fully managed, distributed PostgreSQL or PostgreSQL-compatible systems:

AWS Aurora
Google AlloyDB
Microsoft HorizonDB on Azure
Third-party systems such as PGD, TimescaleDB, YugabyteDB, CockroachDB

Even The Register article that quoted the “could not scale” remark listed these services as answers to concerns about PostgreSQL scalability. (The Register)

These platforms exist precisely because organisations want PostgreSQL semantics and ecosystem benefits at a very large scale.

None of this is science fiction. It is what many of us run for clients every day.

AI workloads are still workloads

A lot of the rhetoric here anchors on “AI workloads,” as if that label alone demands a completely different class of database.

When you strip away the hype, most AI-heavy platforms combine:

High-volume event ingestion (traces, actions, telemetry)
Vector search over embeddings
Metadata and configuration storage
Analytical queries over usage and performance

PostgreSQL handles these patterns very well when you apply the right architecture:

The pgvector extension provides vector types and indexes.
Time-series and event tables benefit from partitioning and, if appropriate, time-series extensions. (Tiger Data)
Read replicas and distributed variants cover high-volume read access and multi-region requirements.
Strong transactional semantics and mature tooling simplify the “critical path” parts of AI platforms, such as billing, entitlements, and configuration.

MongoDB also effectively supports AI workloads when the document model aligns with the system’s needs. The real design work lies in balancing transactional consistency, query patterns, schema evolution, and operational comfort for the team.

“AI workload” is a description of behaviour, not a free pass to declare one database category the winner.

What I see in real PostgreSQL production environments

Because I run a PostgreSQL-only services firm, I tend to see systems when they are under pressure:

A fintech running a three-node PostgreSQL cluster sustaining high write rates and achieving 99.99% availability with automated failover and tested DR.
A last-mile delivery platform built on Odoo and PostgreSQL, where careful indexing, parameter tuning, and query refactoring improved throughput by several orders of magnitude without a move to any new database.
Multi-region architectures where we measured replication behaviour across regions and tuned topology, sync levels, and application retry logic until lag and failure modes stayed inside business SLAs.

In each case, PostgreSQL scaled just fine once the architecture matched the problem:

Hot paths isolated into lean tables.
Background jobs separated from request paths.
Connection limits enforced with pooling.
Replication configured deliberately instead of left at defaults.

The blocker was rarely PostgreSQL as a core technology. The blocker was design, implementation, or operational discipline.

Where MongoDB is a strong choice

To keep this honest: there are scenarios where MongoDB is a very reasonable first choice, even in systems that already use PostgreSQL:

Highly polymorphic document payloads with frequent structural changes, where strict relational modelling would slow teams down.
Use cases dominated by document-level access with minimal cross-document joins.
Teams with deep MongoDB experience and an existing operational toolchain that they trust.

Choosing MongoDB in those scenarios is smart engineering.

My issue is with narratives that imply PostgreSQL is fundamentally unable to scale, rather than saying, “For this workload and this team, MongoDB was a better fit.”

Our industry benefits greatly when leaders frame their success that way.

Scaling is an engineering discipline, not a brand attribute

I believe responsible comparisons look more like this:

1. Start from the workload

Request rate, throughput, and latency targets
Data model, relationships, and query patterns
Consistency and durability requirements

2. Consider team capabilities

SQL and relational experience
Existing operational muscle around PostgreSQL, MongoDB, or both
Appetite for running distributed systems

3. Match patterns, then products

Vertical scale, replicas, and partitioning may cover years of growth.
When you truly outgrow those, consider distributed PostgreSQL services or specialised databases, including MongoDB, that address the specific gap.

4. Benchmark honestly

Use the same hardware class, realistic schemas, and realistic queries.
Measure p95 and p99 latencies, plan stability, failure behaviour, and operational effort, not only headline QPS.

This is where meaningful decisions happen, far away from any sentence that says “X cannot scale.”

For CTOs and Database Engineers reading this

If you are responsible for a PostgreSQL deployment and you are now second-guessing your choices because of a quote from an earnings call, you are exactly who I am writing for.

A few practical questions you can ask yourself before considering a move from PostgreSQL to anything else:

Have we exhausted vertical scaling on a well-tuned primary with fast storage?
Are we using read replicas where they make sense?
Is our schema designed for our access patterns, or is the database carrying application-layer problems?
Have we explored partitioning for our largest, hottest tables?
Are we testing and measuring, or reacting to anecdotes?

If the answer to several of these is “no,” you still have a lot of PostgreSQL headroom.

My closing thought

You can show MongoDB’s value without framing PostgreSQL as a dead end. You can show PostgreSQL’s value without dismissing MongoDB.

As practitioners, we owe it to the people who depend on these systems to keep the conversation grounded in architecture, workload patterns, and operational reality, rather than headlines.

The post PostgreSQL, MongoDB, and what “cannot scale” really means appeared first on Stormatics.