SQLpassion

Don’t Miss Out – SQL Server 2025 Unleashed Training Starts Next Monday!

Klaus Aschenbrenner — Mon, 02 Mar 2026 10:19:24 +0000

Next Monday, March 9, 2026, my one-day live online training SQL Server 2025 Unleashed: AI, Performance & Beyond will take place! This hands-on training is designed to help DBAs, developers, and consultants deeply understand the new features in SQL Server 2025 – like Vector Search, AI integration, Optimized Locking, and JSON support.

If you’re still on the fence – there are still open seats available, but they’re filling up quickly. Don’t miss your chance to:

What’s new and relevant in SQL Server 2025
Integrate AI and Vector Search into your own database applications
Practical use-cases around the new performance and T-SQL improvements
How to achieve High Availability and Disaster Recovery with SQL Server 2025

Secure your spot today before it’s too late: https://www.SQLpassion.at/live-trainings/sql-server-2025-unleashed/

See you online next Monday!

Thanks,

-Klaus

PostgreSQL Quickie #2: Installation in Virtual Machines

Klaus Aschenbrenner — Mon, 23 Feb 2026 17:21:34 +0000

A few weeks I have uploaded my 2nd PostgreSQL Quickie to YouTube. This time I’m talking about PostgreSQL installation in Virtual Machines. If you are interested in learning more about how to transition your SQL Server knowledge seamless to PostgreSQL, I highly recommend to check-out my live training about it that I run from April 20 – 21, 2026.

The Linux Page Cache and PostgreSQL

Klaus Aschenbrenner — Tue, 17 Feb 2026 09:37:04 +0000

PostgreSQL performance on Linux is often discussed in terms of SQL tuning, indexes, and query plans. Yet many real-world performance problems originate much lower in the stack: inside the Linux page cache and its writeback policy. PostgreSQL deliberately relies on the kernel for file caching and writeback, which means that kernel misconfiguration can silently undermine otherwise well-tuned database systems.

This blog posting explains how the Linux page cache works, how PostgreSQL integrates with it, and – most importantly – how incorrect page cache and dirty page settings can negatively affect PostgreSQL workloads.

The Linux Page Cache: Deferred I/O by Design

On Linux, the page cache is the kernel’s in-memory representation of file-backed data. Any read or write performed through normal system calls interacts with this cache. Pages are typically 4 KiB in size and are tracked globally by the virtual memory subsystem.

When data is read, the page cache acts as a classic cache: if the data is present and valid, the kernel serves it directly from memory. When data is written, however, Linux behaves differently. The kernel copies the data into the page cache, marks the affected pages as dirty, and immediately returns control to the calling process. Persistence is deferred.

This deferred-write model is fundamental to Linux I/O performance. It allows the kernel to coalesce writes, reorder them, and schedule I/O efficiently. The cost is that memory and disk can diverge significantly, and the kernel must actively manage how much dirty data it allows to accumulate.

Dirty Pages and the Two Critical Thresholds

Linux controls dirty memory using two related thresholds.

The first threshold is governed by vm.dirty_background_ratio. Once the amount of dirty memory exceeds this percentage of total RAM, the kernel starts background writeback. Dedicated flusher threads begin writing dirty pages to disk asynchronously. At this stage, applications are not slowed down; the kernel is merely trying to keep dirty memory from growing without bound.

The second threshold is defined by vm.dirty_ratio. This is a hard limit. When the amount of dirty memory exceeds this value, the kernel stops being polite. Processes that generate new dirty pages are actively throttled. Writes may block, and processes may be forced to sleep until enough dirty pages have been written back.

The distinction matters greatly. vm.dirty_background_ratio controls when cleaning starts. vm.dirty_ratio controls when applications are forced to wait. Misconfiguring either can have serious consequences for write-heavy workloads.

PostgreSQL Caching: Shared Buffers on Top of the Kernel

PostgreSQL maintains its own cache, known as shared buffers. These buffers store table and index pages and implement database-specific logic such as MVCC visibility, locking, and WAL coordination. Shared buffers are essential for correctness and concurrency, but they are not intended to replace the kernel’s page cache.

When PostgreSQL writes a modified page, it does so using standard system calls. The data flows into the Linux page cache, where it becomes dirty. PostgreSQL relies on carefully placed fsync() calls to ensure durability, but it does not dictate when the kernel writes individual data pages to disk.

As a result, PostgreSQL effectively operates with two layers of caching: shared buffers in userspace and the page cache in the kernel. This is intentional. PostgreSQL delegates readahead, writeback batching, and I/O scheduling to Linux, while focusing on transactional correctness itself.

The downside is that PostgreSQL is directly exposed to the kernel’s dirty page policy. When the page cache is misconfigured, PostgreSQL backends pay the price.

How Page Cache Misconfiguration Hurts PostgreSQL

Many PostgreSQL performance issues attributed to “slow disks” or “bad checkpoints” are, in reality, symptoms of poor page cache configuration.

If vm.dirty_background_ratio is set too high, background writeback starts very late. During write-heavy operations – such as bulk inserts, VACUUM, or index creation – dirty pages accumulate rapidly in memory. The system may appear fast initially because writes return immediately. Eventually, however, the dirty set grows so large that writeback cannot keep up. When vm.dirty_ratio is finally reached, PostgreSQL backend processes are suddenly throttled by the kernel. Queries stall, latency spikes appear, and throughput collapses in bursts.

If vm.dirty_ratio itself is set excessively high, the problem becomes even worse. The kernel allows massive amounts of dirty memory to build up, sometimes tens of gigabytes on large systems. When writeback eventually catches up – often during checkpoints or periods of reduced activity – it does so aggressively. This results in long fsync() times, I/O saturation, and unpredictable response times, precisely the opposite of what a database workload needs.

On the other hand, setting dirty limits too low has its own dangers. If vm.dirty_background_ratio or vm.dirty_ratio is overly restrictive, PostgreSQL backends are throttled almost continuously. Write throughput drops, CPU cores go idle while waiting on I/O, and overall system efficiency suffers. The database feels sluggish even though storage bandwidth may not be fully utilized.

The most subtle failure mode is instability. Poorly chosen dirty limits can cause PostgreSQL to oscillate between fast and stalled phases, making performance hard to predict and even harder to diagnose.

Best Practices for Kernel Dirty Page Configuration

On modern database servers with large amounts of RAM, percentage-based dirty limits are often a trap. Percentages scale with memory size, not with storage throughput. Adding more RAM should not automatically allow vastly more dirty data.

For PostgreSQL systems, it is usually safer to configure absolute limits using vm.dirty_background_bytes and vm.dirty_bytes. This keeps writeback behavior predictable and independent of RAM size. Background writeback should start early enough to run continuously, and the hard limit should be low enough that the kernel can drain dirty pages without long stalls.

Equally important is aligning kernel behavior with PostgreSQL configuration. Checkpoint frequency, checkpoint completion targets, and background writer activity must be considered together with dirty page limits. Tuning one layer in isolation often makes things worse, not better.

Summary

PostgreSQL’s performance on Linux is inseparable from the kernel’s page cache behavior. Shared buffers and the page cache form a cooperative system, and kernel parameters such as vm.dirty_background_ratio and vm.dirty_ratio define how smoothly that cooperation works.

Misconfigured dirty page limits can turn fast storage into unpredictable latency, stall database backends, and make performance tuning feel like guesswork. Correctly configured, the same mechanisms provide steady writeback, stable latency, and predictable throughput. Understanding and testing these interactions is therefore not optional – it is a core skill for running PostgreSQL reliably on Linux.

Thanks for reading,

-Klaus

Where is My Clustered Index in PostgreSQL?

Klaus Aschenbrenner — Wed, 11 Feb 2026 11:51:29 +0000

Every SQL Server professional eventually reaches a point where storage internals stop being abstract. You learn that a table is not just a logical container but a physical structure, and that structure is defined by the clustered index. Rows live at the leaf level of a B-tree, the clustering key defines physical order, and every nonclustered index ultimately points back to that key. Once this model settles in, performance tuning feels almost intuitive. Range scans behave predictably, bookmark lookups make sense, and fragmentation becomes an expected consequence of how data is stored.

That is why the first encounter with PostgreSQL often feels disorienting. You look for clustered indexes because, in your experience, a serious database engine must have them. You eventually discover the CLUSTER command and assume you have found the equivalent. Then you realize it is not persistent, not automatic, and not enforced. At that moment, PostgreSQL feels incomplete, as if a fundamental optimization mechanism were missing.

The truth is that PostgreSQL is not missing clustered indexes. It deliberately chose a different foundation.

If you want to learn more about SQL Server 2025 and PostgreSQL, I highly recommend to have a look on my upcoming online trainings about it.

Why PostgreSQL Refuses to Store Data Inside an Index

In SQL Server, the clustered index is the table. The data itself is stored inside the B-tree, and the physical order of rows is inseparable from the index definition. This makes physical ordering a core design feature, but it also means that every write operation must preserve that structure. Inserts, updates, and deletes all interact directly with the tree, which is why page splits, fragmentation, and index rebuilds are normal operational concerns.

PostgreSQL does not work this way. Tables are stored as heaps, which are essentially unordered collections of data pages. Indexes are separate structures that contain pointers to rows rather than the rows themselves. These pointers reference physical locations known as tuple IDs. The index answers the question “how do I find this row?” but not “where should this row live physically?”

This distinction becomes critical once you consider PostgreSQL’s concurrency model. PostgreSQL is built around multi-version concurrency control, and that choice dominates every other storage decision. When a row is updated, PostgreSQL does not overwrite it. Instead, it creates a new version of the row and leaves the old version in place until it is no longer visible to any transaction. Reads and writes proceed without blocking each other, even under heavy concurrency.

A true clustered index would be fundamentally odds with this approach. If every update creates a new row version, maintaining strict physical order would require constant rebalancing of the index structure. Page splits would be frequent, optimizations like HOT updates would no longer work, and vacuum would need to physically reorder data instead of merely cleaning up dead tuples. The cost would be prohibitive.

Rather than compromise MVCC, PostgreSQL rejects the idea that physical row order should be permanently tied to an index.

The CLUSTER Command and the Illusion of Familiarity

The CLUSTER command often gives SQL Server professionals false hope. PostgreSQL can reorder a table based on an index, and after running CLUSTER, sequential scans can indeed become faster. For a brief moment, the table behaves as if it were clustered.

The key difference is that PostgreSQL makes no attempt to preserve this order. CLUSTER rewrites the table once and then steps away. As soon as new rows are inserted or existing rows are updated, the physical layout begins to drift. PostgreSQL does not consider this drift a problem, because it does not treat physical order as a stable optimization target.

From a SQL Server perspective, this feels like an unfinished feature. From PostgreSQL’s perspective, it is a maintenance operation, not a storage model. Physical order can be imposed temporarily when it helps, but it is never guaranteed and never enforced.

This distinction is subtle but essential. PostgreSQL optimizes for predictable behavior under concurrency and sustained write load, not for preserving a particular physical layout over time.

If you want to learn more about SQL Server 2025 and PostgreSQL, I highly recommend to have a look on my upcoming online trainings about it.

Letting Go of Clustered Indexes and Moving Forward

Many SQL Server professionals only realize in hindsight how much operational complexity clustered indexes introduce. Choosing the wrong clustering key can affect a system for years. Index rebuilds become routine maintenance. Fragmentation management and latch contention are accepted as part of life. PostgreSQL avoids these problems by design, even though it asks you to give up something that feels fundamental.

This is why asking for the PostgreSQL equivalent of a clustered index usually leads to frustration. There is no equivalent, because PostgreSQL does not want one. Instead, it encourages different ways of thinking about performance. Sometimes that means relying on logical access paths rather than physical order. Sometimes it means using partitioning to reduce the size of the data you scan. Sometimes it means exploiting natural data correlation with different index types.

The transition is not easy, especially if clustered indexes have been central to your tuning strategies for years. But once you stop trying to recreate SQL Server’s storage model and start working with PostgreSQL’s assumptions, the system becomes coherent. Concurrency feels smoother, write performance becomes more predictable, and maintenance becomes less dramatic.

The absence of clustered indexes is not a limitation. It is an architectural decision – one that only reveals its value after you accept that PostgreSQL is solving a different set of problems, in a different way.

Thanks for reading,

-Klaus

Vector Capabilities with pgvector in PostgreSQL

Klaus Aschenbrenner — Wed, 04 Feb 2026 11:26:48 +0000

With SQL Server 2025, Microsoft introduces vector support as a native engine feature: a built-in VECTOR(dim) data type, dedicated vector functions, and an approximate vector index tightly integrated into the optimizer. From a SQL Server point of view, this feels natural—vectors become another first-class datatype, queried via explicit functions such as VECTOR_DISTANCE or VECTOR_SEARCH, and backed by a specialized index structure.

In PostgreSQL, the same capability is delivered through pgvector, an extension rather than a core feature. However, pgvector is not a thin add-on. It integrates deeply into PostgreSQL’s planner and executor. Instead of introducing new query syntax, it extends existing SQL concepts: expressions, operators, ordering, and indexes.

If you want to learn more about SQL Server 2025 and PostgreSQL, I highly recommend to have a look on my upcoming online trainings about it.

pgvector vs. SQL Server 2025

In SQL Server 2025, vector operations are explicit and function-driven. Distance is computed through scalar functions, and approximate nearest-neighbor search is initiated through a dedicated construct. The query text clearly communicates intent: this is a vector operation, and the engine handles it specially.

pgvector takes a different approach. It introduces a new data type, vector(n), and a set of operators that compute distance between vectors. The most important of these operators is <->.

The <-> operator represents the distance between two vectors. Conceptually, it fills the same role as VECTOR_DISTANCE in SQL Server 2025, but it is expressed as an operator, not a function. That distinction is crucial:

<-> returns a numeric value
PostgreSQL can sort by it
PostgreSQL can use an index to satisfy that sort

A typical pgvector query therefore looks like this:

SELECT DocumentID, Title FROM Documents
ORDER BY Embedding <-> '[0.2,0.4,0.6]'::VECTOR
LIMIT 10;

From a SQL Server perspective, this is equivalent to ordering by a computed scalar expression and applying TOP (10). The difference is that PostgreSQL’s planner understands that certain indexes (HNSW or IVFFlat) can satisfy this ordering efficiently. There is no direct analogue to <-> in SQL Server 2025, because SQL Server expresses distance through functions rather than operators. Functionally, however, both systems are solving the same problem: rank rows by similarity and return the top K.

HNSW in pgvector

The most commonly used index type in pgvector is HNSW, short for Hierarchical Navigable Small World graph. Understanding HNSW is essential to understanding pgvector’s performance characteristics.

Exact nearest-neighbor search in high-dimensional space does not scale. Computing the distance from a query vector to every row is equivalent to a full table scan with an expensive computation. HNSW exists to make this problem tractable at hundreds of thousands or millions of vectors.

HNSW is an approximate nearest-neighbor (ANN) algorithm. It trades perfect accuracy for dramatically lower latency and predictable performance, which is exactly what workloads such as semantic search, recommendation systems, and RAG pipelines require. HNSW organizes vectors into a layered graph:

Upper layers are sparse and allow long “jumps” across the vector space
Lower layers are denser and allow fine-grained local navigation
A query starts at the top layer, greedily walks the graph, and descends layer by layer until it reaches a small candidate set near the query vector

n pgvector, HNSW is exposed as an index access method. When you create an HNSW index, you are telling PostgreSQL: this ordering by <-> can be approximated efficiently using a graph-based structure:

CREATE INDEX documents_embedding_hnsw
ON Documents
USING hnsw (embedding vector_l2_ops);

The operator class (vector_l2_ops) defines what <-> means for this index – in this case, Euclidean (L2) distance.

A pgvector example

This walkthrough mirrors how a SQL Server professional would evaluate any new indexing feature: establish a baseline, add an index, and compare execution plans. Let’s enable the extension in the first step:

CREATE EXTENSION IF NOT EXISTS vector;

This registers the vector(n) type and the distance operators, including <->.

In the next step we use a simple multi-tenant table, because filtered nearest-neighbor queries are where ANN behavior really matters. We also insert some test data.

-- Create a simple Documents table
CREATE TABLE documents
(
    DocumentID 	BIGSERIAL PRIMARY KEY,
    TenantID  	INT  	  NOT NULL,
    Title       TEXT      NOT NULL,
    Embedding   VECTOR(3) NOT NULL
);

-- Insert some data
INSERT INTO Documents (TenantID, Title, Embedding)
SELECT
    (gs % 100) + 1,
    format('doc-%s', gs),
    ARRAY[
        random()::real,
        random()::real,
        random()::real
    ]::vector(3)
FROM generate_series(1, 200000) AS gs;

-- Update the statistics
ANALYZE Documents;

Without a vector index, PostgreSQL must scan the entire table and compute distances for every row.

-- A simple query
EXPLAIN (ANALYZE)
SELECT DocumentID, Title FROM Documents
ORDER BY Embedding <-> '[0.2,0.4,0.6]'::VECTOR
LIMIT 10;

For a SQL Server DBA, this should look like a classic “compute and sort” plan – exactly what you would expect without a supporting index.

On my system, this query runs for about 30 ms. Let’s create now a supporting HNSW index for this query:

-- Create a supporting index
CREATE INDEX documents_embedding_hnsw
ON Documents
USING hnsw (embedding vector_l2_ops);

Re-running the same query now typically shows an index-driven plan with dramatically lower execution time and I/O.

As you can see, the query runs now for only about 1ms, which is a quite impressive improvement.

If you want to learn more about SQL Server 2025 and PostgreSQL, I highly recommend to have a look on my upcoming online trainings about it.

Summary

SQL Server 2025 and PostgreSQL with pgvector approach vector search from different angles. SQL Server emphasizes native types and explicit vector functions. PostgreSQL emphasizes composability: vectors become sortable values, <-> becomes a first-class ordering operator, and HNSW becomes another index access method the planner can reason about.

For a SQL Server DBA or developer, pgvector is best understood not as a foreign concept, but as PostgreSQL extending its relational model into high-dimensional space. Execution plans still matter. Statistics still matter. Filters still matter. The tools are different, but the discipline is the same.

Thanks for reading,

-Klaus

Don’t Miss Out – SQL Server Query Tuning Fundamentals Starts Next Monday!

Klaus Aschenbrenner — Mon, 02 Feb 2026 14:35:42 +0000

Next Monday, February 9, 2026, my one-day live online training SQL Server Query Tuning Fundamentals will take place! This hands-on session is designed to help DBAs, developers, and consultants deeply understand how SQL Server executes queries and how you can systematically improve performance with smarter indexing and execution-plan analysis.

If you’re still on the fence – there are still open seats available, but they’re filling up quickly. Don’t miss your chance to:

Learn how SQL Server accesses data and chooses execution plans
Understand core query operators and performance pitfalls
Speak confidently about tuning and optimization – not guessing

Secure your spot today before it’s too late: https://www.SQLpassion.at/live-trainings/sql-server-query-tuning-fundamentals/

See you online next Monday!

Thanks,

-Klaus

Spatial Data with PostGIS in PostgreSQL

Klaus Aschenbrenner — Mon, 26 Jan 2026 11:13:31 +0000

Developers coming from SQL Server often approach PostgreSQL with the assumption that spatial functionality will look and behave more or less the same. After all, both systems expose geometry and geography types, both support distance calculations, and both offer spatial indexes. That assumption is only partially correct.

While PostgreSQL with PostGIS covers everything SQL Server Spatial can do, it also introduces new compositional concepts that SQL Server simply does not have. Understanding those concepts is the key to using PostGIS effectively instead of trying to replicate SQL Server patterns verbatim.

This blog posting walks through a small but complete PostGIS example and explains how each piece maps to what SQL Server developers already know—while highlighting the PostgreSQL-specific ideas along the way.

Spatial Support as an Extension

In SQL Server, spatial support is part of the engine. You never “enable” it; geometry and geography are always available. PostgreSQL takes a different approach. Spatial functionality is delivered via an extension that must exist on the system and be enabled per database:

-- Check, if the PostGIS extension is installed on the system
SELECT * FROM pg_available_extensions WHERE name = 'postgis';

-- Create the extension in the current database
CREATE EXTENSION IF NOT EXISTS postgis;

-- Check the PostGIS version
SELECT postgis_version();

Conceptually, this is not a limitation but a design choice. PostgreSQL keeps the core engine small and pushes domain-specific logic into native extensions. PostGIS is one of the most mature examples of this philosophy: it is not a wrapper or add-on, but deeply integrated native code.

For SQL Server developers, the important mental shift is that features are opt-in per database, not global.

Geometry and Geography: Same Names, More Explicit Design

At first glance, PostGIS looks reassuringly familiar. You still get geometry for planar calculations and geography for spheroidal, meter-based calculations. The difference becomes apparent when you look at how the two are related:

-- Create a new table with the GEOMETRY and GEOGRAPHY data types
CREATE TABLE PointOfInterests
(
    PoiID      	BIGSERIAL PRIMARY KEY,
    Name        TEXT NOT NULL,
    Category    TEXT NOT NULL,
    Geom        GEOMETRY(POINT, 4326) NOT NULL,
    Geog        GEOGRAPHY(POINT, 4326) GENERATED ALWAYS AS (GEOM::GEOGRAPHY) STORED,
    CreatedAt   TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

In SQL Server, GEOMETRY and GEOGRAPHY are separate columns with no inherent relationship. If you store both, it is your responsibility to keep them consistent. PostgreSQL allows you to declare that relationship explicitly using a generated column. Here, Geog is always derived from Geom. It cannot drift, it cannot be forgotten during inserts, and it is always correct by definition. This is a small feature, but it fundamentally changes how safely you can model spatial data.

Also worth noting for SQL Server developers:

BIGSERIAL is PostgreSQL’s shorthand for an auto-incrementing bigint
TIMESTAMPTZ always stores timestamps in UTC internally, unlike SQL Server’s datetimeoffset semantics

Inserting Spatial Data

Inserting data looks very similar to SQL Server, with one important difference: PostGIS requires you to be explicit about SRIDs.

-- Insert some Points of Interests
INSERT INTO PointOfInterests (Name, Category, Geom) VALUES
('Stephansdom', 'sight',   ST_SetSRID(ST_MakePoint(16.373819, 48.208174), 4326)),
('Prater',      'park',    ST_SetSRID(ST_MakePoint(16.404954, 48.216778), 4326)),
('HBF Wien',    'station', ST_SetSRID(ST_MakePoint(16.375000, 48.185000), 4326)),
('Schönbrunn',  'sight',   ST_SetSRID(ST_MakePoint(16.312222, 48.184516), 4326)),
('Donauinsel',  'park',    ST_SetSRID(ST_MakePoint(16.410000, 48.240000), 4326));

SQL Server will happily let you insert spatial data without enforcing SRID consistency. PostGIS treats SRIDs as first-class metadata and uses them aggressively in validation, transformations, and indexing. This strictness pays off later when queries become more complex.

Spatial Indexing

Indexing is where architectural differences start to matter.

-- Create a spatial index
CREATE INDEX poi_geom_gix   ON PointOfInterests USING GIST (geom);

-- Update the statistics
ANALYZE PointOfInterests;

SQL Server spatial indexes are grid-based and require careful configuration to avoid pathological plans. PostGIS relies on GiST indexes, which behave like generalized R-trees and integrate naturally with the PostgreSQL query planner. The ANALYZE step is explicit in PostgreSQL. Statistics are not always refreshed automatically in development environments, and PostGIS queries are particularly sensitive to selectivity estimates.

Distance Calculations

Distance queries using geography will feel immediately familiar:

-- Distance calculation between 2 Point of Interests
SELECT
  	a.Name AS from_name,
  	b.Name AS to_name,
  	ST_Distance(a.geog, b.geog) AS distance_m
FROM PointOfInterests a
JOIN PointOfInterests b ON a.Name = 'Stephansdom' AND b.Name = 'Schönbrunn';

This maps almost one-to-one to STDistance() in SQL Server. The key takeaway is that PostGIS makes no attempt to hide whether you are working in planar or spheroidal space—you choose explicitly via the data type.

CROSS JOIN LATERAL

The next query introduces something that does not exist in SQL Server:

-- All Point of Interests within a radius
SELECT
	p.PoiID,
  	p.Name,
  	p.Category,
  	ST_Distance(p.Geog, c.Geog) AS distance_m
FROM PointOfInterests p
CROSS JOIN LATERAL
(
  SELECT Geog FROM PointOfInterests WHERE Name = 'Stephansdom'
) c
WHERE ST_DWithin(p.Geog, c.Geog, 4000)
ORDER BY distance_m;

CROSS JOIN LATERAL allows a subquery to reference columns from the current row of the outer query. In SQL Server, you would typically simulate this using variables, CTEs, or APPLY. PostgreSQL makes it a first-class relational operator.

Think of LATERAL as a per-row function invocation. The subquery is executed logically once per outer row, but the planner is free to optimize it aggressively. This feature enables expressive, composable spatial queries without procedural workarounds.

Nearest Neighbor Search and the <-> Operator

The final query demonstrates one of PostGIS’s most powerful features:

-- Nearest Neighbor search
SELECT
  	p2.PoiID,
  	p2.Name,
  	p2.Category,
  	ST_Distance(p2.geog, ref.geog) AS distance_m
FROM PointOfInterests p2
CROSS JOIN (SELECT Geog, Geom FROM PointOfInterests WHERE Name = 'Stephansdom') ref
WHERE p2.name <> 'Stephansdom'
ORDER BY p2.Geom <-> ref.Geom
LIMIT 3;

The <-> operator performs K-nearest-neighbor (KNN) ordering using the GiST index. This is not a function call – it is an operator that the planner understands deeply. SQL Server has no equivalent operator. Nearest-neighbor queries there usually involve manual distance calculations combined with TOP and ORDER BY, often resulting in expensive plans.

In PostGIS, <-> allows the database to:

Use the spatial index for ordering
Avoid computing exact distances for all rows
Return nearest neighbors efficiently and predictably

This is a fundamental difference in how spatial querying is expressed.

Summary

For SQL Server developers, PostGIS is not just “SQL Server Spatial with more functions.” It is a system that encourages relational composition over procedural workarounds. Features like generated columns, LATERAL joins, and index-aware operators (<->) enable spatial queries that remain declarative, readable, and performant as complexity grows.

Features like generated columns, LATERAL joins, and index-aware operators (<->) enable spatial queries that remain declarative, readable, and performant as complexity grows.

Thanks for reading,

-Klaus

BRIN Indexes in PostgreSQL

Klaus Aschenbrenner — Mon, 19 Jan 2026 10:18:29 +0000

When SQL Server professionals start working seriously with PostgreSQL, most of the learning curve feels comfortable. Tables behave as expected, transactions are familiar, and B-tree indexes look reassuringly similar to what you have used for years in SQL Server. Then you encounter BRIN indexes.

At first glance, they seem almost reckless: no row pointers, no precise navigation, and an explicit acceptance of false positives. And yet, on very large tables, BRIN indexes often deliver performance gains that would require clustered indexes, partitioning, or even columnstore indexes in SQL Server. To understand why, we need to look not only at what BRIN does, but how it works internally.

If you want to learn more about SQL Server 2025 and PostgreSQL, I highly recommend to have a look on my upcoming online trainings about it.

From Precision to Probability

SQL Server indexing is built around precision. A nonclustered index maps keys to row locators. A clustered index defines the physical layout of the table itself. Performance tuning is about choosing the correct key order and keeping fragmentation under control.

PostgreSQL can do all of that too. But it also embraces a different idea: sometimes you don’t need to know where a row is – it is enough to know where rows cannot be. BRIN, short for Block Range Index, is the embodiment of that idea.

PostgreSQL stores tables as collections of 8 KB heap pages. A BRIN index groups these pages into block ranges, typically 128 heap pages per range (about 1 MB of data). For each range, PostgreSQL stores only summary information, most commonly the minimum and maximum value of the indexed column. Crucially, a BRIN index does not store page IDs or row pointers.

This is a fundamental difference compared to B-tree indexes and SQL Server nonclustered indexes. There is no list of heap pages associated with a range. Instead, the relationship between a BRIN index entry and the table is implicit. The position of a BRIN index tuple is the address of the block range.

Range 0 corresponds to heap pages 0 – 127
Range 1 corresponds to heap pages 128 – 255
Range N corresponds to heap pages N × pages_per_range through (N+1) × pages_per_range – 1

No pointers are stored because none are needed. PostgreSQL derives the heap pages mathematically. This design choice is one of the main reasons BRIN indexes are so small and so fast.

What Happens During Query Execution

When a query arrives with a predicate such as a timestamp range, PostgreSQL scans the BRIN index sequentially. For each block range, it compares the query predicate with the stored minimum and maximum values. If the predicate lies completely outside the range, PostgreSQL can skip that entire block range without reading a single heap page. If the predicate might match, the range is marked as a candidate and only those heap pages are scanned.

False positives are expected and acceptable. The cost of checking a few extra pages is negligible compared to scanning the entire table. Because the mapping between index entry and heap pages is implicit, the overhead is minimal: no pointer chasing, no tree traversal, and excellent cache locality.

A Concrete Example: Large Time-Series Data

Consider a classic logging or event table: append-only, ordered by time, and very large. This is exactly the kind of workload BRIN was designed for.

-- Create a simple table
CREATE TABLE Events
(
  ID        BIGINT 		GENERATED ALWAYS AS IDENTITY,
  Ts        TIMESTAMPTZ NOT NULL,
  DeviceID  INT         NOT NULL,
  Payload   TEXT        NOT NULL
);

-- Insert some test data
INSERT INTO Events (Ts, DeviceID, Payload)
SELECT
  TIMESTAMPTZ '2025-01-01 00:00:00+00' + MAKE_INTERVAL(secs => g),
  (g % 1000) + 1,
  MD5(g::TEXT)
FROM GENERATE_SERIES(1, 20000000) AS g;

-- Update Statistics
ANALYZE Events;

With 20 million rows inserted in strictly increasing timestamp order, the physical layout of the table aligns perfectly with time-based queries. A query that filters a single hour still has to examine a large portion of the table without an index. The following query takes around 200ms on my system:

-- Parallel Sequential Scan
-- Execution Time: 200ms
EXPLAIN (ANALYZE, BUFFERS)
SELECT COUNT(*) FROM Events
WHERE
	Ts >= TIMESTAMPTZ '2025-06-01 10:00:00+00'
	AND Ts <  TIMESTAMPTZ '2025-06-01 11:00:00+00';

Now, let’s add a BRIN index:

-- Create a BRIN Index
CREATE INDEX idx_Events_Ts_Brin
ON Events
USING BRIN (Ts)
WITH (pages_per_range = 128);

-- Update Statistics
ANALYZE Events;

Running the same query again shows a different execution plan. PostgreSQL uses the BRIN index to identify a small number of relevant block ranges and reads only those heap pages. The buffer statistics make the benefit obvious: far fewer pages are touched. The query runs now for about 2ms:

-- Bitmap Index/Heap Scan
-- Execution Time: 2ms
EXPLAIN (ANALYZE, BUFFERS)
SELECT COUNT(*) FROM Events
WHERE
	Ts >= TIMESTAMPTZ '2025-06-01 10:00:00+00'
	AND Ts <  TIMESTAMPTZ '2025-06-01 11:00:00+00';

Even more interesting is the size comparison:

-- The BRIN index is very small
SELECT
  pg_size_pretty(pg_relation_size('Events')) AS heap_size,
  pg_size_pretty(pg_relation_size('idx_Events_Ts'))  AS brin_size;

The BRIN index is very small, and only has a size of around 72KB – in comparison to 1611MB of the whole heap table. On a table with tens of millions of rows, the BRIN index is often measured in kilobytes or a few megabytes:

Why This Feels Unfamiliar to SQL Server Experts

From a SQL Server perspective, this approach feels unusual. A clustered index on a timestamp column would give similar range-scan performance, but at the cost of a large index structure and ongoing maintenance. Partitioning could reduce scan scope, but introduces administrative complexity. Columnstore indexes use segment-level metadata with MIN and MAX values, which is conceptually close to BRIN, but they come with a very different execution engine and workload profile.

BRIN can be thought of as automatic, ultra-lightweight, intra-table partition pruning. It achieves much of the benefit of these SQL Server techniques with a fraction of the overhead.

The Trade-Offs: When BRIN Fails

BRIN’s efficiency depends on data correlation. If values are randomly distributed, the minimum and maximum values of block ranges overlap heavily. In that case, most ranges become candidates and PostgreSQL ends up scanning large portions of the table anyway.

BRIN is also unsuitable for point lookups. Queries that search for a specific ID or require high precision still need B-tree indexes. BRIN is designed to reduce the search space, not to pinpoint rows.

Another subtle drawback is cost estimation. Because BRIN is coarse by nature, the planner may misestimate selectivity, especially on smaller tables. In those cases, PostgreSQL may correctly decide that a sequential scan is cheaper.

The key insight for SQL Server professionals is that BRIN is not a replacement for B-tree indexes. It is a complementary tool. In PostgreSQL, it is common – and powerful – to combine precise B-tree indexes for OLTP access with BRIN indexes for large-scale analytical scans on the same table. Each index serves a different purpose, and together they often outperform more complex designs.

Summary

BRIN indexes highlight a deeper philosophical difference between PostgreSQL and SQL Server. SQL Server emphasizes precision and carefully engineered index structures. PostgreSQL, with BRIN, embraces the idea that an index can be imperfect as long as it is cheap and effective at scale.

Once you internalize that, BRIN stops feeling strange. It starts to feel like a pragmatic optimization that trades precision for simplicity – and wins.

For SQL Server professionals moving into PostgreSQL, understanding BRIN is more than learning a new index type. It is learning a new way of thinking about performance.

If you want to learn more about SQL Server 2025 and PostgreSQL, I highly recommend to have a look on my upcoming online trainings about it.

Thanks for your time,

-Klaus

GIN Indexes in PostgreSQL

Klaus Aschenbrenner — Mon, 12 Jan 2026 12:52:06 +0000

If you come from SQL Server (like in my case), PostgreSQL indexing can feel familiar at first – B-tree indexes exist, composite indexes exist, covering indexes exist. And then you run into queries like this:

WHERE payload @> '{"type":"payment","status":"failed"}'

or this:

WHERE tsv @@ plainto_tsquery('postgresql')

At that point, most SQL Server developers ask two questions:

What are these operators?
Why does PostgreSQL need a completely different index type for this?

This blog posting answers both questions – and shows why GIN indexes exist, what problems they solve, and how they compare to modern SQL Server (including SQL Server 2025 with native JSON indexes).

If you want to learn more about SQL Server 2025 and PostgreSQL, I highly recommend to have a look on my upcoming online trainings about it.

First Things First: PostgreSQL Indexes Are Operator-Driven

PostgreSQL uses a fundamentally different indexing philosophy than SQL Server. In SQL Server, indexes are:

Column-based
Value-based
Optimized for equality, range, and order

In PostgreSQL, indexes are:

operator-based
designed around how data is queried, not just how it is stored

This is why PostgreSQL has operators that look unfamiliar – but are actually very explicit.

Understanding the Two Key Operators

Before talking about GIN, you must understand what these operators mean.

The @> operator means “contains”:

payload @> '{"type":"payment"}'

This means: “Does the JSON document in payload contain at least this key/value pair?“

It does not require:

Identical JSON
Identical ordering
Identical structure beyond the keys provided

This is very different from SQL Server’s traditional JSON functions, which historically required explicit extraction. Before SQL Server 2025, you typically wrote:

JSON_VALUE(payload, '$.type') = 'payment'

And indexed that via:

Persisted computed columns
Or filtered indexes

SQL Server 2025 improves this by introducing:

Native JSON data type
JSON indexes

However, those indexes still operate on paths and values, not on arbitrary JSON containment semantics. PostgreSQL’s @> operator answers a higher-level question: “Does this document logically include this structure?“

On the other hand, the @@ operator is used for full-text search:

tsv @@ plainto_tsquery('postgresql')

This means: “Does this document’s token vector match this text query?“

Think of it as:

Not LIKE
Not string comparison
But linguistic matching

SQL Server developers should compare this to Full-Text Search predicates, not to LIKE ‘%text%’.

The Need for GIN Indexes

B-tree indexes work when:

One row = one indexed value
Comparisons are equality or range based
Ordering matters

A GIN (Generalized Inverted Index) is an inverted index. Instead of storing:

Row -> Value

It stores:

Value -> Many Rows

Each searchable element inside a column becomes its own index key. This is why GIN is ideal for:

JSONB
Arrays
Tags
Full-Text tokens

A Concrete Example

Let’s create a simple table that stores payment processing information:

-- Create a simple table
CREATE TABLE Events
(
    ID	        BIGINT GENERATED ALWAYS AS IDENTITY,
    OccurredAt TIMESTAMPTZ NOT NULL,
    Payload     JSONB NOT NULL
);

It will store in the column Payload the following JSON document:

{
  "type": "payment",
  "status": "failed",
  "user_id": 4711,
  "tags": ["stripe", "europe"]
}

Let’s insert some test data:

-- Insert 1mio rows
INSERT INTO Events (OccurredAt, Payload)
SELECT
    now() - (random() * interval '30 days'),
    jsonb_build_object(
        'type',
            CASE
                WHEN r < 0.80 THEN 'payment'
                WHEN r < 0.95 THEN 'login'
                ELSE 'signup'
            END,
        'status',
            CASE
                WHEN r < 0.80 THEN
                    CASE WHEN random() < 0.002 THEN 'failed' ELSE 'success' END
                ELSE
                    CASE WHEN random() < 0.01  THEN 'failed' ELSE 'success' END
            END,
        'user_id', (random() * 5000000)::int,
        'region',  CASE WHEN random() < 0.6 THEN 'eu' ELSE 'us' END,
        'provider',CASE WHEN random() < 0.5 THEN 'stripe' ELSE 'paypal' END,
        'tags',
            CASE
                WHEN random() < 0.20 THEN jsonb_build_array('retry','europe')
                WHEN random() < 0.40 THEN jsonb_build_array('stripe','europe')
                WHEN random() < 0.60 THEN jsonb_build_array('paypal','us')
                ELSE jsonb_build_array('mobile','web')
            END
    )
FROM (
    SELECT random() AS r
    FROM generate_series(1, 1000000)
) s;

-- Update Statistics
ANALYZE events;

And now we have the following query that we want to optimize:

-- Execution Time: around 60ms
EXPLAIN (ANALYZE, BUFFERS)
SELECT COUNT(*)
FROM Events
WHERE Payload @> '{"type":"payment","status":"failed"}'::JSONB;

Without any index in place, this query runs for about 60ms on my system:

Let’s create a GIN index:

-- Create a GIN index
CREATE INDEX idx_EventsPayload_GIN
ON Events
USING GIN(Payload);

-- Update Statistics
ANALYZE events;

That single statement indexes:

Every JSON key
Every JSON value
Every array element

This is something SQL Server JSON indexes – even in SQL Server 2025 – do not model the same way, because they remain path-centric rather than containment-centric.

When we run now the query again, the query planner uses the GIN index, and the query finishes in around 18ms:

But GIN indexes also have some negative side-effects:

Larger than B-Trees
Slow down Inserts and Updates
Do not support ordering

That last point is very critical – and leads directly to RUM.

What is RUM?

RUM is an extension, not a built-in index type. It extends GIN by storing:

Token positions
Ranking metadata
Optional ordering hints

CREATE EXTENSION rum;

CREATE INDEX idx_Documents_RUM
ON Documents
USING RUM (tsv rum_tsvector_ops);

Now PostgreSQL can:

Filter
Rank
and partially order results index the index

Summary

Yes, SQL Server 2025 significantly improves JSON support. But PostgreSQL’s GIN model was never about JSON alone – it is about indexing meaning, not just values. GIN exists because modern data is:

Multi-valued
Semi-structured
Queried semantically

RUM exists because once filtering is cheap, relevance matters. If you approach PostgreSQL with this mindset, GIN indexes stop being “weird” – and start feeling inevitable.

If you want to learn more about SQL Server 2025 and PostgreSQL, I highly recommend to have a look on my upcoming online trainings about it.

Thanks for your time,

-Klaus

Level Up Your Skills in 2026: Live Online Trainings for Database & Infrastructure Professionals

Klaus Aschenbrenner — Wed, 07 Jan 2026 10:21:45 +0000

The first half of 2026 is packed with live, instructor-led online trainings designed for professionals who want to go deeper – not just collect features, but truly understand how things work and how to use them effectively in production.

Whether you are a SQL Server professional, expanding into PostgreSQL, Linux, automation, or even systems programming with Rust, this training lineup is built to help you grow with confidence and clarity.

All sessions are live, hands-on, and focused on real-world scenarios – with plenty of room for questions, discussion, and practical insights you can apply immediately.

SQL Server Query Tuning Fundamentals

February 9, 2026 · 1 Day (Live Online)

Execution plans are the key to SQL Server performance – but only if you truly understand them.

This training is all about building a systematic approach to query tuning. Instead of guessing or trial-and-error, you’ll learn how SQL Server actually accesses data, how the query optimizer thinks, and how to translate execution plans into concrete tuning actions.

You will learn:

Data access & index fundamentals
Execution plans made practical and readable
Core query operators explained clearly
Advanced performance pitfalls seen in real systems

Learn more and register for SQL Server Query Tuning Fundamentals

Perfect if you know SQL Server – but want to become truly dangerous with performance tuning.

SQL Server 2025 Unleashed: AI, Performance & Beyond

March 9, 2026 · 1 Day (Live Online)

SQL Server 2025 is more than just another version – it introduces meaningful advances in AI integration, performance, and high availability.

In this session, we focus on what actually matters in practice: which features are relevant, how they work internally, and when they make sense to use.

You will learn:

AI features & vector search use cases
Performance improvements and optimizer enhancements
Intelligent Query Processing updates
TempDB improvements
High availability & disaster recovery enhancements

Learn more and register for SQL Server 2025 Unleashed: AI, Performance & Beyond

Ideal for DBAs and developers who want to stay ahead – without the marketing fluff.

PostgreSQL for the SQL Server Professional

April 20 – 21, 2026 · 2 Days (Live Online)

PostgreSQL is not “SQL Server with different syntax” – and treating it that way leads to frustration and poor performance.

This training is specifically designed for experienced SQL Server professionals who want to understand PostgreSQL on its own terms – while leveraging their existing knowledge.

You will learn:

PostgreSQL architecture & internals
PostgreSQL vs. SQL Server (conceptual and practical differences)
Query plans and execution strategies
Data access methods & indexing
MVCC and transaction handling
Write-Ahead Logging (WAL)
High availability & disaster recovery concepts

Learn more and register for PostgreSQL for the SQL Server Professional

If PostgreSQL is becoming part of your future, this training shortens the learning curve dramatically.

Linux Fundamentals for the Windows Professional

May 18, 2026 · 1 Day (Live Online)

Moving from Windows to Linux doesn’t mean starting over – but it does require a new mental model.

This training focuses on how Linux really works in production environments, explained from a Windows professional’s perspective – clear, structured, and practical.

You will learn:

The Linux mindset and design philosophy
Filesystems, users, permissions, and security
Software installation and package management
CPU, memory, and disk I/O fundamentals
Performance analysis and troubleshooting
Proven best practices for Linux systems

Learn more and register for Linux Fundamentals for the Windows Professional

A fast, confidence-building entry into Linux for real-world environments.

Database Infrastructure Automation with Terraform & Ansible

June 22 – 23, 2026 · 2 Days (Live Online)

Manual setups don’t scale. Automation does.

This training shows you how to build reproducible, enterprise-grade database environments – from infrastructure provisioning to fully configured SQL Server and PostgreSQL systems.

You will learn:

Infrastructure as Code (IaC) principles
Terraform vs. Ansible – when to use which
VM provisioning (on-premises & cloud)
Automated OS installations
Ansible fundamentals
Automated Active Directory setups
Database installation & configuration workflows

Learn more and register for Database Infrastructure Automation with Terraform & Ansible

Essential for DBAs and infrastructure engineers who want reliability, speed, and consistency.

Mastering Memory Safety & Performance with Rust

June 25, 2026 · 1 Day (Live Online)

The Rust programming language offers something rare: low-level control with high-level safety.

In this one-day training, you’ll understand why Rust’s ownership model works, how it prevents entire classes of bugs at compile time, and how you can write fast, predictable, and safe code – without a garbage collector.

You will learn:

Memory safety without GC
Ownership, borrowing, and lifetimes – demystified
Zero-cost abstractions and performance predictability
Compile-time detection of data races
Rust’s powerful tooling ecosystem

Learn more and register for Mastering Memory Safety & Performance with Rust

Perfect for engineers who care about correctness, performance, and long-term maintainability.

Ready to Invest in Your Skills?

These trainings are designed to help you think deeper, work smarter, and build systems you truly understand—from query execution and operating systems to automation and safe systems programming.

Seats are limited to keep sessions interactive.
Live delivery with real-world examples and open discussion.

Explore the trainings and secure your spot for 2026 now.

Thanks for your time,

-Klaus