Jekyll2026-03-10T09:43:43+00:00https://kernelmaker.github.io/feed.xmlZhao Song’s BlogDatabase internalsZhao SongMySQL vs PostgreSQL Internals (Part 2) — MVCC (Multi-version Concurrency Control)2026-03-10T00:00:00+00:002026-03-10T00:00:00+00:00https://kernelmaker.github.io/mysql-vs-pg-mvccIn the previous post, I took a detailed look at how MySQL and PostgreSQL differ in their buffer pool design and implementation. In this post, I will continue with a detailed comparison of their MVCC implementations.

The Role of MVCC

MVCC (Multi-version Concurrency Control) is a common mechanism used in transactional databases to resolve read–write conflicts. The core idea is that when a transaction modifies data, it does not overwrite the original data directly. Instead, it preserves the previous version while creating a new version of the record. As a result, all historical versions of a record are retained in the database.

The key benefit is that read and write transactions on the same record no longer need to block each other. Even if a write transaction has modified a record but not yet committed, a read transaction can still directly read the version that is visible to it from the historical versions.

The simplified principle is illustrated below:

image-1

For three modifications to the record with PK, each modification creates a new version, so all historical versions of the record are retained in the database.

So what is the benefit of retaining all historical versions? Consider the following example:

image-2

Three write transactions A, B, and C modify the same PK sequentially on the timeline, while three read transactions X, Y, and Z interleave with them and read the PK. Without MVCC, read transactions must block until write transactions commit and release locks. With MVCC, when read transaction X attempts to read the PK, the state of the PK is:

  1. Write transaction A inserted PK: ‘aaa’ and has committed
  2. Write transaction B modified it to PK: ‘bbb’ but has not yet committed

Under RC (Read Committed) and RR (Repeatable Read) isolation levels, PK: ‘bbb’ is not visible to transaction X. Because MVCC preserves the old version, transaction X can directly read the visible version PK: ‘aaa’ and ignore the currently running write transaction B. This greatly improves read–write concurrency.

As an essential capability of databases, MVCC is supported by both MySQL and PostgreSQL. Fundamentally, they both achieve the behavior described above, but their designs and implementations make different trade-offs.

In the following sections, I will compare their implementations in detail across three aspects:

  1. Organization of multiple versions
  2. Visibility checks for multiple versions
  3. Garbage collection of old versions

1. Organization of Multiple Versions

PostgreSQL

In PostgreSQL, a tuple and all its historical versions reside in the heap, as shown below:

image-3

The leaf nodes of the nbtree index store the PK fields of the tuple and point to the actual location of the tuple in the heap (TID). Following the Heap TID leads to the tuple, which contains the full data including PK fields and value fields.

Each index tuple points to the oldest version in its corresponding HOT chain. When the tuple is modified, the old version is not changed. Instead, a new version is created with the same PK fields but different value fields. The ctid field of the old tuple points to the location of the next version. The latest version’s ctid points to itself.

In short, every version of a PostgreSQL tuple is a complete tuple containing all fields. Starting from the index tuple, the version chain is linked from old to new through the ctid field.

It is important to note that the version chain may become ‘broken’, as shown below:

image-4

The PK fields remain unchanged, but the value fields are modified multiple times. Each modification generates a new full version stored in the same heap page as the old version, as shown for versions 1, 2, and 3. Because they reside in the same heap page, advancing through the chain via ctid is very cheap (no additional heap page lock is required). This is the type of chain that can be efficiently traversed for version lookups, known as a HOT chain in PostgreSQL.

A HOT chain requires two conditions:

  1. The new version can fit into the same heap page
  2. The modified columns do not include any indexed columns

If either condition is not met, the HOT chain breaks.

As modifications continue, starting from version 4, the original heap page (heap page 1) can no longer accommodate the new tuple. The new tuple is therefore stored in another page (heap page 2). Although version 3’s ctid still points to version 4, the HOT chain effectively ends at version 3.

When traversing a HOT chain, the reader will not follow ctid across heap pages. Instead, it stops at the end of the chain. The reason is that both the latest and historical versions of tuples reside in heap pages. If the reader followed ctid from page 1 to page 2, it would hold a read lock on heap page 1 and attempt to acquire a read lock on heap page 2. Because both pages are heap pages and there is no defined lock ordering between them, another backend might hold the lock on page 2 and attempt to acquire the lock on page 1, leading to a deadlock.

At this point, the HOT chain is considered broken.

How does PostgreSQL transition from version 3 to version 4 then?

The implementation is to insert a new index tuple into the nbtree index with the same PK fields pointing to version 4, starting a new HOT chain. If a read operation traverses the version chain and finds versions 1, 2, and 3 all invisible, it stops the current HOT chain traversal and returns to the index layer. It then proceeds to the next index tuple (the one pointing to version 4).

This design leads to an interesting and somewhat counterintuitive behavior: multiple index tuples with identical PK fields may coexist in the nbtree index.

MySQL

In MySQL, data is stored directly in the clustered index (B+Tree) leaf nodes. This is an important difference from PostgreSQL: MySQL does not have a heap.

The second difference is that the record stored in the clustered index and its historical versions reside in different places. Historical versions are not stored in the clustered index. Instead, old values are stored in undo records in the undo space. When needed, historical versions are reconstructed by applying the undo records to the current record.

The third difference is that an undo record does not store a full copy of the record. It only stores the old values of the columns modified in the operation.

The fourth difference is the direction of the version chain. In MySQL, the clustered index always stores the latest version. Before a record is modified, the old values of the columns being changed (together with the PK fields) are copied to the undo space, and the record is updated in place.

As shown below:

image-5

The clustered index record contains two system fields: TRX_ID and ROLL_PTR.

  • TRX_ID records the transaction ID that last modified the record and is used for MVCC visibility checks.
  • ROLL_PTR links the version chain.

Similar to PostgreSQL’s ctid, ROLL_PTR links versions together, but the direction is opposite: ctid points from old to new, while ROLL_PTR points from new to old.

In the figure, the record was modified three times:

  1. Field 2 was modified
  2. Field 2 was modified again
  3. Field 3 was modified

Therefore, the clustered index record stores the latest version after the three modifications. Through ROLL_PTR, it points to the previous version stored in the undo space (the version before Field 3 was modified), and so on.

Summary

The differences in version organization between PostgreSQL and MySQL can be summarized in three contrasts:

  1. Versions mixed together vs latest version and historical versions stored in different spaces
  2. Old versions contain full tuples vs old versions mainly store the primary key and old values of modified columns
  3. Version chain ordered from old to new vs from new to old

2. Visibility Checks for Multiple Versions

Once multiple versions exist, the next question is: how does a read transaction determine which version it should see?

This is the core of MVCC: visibility checks.

To determine visibility, the database must establish an order among transactions. Taking the RR isolation level as an example, when a transaction begins, it must know which write transactions are currently active in the system. All modifications produced by those active transactions are invisible to the read transaction. Only modifications from transactions that were already committed at that moment are visible.

Therefore, if the database can define an order among write transactions, it becomes straightforward to perform visibility checks.

Most databases achieve this by using a globally increasing transaction ID. When a write transaction is created, it obtains the current maximum transaction ID plus one. This naturally orders write transactions.

Once transaction IDs exist, each data modification can be tagged with the transaction ID that produced it. A read transaction, when created, obtains the list of currently active write transactions. Later, when reading data, it simply compares the transaction ID recorded on the data with this list and applies the visibility rules to determine whether the data is visible.

Both PostgreSQL and MySQL follow this approach.

PostgreSQL

As mentioned earlier, transaction IDs are critical. In PostgreSQL, the globally increasing transaction ID is called nextXid.

Each write transaction obtains the latest value when it starts.

image-6

Transaction A is created first and obtains xid 7. It inserts PK: ‘aaa’. The tuple records this through the xmin field, which stores the inserter’s transaction ID (7).

After transaction A commits, transaction B is created and obtains xid 11. It updates the record to ‘bbb’. Following the multi-version rule, transaction B does not overwrite the tuple inserted by A. Instead, it creates a new version. The old tuple’s xmax is set to 11, indicating that transaction 11 has “deleted” this tuple version. The new tuple records xmin = 11. The old tuple’s ctid points to the new tuple.

Transaction C proceeds similarly.

image-7

Thus, each PostgreSQL tuple contains two fields recording the related transactions:

  • xmin : the inserter
  • xmax : the deleter

With the global transaction ID (nextXid) and the transaction tags (xmin, xmax) on each tuple, the next requirement is the snapshot used by read transactions for visibility checks.

image-8

At the top of the figure are the globally increasing transaction IDs and the currently active write transactions. The next ID to allocate is 16. Among all assigned IDs, transactions ≤7 have already committed. Between 8 and 15, some have committed, and the currently active write transactions are 8, 11, 12, and 14.

If a read transaction starts now, it obtains a snapshot:

  • xmin : the smallest active transaction ID (8)
  • xmax : the next transaction ID to allocate (16)
  • xids[] : the list of active transactions

With this snapshot, it can determine whether a tuple’s transaction tag is visible. For example:

  • If a tuple has xmin = 7, it is visible to the snapshot.
  • If a tuple has xmin = 14, it is not visible.

Now that we know how to determine the visibility of transaction tags, the final question is how to determine whether a tuple itself is visible, given that it has both xmin and xmax.

The core principle is:

A tuple is visible to a snapshot if its inserter (xmin) is visible and its deleter (xmax) is not visible.

image-9

The process is:

  1. Check xmin. If xmin is not visible, the tuple is invisible.
  2. If xmin is visible, check xmax.
  3. If xmax is also visible, the tuple has been deleted in the snapshot and is therefore invisible.
  4. If xmax is not visible, the tuple is visible.

Finally, the following figure shows the process of locating a tuple visible to a snapshot starting from the nbtree index:

image-10

MySQL

MySQL also has a globally increasing transaction ID called next_trx_id_or_no.

image-11

In the example, three transactions modify the same record three times. Both the clustered index record and the undo records contain a TRX_ID field. This field is the transaction tag used by MySQL. The TRX_ID records which transaction created that version of the record.

Unlike PostgreSQL, a record in MySQL has only one transaction tag, TRX_ID, rather than two. The reason will be explained later.

Next, consider the visibility check.

image-12

MySQL’s ReadView is extremely similar to PostgreSQL’s snapshot and serves the same purpose. The only difference is that MySQL’s ReadView contains an additional field: m_creator_trx_id.

This field is necessary because the transaction that creates the ReadView is itself included in m_ids[] (since it is an active transaction). Without m_creator_trx_id, the transaction would not be able to see its own modifications. It also handles cases where a read transaction is promoted to a write transaction.

Aside from this, the visibility rules are almost identical.

Given the visibility rules, determining whether a record is visible to a ReadView becomes straightforward:

image-13

The process is simple: check whether the record’s TRX_ID is visible to the ReadView. Unlike PostgreSQL, MySQL does not need two separate checks for xmin and xmax.

Finally, the process of finding a version visible to a ReadView starting from the B+Tree is shown below:

image-14

Summary

PostgreSQL and MySQL use highly similar visibility mechanisms. The primary difference is that PostgreSQL stores two transaction tags (xmin and xmax) on each tuple, requiring two checks. MySQL stores only one (TRX_ID), requiring only one check.

Why is this the case? The fundamental reason is the direction of the version chain:

  1. PostgreSQL’s version chain goes from old to new. In theory, xmin alone would be sufficient because it records the inserter. However, when traversing the chain, the reader cannot stop immediately after finding a visible insert because the next version might also be visible. The reader must continue until it finds the first version whose insert is invisible. The previous version is then the visible version. This means at least one extra step is required. PostgreSQL therefore stores the insert transaction of the next version as the deleter (xmax) of the current version, avoiding that extra traversal. Additionally, xmax is required for DELETE operations where no next version exists.
  2. MySQL’s version chain goes from new to old. Once the latest version is found to be invisible, the reader simply moves to the previous version until it finds the first visible one. No additional step is required.

3. Garbage Collection of Multiple Versions

The next core problem in MVCC is garbage collection of historical versions. Historical versions are not always needed.

Because global transaction IDs advance linearly, the snapshots (or ReadViews) of read transactions also move forward. A historical version can be safely purged when:

No active snapshot or ReadView in the system still needs that version (i.e., all snapshots can already see the newer version that replaced it).

This is where PostgreSQL and MySQL differ most significantly.


PostgreSQL

PostgreSQL reclaims historical versions through the Vacuum backend.

image-15

PostgreSQL uses GlobalVisState to track purge boundaries. It contains two variables:

maybe_needed

This is the minimum value among all backend transaction IDs and the xmin values of their snapshots. Backend transaction IDs must be considered because a backend may have started a write transaction and obtained an xid but not yet created a snapshot. That xid still forms a lower bound that cannot be crossed.

All tuple xmax values (deleters) are compared against maybe_needed. If xmax is smaller than maybe_needed, the deleter is visible to all backends and snapshots, meaning the tuple is globally deleted and can be safely purged.

definitely_needed

This is the xmin of the latest snapshot taken by the Vacuum backend. Any tuple whose xmax is greater than or equal to definitely_needed is invisible to the Vacuum backend and cannot be purged.

These two values define the continuous upper bound that can be purged and the lower bound that cannot. For tuples whose xmax falls between these bounds, Vacuum may need to refresh maybe_needed and re-evaluate, since the snapshot used by Vacuum might be outdated. Because refreshing is expensive, PostgreSQL optimizes this by checking whether RecentXmin has advanced. If it has not changed, refreshing is skipped.

With these rules, the workflow of the Vacuum backend is:

image-16

  1. Scan all heap tuples and determine whether they can be purged using GlobalVisState. Collect purgable tuples into a set.
  2. Scan all index tuples and check whether they reference heap tuples in the purge set. If so, delete those index tuples.
  3. Scan again all pages containing dead tuples collected in step 1 and reclaim them (setting their line pointers to LP_UNUSED).

This process involves extensive scanning of both heap and index structures, which can be expensive. PostgreSQL mitigates this cost with several optimizations:

  1. Visibility map – allows the first scan to skip pages where all tuples are visible.
  2. HOT pruning – during normal reads of heap pages, PostgreSQL opportunistically removes dead tuples through heap_page_prune(), reducing the workload of Vacuum.
  3. LP_REDIRECT – when intermediate versions in a HOT chain are removed, the head line pointer is redirected to the surviving tuple instead of being marked unused, so existing index tuples can still locate the correct tuple without index updates.

MySQL

MySQL takes a different approach.

All undo records (historical versions) are grouped by the transactions that produced them. These transactions are then organized according to their global commit order (forming a min-heap).

With this ordering, MySQL can quickly identify the undo records belonging to the earliest committed transaction, which are typically the closest candidates for purging.

The purge thread compares the transaction number (trx_no) of the earliest transaction in the history list with m_low_limit_no from the purge view.

  • If trx_no < m_low_limit_no, all active ReadViews can see this transaction’s commit, so its undo records are no longer needed and can be safely purged.
  • Otherwise, it cannot be purged. Since it is the earliest transaction, later ones cannot be purged either, so the purge process stops and waits.

An important optimization is that transactions are ordered by commit order rather than creation order.

Sorting by creation order would be safe because the earliest transaction must be purged first. However, it has a drawback: if the earliest transaction does not commit for a long time, later transactions that have already committed cannot be purged even if they are no longer needed.

For example:

  1. Trx A is created and modifies record R1 from ‘111’ to ‘222’
  2. Trx B is created and modifies record R2 from ‘aaa’ to ‘bbb’
  3. Read-only Trx X starts. Since Trx B has not committed, X sees R2 as ‘aaa’
  4. Trx B commits
  5. Trx X commits

If transactions were ordered by creation time, Trx A would come before Trx B. Because Trx A has not committed, purge would be blocked and Trx B’s undo records could not be purged, even though no ReadView needs them anymore.

By ordering transactions by commit time instead, MySQL can purge Trx B’s undo records immediately after its commit.

This is an important optimization. Notably, trx_id and trx_no both come from the same global variable: next_trx_id_or_no.

The workflow is shown below:

image-17

The purge thread first clones the oldest active ReadView in the system. The m_low_limit_no in this ReadView represents the smallest trx_no that was still committing when the view was created. All transactions with smaller trx_no values have already committed.

In the undo space, committed transactions’ undo records are linked together in the history list in commit order (ascending trx_no). The purge thread simply compares m_low_limit_no with the smallest trx_no in the history list to determine whether purging is possible.

Summary

Garbage collection of historical versions is a major implementation difference between PostgreSQL and MySQL.

In fact, it reflects their different design philosophies. This difference was already visible in the previous post discussing buffer pools.

MySQL tends to favor precise control and ordered structures, such as the LRU list and flush list, which allow it to quickly identify the oldest pages that can be evicted or flushed. Similarly, undo purge maintains ordered historical versions so that the oldest purgeable undo records can be quickly located.

PostgreSQL, on the other hand, tends to rely more on global scanning mechanisms, both in shared buffers and in Vacuum. In the buffer pool case, the cost of global scanning is relatively low because it scans descriptor arrays in memory. However, Vacuum must scan heap and index disk pages (although visibility maps can skip many all-visible pages). For frequently updated tables, the amount of scanning can still be substantial.

]]>
Zhao Song
MySQL vs PostgreSQL Internals (Part 1) – Buffer Pool2026-02-16T00:00:00+00:002026-02-16T00:00:00+00:00https://kernelmaker.github.io/mysql-vs-pg-bufferpoolThe debate over “MySQL vs PostgreSQL, which one is better?” has been around for a long time. As two outstanding representatives of open-source OLTP databases, I personally don’t think one overwhelmingly dominates the other. Transactional database theory has been stable for decades; both systems are practical implementations built under the same theoretical framework.

The differences mainly come from different trade-offs made during engineering practice. I’ve always believed that database development is the art of trade-offs. So I’m planning a series that compares MySQL and PostgreSQL from the perspective of kernel design and implementation, focusing on the different trade-offs they make when pursuing similar goals.

As the first article in this series, I’ll start with the design and implementation differences of the Buffer Pool.

Comparison Dimensions

The Buffer Pool in MySQL and the corresponding module in PostgreSQL (commonly referred to as Shared Buffers) are critical subsystems. Their primary job is to cache on-disk data pages in memory to minimize disk I/O as much as possible, and they are therefore a major factor in relational database performance.

In essence, it is a huge hash table:

  • The key corresponds to a specific on-disk data page.
  • The value is a pointer (or index) to the in-memory representation of that page.

In the following sections, I compare MySQL and PostgreSQL buffer pool designs from these aspects:

  1. Hash table structure and implementation
  2. Eviction policy for old pages and its implementation
  3. Dirty page flushing strategy and its implementation

1. Hash Table

MySQL

image-1

MySQL’s buffer pool is not backed by a single hash table, it uses multiple hash tables. As illustrated conceptually:

  1. Multiple buf_pool_t instances shard one large buffer pool. Each buf_pool_t maintains its own hash table.

  2. The hash key is (space_id, page_no), identifying a specific page within a data file (tablespace). During lookup:

    • First, it computes a hash using (space_id, page_no >> 6) to locate the corresponding buf_pool_t instance.
    • Why shift page_no >> 6? Because MySQL tries to place 64 consecutive pages under the same space_id into the same buf_pool_t. This helps in two ways:
      • During reads, it enables read-ahead (prefetching contiguous pages).
      • During flushing, it increases the chance to flush contiguous dirty pages together, improving I/O utilization.
    • After locating the buf_pool_t, it computes a hash over the full key (space_id, page_no) to find the target cell in that instance’s hash table.
      • Pages with the same hash value are chained in that cell.
      • The lookup then traverses the chain and compares keys to find the target page.
  3. The hash table stores only pointers to the corresponding page objects (buf_page_t). The actual buf_block_t objects and page frames live in a large memory region.

    image-1

    • MySQL splits the page memory into multiple chunks (buf_chunk_t).
    • Each chunk is a contiguous block of memory.
    • The first part stores per-page metadata (buf_block_t) for the pages in that chunk.
    • The second part stores the actual 16KB page frames.
    • The mapping between buf_block_t and the actual page frame is done via the frame pointer in buf_block_t.

PostgreSQL

image-1

Conceptually (as illustrated):

  1. PostgreSQL also shards the shared buffer mapping, with a similar idea.

  2. It first hashes the key (tablespaceOid, dbOid, relNumber, forkNum, blockNum) to obtain a bucket number.

  3. Then it uses bucket_number >> 8 to locate the directory entry in the first-level mapping, i.e., the segment (dir).

  4. Each segment contains 256 buckets, so after finding the segment, it uses bucket_number % 256 to locate the bucket within the segment.

  5. It then traverses the bucket chain, comparing keys one by one to find the page.

  6. All page frames are stored in one contiguous memory region, as an array: BufferBlocks[].

    image-1

    • Each page is 8KB.
    • PostgreSQL does not split this region into chunks like MySQL does, all pages are stored together.
    • Metadata for pages is stored separately in another array: BufferDescriptors[].
    • Both arrays have the same number of elements, equal to the total number of buffers/pages.
    • The indices align one-to-one: it is straightforward to locate the actual page frame from the metadata by index.
    • The hash table stores buf_id, which is the index into both BufferDescriptors[] and BufferBlocks[].

Summary: Both MySQL and PostgreSQL implement fairly standard hash-table-based page lookup; there isn’t a fundamental difference there. The biggest difference is that MySQL splits pages into chunks, which makes it easier to dynamically resize the buffer pool by adding/removing chunks.

2. Eviction Policy for Old Pages (Aging) and Implementation

MySQL

image-1

MySQL maintains page aging information in a direct way: pages in the hash table are also linked into an LRU doubly-linked list. Each page’s buf_page_t::LRU is the list node that links the page into the LRU list.

  • The LRU head points to the most recently accessed page.
  • The LRU tail points to the least recently accessed page.

Each time a page is found via hash lookup, MySQL moves the page to the head of the LRU list via buf_page_t::LRU. Over time, pages that are not accessed drift toward the tail. When memory is insufficient and an old page must be evicted, the tail provides a fast candidate.

Of course, that is the conceptual LRU behavior. MySQL adds an important optimization, because the above design has a major problem: if requests perform table scans, a large number of pages enter the LRU and can overwrite/destroy the existing hot/cold information. To avoid scan workloads disrupting the LRU, MySQL splits the LRU list.

Roughly ~37.5% from the tail, it maintains a midpoint:

  • To the left is the young area: the true hot region.
  • To the right is the old area: a screening region for newly loaded pages.

All new pages loaded from disk are initially inserted at the midpoint, i.e., the head of the old list. Since it is close to the tail, such pages are more likely to be evicted quickly. If a page is accessed again before it is evicted, MySQL does not immediately promote it to the young region. Instead, it records the first access time, and the page’s position stays unchanged. Only when it is accessed again, and the elapsed time since the first access exceeds innodb_old_blocks_time (default 1 second), will it be promoted to the LRU head (young region). As a result, pages introduced by full table scans typically stay in the old area for less than 1 second and are evicted quickly, without polluting the hot working set in the young region.

When a user thread needs to read a disk page but the buffer pool is full, it evicts an old page from the LRU tail and uses that frame to load the needed page. But eviction is not that trivial. Below is the concrete eviction procedure when a user thread needs a new page:

First attempt (n_iterations == 0)

  1. First, try the free list. If a free page is found, return it. Otherwise:
  2. If try_LRU_scan == true, it indicates a partial LRU scan is allowed. Scan from the tail forward, at most 100 pages.
    • If an evictable page is found, reset it and move it to the free list, then return to step 1 and retry.
    • If no evictable page is found, set try_LRU_scan = false to tell other user threads that partial LRU scanning is ineffective, so they should skip partial scans and go directly to the single-page flush path.
  3. Notify the page cleaner thread that free pages are insufficient and it should accelerate cleaning.
  4. Scan forward from the tail.
    • If a clean evictable page is found, evict it directly.
    • Otherwise, locate the first dirty page that can be flushed; perform a synchronous flush of that single page; then add it to the free list and proceed to the next attempt.

Second attempt (n_iterations == 1)

  1. Same as first attempt step 1.
  2. Perform a full LRU list scan starting from the tail, searching for an evictable page; if found, move it to the free list and return to step 1 to retry. If that fails:
  3. Same as first attempt step 3.
  4. Same as first attempt step 4.

Third and subsequent attempts (n_iterations > 1)

  1. Same as first attempt step 1.
  2. Same as second attempt step 2.
  3. Same as first attempt step 3.
  4. Sleep for 10ms.
  5. Same as first attempt step 4.

One more detail worth mentioning: the LRU scan does not always start from the tail for every thread. Each buf_pool_t maintains a global scan cursor lru_scan_itr (type LRUItr). After a thread finishes scanning, it leaves the cursor at its current position, and the next thread continues scanning from there, avoiding multiple threads repeatedly scanning the same region. Only when the cursor is empty/invalid, or still within the old region (meaning the previous scan did not progress far enough), will it be reset back to the tail. In addition, single-page flushing (step 4) uses another independent cursor single_scan_itr; these two cursors do not interfere with each other.

PostgreSQL

image-1

PostgreSQL does not maintain a global LRU list like MySQL does, but that doesn’t mean it does not perform LRU-style eviction. It simply takes another path.

All page metadata lives in the BufferDescriptors[] array. Each BufferDescriptor has two fields representing the current usage state of its corresponding page:

  • refcount: how many backends are currently using (pinning) the page
  • usage_count: the accumulated number of accesses to the page (capped at 5, When accessed via a ring buffer strategy, it is only incremented if it is currently 0, limiting it to 1)

Whenever a backend accesses a page via the hash table, it increments both refcount and usage_count. When the backend is done with the page, it only decrements refcount. Therefore, usage_count serves as an approximate LRU weight (but not unbounded, it stops increasing once it reaches 5).

When a backend tries to load a page from disk but finds no free page, it starts a clock sweep: it traverses BufferDescriptors circularly. If a buffer is not currently used by any backend (refcount == 0), it decrements usage_count (cooling down the LRU weight) and continues sweeping. Eventually it finds a buffer where both refcount == 0 and usage_count == 0, and that buffer becomes the victim for eviction.

Of course, this alone is still insufficient to prevent LRU pollution from one-time full scans. PostgreSQL has its own optimization: introducing a local ring buffer.

Each backend has its own local ring buffer: essentially a fixed-length array of buffer IDs. A buffer ID points to a page slot in the global BufferDescriptors. The ring buffer limits how many global buffers the backend consumes at once, so eviction is more likely to happen within the ring buffer itself, reducing pollution of the global shared buffers.

More concretely, suppose a backend is performing a sequential scan and the upper layer marks the operation to use the ring buffer. When reading pages via the hash table:

  • If the backend’s local ring buffer is not full, it stores the buffer ID into the ring buffer.
  • As reading continues, the ring buffer becomes full.
  • After it is full, when it needs to read the next page:
    • It checks the page at the ring buffer’s current cursor position.
    • If that buffer is not used by other backends (refcount == 0 and usage_count <= 1), it reuses it directly: evict and load the next page into it.
    • If that buffer is currently used by other backends, it falls back to searching in BufferDescriptors for another available buffer to load the next page, and then replaces the current ring entry with the new buffer ID.

Here you can see the different approaches MySQL and PostgreSQL take for the same scenario. MySQL introduces an “old/young” split in the global LRU list as a general strategy to prevent pollution. PostgreSQL’s ring buffer is essentially also an “old area”, but it relies on higher-level operation tagging: only scan-heavy operations such as VACUUM, sequential scan, bulk insert, etc., will use the ring buffer.

Below is the complete procedure PostgreSQL uses to find a free buffer when a backend needs one:

  1. Determine whether to use the ring buffer. If yes, inspect the buffer at the ring’s current cursor position:

    a. If it has not been used before, the ring is not full yet, go to step 2.

    b. Otherwise the ring is full. If the buffer is not used by any backend (refcount == 0 and usage_count <= 1), it can be reused immediately, return this buffer.

    c. If the buffer is used by other backends, fall back to step 2 to find a buffer from the global pool; after success, replace the current ring entry with the newly found buffer ID.

  2. Check the free list. If a buffer is available, return it.

  3. Start clock sweep: traverse from nextVictimBuffer (the current sweep cursor in BufferDescriptors):

    • If refcount != 0, skip.
    • Otherwise, if usage_count != 0, decrement it (cooling down) and continue.
    • Otherwise, the buffer is evictable. If it is not dirty, return it immediately. If it is dirty, flush it and then return it.
    • Advance nextVictimBuffer accordingly.

Summary: MySQL and PostgreSQL are similar in essence: both are LRU-like. MySQL chooses to implement an explicit LRU list for more precise eviction, at the cost of additional overhead to maintain the list. PostgreSQL uses reference counting plus usage_count as an approximate LRU, avoiding the locking overhead of maintaining a true LRU list but losing precision. This is the result of different trade-offs. Another notable difference: when a MySQL foreground thread tries to find a free page, it tends to prefer evicting old pages that are not dirty first; PostgreSQL’s sweep does not have an explicit priority between dirty and clean pages in the same sense.

3. Dirty Page Flushing Strategy and Implementation

Earlier we mentioned that MySQL user threads and PostgreSQL backends may flush a single dirty page when searching for a free page (single-page flush). However, such foreground single-page flushing is only an emergency measure when no free page is available.

For normal bulk flushing, both MySQL and PostgreSQL have dedicated background threads/processes. The goal is to flush dirty pages in advance and evict old pages so that foreground threads can quickly find free pages.

Background flushing has two goals:

  1. LRU flush: flush old pages in advance based on foreground free-page pressure, reducing foreground wait time for free pages.
  2. Checkpoint flush: flush dirty pages associated with the oldest WAL LSN to advance the checkpoint, purge old WAL, and reduce crash recovery time.

MySQL

image-1

image-1

In MySQL (InnoDB), background flushing is performed by page cleaner threads, consisting of one coordinator and N workers.

Coordinator

  1. Sleep for ~1 second, or be woken by a foreground thread.
  2. Check whether work is needed (sync flush / adaptive / idle). If yes:
  3. Dynamically calculate the number of dirty pages to flush in the next batch: n_pages.
  4. Pass n_pages to all workers and wake them up. Each worker is responsible for one buf_pool_t slot. The coordinator itself also works as worker 0.
  5. Wait for all workers to finish.

Worker

  1. Wait to be woken by the coordinator.
  2. Locate the assigned buf_pool_t slot.
  3. LRU flush: scan from the LRU tail forward, scanning at most srv_LRU_scan_depth pages.
    • If a page is clean and not being used, move it directly from the LRU list to the free list.
    • If a page can be flushed, initiate asynchronous I/O; after I/O completes, move it into the free list.
    • Stop early if the free list length reaches srv_LRU_scan_depth.
  4. Checkpoint flush: scan from the flush list tail forward and flush continuously until:
    • the number of flushed pages satisfies the quota assigned by the coordinator, or
    • the WAL LSN advances to the target LSN assigned by the coordinator.
  5. Finish and report to the coordinator.

Now, step 3 in the coordinator is adaptive: it calculates the flush workload and the target LSN advancement. The logic is as follows:

a. Based on dirty page percentage (get_pct_for_dirty())

Compute dirty_pct, the percentage of dirty pages in the buffer pool:

  • If innodb_max_dirty_pages_pct_lwm (low watermark) is set and dirty_pct >= lwm, start progressive flushing and return the percentage of io_capacity as: dirty_pct * 100 / (max_dirty_pages_pct + 1)
  • If no low watermark is set, but dirty_pct >= innodb_max_dirty_pages_pct (high watermark), flush at 100% io_capacity.
  • Otherwise, do not flush based on dirty ratio (return 0).

b. Based on redo log age (get_pct_for_lsn(age))

Compute checkpoint age: age = current_lsn - oldest_lsn

  • If age < innodb_adaptive_flushing_lwm (default 10% of redo log capacity), no adaptive flushing needed (return 0).
  • If age exceeds the low watermark: age_factor = age * 100 / limit_for_dirty_page_age Return the percentage of io_capacity as: (max_io_capacity / io_capacity) * age_factor * sqrt(age_factor) / 7.5

This is a super-linear growth curve: as redo space approaches exhaustion, flushing ramps up aggressively.

Combined calculation (set_flush_target_by_lsn())

Take: pct_total = max(pct_for_dirty, pct_for_lsn)

Then compute the target LSN: target_lsn = oldest_lsn + lsn_avg_rate * 3 (i.e., advance by 3× the recent average redo generation rate; buf_flush_lsn_scan_factor = 3)

Then traverse each buffer pool instance’s flush list and count the number of pages whose oldest_modification <= target_lsn. Call this number pages_for_lsn (pages that must be flushed to advance checkpoint to target_lsn).

Finally, take the average of three estimates:

n_pages = (PCT_IO(pct_total) + page_avg_rate + pages_for_lsn) / 3

Where:

  • PCT_IO(pct_total) is the I/O demand estimated from dirty ratio / redo age.
  • page_avg_rate is the recent actual average flushing rate (moving average across multiple iterations).
  • pages_for_lsn is the precise demand obtained from scanning the flush list.

Averaging these three makes the flushing rate smoother and avoids abrupt oscillation. n_pages is capped by srv_max_io_capacity.

If redo pressure is high (pct_for_lsn > 30), the per-instance flush quota is weighted by how many pages in each instance’s flush list need flushing; otherwise, it is evenly distributed across instances.

Sync Flush mode

When redo log space is extremely tight (checkpoint cannot keep up with redo generation), log_sync_flush_lsn() returns non-zero and the coordinator enters sync flush mode:

  • It no longer sleeps for 1 second; it starts the next iteration immediately.
  • n_pages is set directly to pages_for_lsn (no averaging), with a lower bound of srv_io_capacity.
  • It loops until redo pressure is relieved.

Idle flushing

When the server is idle (no user activity) and the 1-second sleep times out, the coordinator does not run the adaptive algorithm. Instead, it flushes in the background using innodb_idle_flush_pct percent of innodb_io_capacity (default 100%), keeping the buffer pool clean.

PostgreSQL

PostgreSQL also has both LRU flush and checkpoint flush, but unlike MySQL’s unified page cleaner, PostgreSQL separates responsibilities:

  • bgwriter handles LRU flush
  • checkpointer handles checkpoint flush

1. bgwriter

image-1

The goal of bgwriter is to predict the upcoming demand for free buffers based on historical and current pressure, and try to free enough buffers before backends are forced into heavy clock sweep work (i.e., flush dirty pages that are otherwise reusable victims).

The overall flow:

  1. Collect historical info from clock sweep, including:

    • strategy_buf_id: the current backend clock sweep position
    • strategy_passes: how many full sweeps have been completed
    • recent_alloc: how many buffers have been allocated by backends since the last bgwriter recycle
  2. Compare bgwriter’s current position next_to_clean with clock sweep’s strategy_buf_id, and determine how far ahead it is:

    • bufs_to_lap: number of buffers bgwriter must scan for next_to_clean to “lap” (catch up to) strategy_buf_id.
      • Case 1: same pass, bgwriter ahead → bufs_to_lap is the remaining distance to lap.
      • Case 2: same pass, bgwriter behind → set next_to_clean to strategy_buf_id, set bufs_to_lap = NBuffers, effectively reset bgwriter.
      • Case 3: bgwriter already one full pass ahead → bufs_to_lap may be negative, meaning bgwriter has scanned everything it can scan; no need to scan in this round.
    • bufs_ahead = NBuffers - bufs_to_lap (how many buffers bgwriter is ahead of sweep)
  3. Based on the history above, compute how many buffers clock sweep needs to scan to find one free buffer, i.e. scans_per_alloc. Maintain an exponential moving average: smoothed_density += (scans_per_alloc - smoothed_density) / 16;

  4. Maintain smoothed_alloc similarly:

    • If smoothed_alloc < recent_alloc, set smoothed_alloc = recent_alloc (fast attack).
    • Otherwise decay slowly using EMA: smoothed_alloc += (recent_alloc - smoothed_alloc) / 16; (slow decay)
  5. Compute the prediction for the next round:

    • upcoming_alloc_est = smoothed_alloc * bgwriter_lru_multiplier (predict upcoming allocations)
    • Estimate how many reusable buffers exist in the region bgwriter is ahead: reusable_buffers_est = bufs_ahead / smoothed_density
    • Ensure minimum progress: min_scan_buffers = NBuffers / (120s / 200ms) Then: upcoming_alloc_est = max(upcoming_alloc_est, min_scan_buffers + reusable_buffers_est)

    This “minimum progress” ensures that even if the system is idle, bgwriter will scan the entire buffer pool in about 120 seconds, continuously cleaning dirty pages.

  6. Scan from next_to_clean. For each buffer, bgwriter only considers buffers with refcount == 0 and usage_count == 0 (truly reusable candidates). It skips buffers in use or recently used. If a candidate is dirty, it flushes it synchronously. Stop scanning when any of these is met:

    • bufs_to_lap reaches 0 (caught up to clock sweep)
    • reusable_buffers reaches upcoming_alloc_est (freed enough reusable buffers)
    • num_written reaches bgwriter_lru_maxpages (default 100) to avoid excessive I/O in one round

After one scan round, bgwriter sleeps for bgwriter_delay (default 200ms) before next iteration. If bufs_to_lap == 0 and recent_alloc == 0 (no allocation activity), bgwriter enters hibernation and sleeps longer, until a backend needing buffers wakes it via latch.

2. checkpointer

image-1

The goal of checkpointer is to flush all dirty pages up to a consistency point, forming a checkpoint. This advances WAL recycling and reduces how much WAL must be replayed during crash recovery. Unlike bgwriter, checkpointer does not care whether a page was recently used, it must flush all pages that were dirty at checkpoint start.

Trigger conditions: in the main loop, checkpointer triggers a checkpoint when any of the following occurs:

  • Time since last checkpoint exceeds checkpoint_timeout (default 5 minutes)
  • WAL volume exceeds max_wal_size and backends notify checkpointer
  • User manually runs CHECKPOINT
  • Shutdown checkpoint during server shutdown

Detailed procedure:

  1. Scan and collect dirty buffers: traverse all NBuffers BufferDescriptors. For each dirty page, set the BM_CHECKPOINT_NEEDED flag, and collect its identity info into CkptBufferIds[] (tablespace OID, relation number, fork number, block number, etc.). Note: only pages that are already dirty at checkpoint start are included. Pages that become dirty during the checkpoint are not included and will be handled in the next checkpoint.

  2. Sort: sort CkptBufferIds[] by (tablespace, relation, fork, block). This clusters pages from the same file and orders them by increasing block number, converting random I/O into more sequential patterns as much as possible.

  3. Build tablespace-level progress tracking: traverse the sorted array and group by tablespace. For each tablespace, build a CkptTsStatus structure tracking total pages to flush and current progress. Put all tablespaces into a binary heap (min-heap), ordered by flush progress.

  4. Balanced flushing across tablespaces: repeatedly pop the tablespace with the lowest progress from the heap, flush its next dirty page (via SyncOneBuffer), update its progress, then re-heapify. The purpose is to spread writes evenly across tablespaces (possibly on different disks), instead of flushing one tablespace completely before another. Unlike bgwriter, checkpointer calls SyncOneBuffer with skip_recently_used = false, meaning it will flush buffers with BM_CHECKPOINT_NEEDED regardless of recent usage.

  5. Write throttling: after flushing each page, call CheckpointWriteDelay() to throttle. The goal is to finish flushing within: checkpoint_completion_target (default 0.9) × checkpoint_timeout. The logic compares:

    • flush progress (flushed pages / total),
    • elapsed time progress,
    • WAL progress.
    • If flush progress is ahead of both time progress and WAL progress (IsCheckpointOnSchedule == true), sleep 100ms.
    • If lagging behind, do not sleep and flush at full speed.
    • In IMMEDIATE mode (e.g., shutdown checkpoint) or under urgent checkpoint requests, do not throttle.

    This spreads checkpoint I/O across the entire checkpoint window and avoids I/O spikes.

  6. Writeback coalescing: if not using O_DIRECT, similar to bgwriter, use WritebackContext to collect tags for flushed pages. After accumulating enough, batch-call IssuePendingWritebacks(), sort and coalesce adjacent blocks, and use posix_fadvise to hint the kernel to write back OS cache pages to disk. After checkpoint completion, force one more IssuePendingWritebacks() to ensure all pending writebacks are issued.

Summary: Although the implementations differ significantly, both MySQL and PostgreSQL aim to pre-clean pages in the background so that foreground threads can quickly find free pages. PostgreSQL’s bgwriter predicts upcoming buffer allocation demand from foreground activity; MySQL’s page cleaner reacts to dirty page pressure and redo log age.

From an engineering perspective, their differences largely come down to the trade-off between linked lists and arrays:

  • With linked lists, MySQL can precisely obtain LRU ordering and dirty-page ordering from old to new. This greatly improves precision in eviction and flushing decisions. In particular, for checkpoint flushing, it can directly take the oldest dirty pages from the flush list tail to advance checkpoint quickly. The trade-off is the cost of maintaining those lists.
  • PostgreSQL sacrifices some precision and scans arrays instead, avoiding the additional overhead of maintaining linked lists. It is also worth noting that PostgreSQL’s checkpoint flushing emphasizes balanced progress across tablespaces rather than globally prioritizing the oldest dirty pages to advance checkpoint in small steps.
]]>
Zhao Song
Visualizing MySQL BLOB Internals Directly from MySQL Data Files (.ibd)2026-02-08T00:00:00+00:002026-02-08T00:00:00+00:00https://kernelmaker.github.io/blob-ibdninjaIn a previous post, I explored how MySQL implements partial updates and multi-versioning for BLOB columns internally.

To better see what actually happens inside the data files, I’ve added a new feature to ibdNinja, an interactive BLOB inspection mode:

--inspect-blob

This feature is designed as a extension of ibdNinja’s existing inspection workflow, allowing you to drill down from high-level structures to the actual BLOB data stored on disk.

How it works:

Step 1

Use ibdNinja’s existing features to parse, extract, and print information from a MySQL .ibd file at the table, index, page, and record levels. Once you’ve located a record you want to dive deeper into, note its page number and record number.

Step 2

Pass those identifiers to --inspect-blob:

ibdNinja -f <table.ibd> --inspect-blob <page_no>,<record_no>

to start an interactive inspection of the BLOB field in that record.

image-1

As shown above, ibdNinja will:

  1. Traverse the external BLOB page chain
  2. Reconstruct the version chain introduced by partial updates
  3. Visualize the complete on-disk layout of the BLOB across all versions

From there, you can choose any version and:

  1. Hex-print or dump the full value for binary BLOBs (images, raw binary data, etc.)
  2. Decode JSON BLOBs (MySQL JSON is still a BLOB internally) into readable text, or inspect the raw MySQL-encoded JSON in hex

If some historical versions have already been purged, ibdNinja will detect that and clearly report it.

If you’re into MySQL data file internals, or knee-deep in development, debugging, or production issues, give ibdNinja a try, dig under the hood — and consider bug reports part of the feature set.

]]>
Zhao Song
A POC on optimizing MySQL’s unique index insertion path2026-01-25T00:00:00+00:002026-01-25T00:00:00+00:00https://kernelmaker.github.io/unique-index-pocA few months ago, I wrote a post about a possible optimization in MySQL’s unique index insertion path. As illustrated there, the idea is to reduce the current 3 B+Tree searches into 1 B+Tree search plus a scan on the leaf page (or leaf level), in order to avoid the overhead of repeatedly traversing the tree. This weekend, I implemented a quick proof-of-concept on MySQL 8.0.45 and measure the effect.

image-1

1. Setup:

Table with 200K rows, a VARCHAR(700) unique key (latin1), creating a tall B-tree:

CREATE TABLE t1 (
 id INT PRIMARY KEY AUTO_INCREMENT,
 uk_col VARCHAR(700) NOT NULL,
 UNIQUE KEY uk_idx (uk_col)
) ENGINE=InnoDB CHARACTER SET latin1;

2. Test procedure:

  • Insert 100 TARGET rows with prefix “TARGET_ROW_”
  • Start a blocker transaction (START TRANSACTION WITH CONSISTENT SNAPSHOT) to prevent purge
  • Delete the 100 TARGET rows (creates delete-marked records)
  • Re-insert the same 100 TARGET rows, this triggers the duplicate-check path, since delete-marked records with the same unique key exist
  • Instrument row_ins_sec_index_entry_low() with timing around each B-tree search.
  • Run the benchmark twice: once with the original path, reset metrics, then with the optimized path

3. Results:

Original path (3 B-tree searches):

  • Search1: ~6,508 ns
  • Search2: ~5,649 ns
  • Search3: ~2,498 ns
  • Total: ~14,656 ns

    Optimized path (1 B-tree search + inline scan):

  • Search1: ~7,272 ns

  • Inline: ~3,118 ns

  • Total: ~10,390 ns

Improvement: ~29.1% reduction in search-path time

This test focuses specifically on the unique index insertion path (row_ins_sec_index_entry_low()), comparing the cost of the original three searches with the optimized “one search + inline scan” approach. In this local scope, the saving is close to 30%, which matches the intuition of collapsing three tree traversals into one.

4. However, when evaluating the overall benefit, there are a few important considerations:

  • In a single-row insert, how large is this part relative to the whole insert path? If its share is small, the end-to-end gain will be diluted. In my tests, when measuring the full insert path, the improvement drops to single-digit percentages.

  • Under concurrent workloads, each of the three B-tree searches holds page latches. This is one of the key factors affecting scalability. Reducing this section by ~30% also shortens latch holding time, so the benefit may be more visible in parallel scenarios.

  • While implementing the POC, I also realized that this optimization is not a silver bullet. There are cases that still need to fall back to the original path, although there are ways to minimize how often that happens.

These are just the numbers from a quick POC. If this direction turns out to be meaningful, it would still require much more careful design, implementation, and testing.

Bug #118363

]]>
Zhao Song
MySQL BLOB Internals - Partial Update Implementation and Multi-Versioning2025-12-01T00:00:00+00:002025-12-01T00:00:00+00:00https://kernelmaker.github.io/mysql-blobIn this blog, I would like to introduce the implementation of BLOB and BLOB partial update in MySQL, and explain how the current design works together with the MVCC module to support multi-version control for BLOB columns.

1. Background

Before going into the details, I would like to briefly introduce two important concepts that are closely related to this topic.

1. Basic Principles of MySQL MVCC (Multi-Version Concurrency Control)

MySQL supports snapshot reads. Each read transaction reads data based on a certain snapshot, so even if other write transactions modify the data during the execution of a read transaction, the read transaction will always see the version it is supposed to see.

The underlying mechanism is that a write transaction directly updates the data in place on the primary key record. However, before the update happens, the old value of the field to be modified is copied into the undo space. At the same time, there is a ROLL_PTR field in the row that points to the exact location in the undo space where the old value (the undo log record) is stored.

image-1

As shown in the figure above, there is a row in the primary key index that contains three fields. Suppose a write transaction is modifying Field 2. It will first copy the original value of Field 2 into the undo space, and then overwrite Field 2 directly in the row. After that, two important system fields of the row are updated:

  • TRX_ID is set to the ID of the current write transaction and is used later by read transactions to determine visibility.
  • ROLL_PTR points to the exact location in the undo space where the old value of the modified field is stored, and is used to reconstruct the previous version of the row when needed.

After the update is finished, if a previously existing read transaction reads this row again, it will find, based on the TRX_ID, that the row has been modified by a later write transaction. Therefore, the current version of the row is not visible to this read transaction. It must roll back to the previous version. At this point, it uses the ROLL_PTR to locate the old value in the undo space, applies it to the current row, and thus reconstructs the version that it is supposed to see.

2. Basic Implementation of MySQL BLOB

The primary key record in MySQL contains the values of all fields and is stored in the clustered index. However, BLOB columns are an exception. Since they are usually very large, MySQL stores their data in separate data pages called external pages.

A BLOB value is split into multiple parts and stored sequentially across multiple external pages. These pages are linked together in order, like a linked list. So how does the primary key record locate the corresponding BLOB data stored in those external pages? For each BLOB column, the clustered record stores a reference (lob::ref_t). This ref_t contains some metadata about the column and a pointer to the first external page where the BLOB data starts.

image-2

When reading the row, MySQL first locates the row via the primary key index, then follows this reference to find the external pages and reconstructs the full BLOB value by copying the data from those pages.

This is a very straightforward and intuitive design, simple and sufficient. It is also exactly how BLOB was implemented in older versions of MySQL.

3. A “Thought Exercise”

Based on the two points above, here is a question:

How is MVCC implemented for BLOB in MySQL?

The intuitive answer is as follows: the lob::ref_t stored in the primary key record follows the same MVCC rules. Every time a BLOB column is updated, the old BLOB value is read out, modified, and then the entire modified BLOB is written into newly allocated external pages. The corresponding lob::ref_t in the primary key record is overwritten with the new reference. At the same time, following the MVCC mechanism, the old lob::ref_t is copied into the undo space.

image-3

After the modification, the situation looks like this (as shown in the figure): the undo space stores the lob::ref_t that points to the old BLOB value, while the lob::ref_t in the primary key record points to the new value.

This is exactly how older versions of MySQL worked. The next question is:

What are the pros and cons of this design?

The advantage is that the undo log only needs to record the lob::ref_t, and it does not need to store the entire old BLOB value.

The disadvantage is that no matter how small the change to the BLOB is, even if only a single byte is modified, the entire modified BLOB still has to be written into newly allocated external pages. BLOB columns are usually very large, so if each update only changes a very small portion, this design introduces a lot of extra I/O and space overhead.

A typical example is JSON. Internally, MySQL stores JSON as BLOB. Usually, updates to JSON are local and small. However, with the old design, each small partial update still requires reading the entire JSON, modifying a part of it, and then inserting the whole value back again. This is obviously very heavy.

So how to solve this problem? MySQL introduced BLOB partial update to address it.

2. Implementation of BLOB Partial Update

MySQL optimized the format of the external pages used to store BLOB data and redesigned the original simple linked-list structure:

image-4

  1. Each external page now has a corresponding index entry.
  2. These index entries are organized as a linked list and stored in the BLOB first page. (If there are too many index entries to fit, they are stored in separate BLOB index pages.)
  3. Under normal circumstances, these index entries are linked together in order, just like the external pages in the old implementation.
  4. To support partial updates, MySQL changes the granularity of BLOB updates from the whole BLOB to individual external pages. Only the external pages involved in the current modification are updated. The modified external page is copied into a new page and updated there, while the other external pages remain unchanged.

Then the question becomes: how can MySQL make sure that it can read the correct new and old BLOB values? The answer is that the new external page and the old external page share the same logical position in the index entry list. In other words, at this specific position in the list, there are now two versions, version 1 and version 2. Which one is used is determined by the version number recorded in the current lob::ref_t. The idea is illustrated in the figure below.

image-5

In summary, MySQL transforms the original external-page linked list into a linked list of index entries. For each index entry in this list, if the corresponding external page is modified, a new version of the index entry is created at the same horizontal position to point to the new version of that external page. Essentially, this introduces multi-versioning for external pages.

Special Case: BLOB Small Changes

The implementation described above is not the whole story. MySQL makes a practical trade-off between creating a new index entry (which requires copying the entire external page) and copying only the modified portion into the undo space.

For BLOB small-change scenarios, when the modification to a blob is smaller than 100 bytes, MySQL does not create a new index entry and link it into the version chain for that page. Instead, it modifies the page in place. Following MVCC principles, the portion to be modified is first written into the undo space before the in-place update happens.

image-6

It is worth noting that in this case, the lob::ref_t stored in the primary key record does not advance its base version number. It shares the same base as the previous version. When a read transaction needs to read the previous version, it first constructs the latest BLOB value based on the lob::ref_t and the index entry list. Then, following the MVCC logic, it finds that the TRX_ID indicates that this version is not visible. At this point, it follows the ROLL_PTR to the undo space, where the old value of the modified external page is stored. By applying that old data back onto the current value, the complete and correct historical BLOB value can be reconstructed.

In this scenario, the recovery process is a combination of two steps:

  1. First, the version corresponding to the lob::ref_t is reconstructed via the index entry version chain.
  2. Then, the version visible to the current transaction is reconstructed via the ROLL_PTR chain.

Index Entry Details

Index entries are the key to the implementation of BLOB partial update. To make them easier to understand, I drew the following diagram to illustrate the logical relationships among index entries. It is a two-dimensional linked list. The horizontal dimension represents the sequential position when assembling the full BLOB value. The vertical dimension represents multiple versions at the same position. Each time the page at that position is modified, a new node is added vertically.

image-7

Of course, this is only a logical model. The physical layout is not organized exactly like this. Each BLOB has a BLOB first page. This page stores a portion of the BLOB data (the initial part) and 10 index entries. Each index entry corresponds to one BLOB data page. When all 10 index entries are used up, a new BLOB index page is allocated, and additional index entries are allocated from there. In reality, the index entries distributed across the BLOB first page and the BLOB index pages are linked together to form the logical structure shown in the diagram above.

image-8

]]>
Zhao Song
SIMD in Vector Search - “Hand-Tuned SIMD vs Compiler Auto-Vectorization”2025-09-08T00:00:00+00:002025-09-08T00:00:00+00:00https://kernelmaker.github.io/simdSIMD (Single instruction, multiple data) is often one of the key optimization techniques in vector search. In particular, when computing the distance between two vectors, SIMD can transform what was originally a one-dimensional-at-a-time calculation into 8- or 16-dimensions-at-a-time, significantly improving performance.

Here, as I mentioned in previous posts, MariaDB and pgvector take different approaches:

  1. MariaDB: directly implements distance functions using SIMD instructions.
  2. pgvector: implements distance functions in a naive way and relies on compiler optimization (-ftree-vectorize) for vectorization.

To better understand the benefits of SIMD vectorization, and to compare these two approaches, I ran a series of benchmarks — and discovered some surprising performance results along the way.

1. Test Environment and Method

Environment

  1. AWS EC2: c5.4xlarge, 16 vCPUs, 32 GiB memory
  2. Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
  3. gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0

Method

  1. First, I implemented 4 different squared L2 distance (L2sq) functions (i.e., Euclidean distance without the square root):

    • Naive L2sq implementation
    static inline double l2sq_naive_f32(const float* a, const float* b, size_t n) {
      float acc = 0.f;
      for (size_t i = 0; i < n; ++i) { float d = a[i] - b[i]; acc += d * d; }
      return (double)acc;
    }
    
    • Naive high-precision L2sq (converting float to double before computation)
    static inline double l2sq_naive_f64(const float* a, const float* b, size_t n) {
      double acc = 0.0;
      for (size_t i = 0; i < n; ++i) { double d = (double)a[i] - (double)b[i]; acc += d * d; }
      return acc;
    }
    
    • SIMD (AVX2) L2sq implementation, computing 8 dimensions at a time
    // Reference: simSIMD
    SIMSIMD_PUBLIC void simsimd_l2sq_f32_haswell(simsimd_f32_t const *a,
                                                 simsimd_f32_t const *b,
                                                 simsimd_size_t n,
                                                 simsimd_distance_t *result) {
       
        __m256 d2_vec = _mm256_setzero_ps();
        simsimd_size_t i = 0;
        for (; i + 8 <= n; i += 8) {
            __m256 a_vec = _mm256_loadu_ps(a + i);
            __m256 b_vec = _mm256_loadu_ps(b + i);
            __m256 d_vec = _mm256_sub_ps(a_vec, b_vec);
            d2_vec = _mm256_fmadd_ps(d_vec, d_vec, d2_vec);
        }
       
        simsimd_f64_t d2 = _simsimd_reduce_f32x8_haswell(d2_vec);
        for (; i < n; ++i) {
            float d = a[i] - b[i];
            d2 += d * d;
        }
       
        *result = d2;
    }
    SIMSIMD_INTERNAL simsimd_f64_t _simsimd_reduce_f32x8_haswell(__m256 vec) {
        // Convert the lower and higher 128-bit lanes of the input vector to double precision
        __m128 low_f32 = _mm256_castps256_ps128(vec);
        __m128 high_f32 = _mm256_extractf128_ps(vec, 1);
       
        // Convert single-precision (float) vectors to double-precision (double) vectors
        __m256d low_f64 = _mm256_cvtps_pd(low_f32);
        __m256d high_f64 = _mm256_cvtps_pd(high_f32);
       
        // Perform the addition in double-precision
        __m256d sum = _mm256_add_pd(low_f64, high_f64);
        return _simsimd_reduce_f64x4_haswell(sum);
    }
    SIMSIMD_INTERNAL simsimd_f64_t _simsimd_reduce_f64x4_haswell(__m256d vec) {
        // Reduce the double-precision vector to a scalar
        // Horizontal add the first and second double-precision values, and third and fourth
        __m128d vec_low = _mm256_castpd256_pd128(vec);
        __m128d vec_high = _mm256_extractf128_pd(vec, 1);
        __m128d vec128 = _mm_add_pd(vec_low, vec_high);
       
        // Horizontal add again to accumulate all four values into one
        vec128 = _mm_hadd_pd(vec128, vec128);
       
        // Convert the final sum to a scalar double-precision value and return
        return _mm_cvtsd_f64(vec128);
    }
    
    • SIMD (AVX-512) L2sq implementation, computing 16 dimensions at a time
    // Reference: simSIMD
    SIMSIMD_PUBLIC void simsimd_l2sq_f32_skylake(simsimd_f32_t const *a,
                                                 simsimd_f32_t const *b,
                                                 simsimd_size_t n,
                                                 simsimd_distance_t *result) {
        __m512 d2_vec = _mm512_setzero();
        __m512 a_vec, b_vec;
       
    simsimd_l2sq_f32_skylake_cycle:
        if (n < 16) {
            __mmask16 mask = (__mmask16)_bzhi_u32(0xFFFFFFFF, n);
            a_vec = _mm512_maskz_loadu_ps(mask, a);
            b_vec = _mm512_maskz_loadu_ps(mask, b);
            n = 0;
        }
        else {
            a_vec = _mm512_loadu_ps(a);
            b_vec = _mm512_loadu_ps(b);
            a += 16, b += 16, n -= 16;
        }
        __m512 d_vec = _mm512_sub_ps(a_vec, b_vec);
        d2_vec = _mm512_fmadd_ps(d_vec, d_vec, d2_vec);
        if (n) goto simsimd_l2sq_f32_skylake_cycle;
       
        *result = _simsimd_reduce_f32x16_skylake(d2_vec);
    }
    ......
    
  2. I generated a dataset of 10,000 float vectors (dimension = 1024, 64B aligned) and one target vector. Then, for the following 5 scenarios, I searched for the vector with the closest L2sq distance to the target. Each distance computation was repeated 16 times (to create a CPU-intensive workload), and each scenario was executed 5 times, taking the median runtime to eliminate random fluctuations:

    1. SIMD L2sq implementation
    2. Naive L2sq implementation
    3. Naive L2sq with compiler vectorization disabled (-fno-tree-vectorize -fno-builtin -fno-lto -Wno-cpp -Wno-pragmas)
    4. Naive high-precision L2sq implementation
    5. Naive high-precision L2sq with compiler vectorization disabled
  3. Compile with AVX2 (-O3 -mavx2 -mfma -mf16c -mbmi2) and run the 5 scenarios.

  4. Compile with AVX-512 (-O3 -mavx512f -mavx512dq -mavx512bw -mavx512vl -mavx512cd -mfma -mf16c -mbmi2) and run the 5 scenarios again.

2. Results and Analysis

image-1

Expected results:

  1. SIMD L2sq implementations are much faster than others, and AVX-512 outperforms AVX2 since it processes 16 dimensions at once instead of 8.

  2. Under AVX2, naive L2sq (178.385ms) is faster than naive high-precision L2sq (183.973ms), because the latter incurs float→double conversion overhead.

  3. Under both AVX2 and AVX-512, naive implementations with compiler vectorization disabled perform the worst, since they are forced into scalar execution.

Unexpected Results

In addition to the expected results above, some surprising findings appeared:

  1. For naive L2sq, AVX-512 performance (208.822ms) was actually slower than AVX2 (178.385ms).
  2. With AVX-512, naive L2sq was slower than naive high-precision L2sq.

Both deserve deeper analysis.

(1) Why was naive L2sq with AVX-512 slower than with AVX2?

Although this was a naive implementation, with -O3 we would expect the compiler to auto-vectorize. However, the vectorized result generated by the compiler was far worse than our manual SIMD implementation, and AVX-512 even performed worse than AVX2.

To investigate further, I used objdump to examine the AVX2 and AVX-512 binaries for l2sq_naive_f32().

  • Under AVX2:

    0000000000007090 <_ZL19l2sq_naive_f32PKfS0_m>:
         ... ...
         70b7:       48 c1 ee 03             shr    rsi,0x3
         70bb:       48 c1 e6 05             shl    rsi,0x5
         70bf:       90                      nop
         70c0:       c5 fc 10 24 07          vmovups ymm4,YMMWORD PTR [rdi+rax*1]
         70c5:       c5 dc 5c 0c 01          vsubps ymm1,ymm4,YMMWORD PTR [rcx+rax*1]
         70ca:       48 83 c0 20             add    rax,0x20
         70ce:       c5 f4 59 c9             vmulps ymm1,ymm1,ymm1
           
         70d2:       c5 fa 58 c1             vaddss xmm0,xmm0,xmm1
         70d6:       c5 f0 c6 d9 55          vshufps xmm3,xmm1,xmm1,0x55
         70db:       c5 f0 c6 d1 ff          vshufps xmm2,xmm1,xmm1,0xff
         70e0:       c5 fa 58 c3             vaddss xmm0,xmm0,xmm3
         70e4:       c5 f0 15 d9             vunpckhps xmm3,xmm1,xmm1
         70e8:       c4 e3 7d 19 c9 01       vextractf128 xmm1,ymm1,0x1
         70ee:       c5 fa 58 c3             vaddss xmm0,xmm0,xmm3
         70f2:       c5 fa 58 c2             vaddss xmm0,xmm0,xmm2
         70f6:       c5 f0 c6 d1 55          vshufps xmm2,xmm1,xmm1,0x55
         70fb:       c5 fa 58 c1             vaddss xmm0,xmm0,xmm1
         70ff:       c5 fa 58 c2             vaddss xmm0,xmm0,xmm2
         7103:       c5 f0 15 d1             vunpckhps xmm2,xmm1,xmm1
         7107:       c5 f0 c6 c9 ff          vshufps xmm1,xmm1,xmm1,0xff
         710c:       c5 fa 58 c2             vaddss xmm0,xmm0,xmm2
         7110:       c5 fa 58 c1             vaddss xmm0,xmm0,xmm1
         ... ...
    

    The compiler did use vector instructions (vmovups, vsubps, vmulps) to compute L2sq in groups of 8 floats. But when folding the 8 results horizontally into xmm0, it extracted elements using vshufps, vunpckhps, vextractf128, etc., and then added them one by one with scalar vaddss. Worse, this folding happened in every iteration.

    image-2

    This per-iteration horizontal reduction became the bottleneck. Instead, like the manual SIMD implementation, it should have accumulated vector results across the whole loop and performed just one horizontal reduction at the end.

  • Under AVX-512:

        a057:       48 c1 ee 04             shr    rsi,0x4
        a05b:       48 c1 e6 06             shl    rsi,0x6
        a05f:       90                      nop
        a060:       62 f1 7c 48 10 2c 07    vmovups zmm5,ZMMWORD PTR [rdi+rax*1]
        a067:       62 f1 54 48 5c 0c 01    vsubps zmm1,zmm5,ZMMWORD PTR [rcx+rax*1]
        a06e:       48 83 c0 40             add    rax,0x40
        a072:       62 f1 74 48 59 c9       vmulps zmm1,zmm1,zmm1
          
        a078:       c5 f0 c6 e1 55          vshufps xmm4,xmm1,xmm1,0x55
        a07d:       c5 f0 c6 d9 ff          vshufps xmm3,xmm1,xmm1,0xff
        a082:       62 f3 75 28 03 d1 07    valignd ymm2,ymm1,ymm1,0x7
        a089:       c5 fa 58 c1             vaddss xmm0,xmm0,xmm1
        a08d:       c5 fa 58 c4             vaddss xmm0,xmm0,xmm4
        a091:       c5 f0 15 e1             vunpckhps xmm4,xmm1,xmm1
        a095:       c5 fa 58 c4             vaddss xmm0,xmm0,xmm4
        a099:       c5 fa 58 c3             vaddss xmm0,xmm0,xmm3
        a09d:       62 f3 7d 28 19 cb 01    vextractf32x4 xmm3,ymm1,0x1
        a0a4:       c5 fa 58 c3             vaddss xmm0,xmm0,xmm3
        a0a8:       62 f3 75 28 03 d9 05    valignd ymm3,ymm1,ymm1,0x5
        a0af:       c5 fa 58 c3             vaddss xmm0,xmm0,xmm3
        a0b3:       62 f3 75 28 03 d9 06    valignd ymm3,ymm1,ymm1,0x6
        a0ba:       62 f3 7d 48 1b c9 01    vextractf32x8 ymm1,zmm1,0x1
        a0c1:       c5 fa 58 c3             vaddss xmm0,xmm0,xmm3
        a0c5:       c5 f0 c6 d9 55          vshufps xmm3,xmm1,xmm1,0x55
        a0ca:       c5 fa 58 c2             vaddss xmm0,xmm0,xmm2
        a0ce:       c5 f0 c6 d1 ff          vshufps xmm2,xmm1,xmm1,0xff
        a0d3:       c5 fa 58 c1             vaddss xmm0,xmm0,xmm1
        a0d7:       c5 fa 58 c3             vaddss xmm0,xmm0,xmm3
        a0db:       c5 f0 15 d9             vunpckhps xmm3,xmm1,xmm1
        a0df:       c5 fa 58 c3             vaddss xmm0,xmm0,xmm3
        a0e3:       c5 fa 58 c2             vaddss xmm0,xmm0,xmm2
        a0e7:       62 f3 7d 28 19 ca 01    vextractf32x4 xmm2,ymm1,0x1
        a0ee:       c5 fa 58 c2             vaddss xmm0,xmm0,xmm2
        a0f2:       62 f3 75 28 03 d1 05    valignd ymm2,ymm1,ymm1,0x5
        a0f9:       c5 fa 58 c2             vaddss xmm0,xmm0,xmm2
        a0fd:       62 f3 75 28 03 d1 06    valignd ymm2,ymm1,ymm1,0x6
        a104:       62 f3 75 28 03 c9 07    valignd ymm1,ymm1,ymm1,0x7
        a10b:       c5 fa 58 c2             vaddss xmm0,xmm0,xmm2
        a10f:       c5 fa 58 c1             vaddss xmm0,xmm0,xmm1
    

    The first part similarly used vector instructions to compute 16 values at a time. But folding 16 results was even more complex and expensive, involving vshufps, valignd, vunpckhps, vextractf32x4, vextractf32x8, etc. This additional complexity canceled out the gains from processing 16 dimensions per iteration, which explains why AVX-512 was slower.

(2) Why was naive float L2sq slower than naive high-precision L2sq under AVX-512?

Theoretically, high-precision L2sq should be slower because of float→double conversions. So why was it faster?

Looking at the disassembly of l2sq_naive_f64:

000000000000a280 <_ZL19l2sq_naive_f64PKfS0_m>:
    a280:       f3 0f 1e fa             endbr64
    a284:       48 85 d2                test   rdx,rdx
    a287:       74 37                   je     a2c0 <_ZL19l2sq_naive_f64_oncePKfS0_m+0x40>
    a289:       c5 e0 57 db             vxorps xmm3,xmm3,xmm3
    a28d:       31 c0                   xor    eax,eax
    a28f:       c5 e9 57 d2             vxorpd xmm2,xmm2,xmm2
    a293:       0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
    a298:       c5 e2 5a 04 87          vcvtss2sd xmm0,xmm3,DWORD PTR [rdi+rax*4]
    a29d:       c5 e2 5a 0c 86          vcvtss2sd xmm1,xmm3,DWORD PTR [rsi+rax*4]
    a2a2:       c5 fb 5c c1             vsubsd xmm0,xmm0,xmm1
    a2a6:       48 83 c0 01             add    rax,0x1
    a2aa:       c5 fb 59 c0             vmulsd xmm0,xmm0,xmm0
    a2ae:       c5 eb 58 d0             vaddsd xmm2,xmm2,xmm0
    a2b2:       48 39 c2                cmp    rdx,rax
    a2b5:       75 e1                   jne    a298 <_ZL19l2sq_naive_f64_oncePKfS0_m+0x18>
    a2b7:       c5 eb 10 c2             vmovsd xmm0,xmm2,xmm2
    a2bb:       c3                      ret
    a2bc:       0f 1f 40 00             nop    DWORD PTR [rax+0x0]
    a2c0:       c5 e9 57 d2             vxorpd xmm2,xmm2,xmm2
    a2c4:       c5 eb 10 c2             vmovsd xmm0,xmm2,xmm2
    a2c8:       c3                      ret
    a2c9:       0f 1f 80 00 00 00 00    nop    DWORD PTR [rax+0x0]
  • The code is much shorter than the float version.
  • Although it includes scalar float→double conversions (vcvtss2sd) and computes one dimension at a time, it avoids the complex and costly 16-element horizontal folding.

In other words, even with the conversion overhead, the simpler scalar path was still faster than the float version with vector folding. The compiler likely chose the conservative scalar path here, avoiding vectorization.

(3) How to Improve Naive L2sq for Better Compiler Vectorization?

The reason for horizontal folding is likely that the compiler strictly follows IEEE 754 semantics, preserving the exact order of floating-point additions. This prevents the compiler from reordering additions into vectorized accumulations.

To relax this, we can explicitly allow reassociation:

static inline double l2sq_naive_f32(const float* a, const float* b, size_t n) {
    float acc = 0.f;
    #pragma omp simd reduction(+:acc)
    for (size_t i = 0; i < n; ++i) {
        float d = a[i] - b[i];
        acc += d * d;
    }
    return (double)acc;
}

And compile with -fopenmp-simd to enable this directive.

Running again shows a significant improvement: compiler auto-vectorization now achieves performance close to manual SIMD implementations. Using -ffast-math also works.

image-3

3. Summary

  1. SIMD significantly improves distance computation performance.
  2. Hand-written SIMD implementations perform best.
  3. For naive implementations, allowing reassociation (via #pragma omp simd reduction(+:acc) or appropriate subsets of -ffast-math) is the key to approaching hand-written SIMD performance. Under strict IEEE semantics, the compiler conservatively generates per-iteration folding, which creates slow paths where AVX-512 does not necessarily have an advantage.
]]>
Zhao Song
Is pgvector breaking PostgreSQL’s Repeatable Read isolation?2025-08-11T00:00:00+00:002025-08-11T00:00:00+00:00https://kernelmaker.github.io/pgvector_rrThis thought hit me on the way to work today: (The table ‘items’ has an HNSW index on the vector column ‘embedding’)

BEGIN;
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;
……

Can we really say this SELECT is repeatable read safe❓

I used to assume pgvector, as a PostgreSQL extension, naturally inherits Postgres’s transactional guarantees — but after thinking it through, that might not be the case.

PostgreSQL MVCC relies on 3 assumptions:

  1. Indexes are append-only: Write operations only insert new index entries — never update or delete them.
  2. The heap stores version history: Each row’s versions are retained for snapshot-based visibility checks.
  3. VACUUM coordinates cleanup: It purges dead heap tuples and their corresponding index entries together.

This works well with native ordered index like nbtree. For example:

  1. A REPEATABLE READ transaction performs the same SELECT twice.
  2. Between them, a new row B is inserted.
  3. In the second SELECT, B appears in the index scan but is filtered out after a heap visibility check.

So, the query still returns the same results — consistent with REPEATABLE READ.

image-1

But HNSW behaves differently…

When inserting a new vector B:

  1. B searches the graph to find neighbors.
  2. Selected neighbors (say, T) update their neighbor lists to include B.
  3. If T’s list is full, HNSW re-selects top-k neighbors — possibly evicting an existing node like D.

Here’s the issue: T’s neighbor list is modified — breaking assumption #1. Now, suppose a REPEATABLE READ transaction had previously discovered D via T. In its second identical query, it may no longer reach D, simply because D was evicted from T’s neighbor list. At the same time, the newly inserted B is now reachable — but is correctly rejected due to heap visibility checks.

Root cause:

  1. The HNSW index breaks MVCC’s immutability assumption: It performs in-place modifications to graph nodes during insertions.
  2. No versioning in HNSW index: There’s no way to preserve historical neighbor lists for concurrent transactions. Even though I prefer pgvector’s low-level, native integration (at the same level as nbtree), MariaDB’s design may provide better transactional isolation here. Its HNSW index is implemented as a separate InnoDB table — which naturally supports MVCC, including versioned index “rows.”

This question came to mind today — I reached a tentative conclusion through some code review and thought experiments. Haven’t verified this with a test case yet, so feel free to correct me if I’m wrong.

🤔 BTW, lately, I’ve been comparing how vector search is implemented in transactional databases vs dedicated vector databases by reading through their code. It’s exciting to see traditional databases embracing new trends — but what do you think: Do transactions bring real value to vector search, or are they more of a burden in practice? And what about the other way around?

Discussion

This post has sparked some discussion on LinkedIn, with two main points being raised:

  1. HNSW is approximate search by nature, so strict Repeatable Read isn’t required.
  2. PostgreSQL doesn’t currently guarantee identical results in all cases anyway (e.g., non-unique indexes with SELECT ... ORDER BY ... LIMIT ...), because different execution plans can produce different result orders.

I’m not convinced by either of these arguments:

  1. Approximate search is an inherent trade-off in the vector search domain. It’s unrelated to PostgreSQL’s ACID guarantees, and using vector search shouldn’t be a reason to compromise on them.
  2. The core issue here isn’t about result order — it’s about the result set itself. Query plan variability doesn’t explain this away, because even if we strictly control every runtime condition to ensure identical execution plans, HNSW can still produce different result sets (not just differently ordered sets) due to the root cause I described above.
]]>
Zhao Song
Exploring the Internals of pgvector2025-07-14T00:00:00+00:002025-07-14T00:00:00+00:00https://kernelmaker.github.io/pgvectorpgvector brings vector similarity search capabilities to the PostgreSQL. Thanks to PostgreSQL’s flexible extension architecture, pgvector can focus purely on implementing the vector index: defining the vector data type, implementing index operations like insert, search, and vacuum, managing index page and tuple formats, and handling WAL logging. PostgreSQL handles the rest — index file and page buffer management, MVCC, crash recovery, and more. Because of this, pgvector is able to implement its vector index at a low level, on par with PostgreSQL’s built-in nbtree index.

This is one of the key differences from MariaDB’s vector index. Due to MariaDB’s pluggable storage engine architecture and engineering trade-offs, its vector index is implemented at the server layer through the storage engine’s transactional interface. The storage engine itself is unaware of the vector index — it just sees a regular table. Curious about the internals of the MariaDB vector index? Take a look at my previous posts here[1] and here[2].

pgvector supports two types of vector indexes: HNSW and IVFFlat. This post focuses on the HNSW implementation, particularly its concurrency control mechanisms — a topic that has received relatively little attention.

1. Key Interfaces

  1. hnswinsert: This is the core interface for inserting into the index. Its implementation closely follows the HNSW paper, aligning with the INSERT, SEARCH-LAYER, and SELECT-NEIGHBORS-HEURISTIC algorithms (Algorithms 1, 2, and 4). One difference I noticed is that pgvector omits the extendCandidates step from SELECT-NEIGHBORS-HEURISTIC.

  2. hnswgettuple: This is the search interface. It invokes GetScanItems to perform the HNSW search, which aligns closely with Algorithm 5 (K-NN-SEARCH) from the paper. In iterative scans, GetScanItems not only returns the best candidates but also retains the discarded candidates — those rejected during neighbor selection. Once all the best candidates are exhausted, it revisits some of these discarded candidates at layer 0 for additional search rounds. This continues until hnsw_max_scan_tuples is exceeded, after which the remaining discarded candidates are returned as-is and the scan ends.

  3. hnswbulkdelete: This is the vacuum-related interface. It’s the heaviest part of pgvector, involving three steps:
    1. Scanning all index pages to collect invalid tuples.

    2. RepairGraph, the most complex step, removes invalid tuples from the graph and their neighbors, then repairs the graph to ensure correctness. This requires scanning all pages again and performing multiple search-layer and select-neighbors operations.

    3. Safely deleting the invalid tuples from the index, again requiring a full scan of all pages.

    As you can see, the full traversal of index pages three times plus extensive HNSW operations make this function very heavy.

  4. hnswbuild: This handles index creation. It uses the similar logic as hnswinsert, but with PostgreSQL’s index build APIs. Notably, it supports concurrent builds. Initially, it builds the index in memory; only when the memory limit is exceeded does it flush pages to disk and switch to on-disk insertions. WAL logging is disabled throughout the build phase.

2. Concurrency Control

In an earlier post, I analyzed the concurrency control design of MariaDB’s vector index. There, read-write concurrency is supported through InnoDB’s MVCC, but write-write concurrency is not.

pgvector goes further by supporting true write-write concurrency. Let’s dive into how pgvector handles concurrency between hnswinsert and hnswbulkdelete.

It introduces multiple lock types:

  1. Page Locks These are PostgreSQL’s standard buffer locks and are used to protect individual pages. Pages in pgvector fall into three categories:

    • Meta Page: Stores index metadata.

    • Element Pages: Store index tuples.

    • Entrypoint Page: A special element page containing the HNSW entrypoint.

  2. HNSW_UPDATE_LOCK is a read-write lock used in pgvector to protect critical sections that span a larger scope than a single page lock.
    • Most inserts (hnswinsert) acquire it in shared mode, allowing concurrent inserts.

    • If an insert needs to update the entrypoint, it upgrades to exclusive mode to ensure only one insert can modify the entrypoint. hnswbulkdelete briefly acquires and immediately releases the lock in exclusive mode after collecting invalid index tuples and just beforeRepairGraph, ensuring that all in-flight inserts have completed. Otherwise, concurrent inserts might reintroduce the invalid elements being removed, making them neighbors again.

  3. HNSW_SCAN_LOCK Similar to HNSW_UPDATE_LOCK, but used to coordinate between hnswbulkdelete and hnswgettuple.

All locking operations in hnswinsert and hnswbulkdelete are well-structured. The diagram below shows detailed lock scopes in both implementations, where solid lines indicate exclusive locks and dashed lines indicate shared locks. I won’t go into all the details here — I may write a separate post covering the implementation specifics — but the diagram clearly illustrates that exclusive HNSW_UPDATE_LOCK usage is infrequent. Most operations acquire it in shared mode and hold short-lived page locks only as needed, keeping contention low.

image-1 What about deadlocks? The answer is simple: as shown in the diagram, in most cases only one page buffer lock is held at a time, eliminating the risk of circular dependencies. In rare cases, both an element page and one of its neighbor pages (also an element page) may be locked simultaneously. However, since pgvector maintains a globally unique current insert page, even these scenarios remain safe.

3. Summary

pgvector is another textbook example of how to integrate vector search into a traditional OLTP database. Its implementation is elegant, closely aligned with the original HNSW paper.

Compared to MariaDB’s vector index, it stands out for its fine-grained concurrency control. However, it lacks the SIMD-level optimizations that MariaDB has introduced for better performance.

A deeper comparison between pgvector and MariaDB’s vector index internals would be an interesting future topic.

If you’re interested in performance benchmarks comparing pgvector and MariaDB, Mark Callaghan did detailed tests, check them out here.

]]>
Zhao Song
A Potential Optimization for Inserting into MySQL Unique Indexes2025-06-05T00:00:00+00:002025-06-05T00:00:00+00:00https://kernelmaker.github.io/opt-uniq-indexIn my previous blog, I discussed how many B+Tree searches MySQL performs when inserting a record, detailing the process for inserting into both the primary index and unique secondary indexes.

I believe there is room for optimization in the unique secondary index insertion process. In my previous post, I already reported one such optimization, which has since been verified by the MySQL team. In this post, I’d like to discuss another optimization I recently proposed.

image-1

Currently, when inserting an entry into a unique secondary index, MySQL performs the following steps:

  1. It first does a B+Tree search using all columns of the entry to locate the leaf page insertion point. If it finds a record with matching index columns, it suggests a potential duplicate, so it proceeds to the next step.
  2. It does another B+Tree search using only the index columns to locate the first record with matching index columns. It then iterates (using next) to check for actual duplicates, considering that records marked as deleted don’t count as duplicates. If no duplicate is found, it continues to the next step.
  3. It performs a final B+Tree search using all columns of the entry to find the insertion point and then inserts the entry.

As you can see, this process involves 3 separate B+Tree searches. This is mainly because unique secondary indexes in InnoDB cannot be updated in place; they rely on a delete-mark and insert approach. Since multiple records can share the same index column values (including deleted-marked ones), MySQL has to perform these extra checks to ensure uniqueness.

Each of these B+Tree searches acquires at least one page latch at each tree level, which can become a concurrency bottleneck, especially during page splits or merges.

How can we optimize this?

I believe we can reduce the number of B+Tree searches for unique secondary indexes. Specifically, we could skip the initial B+Tree search that uses all columns as the search pattern. The revised process would be:

  1. Use a B+Tree search with only the index columns to locate the first record with matching index columns. Then iterate (using next) through subsequent records to confirm whether an actual duplicate exists, while also identifying the final insertion point (including comparing the full entry columns when needed). If no duplicate is found, we can directly insert the entry at the determined insertion point.

This approach would reduce the number of B+Tree searches from 3 to 1, significantly reducing the chances of concurrency conflicts. All duplicate checks would happen within one or just a few adjacent leaf pages, making the lock granularity much smaller. Importantly, even in the worst case, the number of entry-record comparisons wouldn’t exceed what the current implementation requires.

I’ve already submitted a report with this idea to the MySQL team. I’m hoping it can generate some interesting discussions around this optimization.

]]>
Zhao Song
Reviewing the Internals of MariaDB’s Vector Index 22025-05-26T00:00:00+00:002025-05-26T00:00:00+00:00https://kernelmaker.github.io/mariadb-vector-2image-1

1. What optimizations does MariaDB apply to vector indexes?

1.1 SIMD Acceleration

  • To improve performance, vector fields in the records are quantized from raw float arrays into fixed-point representations (scale + int16_t arrays) using normalization and quantization. These int16_t arrays allow fast dot product computation using AVX2 or AVX512 instructions, which significantly speeds up vector distance calculations.
  • When expanding neighbors for a given node, MariaDB uses a PatternedSimdBloomFilter to efficiently skip previously visited neighbors. This filter groups visited memory addresses in batches of 8 and uses SIMD to accelerate the matching process.

1.2 Node Caching to Reduce Storage Engine Access

  • Each table’s TABLE_SHARE structure holds an MHNSW_Share object, which contains a global cache shared across sessions (since TABLE_SHARE is global).

  • The cache improves read performance but introduces additional locking overhead, which is worth a closer look. Three types of locks are used to manage concurrency:

    • cache_lock: guards the entire cache structure.
    • node_lock[8]: partitions node-level locks to reduce contention on cache_lock. The thread first uses cache_lock to locate the node, then grabs node_lock[x] for fine-grained protection, allowing cache_lock to be released right after.
    • commit_lock: a read-write lock that ensures consistency during writes. Readers hold the read lock throughout the query to prevent concurrent cache modifications. Writers acquire the write lock during commit, invalidate the cache, bump the version number, and notify any ongoing reads (executing between hlindex_read_first() and hlindex_read_next()) to switch to the new cache which generated by the writer.

    Observations:

    • In pure read workloads, this caching scheme works well, and the two-tier locking mechanism (cache_lock + node_lock) minimizes contention.
    • In read-write workloads, however, every write invalidates the entire cache, making it less effective for concurrent readers.

2. What Is the Transaction Isolation on the Vector Index?

(This section refers specifically to vector indexes on InnoDB.)

1. Read Committed

(tx_isolation = ‘READ COMMITTED’)

  • Consistent reads, remain unchanged.
  • Locking reads, however, become stronger: the shared lock (S-lock) acquired on the entry-point node blocks writes completely, effectively elevating isolation to Serializable.

2. Repeatable Read

(tx_isolation = ‘REPEATABLE READ’)

  • Consistent reads, remain unchanged.
  • Locking reads, now acquire a next-key S-lock on the entry-point node, which again blocks concurrent writes, behaving like Serializable isolation.

Interestingly, under the Repeatable Read isolation level, the correctness of locking reads is not guaranteed by InnoDB’s next-key locks or gap locks. Since gap protection applies to ordered indexes, the concept of a “gap” does not really exist in a vector index structure. However, locking reads in this case effectively behave like Serializable, which still satisfies the requirements of Repeatable Read.

Another notable quirk: the cache layer disrupts normal locking behavior. If a node is found in the cache, no InnoDB-level lock is acquired. Locking only happens on cache misses. This makes the locking behavior somewhat unpredictable under high cache hit rates.


3. Conclusion

Based on the reviews in my last post and this one, I believe MariaDB’s current implementation of vector indexes offers an excellent case study of how to integrate vector search in a relational database. It achieves a strong balance between engineering complexity, performance, and applicability.

Looking forward to seeing even more powerful iterations in the future!

]]>
Zhao Song