MVCC (Multi-version Concurrency Control) is a common mechanism used in transactional databases to resolve read–write conflicts. The core idea is that when a transaction modifies data, it does not overwrite the original data directly. Instead, it preserves the previous version while creating a new version of the record. As a result, all historical versions of a record are retained in the database.
The key benefit is that read and write transactions on the same record no longer need to block each other. Even if a write transaction has modified a record but not yet committed, a read transaction can still directly read the version that is visible to it from the historical versions.
The simplified principle is illustrated below:

For three modifications to the record with PK, each modification creates a new version, so all historical versions of the record are retained in the database.
So what is the benefit of retaining all historical versions? Consider the following example:

Three write transactions A, B, and C modify the same PK sequentially on the timeline, while three read transactions X, Y, and Z interleave with them and read the PK. Without MVCC, read transactions must block until write transactions commit and release locks. With MVCC, when read transaction X attempts to read the PK, the state of the PK is:
Under RC (Read Committed) and RR (Repeatable Read) isolation levels, PK: ‘bbb’ is not visible to transaction X. Because MVCC preserves the old version, transaction X can directly read the visible version PK: ‘aaa’ and ignore the currently running write transaction B. This greatly improves read–write concurrency.
As an essential capability of databases, MVCC is supported by both MySQL and PostgreSQL. Fundamentally, they both achieve the behavior described above, but their designs and implementations make different trade-offs.
In the following sections, I will compare their implementations in detail across three aspects:

The leaf nodes of the nbtree index store the PK fields of the tuple and point to the actual location of the tuple in the heap (TID). Following the Heap TID leads to the tuple, which contains the full data including PK fields and value fields.
Each index tuple points to the oldest version in its corresponding HOT chain. When the tuple is modified, the old version is not changed. Instead, a new version is created with the same PK fields but different value fields. The ctid field of the old tuple points to the location of the next version. The latest version’s ctid points to itself.
In short, every version of a PostgreSQL tuple is a complete tuple containing all fields. Starting from the index tuple, the version chain is linked from old to new through the ctid field.
It is important to note that the version chain may become ‘broken’, as shown below:

The PK fields remain unchanged, but the value fields are modified multiple times. Each modification generates a new full version stored in the same heap page as the old version, as shown for versions 1, 2, and 3. Because they reside in the same heap page, advancing through the chain via ctid is very cheap (no additional heap page lock is required). This is the type of chain that can be efficiently traversed for version lookups, known as a HOT chain in PostgreSQL.
A HOT chain requires two conditions:
If either condition is not met, the HOT chain breaks.
As modifications continue, starting from version 4, the original heap page (heap page 1) can no longer accommodate the new tuple. The new tuple is therefore stored in another page (heap page 2). Although version 3’s ctid still points to version 4, the HOT chain effectively ends at version 3.
When traversing a HOT chain, the reader will not follow ctid across heap pages. Instead, it stops at the end of the chain. The reason is that both the latest and historical versions of tuples reside in heap pages. If the reader followed ctid from page 1 to page 2, it would hold a read lock on heap page 1 and attempt to acquire a read lock on heap page 2. Because both pages are heap pages and there is no defined lock ordering between them, another backend might hold the lock on page 2 and attempt to acquire the lock on page 1, leading to a deadlock.
At this point, the HOT chain is considered broken.
How does PostgreSQL transition from version 3 to version 4 then?
The implementation is to insert a new index tuple into the nbtree index with the same PK fields pointing to version 4, starting a new HOT chain. If a read operation traverses the version chain and finds versions 1, 2, and 3 all invisible, it stops the current HOT chain traversal and returns to the index layer. It then proceeds to the next index tuple (the one pointing to version 4).
This design leads to an interesting and somewhat counterintuitive behavior: multiple index tuples with identical PK fields may coexist in the nbtree index.
In MySQL, data is stored directly in the clustered index (B+Tree) leaf nodes. This is an important difference from PostgreSQL: MySQL does not have a heap.
The second difference is that the record stored in the clustered index and its historical versions reside in different places. Historical versions are not stored in the clustered index. Instead, old values are stored in undo records in the undo space. When needed, historical versions are reconstructed by applying the undo records to the current record.
The third difference is that an undo record does not store a full copy of the record. It only stores the old values of the columns modified in the operation.
The fourth difference is the direction of the version chain. In MySQL, the clustered index always stores the latest version. Before a record is modified, the old values of the columns being changed (together with the PK fields) are copied to the undo space, and the record is updated in place.
As shown below:

The clustered index record contains two system fields: TRX_ID and ROLL_PTR.
TRX_ID records the transaction ID that last modified the record and is used for MVCC visibility checks.ROLL_PTR links the version chain.Similar to PostgreSQL’s ctid, ROLL_PTR links versions together, but the direction is opposite:
ctid points from old to new, while ROLL_PTR points from new to old.
In the figure, the record was modified three times:
Therefore, the clustered index record stores the latest version after the three modifications. Through ROLL_PTR, it points to the previous version stored in the undo space (the version before Field 3 was modified), and so on.
The differences in version organization between PostgreSQL and MySQL can be summarized in three contrasts:
Once multiple versions exist, the next question is: how does a read transaction determine which version it should see?
This is the core of MVCC: visibility checks.
To determine visibility, the database must establish an order among transactions. Taking the RR isolation level as an example, when a transaction begins, it must know which write transactions are currently active in the system. All modifications produced by those active transactions are invisible to the read transaction. Only modifications from transactions that were already committed at that moment are visible.
Therefore, if the database can define an order among write transactions, it becomes straightforward to perform visibility checks.
Most databases achieve this by using a globally increasing transaction ID. When a write transaction is created, it obtains the current maximum transaction ID plus one. This naturally orders write transactions.
Once transaction IDs exist, each data modification can be tagged with the transaction ID that produced it. A read transaction, when created, obtains the list of currently active write transactions. Later, when reading data, it simply compares the transaction ID recorded on the data with this list and applies the visibility rules to determine whether the data is visible.
Both PostgreSQL and MySQL follow this approach.
As mentioned earlier, transaction IDs are critical. In PostgreSQL, the globally increasing transaction ID is called nextXid.
Each write transaction obtains the latest value when it starts.

Transaction A is created first and obtains xid 7. It inserts PK: ‘aaa’. The tuple records this through the xmin field, which stores the inserter’s transaction ID (7).
After transaction A commits, transaction B is created and obtains xid 11. It updates the record to ‘bbb’. Following the multi-version rule, transaction B does not overwrite the tuple inserted by A. Instead, it creates a new version. The old tuple’s xmax is set to 11, indicating that transaction 11 has “deleted” this tuple version. The new tuple records xmin = 11. The old tuple’s ctid points to the new tuple.
Transaction C proceeds similarly.

Thus, each PostgreSQL tuple contains two fields recording the related transactions:
xmin : the inserterxmax : the deleterWith the global transaction ID (nextXid) and the transaction tags (xmin, xmax) on each tuple, the next requirement is the snapshot used by read transactions for visibility checks.

At the top of the figure are the globally increasing transaction IDs and the currently active write transactions. The next ID to allocate is 16. Among all assigned IDs, transactions ≤7 have already committed. Between 8 and 15, some have committed, and the currently active write transactions are 8, 11, 12, and 14.
If a read transaction starts now, it obtains a snapshot:
xmin : the smallest active transaction ID (8)xmax : the next transaction ID to allocate (16)xids[] : the list of active transactionsWith this snapshot, it can determine whether a tuple’s transaction tag is visible. For example:
xmin = 7, it is visible to the snapshot.xmin = 14, it is not visible.Now that we know how to determine the visibility of transaction tags, the final question is how to determine whether a tuple itself is visible, given that it has both xmin and xmax.
The core principle is:
A tuple is visible to a snapshot if its inserter (
xmin) is visible and its deleter (xmax) is not visible.

The process is:
xmin. If xmin is not visible, the tuple is invisible.xmin is visible, check xmax.xmax is also visible, the tuple has been deleted in the snapshot and is therefore invisible.xmax is not visible, the tuple is visible.Finally, the following figure shows the process of locating a tuple visible to a snapshot starting from the nbtree index:

MySQL also has a globally increasing transaction ID called next_trx_id_or_no.

In the example, three transactions modify the same record three times. Both the clustered index record and the undo records contain a TRX_ID field. This field is the transaction tag used by MySQL. The TRX_ID records which transaction created that version of the record.
Unlike PostgreSQL, a record in MySQL has only one transaction tag, TRX_ID, rather than two. The reason will be explained later.
Next, consider the visibility check.

MySQL’s ReadView is extremely similar to PostgreSQL’s snapshot and serves the same purpose. The only difference is that MySQL’s ReadView contains an additional field: m_creator_trx_id.
This field is necessary because the transaction that creates the ReadView is itself included in m_ids[] (since it is an active transaction). Without m_creator_trx_id, the transaction would not be able to see its own modifications. It also handles cases where a read transaction is promoted to a write transaction.
Aside from this, the visibility rules are almost identical.
Given the visibility rules, determining whether a record is visible to a ReadView becomes straightforward:

The process is simple: check whether the record’s TRX_ID is visible to the ReadView. Unlike PostgreSQL, MySQL does not need two separate checks for xmin and xmax.
Finally, the process of finding a version visible to a ReadView starting from the B+Tree is shown below:

PostgreSQL and MySQL use highly similar visibility mechanisms. The primary difference is that PostgreSQL stores two transaction tags (xmin and xmax) on each tuple, requiring two checks. MySQL stores only one (TRX_ID), requiring only one check.
Why is this the case? The fundamental reason is the direction of the version chain:
xmin alone would be sufficient because it records the inserter. However, when traversing the chain, the reader cannot stop immediately after finding a visible insert because the next version might also be visible. The reader must continue until it finds the first version whose insert is invisible. The previous version is then the visible version. This means at least one extra step is required. PostgreSQL therefore stores the insert transaction of the next version as the deleter (xmax) of the current version, avoiding that extra traversal. Additionally, xmax is required for DELETE operations where no next version exists.The next core problem in MVCC is garbage collection of historical versions. Historical versions are not always needed.
Because global transaction IDs advance linearly, the snapshots (or ReadViews) of read transactions also move forward. A historical version can be safely purged when:
No active snapshot or ReadView in the system still needs that version (i.e., all snapshots can already see the newer version that replaced it).
This is where PostgreSQL and MySQL differ most significantly.
PostgreSQL reclaims historical versions through the Vacuum backend.

PostgreSQL uses GlobalVisState to track purge boundaries. It contains two variables:
maybe_needed
This is the minimum value among all backend transaction IDs and the xmin values of their snapshots. Backend transaction IDs must be considered because a backend may have started a write transaction and obtained an xid but not yet created a snapshot. That xid still forms a lower bound that cannot be crossed.
All tuple xmax values (deleters) are compared against maybe_needed. If xmax is smaller than maybe_needed, the deleter is visible to all backends and snapshots, meaning the tuple is globally deleted and can be safely purged.
definitely_needed
This is the xmin of the latest snapshot taken by the Vacuum backend. Any tuple whose xmax is greater than or equal to definitely_needed is invisible to the Vacuum backend and cannot be purged.
These two values define the continuous upper bound that can be purged and the lower bound that cannot. For tuples whose xmax falls between these bounds, Vacuum may need to refresh maybe_needed and re-evaluate, since the snapshot used by Vacuum might be outdated. Because refreshing is expensive, PostgreSQL optimizes this by checking whether RecentXmin has advanced. If it has not changed, refreshing is skipped.
With these rules, the workflow of the Vacuum backend is:

GlobalVisState. Collect purgable tuples into a set.LP_UNUSED).This process involves extensive scanning of both heap and index structures, which can be expensive. PostgreSQL mitigates this cost with several optimizations:
heap_page_prune(), reducing the workload of Vacuum.MySQL takes a different approach.
All undo records (historical versions) are grouped by the transactions that produced them. These transactions are then organized according to their global commit order (forming a min-heap).
With this ordering, MySQL can quickly identify the undo records belonging to the earliest committed transaction, which are typically the closest candidates for purging.
The purge thread compares the transaction number (trx_no) of the earliest transaction in the history list with m_low_limit_no from the purge view.
trx_no < m_low_limit_no, all active ReadViews can see this transaction’s commit, so its undo records are no longer needed and can be safely purged.An important optimization is that transactions are ordered by commit order rather than creation order.
Sorting by creation order would be safe because the earliest transaction must be purged first. However, it has a drawback: if the earliest transaction does not commit for a long time, later transactions that have already committed cannot be purged even if they are no longer needed.
For example:
If transactions were ordered by creation time, Trx A would come before Trx B. Because Trx A has not committed, purge would be blocked and Trx B’s undo records could not be purged, even though no ReadView needs them anymore.
By ordering transactions by commit time instead, MySQL can purge Trx B’s undo records immediately after its commit.
This is an important optimization. Notably, trx_id and trx_no both come from the same global variable: next_trx_id_or_no.
The workflow is shown below:

The purge thread first clones the oldest active ReadView in the system. The m_low_limit_no in this ReadView represents the smallest trx_no that was still committing when the view was created. All transactions with smaller trx_no values have already committed.
In the undo space, committed transactions’ undo records are linked together in the history list in commit order (ascending trx_no). The purge thread simply compares m_low_limit_no with the smallest trx_no in the history list to determine whether purging is possible.
Garbage collection of historical versions is a major implementation difference between PostgreSQL and MySQL.
In fact, it reflects their different design philosophies. This difference was already visible in the previous post discussing buffer pools.
MySQL tends to favor precise control and ordered structures, such as the LRU list and flush list, which allow it to quickly identify the oldest pages that can be evicted or flushed. Similarly, undo purge maintains ordered historical versions so that the oldest purgeable undo records can be quickly located.
PostgreSQL, on the other hand, tends to rely more on global scanning mechanisms, both in shared buffers and in Vacuum. In the buffer pool case, the cost of global scanning is relatively low because it scans descriptor arrays in memory. However, Vacuum must scan heap and index disk pages (although visibility maps can skip many all-visible pages). For frequently updated tables, the amount of scanning can still be substantial.
]]>The differences mainly come from different trade-offs made during engineering practice. I’ve always believed that database development is the art of trade-offs. So I’m planning a series that compares MySQL and PostgreSQL from the perspective of kernel design and implementation, focusing on the different trade-offs they make when pursuing similar goals.
As the first article in this series, I’ll start with the design and implementation differences of the Buffer Pool.
The Buffer Pool in MySQL and the corresponding module in PostgreSQL (commonly referred to as Shared Buffers) are critical subsystems. Their primary job is to cache on-disk data pages in memory to minimize disk I/O as much as possible, and they are therefore a major factor in relational database performance.
In essence, it is a huge hash table:
In the following sections, I compare MySQL and PostgreSQL buffer pool designs from these aspects:

MySQL’s buffer pool is not backed by a single hash table, it uses multiple hash tables. As illustrated conceptually:
Multiple buf_pool_t instances shard one large buffer pool. Each buf_pool_t maintains its own hash table.
The hash key is (space_id, page_no), identifying a specific page within a data file (tablespace). During lookup:
(space_id, page_no >> 6) to locate the corresponding buf_pool_t instance.page_no >> 6? Because MySQL tries to place 64 consecutive pages under the same space_id into the same buf_pool_t. This helps in two ways:
buf_pool_t, it computes a hash over the full key (space_id, page_no) to find the target cell in that instance’s hash table.
The hash table stores only pointers to the corresponding page objects (buf_page_t). The actual buf_block_t objects and page frames live in a large memory region.

buf_chunk_t).buf_block_t) for the pages in that chunk.buf_block_t and the actual page frame is done via the frame pointer in buf_block_t.
Conceptually (as illustrated):
PostgreSQL also shards the shared buffer mapping, with a similar idea.
It first hashes the key (tablespaceOid, dbOid, relNumber, forkNum, blockNum) to obtain a bucket number.
Then it uses bucket_number >> 8 to locate the directory entry in the first-level mapping, i.e., the segment (dir).
Each segment contains 256 buckets, so after finding the segment, it uses bucket_number % 256 to locate the bucket within the segment.
It then traverses the bucket chain, comparing keys one by one to find the page.
All page frames are stored in one contiguous memory region, as an array: BufferBlocks[].

BufferDescriptors[].buf_id, which is the index into both BufferDescriptors[] and BufferBlocks[].Summary: Both MySQL and PostgreSQL implement fairly standard hash-table-based page lookup; there isn’t a fundamental difference there. The biggest difference is that MySQL splits pages into chunks, which makes it easier to dynamically resize the buffer pool by adding/removing chunks.

MySQL maintains page aging information in a direct way: pages in the hash table are also linked into an LRU doubly-linked list. Each page’s buf_page_t::LRU is the list node that links the page into the LRU list.
Each time a page is found via hash lookup, MySQL moves the page to the head of the LRU list via buf_page_t::LRU. Over time, pages that are not accessed drift toward the tail. When memory is insufficient and an old page must be evicted, the tail provides a fast candidate.
Of course, that is the conceptual LRU behavior. MySQL adds an important optimization, because the above design has a major problem: if requests perform table scans, a large number of pages enter the LRU and can overwrite/destroy the existing hot/cold information. To avoid scan workloads disrupting the LRU, MySQL splits the LRU list.
Roughly ~37.5% from the tail, it maintains a midpoint:
All new pages loaded from disk are initially inserted at the midpoint, i.e., the head of the old list. Since it is close to the tail, such pages are more likely to be evicted quickly. If a page is accessed again before it is evicted, MySQL does not immediately promote it to the young region. Instead, it records the first access time, and the page’s position stays unchanged. Only when it is accessed again, and the elapsed time since the first access exceeds innodb_old_blocks_time (default 1 second), will it be promoted to the LRU head (young region). As a result, pages introduced by full table scans typically stay in the old area for less than 1 second and are evicted quickly, without polluting the hot working set in the young region.
When a user thread needs to read a disk page but the buffer pool is full, it evicts an old page from the LRU tail and uses that frame to load the needed page. But eviction is not that trivial. Below is the concrete eviction procedure when a user thread needs a new page:
try_LRU_scan == true, it indicates a partial LRU scan is allowed. Scan from the tail forward, at most 100 pages.
try_LRU_scan = false to tell other user threads that partial LRU scanning is ineffective, so they should skip partial scans and go directly to the single-page flush path.One more detail worth mentioning: the LRU scan does not always start from the tail for every thread. Each buf_pool_t maintains a global scan cursor lru_scan_itr (type LRUItr). After a thread finishes scanning, it leaves the cursor at its current position, and the next thread continues scanning from there, avoiding multiple threads repeatedly scanning the same region. Only when the cursor is empty/invalid, or still within the old region (meaning the previous scan did not progress far enough), will it be reset back to the tail. In addition, single-page flushing (step 4) uses another independent cursor single_scan_itr; these two cursors do not interfere with each other.

PostgreSQL does not maintain a global LRU list like MySQL does, but that doesn’t mean it does not perform LRU-style eviction. It simply takes another path.
All page metadata lives in the BufferDescriptors[] array. Each BufferDescriptor has two fields representing the current usage state of its corresponding page:
refcount: how many backends are currently using (pinning) the pageusage_count: the accumulated number of accesses to the page (capped at 5, When accessed via a ring buffer strategy, it is only incremented if it is currently 0, limiting it to 1)Whenever a backend accesses a page via the hash table, it increments both refcount and usage_count. When the backend is done with the page, it only decrements refcount. Therefore, usage_count serves as an approximate LRU weight (but not unbounded, it stops increasing once it reaches 5).
When a backend tries to load a page from disk but finds no free page, it starts a clock sweep: it traverses BufferDescriptors circularly. If a buffer is not currently used by any backend (refcount == 0), it decrements usage_count (cooling down the LRU weight) and continues sweeping. Eventually it finds a buffer where both refcount == 0 and usage_count == 0, and that buffer becomes the victim for eviction.
Of course, this alone is still insufficient to prevent LRU pollution from one-time full scans. PostgreSQL has its own optimization: introducing a local ring buffer.
Each backend has its own local ring buffer: essentially a fixed-length array of buffer IDs. A buffer ID points to a page slot in the global BufferDescriptors. The ring buffer limits how many global buffers the backend consumes at once, so eviction is more likely to happen within the ring buffer itself, reducing pollution of the global shared buffers.
More concretely, suppose a backend is performing a sequential scan and the upper layer marks the operation to use the ring buffer. When reading pages via the hash table:
refcount == 0 and usage_count <= 1), it reuses it directly: evict and load the next page into it.BufferDescriptors for another available buffer to load the next page, and then replaces the current ring entry with the new buffer ID.Here you can see the different approaches MySQL and PostgreSQL take for the same scenario. MySQL introduces an “old/young” split in the global LRU list as a general strategy to prevent pollution. PostgreSQL’s ring buffer is essentially also an “old area”, but it relies on higher-level operation tagging: only scan-heavy operations such as VACUUM, sequential scan, bulk insert, etc., will use the ring buffer.
Below is the complete procedure PostgreSQL uses to find a free buffer when a backend needs one:
Determine whether to use the ring buffer. If yes, inspect the buffer at the ring’s current cursor position:
a. If it has not been used before, the ring is not full yet, go to step 2.
b. Otherwise the ring is full. If the buffer is not used by any backend (refcount == 0 and usage_count <= 1), it can be reused immediately, return this buffer.
c. If the buffer is used by other backends, fall back to step 2 to find a buffer from the global pool; after success, replace the current ring entry with the newly found buffer ID.
Check the free list. If a buffer is available, return it.
Start clock sweep: traverse from nextVictimBuffer (the current sweep cursor in BufferDescriptors):
refcount != 0, skip.usage_count != 0, decrement it (cooling down) and continue.nextVictimBuffer accordingly.Summary: MySQL and PostgreSQL are similar in essence: both are LRU-like. MySQL chooses to implement an explicit LRU list for more precise eviction, at the cost of additional overhead to maintain the list. PostgreSQL uses reference counting plus usage_count as an approximate LRU, avoiding the locking overhead of maintaining a true LRU list but losing precision. This is the result of different trade-offs. Another notable difference: when a MySQL foreground thread tries to find a free page, it tends to prefer evicting old pages that are not dirty first; PostgreSQL’s sweep does not have an explicit priority between dirty and clean pages in the same sense.
Earlier we mentioned that MySQL user threads and PostgreSQL backends may flush a single dirty page when searching for a free page (single-page flush). However, such foreground single-page flushing is only an emergency measure when no free page is available.
For normal bulk flushing, both MySQL and PostgreSQL have dedicated background threads/processes. The goal is to flush dirty pages in advance and evict old pages so that foreground threads can quickly find free pages.
Background flushing has two goals:


In MySQL (InnoDB), background flushing is performed by page cleaner threads, consisting of one coordinator and N workers.
n_pages.n_pages to all workers and wake them up. Each worker is responsible for one buf_pool_t slot. The coordinator itself also works as worker 0.buf_pool_t slot.srv_LRU_scan_depth pages.
srv_LRU_scan_depth.Now, step 3 in the coordinator is adaptive: it calculates the flush workload and the target LSN advancement. The logic is as follows:
get_pct_for_dirty())Compute dirty_pct, the percentage of dirty pages in the buffer pool:
innodb_max_dirty_pages_pct_lwm (low watermark) is set and dirty_pct >= lwm, start progressive flushing and return the percentage of io_capacity as:
dirty_pct * 100 / (max_dirty_pages_pct + 1)dirty_pct >= innodb_max_dirty_pages_pct (high watermark), flush at 100% io_capacity.get_pct_for_lsn(age))Compute checkpoint age:
age = current_lsn - oldest_lsn
age < innodb_adaptive_flushing_lwm (default 10% of redo log capacity), no adaptive flushing needed (return 0).age exceeds the low watermark:
age_factor = age * 100 / limit_for_dirty_page_age
Return the percentage of io_capacity as:
(max_io_capacity / io_capacity) * age_factor * sqrt(age_factor) / 7.5This is a super-linear growth curve: as redo space approaches exhaustion, flushing ramps up aggressively.
set_flush_target_by_lsn())Take:
pct_total = max(pct_for_dirty, pct_for_lsn)
Then compute the target LSN:
target_lsn = oldest_lsn + lsn_avg_rate * 3
(i.e., advance by 3× the recent average redo generation rate; buf_flush_lsn_scan_factor = 3)
Then traverse each buffer pool instance’s flush list and count the number of pages whose oldest_modification <= target_lsn. Call this number pages_for_lsn (pages that must be flushed to advance checkpoint to target_lsn).
Finally, take the average of three estimates:
n_pages = (PCT_IO(pct_total) + page_avg_rate + pages_for_lsn) / 3
Where:
PCT_IO(pct_total) is the I/O demand estimated from dirty ratio / redo age.page_avg_rate is the recent actual average flushing rate (moving average across multiple iterations).pages_for_lsn is the precise demand obtained from scanning the flush list.Averaging these three makes the flushing rate smoother and avoids abrupt oscillation. n_pages is capped by srv_max_io_capacity.
If redo pressure is high (pct_for_lsn > 30), the per-instance flush quota is weighted by how many pages in each instance’s flush list need flushing; otherwise, it is evenly distributed across instances.
When redo log space is extremely tight (checkpoint cannot keep up with redo generation), log_sync_flush_lsn() returns non-zero and the coordinator enters sync flush mode:
n_pages is set directly to pages_for_lsn (no averaging), with a lower bound of srv_io_capacity.When the server is idle (no user activity) and the 1-second sleep times out, the coordinator does not run the adaptive algorithm. Instead, it flushes in the background using innodb_idle_flush_pct percent of innodb_io_capacity (default 100%), keeping the buffer pool clean.
PostgreSQL also has both LRU flush and checkpoint flush, but unlike MySQL’s unified page cleaner, PostgreSQL separates responsibilities:
bgwriter handles LRU flushcheckpointer handles checkpoint flush
The goal of bgwriter is to predict the upcoming demand for free buffers based on historical and current pressure, and try to free enough buffers before backends are forced into heavy clock sweep work (i.e., flush dirty pages that are otherwise reusable victims).
The overall flow:
Collect historical info from clock sweep, including:
strategy_buf_id: the current backend clock sweep positionstrategy_passes: how many full sweeps have been completedrecent_alloc: how many buffers have been allocated by backends since the last bgwriter recycleCompare bgwriter’s current position next_to_clean with clock sweep’s strategy_buf_id, and determine how far ahead it is:
bufs_to_lap: number of buffers bgwriter must scan for next_to_clean to “lap” (catch up to) strategy_buf_id.
bufs_to_lap is the remaining distance to lap.next_to_clean to strategy_buf_id, set bufs_to_lap = NBuffers, effectively reset bgwriter.bufs_to_lap may be negative, meaning bgwriter has scanned everything it can scan; no need to scan in this round.bufs_ahead = NBuffers - bufs_to_lap (how many buffers bgwriter is ahead of sweep)Based on the history above, compute how many buffers clock sweep needs to scan to find one free buffer, i.e. scans_per_alloc. Maintain an exponential moving average:
smoothed_density += (scans_per_alloc - smoothed_density) / 16;
Maintain smoothed_alloc similarly:
smoothed_alloc < recent_alloc, set smoothed_alloc = recent_alloc (fast attack).smoothed_alloc += (recent_alloc - smoothed_alloc) / 16; (slow decay)Compute the prediction for the next round:
upcoming_alloc_est = smoothed_alloc * bgwriter_lru_multiplier (predict upcoming allocations)reusable_buffers_est = bufs_ahead / smoothed_densitymin_scan_buffers = NBuffers / (120s / 200ms)
Then:
upcoming_alloc_est = max(upcoming_alloc_est, min_scan_buffers + reusable_buffers_est)This “minimum progress” ensures that even if the system is idle, bgwriter will scan the entire buffer pool in about 120 seconds, continuously cleaning dirty pages.
Scan from next_to_clean. For each buffer, bgwriter only considers buffers with refcount == 0 and usage_count == 0 (truly reusable candidates). It skips buffers in use or recently used. If a candidate is dirty, it flushes it synchronously. Stop scanning when any of these is met:
bufs_to_lap reaches 0 (caught up to clock sweep)reusable_buffers reaches upcoming_alloc_est (freed enough reusable buffers)num_written reaches bgwriter_lru_maxpages (default 100) to avoid excessive I/O in one roundAfter one scan round, bgwriter sleeps for bgwriter_delay (default 200ms) before next iteration. If bufs_to_lap == 0 and recent_alloc == 0 (no allocation activity), bgwriter enters hibernation and sleeps longer, until a backend needing buffers wakes it via latch.

The goal of checkpointer is to flush all dirty pages up to a consistency point, forming a checkpoint. This advances WAL recycling and reduces how much WAL must be replayed during crash recovery. Unlike bgwriter, checkpointer does not care whether a page was recently used, it must flush all pages that were dirty at checkpoint start.
Trigger conditions: in the main loop, checkpointer triggers a checkpoint when any of the following occurs:
checkpoint_timeout (default 5 minutes)max_wal_size and backends notify checkpointerCHECKPOINTDetailed procedure:
Scan and collect dirty buffers: traverse all NBuffers BufferDescriptors. For each dirty page, set the BM_CHECKPOINT_NEEDED flag, and collect its identity info into CkptBufferIds[] (tablespace OID, relation number, fork number, block number, etc.).
Note: only pages that are already dirty at checkpoint start are included. Pages that become dirty during the checkpoint are not included and will be handled in the next checkpoint.
Sort: sort CkptBufferIds[] by (tablespace, relation, fork, block). This clusters pages from the same file and orders them by increasing block number, converting random I/O into more sequential patterns as much as possible.
Build tablespace-level progress tracking: traverse the sorted array and group by tablespace. For each tablespace, build a CkptTsStatus structure tracking total pages to flush and current progress. Put all tablespaces into a binary heap (min-heap), ordered by flush progress.
Balanced flushing across tablespaces: repeatedly pop the tablespace with the lowest progress from the heap, flush its next dirty page (via SyncOneBuffer), update its progress, then re-heapify.
The purpose is to spread writes evenly across tablespaces (possibly on different disks), instead of flushing one tablespace completely before another.
Unlike bgwriter, checkpointer calls SyncOneBuffer with skip_recently_used = false, meaning it will flush buffers with BM_CHECKPOINT_NEEDED regardless of recent usage.
Write throttling: after flushing each page, call CheckpointWriteDelay() to throttle. The goal is to finish flushing within:
checkpoint_completion_target (default 0.9) × checkpoint_timeout.
The logic compares:
IsCheckpointOnSchedule == true), sleep 100ms.This spreads checkpoint I/O across the entire checkpoint window and avoids I/O spikes.
Writeback coalescing: if not using O_DIRECT, similar to bgwriter, use WritebackContext to collect tags for flushed pages. After accumulating enough, batch-call IssuePendingWritebacks(), sort and coalesce adjacent blocks, and use posix_fadvise to hint the kernel to write back OS cache pages to disk. After checkpoint completion, force one more IssuePendingWritebacks() to ensure all pending writebacks are issued.
Summary: Although the implementations differ significantly, both MySQL and PostgreSQL aim to pre-clean pages in the background so that foreground threads can quickly find free pages. PostgreSQL’s bgwriter predicts upcoming buffer allocation demand from foreground activity; MySQL’s page cleaner reacts to dirty page pressure and redo log age.
From an engineering perspective, their differences largely come down to the trade-off between linked lists and arrays:
To better see what actually happens inside the data files, I’ve added a new feature to ibdNinja, an interactive BLOB inspection mode:
--inspect-blob
This feature is designed as a extension of ibdNinja’s existing inspection workflow, allowing you to drill down from high-level structures to the actual BLOB data stored on disk.
Use ibdNinja’s existing features to parse, extract, and print information from a MySQL .ibd file at the table, index, page, and record levels.
Once you’ve located a record you want to dive deeper into, note its page number and record number.
Pass those identifiers to --inspect-blob:
ibdNinja -f <table.ibd> --inspect-blob <page_no>,<record_no>
to start an interactive inspection of the BLOB field in that record.

As shown above, ibdNinja will:
From there, you can choose any version and:
If some historical versions have already been purged, ibdNinja will detect that and clearly report it.
If you’re into MySQL data file internals, or knee-deep in development, debugging, or production issues, give ibdNinja a try, dig under the hood — and consider bug reports part of the feature set.
]]>
Table with 200K rows, a VARCHAR(700) unique key (latin1), creating a tall B-tree:
CREATE TABLE t1 (
id INT PRIMARY KEY AUTO_INCREMENT,
uk_col VARCHAR(700) NOT NULL,
UNIQUE KEY uk_idx (uk_col)
) ENGINE=InnoDB CHARACTER SET latin1;
Search1: ~7,272 ns
Inline: ~3,118 ns
Improvement: ~29.1% reduction in search-path time
This test focuses specifically on the unique index insertion path (row_ins_sec_index_entry_low()), comparing the cost of the original three searches with the optimized “one search + inline scan” approach. In this local scope, the saving is close to 30%, which matches the intuition of collapsing three tree traversals into one.
In a single-row insert, how large is this part relative to the whole insert path? If its share is small, the end-to-end gain will be diluted. In my tests, when measuring the full insert path, the improvement drops to single-digit percentages.
Under concurrent workloads, each of the three B-tree searches holds page latches. This is one of the key factors affecting scalability. Reducing this section by ~30% also shortens latch holding time, so the benefit may be more visible in parallel scenarios.
While implementing the POC, I also realized that this optimization is not a silver bullet. There are cases that still need to fall back to the original path, although there are ways to minimize how often that happens.
These are just the numbers from a quick POC. If this direction turns out to be meaningful, it would still require much more careful design, implementation, and testing.
]]>Before going into the details, I would like to briefly introduce two important concepts that are closely related to this topic.
MySQL supports snapshot reads. Each read transaction reads data based on a certain snapshot, so even if other write transactions modify the data during the execution of a read transaction, the read transaction will always see the version it is supposed to see.
The underlying mechanism is that a write transaction directly updates the data in place on the primary key record. However, before the update happens, the old value of the field to be modified is copied into the undo space. At the same time, there is a ROLL_PTR field in the row that points to the exact location in the undo space where the old value (the undo log record) is stored.

As shown in the figure above, there is a row in the primary key index that contains three fields. Suppose a write transaction is modifying Field 2. It will first copy the original value of Field 2 into the undo space, and then overwrite Field 2 directly in the row. After that, two important system fields of the row are updated:
After the update is finished, if a previously existing read transaction reads this row again, it will find, based on the TRX_ID, that the row has been modified by a later write transaction. Therefore, the current version of the row is not visible to this read transaction. It must roll back to the previous version. At this point, it uses the ROLL_PTR to locate the old value in the undo space, applies it to the current row, and thus reconstructs the version that it is supposed to see.
The primary key record in MySQL contains the values of all fields and is stored in the clustered index. However, BLOB columns are an exception. Since they are usually very large, MySQL stores their data in separate data pages called external pages.
A BLOB value is split into multiple parts and stored sequentially across multiple external pages. These pages are linked together in order, like a linked list. So how does the primary key record locate the corresponding BLOB data stored in those external pages? For each BLOB column, the clustered record stores a reference (lob::ref_t). This ref_t contains some metadata about the column and a pointer to the first external page where the BLOB data starts.

When reading the row, MySQL first locates the row via the primary key index, then follows this reference to find the external pages and reconstructs the full BLOB value by copying the data from those pages.
This is a very straightforward and intuitive design, simple and sufficient. It is also exactly how BLOB was implemented in older versions of MySQL.
Based on the two points above, here is a question:
How is MVCC implemented for BLOB in MySQL?
The intuitive answer is as follows: the lob::ref_t stored in the primary key record follows the same MVCC rules. Every time a BLOB column is updated, the old BLOB value is read out, modified, and then the entire modified BLOB is written into newly allocated external pages. The corresponding lob::ref_t in the primary key record is overwritten with the new reference. At the same time, following the MVCC mechanism, the old lob::ref_t is copied into the undo space.

After the modification, the situation looks like this (as shown in the figure): the undo space stores the lob::ref_t that points to the old BLOB value, while the lob::ref_t in the primary key record points to the new value.
This is exactly how older versions of MySQL worked. The next question is:
What are the pros and cons of this design?
The advantage is that the undo log only needs to record the lob::ref_t, and it does not need to store the entire old BLOB value.
The disadvantage is that no matter how small the change to the BLOB is, even if only a single byte is modified, the entire modified BLOB still has to be written into newly allocated external pages. BLOB columns are usually very large, so if each update only changes a very small portion, this design introduces a lot of extra I/O and space overhead.
A typical example is JSON. Internally, MySQL stores JSON as BLOB. Usually, updates to JSON are local and small. However, with the old design, each small partial update still requires reading the entire JSON, modifying a part of it, and then inserting the whole value back again. This is obviously very heavy.
So how to solve this problem? MySQL introduced BLOB partial update to address it.
MySQL optimized the format of the external pages used to store BLOB data and redesigned the original simple linked-list structure:

Then the question becomes: how can MySQL make sure that it can read the correct new and old BLOB values? The answer is that the new external page and the old external page share the same logical position in the index entry list. In other words, at this specific position in the list, there are now two versions, version 1 and version 2. Which one is used is determined by the version number recorded in the current lob::ref_t. The idea is illustrated in the figure below.

In summary, MySQL transforms the original external-page linked list into a linked list of index entries. For each index entry in this list, if the corresponding external page is modified, a new version of the index entry is created at the same horizontal position to point to the new version of that external page. Essentially, this introduces multi-versioning for external pages.
The implementation described above is not the whole story. MySQL makes a practical trade-off between creating a new index entry (which requires copying the entire external page) and copying only the modified portion into the undo space.
For BLOB small-change scenarios, when the modification to a blob is smaller than 100 bytes, MySQL does not create a new index entry and link it into the version chain for that page. Instead, it modifies the page in place. Following MVCC principles, the portion to be modified is first written into the undo space before the in-place update happens.

It is worth noting that in this case, the lob::ref_t stored in the primary key record does not advance its base version number. It shares the same base as the previous version. When a read transaction needs to read the previous version, it first constructs the latest BLOB value based on the lob::ref_t and the index entry list. Then, following the MVCC logic, it finds that the TRX_ID indicates that this version is not visible. At this point, it follows the ROLL_PTR to the undo space, where the old value of the modified external page is stored. By applying that old data back onto the current value, the complete and correct historical BLOB value can be reconstructed.
In this scenario, the recovery process is a combination of two steps:
Index entries are the key to the implementation of BLOB partial update. To make them easier to understand, I drew the following diagram to illustrate the logical relationships among index entries. It is a two-dimensional linked list. The horizontal dimension represents the sequential position when assembling the full BLOB value. The vertical dimension represents multiple versions at the same position. Each time the page at that position is modified, a new node is added vertically.

Of course, this is only a logical model. The physical layout is not organized exactly like this. Each BLOB has a BLOB first page. This page stores a portion of the BLOB data (the initial part) and 10 index entries. Each index entry corresponds to one BLOB data page. When all 10 index entries are used up, a new BLOB index page is allocated, and additional index entries are allocated from there. In reality, the index entries distributed across the BLOB first page and the BLOB index pages are linked together to form the logical structure shown in the diagram above.

Here, as I mentioned in previous posts, MariaDB and pgvector take different approaches:
-ftree-vectorize) for vectorization.To better understand the benefits of SIMD vectorization, and to compare these two approaches, I ran a series of benchmarks — and discovered some surprising performance results along the way.
Environment
Method
First, I implemented 4 different squared L2 distance (L2sq) functions (i.e., Euclidean distance without the square root):
static inline double l2sq_naive_f32(const float* a, const float* b, size_t n) {
float acc = 0.f;
for (size_t i = 0; i < n; ++i) { float d = a[i] - b[i]; acc += d * d; }
return (double)acc;
}
static inline double l2sq_naive_f64(const float* a, const float* b, size_t n) {
double acc = 0.0;
for (size_t i = 0; i < n; ++i) { double d = (double)a[i] - (double)b[i]; acc += d * d; }
return acc;
}
// Reference: simSIMD
SIMSIMD_PUBLIC void simsimd_l2sq_f32_haswell(simsimd_f32_t const *a,
simsimd_f32_t const *b,
simsimd_size_t n,
simsimd_distance_t *result) {
__m256 d2_vec = _mm256_setzero_ps();
simsimd_size_t i = 0;
for (; i + 8 <= n; i += 8) {
__m256 a_vec = _mm256_loadu_ps(a + i);
__m256 b_vec = _mm256_loadu_ps(b + i);
__m256 d_vec = _mm256_sub_ps(a_vec, b_vec);
d2_vec = _mm256_fmadd_ps(d_vec, d_vec, d2_vec);
}
simsimd_f64_t d2 = _simsimd_reduce_f32x8_haswell(d2_vec);
for (; i < n; ++i) {
float d = a[i] - b[i];
d2 += d * d;
}
*result = d2;
}
SIMSIMD_INTERNAL simsimd_f64_t _simsimd_reduce_f32x8_haswell(__m256 vec) {
// Convert the lower and higher 128-bit lanes of the input vector to double precision
__m128 low_f32 = _mm256_castps256_ps128(vec);
__m128 high_f32 = _mm256_extractf128_ps(vec, 1);
// Convert single-precision (float) vectors to double-precision (double) vectors
__m256d low_f64 = _mm256_cvtps_pd(low_f32);
__m256d high_f64 = _mm256_cvtps_pd(high_f32);
// Perform the addition in double-precision
__m256d sum = _mm256_add_pd(low_f64, high_f64);
return _simsimd_reduce_f64x4_haswell(sum);
}
SIMSIMD_INTERNAL simsimd_f64_t _simsimd_reduce_f64x4_haswell(__m256d vec) {
// Reduce the double-precision vector to a scalar
// Horizontal add the first and second double-precision values, and third and fourth
__m128d vec_low = _mm256_castpd256_pd128(vec);
__m128d vec_high = _mm256_extractf128_pd(vec, 1);
__m128d vec128 = _mm_add_pd(vec_low, vec_high);
// Horizontal add again to accumulate all four values into one
vec128 = _mm_hadd_pd(vec128, vec128);
// Convert the final sum to a scalar double-precision value and return
return _mm_cvtsd_f64(vec128);
}
// Reference: simSIMD
SIMSIMD_PUBLIC void simsimd_l2sq_f32_skylake(simsimd_f32_t const *a,
simsimd_f32_t const *b,
simsimd_size_t n,
simsimd_distance_t *result) {
__m512 d2_vec = _mm512_setzero();
__m512 a_vec, b_vec;
simsimd_l2sq_f32_skylake_cycle:
if (n < 16) {
__mmask16 mask = (__mmask16)_bzhi_u32(0xFFFFFFFF, n);
a_vec = _mm512_maskz_loadu_ps(mask, a);
b_vec = _mm512_maskz_loadu_ps(mask, b);
n = 0;
}
else {
a_vec = _mm512_loadu_ps(a);
b_vec = _mm512_loadu_ps(b);
a += 16, b += 16, n -= 16;
}
__m512 d_vec = _mm512_sub_ps(a_vec, b_vec);
d2_vec = _mm512_fmadd_ps(d_vec, d_vec, d2_vec);
if (n) goto simsimd_l2sq_f32_skylake_cycle;
*result = _simsimd_reduce_f32x16_skylake(d2_vec);
}
......
I generated a dataset of 10,000 float vectors (dimension = 1024, 64B aligned) and one target vector. Then, for the following 5 scenarios, I searched for the vector with the closest L2sq distance to the target. Each distance computation was repeated 16 times (to create a CPU-intensive workload), and each scenario was executed 5 times, taking the median runtime to eliminate random fluctuations:
-fno-tree-vectorize -fno-builtin -fno-lto -Wno-cpp -Wno-pragmas)Compile with AVX2 (-O3 -mavx2 -mfma -mf16c -mbmi2) and run the 5 scenarios.
Compile with AVX-512 (-O3 -mavx512f -mavx512dq -mavx512bw -mavx512vl -mavx512cd -mfma -mf16c -mbmi2) and run the 5 scenarios again.

SIMD L2sq implementations are much faster than others, and AVX-512 outperforms AVX2 since it processes 16 dimensions at once instead of 8.
Under AVX2, naive L2sq (178.385ms) is faster than naive high-precision L2sq (183.973ms), because the latter incurs float→double conversion overhead.
Under both AVX2 and AVX-512, naive implementations with compiler vectorization disabled perform the worst, since they are forced into scalar execution.
In addition to the expected results above, some surprising findings appeared:
Both deserve deeper analysis.
(1) Why was naive L2sq with AVX-512 slower than with AVX2?
Although this was a naive implementation, with -O3 we would expect the compiler to auto-vectorize. However, the vectorized result generated by the compiler was far worse than our manual SIMD implementation, and AVX-512 even performed worse than AVX2.
To investigate further, I used objdump to examine the AVX2 and AVX-512 binaries for l2sq_naive_f32().
Under AVX2:
0000000000007090 <_ZL19l2sq_naive_f32PKfS0_m>:
... ...
70b7: 48 c1 ee 03 shr rsi,0x3
70bb: 48 c1 e6 05 shl rsi,0x5
70bf: 90 nop
70c0: c5 fc 10 24 07 vmovups ymm4,YMMWORD PTR [rdi+rax*1]
70c5: c5 dc 5c 0c 01 vsubps ymm1,ymm4,YMMWORD PTR [rcx+rax*1]
70ca: 48 83 c0 20 add rax,0x20
70ce: c5 f4 59 c9 vmulps ymm1,ymm1,ymm1
70d2: c5 fa 58 c1 vaddss xmm0,xmm0,xmm1
70d6: c5 f0 c6 d9 55 vshufps xmm3,xmm1,xmm1,0x55
70db: c5 f0 c6 d1 ff vshufps xmm2,xmm1,xmm1,0xff
70e0: c5 fa 58 c3 vaddss xmm0,xmm0,xmm3
70e4: c5 f0 15 d9 vunpckhps xmm3,xmm1,xmm1
70e8: c4 e3 7d 19 c9 01 vextractf128 xmm1,ymm1,0x1
70ee: c5 fa 58 c3 vaddss xmm0,xmm0,xmm3
70f2: c5 fa 58 c2 vaddss xmm0,xmm0,xmm2
70f6: c5 f0 c6 d1 55 vshufps xmm2,xmm1,xmm1,0x55
70fb: c5 fa 58 c1 vaddss xmm0,xmm0,xmm1
70ff: c5 fa 58 c2 vaddss xmm0,xmm0,xmm2
7103: c5 f0 15 d1 vunpckhps xmm2,xmm1,xmm1
7107: c5 f0 c6 c9 ff vshufps xmm1,xmm1,xmm1,0xff
710c: c5 fa 58 c2 vaddss xmm0,xmm0,xmm2
7110: c5 fa 58 c1 vaddss xmm0,xmm0,xmm1
... ...
The compiler did use vector instructions (vmovups, vsubps, vmulps) to compute L2sq in groups of 8 floats. But when folding the 8 results horizontally into xmm0, it extracted elements using vshufps, vunpckhps, vextractf128, etc., and then added them one by one with scalar vaddss. Worse, this folding happened in every iteration.

This per-iteration horizontal reduction became the bottleneck. Instead, like the manual SIMD implementation, it should have accumulated vector results across the whole loop and performed just one horizontal reduction at the end.
Under AVX-512:
a057: 48 c1 ee 04 shr rsi,0x4
a05b: 48 c1 e6 06 shl rsi,0x6
a05f: 90 nop
a060: 62 f1 7c 48 10 2c 07 vmovups zmm5,ZMMWORD PTR [rdi+rax*1]
a067: 62 f1 54 48 5c 0c 01 vsubps zmm1,zmm5,ZMMWORD PTR [rcx+rax*1]
a06e: 48 83 c0 40 add rax,0x40
a072: 62 f1 74 48 59 c9 vmulps zmm1,zmm1,zmm1
a078: c5 f0 c6 e1 55 vshufps xmm4,xmm1,xmm1,0x55
a07d: c5 f0 c6 d9 ff vshufps xmm3,xmm1,xmm1,0xff
a082: 62 f3 75 28 03 d1 07 valignd ymm2,ymm1,ymm1,0x7
a089: c5 fa 58 c1 vaddss xmm0,xmm0,xmm1
a08d: c5 fa 58 c4 vaddss xmm0,xmm0,xmm4
a091: c5 f0 15 e1 vunpckhps xmm4,xmm1,xmm1
a095: c5 fa 58 c4 vaddss xmm0,xmm0,xmm4
a099: c5 fa 58 c3 vaddss xmm0,xmm0,xmm3
a09d: 62 f3 7d 28 19 cb 01 vextractf32x4 xmm3,ymm1,0x1
a0a4: c5 fa 58 c3 vaddss xmm0,xmm0,xmm3
a0a8: 62 f3 75 28 03 d9 05 valignd ymm3,ymm1,ymm1,0x5
a0af: c5 fa 58 c3 vaddss xmm0,xmm0,xmm3
a0b3: 62 f3 75 28 03 d9 06 valignd ymm3,ymm1,ymm1,0x6
a0ba: 62 f3 7d 48 1b c9 01 vextractf32x8 ymm1,zmm1,0x1
a0c1: c5 fa 58 c3 vaddss xmm0,xmm0,xmm3
a0c5: c5 f0 c6 d9 55 vshufps xmm3,xmm1,xmm1,0x55
a0ca: c5 fa 58 c2 vaddss xmm0,xmm0,xmm2
a0ce: c5 f0 c6 d1 ff vshufps xmm2,xmm1,xmm1,0xff
a0d3: c5 fa 58 c1 vaddss xmm0,xmm0,xmm1
a0d7: c5 fa 58 c3 vaddss xmm0,xmm0,xmm3
a0db: c5 f0 15 d9 vunpckhps xmm3,xmm1,xmm1
a0df: c5 fa 58 c3 vaddss xmm0,xmm0,xmm3
a0e3: c5 fa 58 c2 vaddss xmm0,xmm0,xmm2
a0e7: 62 f3 7d 28 19 ca 01 vextractf32x4 xmm2,ymm1,0x1
a0ee: c5 fa 58 c2 vaddss xmm0,xmm0,xmm2
a0f2: 62 f3 75 28 03 d1 05 valignd ymm2,ymm1,ymm1,0x5
a0f9: c5 fa 58 c2 vaddss xmm0,xmm0,xmm2
a0fd: 62 f3 75 28 03 d1 06 valignd ymm2,ymm1,ymm1,0x6
a104: 62 f3 75 28 03 c9 07 valignd ymm1,ymm1,ymm1,0x7
a10b: c5 fa 58 c2 vaddss xmm0,xmm0,xmm2
a10f: c5 fa 58 c1 vaddss xmm0,xmm0,xmm1
The first part similarly used vector instructions to compute 16 values at a time. But folding 16 results was even more complex and expensive, involving vshufps, valignd, vunpckhps, vextractf32x4, vextractf32x8, etc. This additional complexity canceled out the gains from processing 16 dimensions per iteration, which explains why AVX-512 was slower.
(2) Why was naive float L2sq slower than naive high-precision L2sq under AVX-512?
Theoretically, high-precision L2sq should be slower because of float→double conversions. So why was it faster?
Looking at the disassembly of l2sq_naive_f64:
000000000000a280 <_ZL19l2sq_naive_f64PKfS0_m>:
a280: f3 0f 1e fa endbr64
a284: 48 85 d2 test rdx,rdx
a287: 74 37 je a2c0 <_ZL19l2sq_naive_f64_oncePKfS0_m+0x40>
a289: c5 e0 57 db vxorps xmm3,xmm3,xmm3
a28d: 31 c0 xor eax,eax
a28f: c5 e9 57 d2 vxorpd xmm2,xmm2,xmm2
a293: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
a298: c5 e2 5a 04 87 vcvtss2sd xmm0,xmm3,DWORD PTR [rdi+rax*4]
a29d: c5 e2 5a 0c 86 vcvtss2sd xmm1,xmm3,DWORD PTR [rsi+rax*4]
a2a2: c5 fb 5c c1 vsubsd xmm0,xmm0,xmm1
a2a6: 48 83 c0 01 add rax,0x1
a2aa: c5 fb 59 c0 vmulsd xmm0,xmm0,xmm0
a2ae: c5 eb 58 d0 vaddsd xmm2,xmm2,xmm0
a2b2: 48 39 c2 cmp rdx,rax
a2b5: 75 e1 jne a298 <_ZL19l2sq_naive_f64_oncePKfS0_m+0x18>
a2b7: c5 eb 10 c2 vmovsd xmm0,xmm2,xmm2
a2bb: c3 ret
a2bc: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
a2c0: c5 e9 57 d2 vxorpd xmm2,xmm2,xmm2
a2c4: c5 eb 10 c2 vmovsd xmm0,xmm2,xmm2
a2c8: c3 ret
a2c9: 0f 1f 80 00 00 00 00 nop DWORD PTR [rax+0x0]
vcvtss2sd) and computes one dimension at a time, it avoids the complex and costly 16-element horizontal folding.In other words, even with the conversion overhead, the simpler scalar path was still faster than the float version with vector folding. The compiler likely chose the conservative scalar path here, avoiding vectorization.
(3) How to Improve Naive L2sq for Better Compiler Vectorization?
The reason for horizontal folding is likely that the compiler strictly follows IEEE 754 semantics, preserving the exact order of floating-point additions. This prevents the compiler from reordering additions into vectorized accumulations.
To relax this, we can explicitly allow reassociation:
static inline double l2sq_naive_f32(const float* a, const float* b, size_t n) {
float acc = 0.f;
#pragma omp simd reduction(+:acc)
for (size_t i = 0; i < n; ++i) {
float d = a[i] - b[i];
acc += d * d;
}
return (double)acc;
}
And compile with -fopenmp-simd to enable this directive.
Running again shows a significant improvement: compiler auto-vectorization now achieves performance close to manual SIMD implementations. Using -ffast-math also works.

#pragma omp simd reduction(+:acc) or appropriate subsets of -ffast-math) is the key to approaching hand-written SIMD performance. Under strict IEEE semantics, the compiler conservatively generates per-iteration folding, which creates slow paths where AVX-512 does not necessarily have an advantage.BEGIN;
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;
……
Can we really say this SELECT is repeatable read safe❓
I used to assume pgvector, as a PostgreSQL extension, naturally inherits Postgres’s transactional guarantees — but after thinking it through, that might not be the case.
This works well with native ordered index like nbtree. For example:
So, the query still returns the same results — consistent with REPEATABLE READ.

When inserting a new vector B:
Here’s the issue: T’s neighbor list is modified — breaking assumption #1. Now, suppose a REPEATABLE READ transaction had previously discovered D via T. In its second identical query, it may no longer reach D, simply because D was evicted from T’s neighbor list. At the same time, the newly inserted B is now reachable — but is correctly rejected due to heap visibility checks.
This question came to mind today — I reached a tentative conclusion through some code review and thought experiments. Haven’t verified this with a test case yet, so feel free to correct me if I’m wrong.
🤔 BTW, lately, I’ve been comparing how vector search is implemented in transactional databases vs dedicated vector databases by reading through their code. It’s exciting to see traditional databases embracing new trends — but what do you think: Do transactions bring real value to vector search, or are they more of a burden in practice? And what about the other way around?
This post has sparked some discussion on LinkedIn, with two main points being raised:
SELECT ... ORDER BY ... LIMIT ...), because different execution plans can produce different result orders.I’m not convinced by either of these arguments:
This is one of the key differences from MariaDB’s vector index. Due to MariaDB’s pluggable storage engine architecture and engineering trade-offs, its vector index is implemented at the server layer through the storage engine’s transactional interface. The storage engine itself is unaware of the vector index — it just sees a regular table. Curious about the internals of the MariaDB vector index? Take a look at my previous posts here[1] and here[2].
pgvector supports two types of vector indexes: HNSW and IVFFlat. This post focuses on the HNSW implementation, particularly its concurrency control mechanisms — a topic that has received relatively little attention.
hnswinsert: This is the core interface for inserting into the index. Its implementation closely follows the HNSW paper, aligning with the INSERT, SEARCH-LAYER, and SELECT-NEIGHBORS-HEURISTIC algorithms (Algorithms 1, 2, and 4). One difference I noticed is that pgvector omits the extendCandidates step from SELECT-NEIGHBORS-HEURISTIC.
hnswgettuple: This is the search interface. It invokes GetScanItems to perform the HNSW search, which aligns closely with Algorithm 5 (K-NN-SEARCH) from the paper. In iterative scans, GetScanItems not only returns the best candidates but also retains the discarded candidates — those rejected during neighbor selection. Once all the best candidates are exhausted, it revisits some of these discarded candidates at layer 0 for additional search rounds. This continues until hnsw_max_scan_tuples is exceeded, after which the remaining discarded candidates are returned as-is and the scan ends.
Scanning all index pages to collect invalid tuples.
RepairGraph, the most complex step, removes invalid tuples from the graph and their neighbors, then repairs the graph to ensure correctness. This requires scanning all pages again and performing multiple search-layer and select-neighbors operations.
Safely deleting the invalid tuples from the index, again requiring a full scan of all pages.
As you can see, the full traversal of index pages three times plus extensive HNSW operations make this function very heavy.
In an earlier post, I analyzed the concurrency control design of MariaDB’s vector index. There, read-write concurrency is supported through InnoDB’s MVCC, but write-write concurrency is not.
pgvector goes further by supporting true write-write concurrency. Let’s dive into how pgvector handles concurrency between hnswinsert and hnswbulkdelete.
It introduces multiple lock types:
Page Locks These are PostgreSQL’s standard buffer locks and are used to protect individual pages. Pages in pgvector fall into three categories:
Meta Page: Stores index metadata.
Element Pages: Store index tuples.
Entrypoint Page: A special element page containing the HNSW entrypoint.
Most inserts (hnswinsert) acquire it in shared mode, allowing concurrent inserts.
If an insert needs to update the entrypoint, it upgrades to exclusive mode to ensure only one insert can modify the entrypoint. hnswbulkdelete briefly acquires and immediately releases the lock in exclusive mode after collecting invalid index tuples and just beforeRepairGraph, ensuring that all in-flight inserts have completed. Otherwise, concurrent inserts might reintroduce the invalid elements being removed, making them neighbors again.
All locking operations in hnswinsert and hnswbulkdelete are well-structured. The diagram below shows detailed lock scopes in both implementations, where solid lines indicate exclusive locks and dashed lines indicate shared locks. I won’t go into all the details here — I may write a separate post covering the implementation specifics — but the diagram clearly illustrates that exclusive HNSW_UPDATE_LOCK usage is infrequent. Most operations acquire it in shared mode and hold short-lived page locks only as needed, keeping contention low.
What about deadlocks? The answer is simple: as shown in the diagram, in most cases only one page buffer lock is held at a time, eliminating the risk of circular dependencies. In rare cases, both an element page and one of its neighbor pages (also an element page) may be locked simultaneously. However, since pgvector maintains a globally unique current insert page, even these scenarios remain safe.
pgvector is another textbook example of how to integrate vector search into a traditional OLTP database. Its implementation is elegant, closely aligned with the original HNSW paper.
Compared to MariaDB’s vector index, it stands out for its fine-grained concurrency control. However, it lacks the SIMD-level optimizations that MariaDB has introduced for better performance.
A deeper comparison between pgvector and MariaDB’s vector index internals would be an interesting future topic.
If you’re interested in performance benchmarks comparing pgvector and MariaDB, Mark Callaghan did detailed tests, check them out here.
]]>I believe there is room for optimization in the unique secondary index insertion process. In my previous post, I already reported one such optimization, which has since been verified by the MySQL team. In this post, I’d like to discuss another optimization I recently proposed.

Currently, when inserting an entry into a unique secondary index, MySQL performs the following steps:
next) to check for actual duplicates, considering that records marked as deleted don’t count as duplicates. If no duplicate is found, it continues to the next step.As you can see, this process involves 3 separate B+Tree searches. This is mainly because unique secondary indexes in InnoDB cannot be updated in place; they rely on a delete-mark and insert approach. Since multiple records can share the same index column values (including deleted-marked ones), MySQL has to perform these extra checks to ensure uniqueness.
Each of these B+Tree searches acquires at least one page latch at each tree level, which can become a concurrency bottleneck, especially during page splits or merges.
I believe we can reduce the number of B+Tree searches for unique secondary indexes. Specifically, we could skip the initial B+Tree search that uses all columns as the search pattern. The revised process would be:
next) through subsequent records to confirm whether an actual duplicate exists, while also identifying the final insertion point (including comparing the full entry columns when needed). If no duplicate is found, we can directly insert the entry at the determined insertion point.This approach would reduce the number of B+Tree searches from 3 to 1, significantly reducing the chances of concurrency conflicts. All duplicate checks would happen within one or just a few adjacent leaf pages, making the lock granularity much smaller. Importantly, even in the worst case, the number of entry-record comparisons wouldn’t exceed what the current implementation requires.
I’ve already submitted a report with this idea to the MySQL team. I’m hoping it can generate some interesting discussions around this optimization.
]]>
int16_t arrays) using normalization and quantization. These int16_t arrays allow fast dot product computation using AVX2 or AVX512 instructions, which significantly speeds up vector distance calculations.PatternedSimdBloomFilter to efficiently skip previously visited neighbors. This filter groups visited memory addresses in batches of 8 and uses SIMD to accelerate the matching process.Each table’s TABLE_SHARE structure holds an MHNSW_Share object, which contains a global cache shared across sessions (since TABLE_SHARE is global).
The cache improves read performance but introduces additional locking overhead, which is worth a closer look. Three types of locks are used to manage concurrency:
cache_lock: guards the entire cache structure.node_lock[8]: partitions node-level locks to reduce contention on cache_lock. The thread first uses cache_lock to locate the node, then grabs node_lock[x] for fine-grained protection, allowing cache_lock to be released right after.commit_lock: a read-write lock that ensures consistency during writes. Readers hold the read lock throughout the query to prevent concurrent cache modifications. Writers acquire the write lock during commit, invalidate the cache, bump the version number, and notify any ongoing reads (executing between hlindex_read_first() and hlindex_read_next()) to switch to the new cache which generated by the writer.Observations:
cache_lock + node_lock) minimizes contention.(This section refers specifically to vector indexes on InnoDB.)
(tx_isolation = ‘READ COMMITTED’)
(tx_isolation = ‘REPEATABLE READ’)
Interestingly, under the Repeatable Read isolation level, the correctness of locking reads is not guaranteed by InnoDB’s next-key locks or gap locks. Since gap protection applies to ordered indexes, the concept of a “gap” does not really exist in a vector index structure. However, locking reads in this case effectively behave like Serializable, which still satisfies the requirements of Repeatable Read.
Another notable quirk: the cache layer disrupts normal locking behavior. If a node is found in the cache, no InnoDB-level lock is acquired. Locking only happens on cache misses. This makes the locking behavior somewhat unpredictable under high cache hit rates.
Based on the reviews in my last post and this one, I believe MariaDB’s current implementation of vector indexes offers an excellent case study of how to integrate vector search in a relational database. It achieves a strong balance between engineering complexity, performance, and applicability.
Looking forward to seeing even more powerful iterations in the future!
]]>