On 12nd of March 2026, Robert Haas committed patch: Add pg_plan_advice contrib module. Provide a facility that (1) can be used to stabilize certain plan choices so that the planner cannot reverse course without authorization and (2) can be used by knowledgeable users to insist on plan choices contrary to what the planner believes … Continue reading "Waiting for PostgreSQL 19 – Add pg_plan_advice contrib module."
The previous article showed that PostgreSQL 18 makes optimizer statistics portable, but left one gap open:
It's not worth trying to inject relpages as the planner checks the actual file size and scales it proportionally.
The planner doesn't trust pg_class.relpages. It calls smgrnblocks() to read the actual number of 8KB pages from disk. Your table is 74 pages on disk but pg_class.relpages says 123,513? The planner uses the ratio to scale reltuples down to match the actual file size. The selectivity ratios stay correct, plan shapes mostly survive, but the absolute cost estimates are off.
For debugging a single query, that's usually fine. For automated regression testing where you compare EXPLAIN costs across runs, it breaks things. A cost threshold of 2× means something different when the baseline was computed from fake-scaled numbers.
As part of my work on RegreSQL I'm happy to announce pg_regresql extension which fixes this by hooking directly into the planner.
Why the planner ignores relpages
When PostgreSQL's planner calls get_relation_info() in plancat.c, it delegates to estimate_rel_size() which ends up in table_block_relation_estimate_size() in tableam.c. There, the actual page count comes from the storage manager:
curpages = RelationGetNumberOfBlocks(rel);
The function then computes a tuple density from pg_class (reltuples / relpages) and multiplies it by curpages to estimate tuples. So pg_class.reltuples isn't ignored, it's scaled to match the real file size. The reasoning is sound for normal operation: the catalog might be stale, but the file system is always current.
The same applies to indexes. The planner reads their actual sizes from disk too.
What pg_regresql does
The extension hooks into get_relation_info_hook, a planner callback that runs after PostgreSQL reads the physical file stats. The hook replaces the file-based numbers with the values stored in pg_class:
Recently during one of the Oracle to PostgreSQL migration with enterprise customer while designing cutover runbook, we were evaluating steps to perform Sequence value reset to match it as per Source so that every new value request using NextVal is an new one and does not incur transactional failure.
It is one of the critical steps of any database migrations, as in most cases sequence LAST_VALUE is not migrated at target implicitly and sequence values need to be matching as per source so that new transaction never fails and application work post cutover.
We used ora2pg’s SEQUENCE_VALUES export to generate DDL command to set the last values of sequences.
ora2pg’s SEQUENCE_VALUES export generates ALTER SEQUENCE ... START WITH. That command does not reset the sequence.
Expected: next nextval() = 1494601. Actual: sequence continues from wherever it was.
Silently wrong. The kind that surfaces as a “balance mismatch” as part of post-cutover steps.
Understand Different Sequence Reset Options.
Any sequence in PostgreSQL has three internal fields: start_value → reference point for bare RESTART last_value → where the counter actually is (nextval reads this) is_called → has last_value been consumed yet?
We have couple of options to reset last_value for an sequences,
Following table summarize with all options in comparison with START WITH.
I’ve been busy with an internal project at work, but have responded to a few pg_clickhouse reports for a couple crashes and vulnerabilities, thanks to pen testing and a community security report. These changes drive the release of v0.1.5 today.
Welcome to Part three of our series for building a High Availability Postgres cluster using Patroni! Part one focused entirely on establishing the DCS using etcd to provide the critical DCS backbone for the cluster, and part two added Patroni and Postgres to the software stack. While it's entirely possible to stop at that point and use the cluster as-is, there's one more piece that will make it far more functional overall.New connections need a way to reach the primary node easily and consistently. Patroni provides a REST interface for interrogating each node for its status, making it a perfect match for any software or load-balancer layer compatible with HTTP checks. Part three focuses on adding HAProxy to fill that role, completing the cluster with a routing layer.Hopefully you still have the three VMs where you installed etcd, Postgres, and Patroni. We will need those VMs for the final stage, so if you haven't already gone through the steps in part one and two, come back when you're ready.Otherwise, let's complete the cluster!
What HAProxy adds
HAProxy is one of the most common HTTP proxies available, but it also has a hidden superpower: it can transparently redirect raw TCP connections as well. This means it can also act as a proxy for any kind of service such as Postgres. Here's how it works:
HAProxy connects to the Patroni REST interface and gets the status for the "/" URL.
Patroni will only respond with a "200 OK" status on the primary node. All other nodes will produce a "500" error of some kind.
HAProxy marks nodes that respond with errors as unhealthy.
All connections get routed to the only "healthy" node: the primary for the cluster.
Of course that's not the end of it; the Patroni REST API is incredibly powerful, as it provides multiple additional endpoints. For example a check against:
/replica will succeed if the node is a healthy streaming replica of the primary, a good match for offloading intensive read queries from the primary node.
Yes, it was St. Patrick’s Day, and also Illinois Primaries, and the weather was beyond bad, but we still had a good crowd!
Pizza always comes first :), because nobody is going to go hungry! Whether you stay for Postgres or not is up to you, so I am assuming that when people are coming and staying, it’s not just for pizza
There was a time when High Availability in PostgreSQL came with an implicit assumption: if something important happened, an administrator could log into the server, inspect the state of the cluster, and run the command that steadied the ship. That assumption is fading fast. In many modern enterprises, direct OS-level access is no longer part of the operating model. SSH is locked down, bastion access is tightly controlled, and every administrative pathway is examined through the lens of zero-trust security.
And yet—High Availability doesn’t wait.
That shift creates a very real operational question for database teams: how do you maintain control of a PostgreSQL HA environment when the traditional control surface has been deliberately removed?
This is where the refreshed vision for efm_extension becomes interesting. It is an open-source PostgreSQL extension, released under the PostgreSQL License, designed to expose EDB Failover Manager (EFM) operations directly through SQL—bringing operational control into a governed, auditable layer.
The HA Landscape: Patroni, repmgr, and EFM
PostgreSQL High Availability has never been a one-size-fits-all story.
Some teams lean toward Patroni, embracing distributed coordination and cloud-native patterns. Others prefer repmgr, valuing its simplicity and DBA-centric workflows. And then there is EDB Failover Manager (EFM)—a mature, enterprise-grade solution designed to monitor streaming replication clusters and orchestrate failover with predictability and control.
Each of these tools reflects a different philosophy. But they share one quiet assumption:
Operational control happens at the OS level.
And that assumption is exactly where modern security models push back.
The Real Problem: Control Without Access
In today’s enterprise environments, responsibilities are deliberately separated.
On 10th of March 2026, Álvaro Herrera committed patch: Introduce the REPACK command REPACK absorbs the functionality of VACUUM FULL and CLUSTER in a single command. Because this functionality is completely different from regular VACUUM, having it separate from VACUUM makes it easier for users to understand; as for CLUSTER, the term is heavily … Continue reading "Waiting for PostgreSQL 19 – Introduce the REPACK command"
PostgreSQL major version upgrades are one of those tasks that every DBA has to deal with regularly. They are routine — but they are also full of small, potentially dangerous details that can turn a straightforward maintenance window into an incident. Having performed hundreds of upgrades across different environments over the years, I want to share a comprehensive, practical guide to upgrading from PostgreSQL 17 to 18, with particular focus on what has changed and what has finally improved in the upgrade process itself.
This article is based on my PGConf.EU 2024 talk, updated to cover the PostgreSQL 17→18 upgrade path and the significant improvements that landed in version 18. This time of the year we usually recommend our customers to upgrade: current release 18.3 is stable enough.
Why Upgrades Matter
Let me start with a reality check. PostgreSQL major versions are supported for five years. If you are running a version that is past its end-of-life, you are exposed to unpatched security vulnerabilities and bugs that the community will never fix. But even within the support window, newer versions bring performance improvements, new features, and better tooling. The question is not whether to upgrade, but how to do it safely and with minimal downtime.
The upgrade itself is not rocket science. The tricky part is the combination of small details — replication slots, extension compatibility, configuration drift between old and new clusters, statistics collection, and the behavior of your connection pooler during the switchover. Any one of these can bite you if you are not paying attention.
Preparation
Read release notes. If you want to jump from lets say version 13 to version 18, read them all. There could be some manual steps you need to make.
Plan and test. Upgrades are simple and straightforward, even major upgrades. The problem is in small details which are easy to overlook. Collation, checksums, extensions, forgotten checkpoint, customized statistics targets could make your li
I'm planning to hold a single hacking workshop for April and May combined, covering Masahiko Sawada's talk, Breaking away from FREEZE and Wraparound, given at PGCon 2022. If you're interested in joining us, please sign up using this form and I will send you an invite to one of the sessions. Thanks to Sawada-san for agreeing to join us.
On 4th of March 2026, Amit Kapila committed patch: Allow table exclusions in publications via EXCEPT TABLE. Extend CREATE PUBLICATION ... FOR ALL TABLES to support the EXCEPT TABLE syntax. This allows one or more tables to be excluded. The publisher will not send the data of excluded tables to the subscriber. To … Continue reading "Waiting for PostgreSQL 19 – Allow table exclusions in publications via EXCEPT TABLE."
Most RAG systems were built for a specific workload: abundant reads, relatively few writes, and a document corpus that doesn't change much. That model made sense for early retrieval pipelines, but it doesn't reflect how production agent systems actually behave. In practice, multiple agents are constantly writing new observations, updating shared memory, and regenerating embeddings, often at the same time. The storage layer that worked fine for document search starts showing cracks under that kind of pressure.The failures that result aren't always obvious. Systems stay online, but answers drift. One agent writes a knowledge update while another is mid-query, reading a half-committed state. The same question asked twice returns different answers. Embeddings exist in the index with no corresponding source text. These symptoms get blamed on the model, but the model isn't the problem. The storage layer is serving up an inconsistent state, and no amount of prompt engineering can fix that.This isn't a new class of problem. Databases have been solving concurrent write correctness for decades, and PostgreSQL offers guarantees that meet those agent memory needs.
What RAG Systems Are Missing Today
RAG systems depend on memory that evolves over time, but most current architectures were designed for static document search rather than stateful reasoning, creating fundamental correctness, consistency, and reproducibility problems in production environments.
Stateless Retrieval Problems and Solutions
Most RAG pipelines treat retrieval as a stateless search over embeddings and documents. The system pulls the top matching chunks with no awareness of how memory has evolved, what the agent's current session context is, or where a piece of information sits on a timeline. For static document search, that limitation rarely matters. For agent memory, where knowledge changes continuously, it is a real problem.Without stateful awareness, retrieval starts mixing facts from different points in time. One query might retrieve yester[...]
Like much of the world, I have been exploring capabilities and realities of LLMs and other generative tools for a while now. I am focused on using the technology with the framing of my technology-focused work, plus my other common scoping on data privacy and ethics. I want basic coding help (SQL, Python, Docker, PowerShell, DAX), ideation, writing boilerplate code, and leveraging existing procedures. Naturally, I want this available offline in a private and secure environment. I have been focused on running a local LLM with RAG capabilities and having control over what data goes where, and how it is used. Especially data about my conversations with the generative LLM.
This post collects my notes on what my expectations and goals are, and outlines the components I am using currently, and thoughts on my path forward.
The 23rd edition of the Southern California Linux Expo, or SCaLE 23x, took place from March 5-8, 2026, in Pasadena, California. It was another fantastic community-run event with talks you don't get to hear anywhere else, and that incredible open-source community spirit.
While I didn't broadcast any live streams from the conference floor this year, I did end up catching some great talks (fortunately everything's recorded!) and having some deeply rewarding hallway track conversations. A highlight was catching up with folks from LPI, the legendary Jon "maddog" Hall, and Henrietta Dombrovskaya. We had a great discussion around our ongoing PostgreSQL Compatibility initiative: We are continuing to define what "Postgres Compatible" truly means to prevent market confusion and ensure a reliable "standard" for users. If you are interested in contributing to this effort, come join the conversation on our new Discord server:
Postgres once again had a stellar presence at SCaLE with a dedicated PostgreSQL track. We had an entire lineup of trainings and talks, plus a PostgreSQL Booth in the expo hall and the always-popular SCaLE 23x PostgreSQL Ask Me Anything session. A massive thank you to the organizers of PostgreSQL@SCaLE, the trainers, speakers, and all the volunteers who made it happen!
On 26th of February 2026, Andrew Dunstan committed patch: Add non-text output formats to pg_dumpall pg_dumpall can now produce output in custom, directory, or tar formats in addition to plain text SQL scripts. When using non-text formats, pg_dumpall creates a directory containing: - toc.glo: global data (roles and tablespaces) in custom format - map.dat: … Continue reading "Waiting for PostgreSQL 19 – Add non-text output formats to pg_dumpall"
pgNowis a lightweight PostgreSQL diagnostic tool developed by Redgate that provides quick visibility into database performance without requiring agents or complex setup. It connects directly to a PostgreSQL instance and delivers real-time insights into query workloads, active sessions, index usage, configuration health, and vacuum activity, helping DBAs quickly identify performance bottlenecks. Because it runs as a simple desktop application, pgNow is particularly useful for quick troubleshooting and point-in-time diagnostics when a full monitoring platform is not available.
The tool is currentlyfree to use, and its development is actively maintained by Redgate, with potential future enhancements expected as the project evolves. It analyzes workload behavior using PostgreSQL system views and extensions such as pg_stat_activity and pg_stat_statements.
Prerequisites
Enable pg_stat_statements in PostgreSQL
Most PostgreSQL distributions already include the pg_stat_statements extension. You only need to enable it in shared_preload_libraries and create the extension in the database
Create the extension CREATE EXTENSION pg_stat_statements; Verify the Extension
Configure Statement Tracking:
Set the tracking level to capture all statements:
ALTER SYSTEM SET pg_stat_statements.track = 'all'; Reload configurations: SELECT pg_reload_conf();
Verify the setting:
CREATEUSER monitor_user WITHPASSWORD'123#abc';
-- Grant connection permission
GRANTCONNECTONDATABASE myoddodb TO monitor_user;
-- Grant usage on schema (adjust schema name if needed)
GRANT USAGE ONSCHEMApublicTO monitor_user;
-- Grant specific table permissions for monitoring
GRANTSELECTON pg_stat_activity TO monitor_user;
GRANTSELECTON pg_stat_database TO monitor_user;
GRANTSELECTON pg_stat_all_tables TO monitor_user;
GRANTSELECTON pg_stat_user_tables TO monitor_user;
The COMMENT command has been in Postgres for decades. It allows text descriptions to be attached to almost any database object. During its long history, it was mostly seen as a nice-to-have addition to database schemas, allowing administrators and developers to more easily understand the schema. Tools like pgAdmin allow you to assign and view comments on database objects.
Now, in the AI era, there is something else that needs to understand database schemas — MCP clients. Without database object comments, MCP clients can only use the database schemas, object names, and constraints. With database comments, database users can supply valuable information to allow MCP clients to more effectively match schema objects to user requests and potentially generate better SQL queries. If database users don't want do add such comments, it might be possible for generative AI to create appropriate comments, perhaps by analyzing data in the tables.
It started as a humble personal project, few years back. The objective was to convert all my PostgreSQL notes and learning into a automatic diagnostic tool, such that even a new DBA can easily spot the problems. The idea was simple, a simple tool which don’t need any installation but do all possible analysis and […]
If you’ve been paying attention to the technology landscape recently, you’ve probably noticed that AI is everywhere. New frameworks, new terminology, and a dizzying array of acronyms and jargon: LLM, RAG, embeddings, vector databases, MCP, and more.
Honestly, it’s been difficult to figure out where to start. Many tutorials either dive deep into machine learning theory (Bayesian transforms?) or hide everything behind a single API call to a hosted model. Neither approach really explains how these systems actually work.
Recently I spent some time experimenting with the pgEdge AI tooling after hearing Shaun Thomas’ talk at a PrairiePostgres meetup. He talked about how to set up the various components of an AI chatbot system, starting from ingesting documents into a Postgres database, vectorizing the text, setting up a RAG and then an MCP server.
When I got home I wanted to try it out for myself – props to the pgEdge team for making it all free an open-source! What surprised me most was not just that everything worked, but how easy it was to get a complete AI retrieval pipeline running locally. More importantly, it turned out to be one of the clearest ways I’ve found to understand how modern AI systems are constructed behind the scenes. Thanks so much, Shaun!
The pgEdge AI Components
The pgEdge AI ecosystem provides several small tools that fit together naturally. I’ll go through them real quickly here
Doc Converter – The doc-converter normalizes documents into a format that is easy to process downstream. Whether the input is PDF, HTML, Markdown, or plain text, the converter produces clean text output suitable for ingestion.
Vectorizer – The vectorizer handles the process of converting text chunks into embeddings. These embeddings are numeric representations of text that capture semantic meaning. Once generated, they can be stored inside PostgreSQL using pgvector and queried with similarity search.
This is the third and final post in a series covering the new AI functionality in pgAdmin 4. In the first post, I covered LLM configuration and the AI-powered analysis reports, and in the second, I introduced the AI Chat agent for natural language SQL generation. In this post, I'll walk through the AI Insights feature, which brings LLM-powered analysis to PostgreSQL EXPLAIN plans.Anyone who has spent time optimising PostgreSQL queries knows that reading EXPLAIN output is something of an acquired skill. pgAdmin has long provided a graphical EXPLAIN viewer that makes the plan tree easier to navigate, along with analysis and statistics tabs that surface key metrics, but interpreting what you're seeing and deciding what to do about it still requires a solid understanding of the query planner's behaviour. The AI Insights feature aims to bridge that gap by providing an expert-level analysis of your query plans, complete with actionable recommendations.
Where to Find It
AI Insights appears as a fourth tab in the EXPLAIN results panel, alongside the existing Graphical, Analysis, and Statistics tabs. It's only visible when an LLM provider has been configured, so if you don't see it, check that you've set up a provider in Preferences (as described in the first post). The tab header simply reads 'AI Insights'.To use it, run a query with EXPLAIN (or EXPLAIN ANALYZE for the most useful results, since actual execution timings give the AI much more to work with), and then click on the AI Insights tab. The analysis starts automatically when you switch to the tab, or you can trigger it manually with the Analyze button.
What the Analysis Provides
The AI Insights analysis produces three sections:
Summary
A concise paragraph providing an overall assessment of the query plan's performance characteristics. This gives you a quick sense of whether the plan is generally healthy or has significant issues worth investigating. For well-optimised queries, the summary will confirm that the plan looks reasonable; for problematic o[...]
In the very early days of my career, an incident made me realise that perfoming my job irresponsibily will affect me adversely, not because it will affect my position adversely, but because it can affect my life otherwise also. I was part a team that produced a software used by a financial institution where I held my account. A bug in the software caused a failure which made several accounts, including my bank account, inaccessible! Fortunately I wasn't the one who introduced that bug and neither was other software engineer working on the product. It has simply crept through the cracks that the age-old software had developed as it went through many improvements. Something that happens to all the architectures, software or otherwise in the world. That was an enlightening and eve opening experience. But professional karma is not always bad; many times it's good. When the humble work I do for earning my living also improves my living, it gives me immense satisfaction. It means that it's also improving billions of lives that way across the globe.
When I was studying post-graduation in IIT Bombay, I often travelled by train - local and intercity. The online ticketing system for long distant trains was still in its early stages. Local train tickets were still issued at stations and getting one required standing in a long queue. Fast forward to today, you can buy a local train ticket on a mobile App or at a kiosk at the station by paying online through UPI. In my recent trip to IIT Bombay I bought such a ticket using GPay in a few seconds. And know what, UPI uses PostgreSQL as an OLTP database in its system. I didn't have to go through the same experience thank to the same education and the work I am doing. Students studying in my alma-matter no more have to go through the same painful experience now, thanks to many PostgreSQL contributors who once were students and might have similar painful experiences in their own lives.
In PGConf.India, Koji Annoura, who is a Graph database expert talked about o
I previously blogged about ensuring that the “ON CONFLICT” directive is used in order to avoid vacuum from having to do additional work. I also later demonstrated the characteristics of how the use of the MERGE statement will accomplish the same thing.
Now in another recent customer case, I was chasing down why the application was invoking 10s of thousands of Foreign Key and Constraint violations per day and I began to wonder, if these kinds of errors also caused additional vacuum as described in those previous blogs. Sure enough it DEPENDS.
Let’s set up a quick test to demonstrate:
/* Create related tables: */
CREATE TABLE public.uuid_product_value (
id int PRIMARY KEY,
pkid text,
value numeric,
product_id int,
effective_date timestamp(3)
);
CREATE TABLE public.uuid_product (
product_id int PRIMARY KEY
);
ALTER TABLE uuid_product_value
ADD CONSTRAINT uuid_product_value_product_id_fk
FOREIGN KEY (product_id)
REFERENCES uuid_product (product_id) ON DELETE CASCADE;
/* Insert some mocked up data */
INSERT INTO public.uuid_product VALUES (
generate_series(0,200));
INSERT INTO public.uuid_product_value VALUES (
generate_series(0,10000),
gen_random_uuid()::text,
random()*1000,
ROUND(random()*100),
current_timestamp(3));
/* Vacuum Analyze Both tables */
VACUUM (VERBOSE, ANALYZE) uuid_product;
VACUUM (VERBOSE, ANALYZE) uuid_product_value;
/* Verify that there are no dead tuples: */
SELECT
schemaname,
relname,
n_live_tup,
n_dead_tup
FROM
pg_stat_all_tables
WHERE
relname in ('uuid_product_value', 'uuid_product');
schemaname | relname | n_live_tup | n_dead_tup
------------+--------------------+------------+------------
public | uuid_product_value | 10001 | 0
public
Welcome to Part two of our series about building a High Availability Postgres cluster using Patroni! Part one focused entirely on establishing the DCS using etcd, providing the critical layer that Patroni uses to store metadata and guarantee its leadership token uniqueness across the cluster.With this solid foundation, it's now time to build the next layer in our stack: Patroni itself. Patroni does the job of managing the Postgres service and provides a command interface for node administration and monitoring. Technically the Patroni cluster is complete at the end of this article, but stick around for part three where we add the routing layer that brings everything together.Hopefully you still have the three VMs where you installed etcd. Those will be the same place where everything else happens, so if you haven’t already gone through the steps in part one, come back when you’re ready.Otherwise, let’s get started!
Installing Postgres
The Postgres community site has an incredibly thorough page dedicated to installation on various platforms. For the sake of convenience, this guide includes a simplified version of the Debian instructions. Perform these steps on all three servers.Start by setting up the PGDG repository:Then install your favorite version of Postgres. For the purposes of this guide, we’re also going to stop Postgres and drop the initial cluster the Postgres package creates. Patroni will recreate all of this anyway, and it should be in control.It’s also important to completely disable the default Postgres service since Patroni will be in charge:Finally, install the version of Patroni included in the PGDG repositories. This should be available on supported platforms like Debian and RedHat variants, but if it isn’t, you may have to resort to the official installation instructions.Once that command completes, we should have three fresh VMs ready for configuration.
Configuring Patroni the easy way
The Debian Patroni package provides a tool called that transforms a Patroni template into a configur[...]
Most PostgreSQL tuning advice that folks chase is quick fixes but not on understanding what made planners choose an path or join over others optimal path. !
Tuning should not start with Analyze on tables involved in the Query but with intend what is causing the issue and why planner is not self sufficient to choose the optimal path.
Most fixes we search for SQL tuning are around,
Add an index. Rewrite the query. Bump work_mem. Done.
Except it’s not done. The same problem comes back, different query, different table, same confusion.
The Real Problem
A slow query is a symptom. Statistics, DDL, query style, and PG version are the actual culprit’s.
Before you touch anything, you need to answer five questions — in order:
Find it — which query actually hurts the most right now?
Read the plan — what is the planner doing and where is it wrong?
Check statistics — is the planner even working with accurate data?
Check the DDL — is your schema helping or hiding the answer?
Check GUCs & version — are the defaults silently working against you?
5-Dimension SQL Tuning Framework
Most developers skip straight to question two. Many skip to indexes without asking any question at all.
What I Covered at PGConf India 2026
I presented this framework at PGConf India yesterday, a room full of developers and DBA , sharp questions, and a lot of “I’ve hit exactly this” moments.
The slides cover core foundations for approaching Query Tuning and production gotchas including partition pruning, SARGability, CTE fences, and correlated column statistics.
This extension provides transparent column-level encryption using custom PostgreSQL datatypes so developers can read and write encrypted columns without changing their SQL queries.
And perhaps the most human part of this project is this:
the idea for this project started back in 2016.
It stayed with me for years as one of those engineering ideas that never quite leaves your mind — the thought that PostgreSQL itself could enforce encryption at the column level.
Now I’ve finally decided to release it.
This is the first public version. It’s a starting point — useful, practical, and hopefully something the PostgreSQL community can explore and build upon.
Why This Matters
Encryption conversations often focus first on infrastructure.
We encrypt disks.
We use TLS connections.
We protect credentials.
All of these are important.
But once data is inside the database, a different question matters:
What happens if someone gains access to the database itself?
When using AWS RDS Proxy, the goal is to achieve connection multiplexing – many client connections share a much smaller pool of backend PostgreSQL connections, givng more resources per connection and keeping query execution running smoothly.
However, if the proxy detects that a session has changed internal state in a way it cannot safely track, it pins the client connection to a specific backend connection. Once pinned, that connection can never be multiplexed again. This was the case with a recent database I worked on.
In this case, we observed the following:
extremely high CPU usage
relatively high LWLock wait times
OOM killer activity on the database, maybe once every day or two
thousands of active connections
What was strange about it all was that the queries involved were relatively simple, with max just one join.
Finding the Pinning Source
To get to the root cause, one option was to look in pg_stat_statements. However, that approach had two problems:
Getting a clean snapshot of the statistics while thousands of queries were being actively processed would be tricky.
pg_stat_statements normalizes queries and does not expose the values passed to parameter placeholders.
Instead, to see the actual parameters, we briefly enabled log_statement = 'all'. This immediately surfaced something interesting in the logs, which could be downloaded and reviewed on my own time and pace.
What we saw were statements like SELECT set_config($2,$1,$3) with parameters related to JIT configuration – that was the first real clue.
Getting to the Bottom
After tracing the behavior through the stack, the root cause turned out to be surprisingly indirect. The application created new connections through SQLAlchemy’s asyncpg dialect, and we needed to drill down into that driver’s behavior.
Step 1 – Reviewing how SQLAlchemy registers JSON codecs
During connection initialization, SQLAlchemy runs an on_connect hook: