rjuju's home

Extracting SQL from WAL? (part 2)

2023-12-20T03:04:10+00:00

In the previous article of this series, we saw how to extract WAL records related to the exact SQL commands we want, INSERTs on heap tables, and what the structure of those records was. In this article we will focus on the heap specific information contained in those records and how to extract SQL queries from them.

INSERT data

At the end of the previous article, we could locate the various xl_heap_insert records from the WAL stream. From there, we extracted some metadata about the file’s physical location (tablespace oid, database oid and relation filenode among other things) and the data that was inserted itself.

As a reminder, here’s an extract of the code responsible for generating the WAL records for an INSERT, in the heap_insert() function, focusing on the interesting data:

void
heap_insert(Relation relation, HeapTuple tup, CommandId cid,
			int options, BulkInsertState bistate)
{
[...]
		xl_heap_header xlhdr;
[...]
		xlhdr.t_infomask2 = heaptup->t_data->t_infomask2;
		xlhdr.t_infomask = heaptup->t_data->t_infomask;
		xlhdr.t_hoff = heaptup->t_data->t_hoff;
[...]
		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
		XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
		/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
		XLogRegisterBufData(0,
							(char *) heaptup->t_data + SizeofHeapTupleHeader,
							heaptup->t_len - SizeofHeapTupleHeader);
[...]

2 entries are inserted: an xl_heap_header which contains some metadata about the tuple, extracted from the tuple header, and the data part of a HeapTuple. Let’s look at those in details.

Page layout

First of all, let’s quickly see how postgres stores tables and indexes on disk. I will only cover those basics that will be helpful for the rest of the article. If you want to dig more into this topic, there are a tons of resource available. You can refer to this entry point in the code, and I otherwise recommend looking at the section about it in “The internals of postgres” website.

A good general introduction is the documentation, which comes with a diagram of the layout that I include here:

Each tuple and index piece of data that postgres stores on disk is stored into a Page, which is by default 8kB. Each page starts with a header that contains some metadata about the page and ends with an optional “special area”, which can contain additional information specific to the component of postgres that will use this page.

In between is the actual data. The beginning of the data part is an array of ItemId, in ascending order, and the end of the data part are the items themselves (which will be the tuples in case of heap table pages), stored in the reverse order from the ItemId. Unless the page is totally full, there will be an empty space between the last ItemId and the first item (the pd_lower and pd_upper offset in the Page metadata).

Here’s the ItemId definition:

typedef struct ItemIdData
{
	unsigned lp_off:15,  /* offset to tuple (from start of page) */
		 lp_flags:2, /* state of line pointer, see below */
		 lp_len:15;  /* byte length of tuple */
} ItemIdData;

As you can see it holds the location of the item in the page, minimal metadata and the length of the item.

HeapTuple

The largest part stored in the record is the tuple itself. As the historic and default access method to store tuple is called heap, the struct that holds the tuple is called HeapTuple. Any custom Table Access Method can use a different struct to store what it needs for its specific implementation, but it will then also use a custom resource manager to generate specific WAL records.

Here’s the definition of a HeapTuple:

typedef struct HeapTupleData
{
	uint32		t_len;		/* length of *t_data */
	ItemPointerData t_self;		/* SelfItemPointer */
	Oid		t_tableOid;	/* table the tuple came from */
#define FIELDNO_HEAPTUPLEDATA_DATA 3
	HeapTupleHeader t_data;		/* -> tuple header and data */
} HeapTupleData;

It starts with some metadata, which isn’t stored on disk but generated or retrieved from somewhere else when the struct is read from disk. Indeed, there wouldn’t be much value storing the relation’s oid for each tuple on disk. The length of the tuple is stored on disk, as it’s a necessary piece of information, and is retrieved from the associated ItemId the we saw just before.

After that follows the “real” data, which is what is stored in the item part of the Page. It’s again split in 2 parts: the tuple header, which I will cover a bit later, and the tuple data.

The tuple data is the physical on-disk representation of the tuple. It was designed to be as space efficient as possible, so accessing individual fields is a bit complex, and CPU intensive. Let’s the most important part of this design. First, the tuple data is defined like that:

struct HeapTupleHeaderData
{
[...]
	/* ^ - 23 bytes - ^ */

#define FIELDNO_HEAPTUPLEHEADERDATA_BITS 5
	bits8		t_bits[FLEXIBLE_ARRAY_MEMBER];	/* bitmap of NULLs */

	/* MORE DATA FOLLOWS AT END OF STRUCT */
};

You probably know or heard that in postgres, NULL attributes don’t use any storage. Indeed, if an attribute is NULL there won’t be anything in the “data section”, and the bit for its attribute number in the t_bit bitmap will be set.

Then, a lot of data types have a variable size (which is internally referred as varlena). So, to save space postgres doesn’t store the offset of each attributes in the HeapTuple and just stores them next to each other (according to the datatype alignment rules) in a big chunk of memory.

This is indeed efficient, but unless your tuple only contains non-null fixed-sized attribute, the only way to access a specific attribute is to read all the previous ones, skip the NULL attribute and compute the position of the next one reading the length of variable datatype. This process is called tuple deforming, it takes a tuple in input and outputs two arrays: one with the datums and one with the null references, all indexed by the attribute number (0 based). The opposite operation (transform a tuple of datum and a tuple of nulls in a tuple) is unsurprisingly called tuple forming. If you want to read a bit more about those operations, the underlying functions are called heap_deform_tuple() and heap_form_tuple().

Note that tuple deforming is one of the operations that can be JITted, and there are some optimisations on the tuple deforming operation. Postgres supports “partial” deforming and will avoid deforming the full tuple when possible, stopping at the last attribute that the query is referencing, and will cache the offset of the latest attribute that has been deformed. But that can only help to some extent, so it’s always a good idea to mark columns as NOT NULL when possible, put all the columns with fixed-length attributes at the beginning of the tuples (with the NOT NULL first), ideally grouped by alignment size to avoid wasting a few bits, and put the most frequently accessed columns of variable length datatype next. All of that will help speeding up tuple deforming as much as possible.

Tuple header

The first part of the stored data is an xl_heap_header struct. It’s just a shorter version of the real tuple header that only contains some part of it, the rest of the header being available elsewhere in the WAL record or just not needed otherwise. Doing it this way can save a few bytes for each insert in the WAL, which is always a good thing. Its definition is:

typedef struct xl_heap_header
{
	uint16		t_infomask2;
	uint16		t_infomask;
	uint8		t_hoff;
} xl_heap_header;

t_infomask2 and t_infomask2 are two bitmaps that contain information about the tuple. You may have heard about hint bits, those two fields contains the tuple-level hint bits.

Let’s look at their details htup_details.c

struct HeapTupleHeaderData
{
[...]
	/* Fields below here must match MinimalTupleData! */

#define FIELDNO_HEAPTUPLEHEADERDATA_INFOMASK2 2
	uint16		t_infomask2;	/* number of attributes + various flags */

#define FIELDNO_HEAPTUPLEHEADERDATA_INFOMASK 3
	uint16		t_infomask;		/* various flag bits, see below */

#define FIELDNO_HEAPTUPLEHEADERDATA_HOFF 4
	uint8		t_hoff;			/* sizeof header incl. bitmap, padding */

	/* ^ - 23 bytes - ^ */
[...]
}

 * information stored in t_infomask2:
 */
#define HEAP_NATTS_MASK			0x07FF	/* 11 bits for number of attributes */
/* bits 0x1800 are available */
#define HEAP_KEYS_UPDATED		0x2000	/* tuple was updated and key cols
										 * modified, or tuple deleted */
#define HEAP_HOT_UPDATED		0x4000	/* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE			0x8000	/* this is heap-only tuple */

#define HEAP2_XACT_MASK			0xE000	/* visibility-related bits */
[...]
 * information stored in t_infomask:
 */
#define HEAP_HASNULL			0x0001	/* has null attribute(s) */
#define HEAP_HASVARWIDTH		0x0002	/* has variable-width attribute(s) */
[...]
#define HEAP_XMIN_COMMITTED		0x0100	/* t_xmin committed */
#define HEAP_XMIN_INVALID		0x0200	/* t_xmin invalid/aborted */
#define HEAP_XMIN_FROZEN		(HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)
#define HEAP_XMAX_COMMITTED		0x0400	/* t_xmax committed */
#define HEAP_XMAX_INVALID		0x0800	/* t_xmax invalid/aborted */
[...]

We can see a few bits useful for the tuple deforming. For instance, we see that 11 bits of t_infomask2 are used to store the actual number of attributes stored in this tuple. Adding a new column in a table doesn’t always require a full table rewrite, and in that case those bits are critical to know when to stop looking for additional attributes when accessing tuples stored before the column was added. There’s also information on whether the tuple contains any NULL or variable-length datatype attribute. The rest of the hint bits are a clever use of the available space to handle various SQL operations, MVCC rules, HOT updates and other low level optimisations.

Tuple descriptors

Now that we covered some internals of the HeapTuple, it seems much easier to reach our goal: transform the INSERT WAL records into plain SQL statements. We know that we just have to deform the tuples to retrieve the values and the NULL attributes, generating the SQL statements around isn’t hard. But here comes the second reason why we need a proper data directory to do so, and why the lack of DDL is important.

As you probably guessed by now, one critical piece of information needed for the tuple deforming operation is the table structure declaration. Indeed, the HeapTuple is just a big chunk of memory, and without the list of columns, data types, and the types details, it’s impossible to interpret those. If your model doesn’t change too much it’s probably possible to do without and instead generate some kind of mapping manually based on what you know about the history of the instance. Be careful if you go this way, any discrepancy between the original and generated data types can lead to bogus output in the best case, or crashing your whole instance. But in my case I had the guarantee that no DDL happened since the incident, and the other data directory available so I could just rely on it.

Postgres handles the table structure declaration using another struct, called TupleDesc, for tuple descriptor. Its definition is:

typedef struct TupleDescData
{
	int	     natts;	/* number of attributes in the tuple */
	Oid	     tdtypeid;	/* composite type ID for tuple type */
	int32	     tdtypmod;	/* typmod for tuple type */
	int	     tdrefcount;/* reference count, or -1 if not counting */
	TupleConstr *constr;	/* constraints, or NULL if none */
	/* attrs[N] is the description of Attribute Number N+1 */
	FormData_pg_attribute attrs[FLEXIBLE_ARRAY_MEMBER];
} TupleDescData;

In our case the most interesting members are the number of attributes (natts) and the array of pg_attribute records (attrs). Those are also useful for the SQL generation part, as we can retrieve the columns from it. Note also that postgres will generate a TupleDesc automatically when you internally open a relation.

Let’s recapitulate. We have the record data, the filename contains the physical file location information that we can use to retrieve the actual relation, we know how to get the tuple descriptor for this relation and we can use it to deform the tuple and get the values from it. We have almost everything we need to generate the SQL queries.

The only remaining detail is that the values we get from the tuple deforming operation are in their physical representation, and we need to emit their textual representation. Again, that’s not a problem as each data type has a dedicated function for that, called type output function, available in pg_type.typoutput.

Extracting SQL from the INSERT records

Now is time for the fun part where we just need to put everything together to finish the project!

I chose to write it as an extension to be able to add and remove it easily from a production server. I also chose to minimize the amount of C code and rely on plpgsql functions when possible. It’s faster to write and plpgsql is also way safer.

I only wrote a single pg_decode_record() C function, that takes as input a record as a bytea, the tablespace oid and the relation filenode and emits the underlying SQL query. I wrote an extra pg_decode_all_records() function in plpgsql that uses existing pg_ls_dir() and pg_read_binary_file() to retrieve the files and record, and split_part() to extract the metadata from the filename.

I’m attaching the resulting extension to this article so you can see the whole implementation and adapt it if needed, and will just quickly describe the main parts here as we already covered the underlying elements. I’m also only showing here a simplified version to avoid too many implementation details.

First, I look for a matching relation oid in the pg_class catalog for the given tablespace and relfilenode, open the found relation with the weakest lock possible, make a copy of the tuple descriptor and start generating the SQL query with the qualified relation name. As for normal application, you need to make sure that the identifiers are properly quoted to generate working queries:

PGDLLEXPORT Datum
pg_decode_record(PG_FUNCTION_ARGS)
{
    bytea  *record = PG_GETARG_BYTEA_PP(0);
    Oid	spc = PG_GETARG_OID(1);
    Oid	relfilenode = PG_GETARG_OID(2);

    /* Get the relation oid from the tablespace oid and relfilenode */
    relid = get_spc_relnumber_relid(spcOid, relNumber);

    relation = table_open(relid, AccessShareLock);
    tupdesc = CreateTupleDescCopy(RelationGetDescr(relation));

    /* Start generating the SQL query */
    initStringInfo(buf);
    appendStringInfo(buf, "INSERT INTO %s.%s",
    		 quote_identifier(get_namespace_name(RelationGetNamespace(relation))),
    		 quote_identifier(RelationGetRelationName(relation)));

The next part extracts the data from the record and generate a HeapTuple with just enough information to be correctly deformed:

    /* mimic heap_xlog_insert */
    data = VARDATA(record);
    datalen = VARSIZE_ANY(record);
[...]
    htup = &tbuf.hdr;
[...]
    htup->t_hoff = xlhdr.t_hoff;

    /* build a fake tuple with the bare minimum to deform it */
    tuple = (HeapTuple) palloc0(HEAPTUPLESIZE + VARSIZE_ANY(record));
    tuple->t_data = htup;
    tuple->t_len = VARSIZE_ANY(record);
    ItemPointerSetInvalid(&(tuple->t_self));
    tuple->t_tableOid = relid;

For the next step, we just need to allocate the 2 arrays needed for the deforming and call heap_deform_tuple():

    values = palloc0(sizeof(Datum) * tupdesc->natts);
    isnull = palloc0(sizeof(bool) * tupdesc->natts);
    heap_deform_tuple(tuple, tupdesc, values, isnull);

Now that we have all the elements, we just need to iterate over the list of columns in the tuple descriptor, output a NULL if needed, otherwise find the type output function, call it for our value, and output it in the query after escaping it:

    /* append the values */
    appendStringInfoString(buf, " VALUES (");
    for (i = 0; i < tupdesc->natts; i++)
    {
    	char	   *value = NULL;
    	Oid			typoutput;
    	bool		typisvarlena;

    	if (i > 0)
    		appendStringInfoString(buf, ", ");

    	if (isnull[i])
    	{
    		appendStringInfoString(buf, "NULL");
    		continue;
    	}

    	getTypeOutputInfo(TupleDescAttr(tupdesc, i)->atttypid,
    					  &typoutput, &typisvarlena);

    	value = OidOutputFunctionCall(typoutput, values[i]);
    	value = quote_literal_cstr(value);

    	appendStringInfo(buf, "%s", value);

    	pfree(value);
    }
    appendStringInfoString(buf, ");");

Once done, we just need to properly close the relation and return the generated query to the caller:

	table_close(relation, NoLock);

	PG_RETURN_TEXT_P(cstring_to_text(buf.data));
}

And that’s all you need for the basic scenario! The real implementation has a bit more code for various other cases, like very basic TOAST table support, but is still unlikely to correctly handle any weird corner cases that can happen in the wild.

Basic usage

We can finally see the result of all the hard work in this article and the previous one! I will be using a simple scenario, first saving the current WAL position to only keep the records generated afterwards, then removing all the data from the table (without changing its relfilenode) to make sure that we don’t read anything from the table itself.

-- Get the current WAL location
rjuju =# SELECT pg_current_wal_lsn();
 pg_current_wal_lsn
--------------------
 F/46349E80
(1 row)

rjuju=# CREATE EXTENSION pg_decode_record;
CREATE EXTENSION

rjuju=# CREATE TABLE decode_record(id integer, val text storage external);
CREATE TABLE

rjuju=# INSERT INTO decode_record
  SELECT 1, 'simple test';
INSERT 0 1

-- Force a full-page write
rjuju=# CHECKPOINT;
CHECKPOINT

rjuju=# INSERT INTO decode_record
  SELECT 2, 'full-page write';
INSERT 0 1

rjuju=# INSERT INTO decode_record
  SELECT 3, 'a bit big '||string_agg(random()::text, ' ') FROM generate_series(1, 10);
INSERT 0 1

rjuju=# INSERT INTO decode_record
  SELECT 4, 'way bigger '||string_agg(random()::text, ' ') FROM generate_series(1, 120);
INSERT 0 1

-- Check the heap table size and underlying TOAST table size
rjuju=# SELECT oid::regclass::text, pg_size_pretty(pg_relation_size(oid)),
  reltoastrelid::regclass::text, pg_size_pretty(pg_relation_size(reltoastrelid))
  FROM pg_class
  WHERE relname = 'decode_record';
      oid      | pg_size_pretty |      reltoastrelid      | pg_size_pretty
---------------+----------------+-------------------------+----------------
 decode_record | 8192 bytes     | pg_toast.pg_toast_66731 | 8192 bytes
(1 row)

rjuju=# DELETE FROM decode_record;
DELETE 4

-- Make sure we remove all records and physically empty the tables
rjuju=# VACUUM decode_record;
VACUUM

rjuju=# SELECT oid::regclass::text, pg_size_pretty(pg_relation_size(oid)),
  reltoastrelid::regclass::text, pg_size_pretty(pg_relation_size(reltoastrelid))
  FROM pg_class
  WHERE relname = 'decode_record';
      oid      | pg_size_pretty |      reltoastrelid      | pg_size_pretty
---------------+----------------+-------------------------+----------------
 decode_record | 0 bytes        | pg_toast.pg_toast_66737 | 0 bytes
(1 row)

Ok, we should have a few records generated in the WAL corresponding to data we definitely lost in the table. Let’s extract the INSERT records using the custom pg_waldump we created in the previous article:

$ mkdir -p /tmp/pg_decode_record
$ pg_waldump --start "F/46349E80" --save-records /tmp/pg_decode_record
[...]
$ ls -l /tmp/pg_decode_record
0000000F-46367520.1663.16384.66743.0_main
0000000F-46367660.1663.16384.66743.0_main
0000000F-46367738.1663.16384.66743.0_main
0000000F-46367868.1663.16384.66746.0_main
0000000F-46368130.1663.16384.66746.0_main
0000000F-46368300.1663.16384.66743.0_main

You might wonder why there are 6 records extracted while we only inserted 4 rows. That’s because the last record was big enough to be TOASTed using 2 chunks, and as far as the WAL are concerned that’s 3 separate INSERTs in 2 different tables. Let’s see that in detail using the extension to decode the records (truncating the output as some rows are quite big):

rjuju=# SELECT substr(v, 1, 95)
    FROM pg_decode_all_records('/tmp/pg_decode_records') f(v);
                                          substr
-------------------------------------------------------------------------------------------
 INSERT INTO public.decode_record (id, val) VALUES ('1', 'simple test');
 INSERT INTO public.decode_record (id, val) VALUES ('2', 'full-page write');
 INSERT INTO public.decode_record (id, val) VALUES ('3', 'a bit big 0.5356172842583808 0.3...'
 INSERT INTO pg_toast.pg_toast_66810 VALUES ('66815', '0', E'\\x7761792062696767657220302e...'
 INSERT INTO pg_toast.pg_toast_66810 VALUES ('66815', '1', E'\\x3337383137353120302e303439...'
 INSERT INTO public.decode_record (id, val) VALUES ('4', /* toast pointer 66815 */);
(6 rows)

(note: I slightly edited the output to make it smaller and have correct syntax highlighting, the real extension will emit the real table name in a comment in case of INSERT in a TOAST table)

We see the first normal records properly decoded, whether they’re in a full-page image or not. The last record is indeed split into 3 different INSERTs, 2 in the TOAST table and 1 in the heap table.

As I mentioned earlier I only added very minimal support for TOAST tables, as I didn’t have any information about the customer tables and whether they would hit that case or not, or how often. The last insert isn’t a valid statement as the 2nd value is missing, but we can manually extract the value from the INSERT statements in the TOAST table and therefore fix the normal INSERT. For instance, using the first few bytes that we can see in the first chunk:

rjuju=# SELECT encode(E'\\x7761792062696767657220302e', 'escape');
-[ RECORD 1 ]---------
encode | way bigger 0.

The data is there, it just needs a bit of manual processing to get it.

To be totally fair, I also cheated a bit in that example by making sure that the data will be TOASTed but not compressed, so it’s very easy to manually retrieve the raw value from the extra INSERTs in the TOAST tables. It wouldn’t be very hard to have all of that working transparently, but I simply didn’t have the need. If you’re interested in that, I’d recommend looking at the detoast_attr() function in src/backend/access/common/detoast.c and all underlying code to see how you can manually decompress data. You would then only need to store the detoasted (and potentially decompressed) value referenced by the toast’s chunk_id locally, and emit it in the query instead of the currently emitted comment.

Conclusion

I hope you enjoyed those two articles and learned a bit about the WAL infrastructure and the way pages and tuples work internally.

If you missed it in the article, here is the link for the full extension.

I want to emphasize again that all the code I showed here is only a quick proof of concept that’s thought for one narrow use case, and it should be used with care. My goal here wasn’t to show state of the art code but rather show one possible way to quickly come up with a plan to salvage data in case of production incident. If you’re unfortunately confronted to a similar problem, or some major other accident I hope you will find some valuable resources and a starting point to come up with your own dedicated solution!

Extracting SQL from WAL? (part 2) was originally published by Julien Rouhaud at rjuju's home on December 20, 2023.

Extracting SQL from WAL? (part 1)

2023-12-06T03:04:10+00:00

Is it actually possible to extract SQL commands from WAL generated in “replica” wal_level?

The answer is usually no, the “logical” wal_level exists for a reason after all, and you shouldn’t expect some kind of miracle here.

But in this series of articles you will see that if some conditions are met you can still manage to extract some information, and how to do it. This first article focuses on the WAL records and how to extract the ones you want, while the next one will show how to try to extract the information contained in those records.

Some context

This article is based of some work I did a few months ago to help a customer recover some data after an incident. It’s not a perfect solution and mostly a set of quick hacks I did to come up with something able to retrieve data in a few hours of work only, but I hope sharing details about it and some methodology can be helpful if you ever get in a similar situation. You will probably need to adapt it to your needs, with yet other hacks, but it should give you a good start. It can otherwise be of some interest if you want to know a bit more about the WAL records internals and some associated infrastructure.

The incident

Due to a series of unfortunate events, one of their HA clusters ended in a split-brain situation for a some time before being reinitialised, which entirely removed one of the data directory. After that, only the WALs that were were generated on that instance were available, those being in “replica” wal_level, and nothing else.

One possibility to try recover the data would be to restore a physical backup, if any, replay archived WALs until the last transaction before the removed node is promoted (assuming those are still available) and then replay the WALs generated on that newly promoted node. Once there you still need to look at each row of each table of each database and compare it to yet another instance restore from the same backup to approximately the same time as this one. That’s clearly not ideal as it will likely require many days or even weeks of tedious hard work to do so, and will consume a lot of resources along the way. Is there a way to do better?

After a quick discussion, it turned out that there were a few elements that made some recovery from the WALs themselves possible (more on why later):

One of the data directories was still available
The customer guaranteed that no DDL happened since the incident
Only INSERTs happened during the split-brain

WALs & Physical replication

As you probably know, postgres physical replication works by sending an exact copy of the modified binary raw data to the various standby servers, in a continuous stream of WAL records. As a consequence, those records don’t really know much about the database objects they reference, and nothing about the SQL queries that generated them. So what do they really contain? Let’s see what’s inside the WAL records generated for an INSERT into a normal heap relation.

WAL records

First of all, you have to know that the WAL records are split into Resource Managers (declared in src/include/access/rmgrlist.h), each being responsible for a specific part of postgres (heap tables, indexes, vauum…). They’re identified by a numeric identifier and often referred to as a rmid, for //resource manager identifier//.

Each of those resource managers can handle various operations, which are internally called opcodes. Here we’re interested in the WAL records generated while operating on standard heap tables, and especially during INSERTs. This resource manager is a bit particular as it’s split into 2 different rmid: RM_HEAP_ID and RM_HEAP2_ID. This is only an implementation details, as each resource manager can only handle a limited number of opcodes, everything is the same otherwise.

If you’re curious, here’s the definition of the main WAL record in the source code and a bit of details on the exact layout in the files:

/*
 * The overall layout of an XLOG record is:
 *		Fixed-size header (XLogRecord struct)
 *		XLogRecordBlockHeader struct
 *		XLogRecordBlockHeader struct
 *		...
 *		XLogRecordDataHeader[Short|Long] struct
 *		block data
 *		block data
 *		...
 *		main data
 * [...]
 */
typedef struct XLogRecord
{
	uint32		xl_tot_len;		/* total len of entire record */
	TransactionId xl_xid;		/* xact id */
	XLogRecPtr	xl_prev;		/* ptr to previous record in log */
	uint8		xl_info;		/* flag bits, see below */
	RmgrId		xl_rmid;		/* resource manager for this record */
	/* 2 bytes of padding here, initialize to zero */
	pg_crc32c	xl_crc;			/* CRC for this record */

	/* XLogRecordBlockHeaders and XLogRecordDataHeader follow, no padding */

} XLogRecord;

and a block data header:

/*
 * Header info for block data appended to an XLOG record.
 *
 * 'data_length' is the length of the rmgr-specific payload data associated
 * with this block. It does not include the possible full page image, nor
 * XLogRecordBlockHeader struct itself.
 *
 * Note that we don't attempt to align the XLogRecordBlockHeader struct!
 * So, the struct must be copied to aligned local storage before use.
 */
typedef struct XLogRecordBlockHeader
{
	uint8		id;				/* block reference ID */
	uint8		fork_flags;		/* fork within the relation, and flags */
	uint16		data_length;	/* number of payload bytes (not including page
								 * image) */

	/* If BKPBLOCK_HAS_IMAGE, an XLogRecordBlockImageHeader struct follows */
	/* If BKPBLOCK_SAME_REL is not set, a RelFileLocator follows */
	/* BlockNumber follows */
} XLogRecordBlockHeader;

Everything here is very generic as it’s used by all the resource managers. One important bit though is the mention of a RelFileLocator after the block header if the record contains information about a different relation from the previous block, whatever is was (which is the meaning of BKPBLOCK_SAME_REL). This is of course important information for us.

typedef struct RelFileLocator
{
	Oid			spcOid;			/* tablespace */
	Oid			dbOid;			/* database */
	RelFileNumber relNumber;	/* relation */
} RelFileLocator;

But here’s a first reason why you need a proper data directory to do anything with the WALs: this doesn’t contain the schema name and table name, or even the table oid, but the tablespace oid, database oid and relfilenode, which is what the WAL actually need to identify a physical relation file (which is itself split into multiple files, the exact fork and segment are deduced using other information). So any table rewrite happening since the WAL records were generated (e.g. a VACUUM FULL) and you won’t be able to identify which relation a record is about, unless of course you find a way to map the current relfilenode to the one before the table rewrite.

Heap INSERT WAL records

Now that we saw a bit of the general WAL structures, let’s focus on the data specific to an INSERT. If you’re not familiar really with the internals, one easy way to locate the code related to a specific command is to look at the functions associated to a resource manager. Let’s look at the RM_HEAP_ID information in src/include/access/rmgrlist.h:

/* symbol name, textual name, redo, desc, identify, startup, cleanup, mask, decode */
PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask, heap_decode)

We here have the name of the actual functions responsible for many operations (the exact list will vary depending on the postgres major version, I’m here using the list in postgres 17).

The redo function is the name of the function that applies an RM_HEAP_ID record, the desc functions is the one that emits the info you see in pg_waldump, the identify function returns a string describing the opcode and so on. Let’s look at heap_identify():

const char *
heap_identify(uint8 info)
{
	const char *id = NULL;

	switch (info & ~XLR_INFO_MASK)
	{
		case XLOG_HEAP_INSERT:
			id = "INSERT";
			break;
[...]
	}

	return id;
}

We now know that the opcode we’re interested in is XLOG_HEAP_INSERT. A quick git grep in the tree will lead you to src/backend/access/heap/heapam.c, more precisely the heap_insert function. The interesting bit is located in the “XLOG stuff” block. I will show here an extract focusing on the bit we will need:

void
heap_insert(Relation relation, HeapTuple tup, CommandId cid,
			int options, BulkInsertState bistate)
{
[...]
	/* XLOG stuff */
	if (RelationNeedsWAL(relation))
	{
		xl_heap_insert xlrec;
		xl_heap_header xlhdr;
		XLogRecPtr	recptr;
		Page		page = BufferGetPage(buffer);
		uint8		info = XLOG_HEAP_INSERT;
		int			bufflags = 0;
[...]
		xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self);
		xlrec.flags = 0;
[...]
		XLogBeginInsert();
		XLogRegisterData((char *) &xlrec, SizeOfHeapInsert);

		xlhdr.t_infomask2 = heaptup->t_data->t_infomask2;
		xlhdr.t_infomask = heaptup->t_data->t_infomask;
		xlhdr.t_hoff = heaptup->t_data->t_hoff;

		/*
		 * note we mark xlhdr as belonging to buffer; if XLogInsert decides to
		 * write the whole page to the xlog, we don't need to store
		 * xl_heap_header in the xlog.
		 */
		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
		XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
		/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
		XLogRegisterBufData(0,
							(char *) heaptup->t_data + SizeofHeapTupleHeader,
							heaptup->t_len - SizeofHeapTupleHeader);
[...]
		recptr = XLogInsert(RM_HEAP_ID, info);

		PageSetLSN(page, recptr);
	}

We see here that this function is as expected inserting an RM_HEAP_ID record, with an XLOG_HEAP_INSERT opcode. There are 2 data parts associated with this record: the header of the tuple that’s being inserted and the tuple itself.

That’s great! At this point we know how to identify what relation an INSERT is about and the content of that INSERT. Let’s see how to filter those records from the WALs.

Extracting and filtering WAL records

Parsing the postgres WALs isn’t that complicated but still requires to know quite a bit more than what I showed here. Writing such code is possible but wait, don’t we already have a tool shipped with postgres which is designed to do exactly that? Yes there sure is, it’s pg_waldump.

Rather that writing something similar, couldn’t we simply teach pg_waldump to filter the records we’re interested in and save them somewhere so that we can later process them and generate SQL queries? This way we can then also benefit from all options in pg_waldump like specifying the starting and/or ending LSN or filtering a specific resource manager, without the need to worry about most of the WAL implementation details and only focusing on the few functions provided by postgres necessary for our need. Let’s see how to implement that.

The main source file is src/bin/pg_waldump/pg_waldump.c. Skipping most of the unrelated code, we can see that there’s a main loop that takes care of reading each record one by one, optionally filter them and then do something with them depending on how the tool was executed. I will again show an extract to focus on the most relevant part only:

	for (;;)
	{
[...]
		/* try to read the next record */
		record = XLogReadRecord(xlogreader_state, &errormsg);
[...]
		/* apply all specified filters */
		if (config.filter_by_rmgr_enabled &&
			!config.filter_by_rmgr[record->xl_rmid])
			continue;

[...]

		/* perform any per-record work */
		if (!config.quiet)
		{
			if (config.stats == true)
			{
				XLogRecStoreStats(&stats, xlogreader_state);
				stats.endptr = xlogreader_state->EndRecPtr;
			}
			else
				XLogDumpDisplayRecord(&config, xlogreader_state);
		}

		/* save full pages if requested */
		if (config.save_fullpage_path != NULL)
			XLogRecordSaveFPWs(xlogreader_state, config.save_fullpage_path);

		/* check whether we printed enough */
		config.already_displayed_records++;
		if (config.stop_after_records > 0 &&
			config.already_displayed_records >= config.stop_after_records)
			break;
	}

That’s quite simple, pg_waldump read the records one by one until it needs to stop, ignore the records that the users asked to discard and then takes action on the remaining ones. We can see that there’s already an option to save full page images, it definitely looks like we could just add something similar there, but for all records.

First, we will need to provide a way to identify the relation the INSERT is about. That’s the RelFileLocator, and we already know that it can be found just after the XLogRecordBlockHeader. Postgres provides a function to retrieve this information, and a bit more, named XLogRecGetBlockTagExtended(). Here is it’s description:

/*
 * Returns information about the block that a block reference refers to,
 * optionally including the buffer that the block may already be in.
 *
 * If the WAL record contains a block reference with the given ID, *rlocator,
 * *forknum, *blknum and *prefetch_buffer are filled in (if not NULL), and
 * returns true.  Otherwise returns false.
 */
bool
XLogRecGetBlockTagExtended(XLogReaderState *record, uint8 block_id,
						   RelFileLocator *rlocator,
 						   ForkNumber *forknum,
						   BlockNumber *blknum,
						   Buffer *prefetch_buffer)

We need to provide the record - pg_waldump already retrieves it for us - and the block_id. The block_id, or block reference, is simply an offset in the array of data that the WAL records contains. If you look a bit above in this article, you will see that we already know that heap_insert() only uses a hardcoded 0 block_id: this is the first argument in the various XLogRegisterXXX() function calls.

Next we need to retrieve the actual WAL record data, the tuple header and the tuple itself. This one is a bit trickier, as the record can either be found in a simple WAL record or in a full-page record. We need to check for a simple WAL record first. The associated function is XLogRecGetBlockData():

/*
 * Returns the data associated with a block reference, or NULL if there is
 * no data (e.g. because a full-page image was taken instead). The returned
 * pointer points to a MAXALIGNed buffer.
 */
char *
XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len)

As noted in the comment, if the function returns NULL (and sets len to 0) then the data may be in a full-page image instead (or the data could be missing entirely). If that’s the case we need to retrieve the full-page image, and then locate the tuple the INSERT was about and extract it in the same format as a simple WAL record.

Postgres provides a function to extract the full-page image: RestoreBlockImage():

/*
 * Restore a full-page image from a backup block attached to an XLOG record.
 *
 * Returns true if a full-page image is restored, and false on failure with
 * an error to be consumed by the caller.
 */
bool
RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)

which is straightforward to use: just provide the record and the block identifier and you get the full-page image if found. However, there’s no function available to extract a tuple for a full-page image. Indeed postgres can simply overwrite the whole block with the full-page image as it contains the latest version of the block at the time it was generated, but in our case we definitely don’t want to emit an INSERT statement for every already existing tuple in the block!

Fortunately, even when we get a full-page image, our record still contains a //main data area//. If you look up at the heap_insert() function, that’s the call to XLogRegisterData(), and as you see here it contains an xl_heap_insert struct. And the first member of this struct, offnum, is actually the position of the tuple in the page which is exactly what we need!

With all of that, it’s just a matter of accessing the tuple header and tuple at the correct place among all the tuples present in the page, and save as we would way it would be if it were a simple WAL record. If you’re wondering how exactly it should be done, you can always look at how postgres itself does it when it needs to return a specific tuple and adapt that code to your need. The functions responsible for that are heapgetpage() and heapgettup(), located in the src/backend/access/heap/heapam.c file we already mentioned.

We now have the information about the physical file location and the record itself that we will need to transmit to another program to decode it. The best way to do that is to simply save the record as-is in a binary file, and use the file name to transmit the metadata. I chose the following pattern to name the produced files:

LSN.TABLESPACE_OID.DATABASE_OID.RELFILENODE.FORKNAME

It will be trivial for the consumer to parse it and extract the required metadata. One thing to note is that I don’t put the rmid or the opcode here as I’m only emitting the only one I’m interested in and discard everything else. If that’s not your case you should definitely remember to add those in the filename pattern.

Since this requires a bit of code to implement, I won’t detail it here but you can find the full result in the patch for pg_waldump that I’m attaching to this article, which implements this as a new –save-records option.

To conclude, let me also remind you that a compiled version of pg_waldump will only work for a single major postgres version. In my case, I had to work with postgres 11, so you can find the patch for this version here, but if needed I also rebased it again the current commit on the master branch, which can be found here.

What’s next?

This is the end of this first article. We saw some details on the postgres WAL infrastructure, with a full example for the case of a plain INSERT on a heap table. We also learned where to look to find where other WAL records are generated and to see more details about the implementation.

We also checked how pg_waldump is working and how to adapt it for our need, with a provided complete patch for both postgres 11 and the current dev version (postgres 17). Again, I’d like to remind you that all this work is only at a proof-of-concept stage, it’s definitely not polished and I’m sure that are many problems that would need to be fixed. One obvious example of such problem is that we’re saving all INSERT we find in the logs but we don’t check if the transaction they’re in eventually committed. It would be possible to fix that but it would require extraneous code, so as is it’s up to the users to double check that as needed. Overall it was enough to recover the needed data so I didn’t pursue any more work on it.

In the next article we will see some usage of this new –save-records option, and also how to read those records and decode them to generate plain INSERT queries. Stay tuned!

Extracting SQL from WAL? (part 1) was originally published by Julien Rouhaud at rjuju's home on December 06, 2023.

Queryid reporting in plpgsql_check

2020-11-17T02:42:33+00:00

plpgsql_check version 1.14.0 was just released and brings some improvement for performance diagnostic.

Thanks a lot to Pavel Stěhule for the awesome plpgsql_check extension and the help for implementing the queryid reporting in v1.14!

plpgsql_check: static code analysis and more

PostgreSQL supports procedural code for many languages, the most popular one probably being plpgsql.

Even if that language allows you to write raw SQL statements, any function written in that language is still a block box as far as PostgreSQL is concerned, which means that PostgreSQL won’t perform a lot of checks to verify code quality, typo or any other problem related to code development. That’s where plpgsql_check extension comes into play.

If you write any plpgsql code, this extension will be your best friend, as it brings so many cool features. The major feature is static code analysis, which can detect many bugs, security / SQL inject issue and even possible performance issue by detecting implicit casts that could prevent PostgreSQL from using indexes and much more.

It also brings a simple, but yet very useful, code profiler.

How to track down performance issue in plpgsql code?

As I mentioned above, plpgsql code is a black box as far as PostgreSQL is concerned. The direct consequence is that the performance diagnostic possibilities are quite limited.

Using core PostgreSQL, the only option is using pg_stat_user_functions (which requires track_functions to be set to pl or all). It’ll show the number of time each function has been called, and how long the execution took including and excluding nested functions. Unfortunately, this view can only help you track down which function is slow, but not why, as you don’t get any per-instruction metric.

You can somehow work around that limitation using the contrib extensions pg_stat_statements. This extensions is one of the most popular one as far as performance diagnostic is concerned, and gives you a lot of data on query performance (including planning counters and WAL counters since PostgreSQL 13).

The only problem is that it can be quite tricky to match pg_stat_statements entries with your plpgsql code, as there’s way to directly identify which queries are run inside your plpgsql code.

plpgsql_check code profiler

Another alternative is to use a plpgsql code profiler. There are multiple extensions that bring this feature, and I personally chose plpgsql_check, as it perfectly suited my need: simple to setup and use, all performance information I needed and possibility to use it either in a per-connection base or globally when configuration the extension in shared_preload_libraries. Thanks to this profiler, you can finally get performance metrics at the statement level inside plpgsql code:

total execution time, that is the cumulated execution time for all the statements in the source code line
average execution time, that is the total execution time divided by the number of statements in the source code line
maximum execution time, per statement
number of rows processed, per statement

With those information, it becomes quite easy to track down the slow part of your functions. Here’s a simplistic example:

=# SELECT lineno, cmds_on_row, total_time, avg_time, max_time, source
  FROM plpgsql_profiler_function_tb('pltest()');
 lineno | cmds_on_row | total_time | avg_time |     max_time     |                        source
--------+-------------+------------+----------+------------------+-------------------------------------------------------
      1 |      <NULL> |     <NULL> |   <NULL> | <NULL>           |
      2 |      <NULL> |     <NULL> |   <NULL> | <NULL>           | DECLARE
      3 |      <NULL> |     <NULL> |   <NULL> | <NULL>           |     num bigint;
      4 |      <NULL> |     <NULL> |   <NULL> | <NULL>           |     _tbl text = 'pg_class';
      5 |           1 |      0.085 |    0.085 | {0.085}          | BEGIN
      6 |           1 |      0.504 |    0.504 | {0.504}          |     drop table if exists meh;
      7 |           1 |       0.81 |     0.81 | {0.81}           |     CREATE TABLE meh(id integer);
      8 |           1 |      0.362 |    0.362 | {0.362}          |     EXECUTE 'SELECT COUNT(*) FROM ' || _tbl INTO num;
      9 |           2 |    1000.84 |   500.42 | {0.349,1000.491} |     delete from meh; PERFORM pg_sleep(1);
     10 |           1 |          0 |        0 | {0}              |     RETURN num;
     11 |      <NULL> |     <NULL> |   <NULL> | <NULL>           | END;
(11 rows)

In this example, we can see immediately that the slowdown comes from source code line n°9, which has a total execution time of 1s. Using the max_time field, we see that it’s because of the 2nd statements. As we also have the source code available in the view, we can immediately see the problematic query, which here is a simple call to pg_sleep(1).

So far so good. But with less naive example the cause of slow execution might be less obvious, and it could be handy to rely on all the available extensions to get more information: pg_stat_statements for general counters, pg_stat_kcache for CPU and disk usage counters, pg_wait_sampling for wait events and so on.

But how to match the plpgsql statement with entries in those extensions?

Exposing queryid in plpgql_check profiler

Indeed, those extensions identify queries using a query identifier, computed by pg_stat_statements. You could try to manually find the related entry using the query text stored by pg_stat_statements, but it may not always be possible. What if the query is dynamic SQL or using unqualified names?

The solution here is quite simple: since plpgsql_check profiler already show per-statement information, also report the statement’s underlying queryid.

This is now available with version 1.14.0. Using the previous naive example, here’s what we now see:

=# SELECT lineno, max_time, queryids, source
  FROM plpgsql_profiler_function_tb('pltest()');
 lineno |     max_time     |                 queryids                  |                        source
--------+------------------+-------------------------------------------+-------------------------------------------------------
      1 | <NULL>           | <NULL>                                    |
      2 | <NULL>           | <NULL>                                    | DECLARE
      3 | <NULL>           | <NULL>                                    |     num bigint;
      4 | <NULL>           | <NULL>                                    |     _tbl text = 'pg_class';
      5 | {0.085}          | <NULL>                                    | BEGIN
      6 | {0.504}          | {NULL}                                    |     drop table if exists meh;
      7 | {0.81}           | {NULL}                                    |     CREATE TABLE meh(id integer);
      8 | {0.362}          | {-7484655548452190292}                    |     EXECUTE 'SELECT COUNT(*) FROM ' || _tbl INTO num;
      9 | {0.349,1000.491} | {8162364748417812595,6729783856403017864} |     delete from meh; PERFORM pg_sleep(1);
     10 | {0}              | <NULL>                                    |     RETURN num;
     11 | <NULL>           | <NULL>                                    | END;
(11 rows)

You’re now only a JOIN away from matching your plpgsql profile data from your favorite extensions!

Limitations

There are unfortunately some limitations.

Due to pg_stat_statements implementation, queryid for DDL queries is not exposed outside the extension, so plpgsql_check can’t retrieve it.

When using dynamic SQL, there might be many queries involved:

the query text itself will be generated using SQL statement(s)
the parameters, if any, will also be resolved running SQL statement(s)
if the query text depends on some parameters, you can end up with multiple different top level query

plpgsql_check will only report the top level query identifier, and if multiple different queries are generated only the query identifier of the first one will be reported.

Even with those limitations I still hope that this new feature will be helpful.

What’s next?

Due to current plpgsql implementation, when a dynamic SQL statement is executed the query identifier is not visible outside plpgsql itself. It means that retrieving the query identifier in that case is a bit costly, as plpgsql_check has to do some of the work that plpgsql is doing:

generate the final query string
parse the query string
call the parse analysis step (this is where the query identifier is generated)

Of course the query itself won’t be executed or even planned, but those extra steps might add non negligible overhead, especially when the dynamic SQL is executing very short OLTP-style queries.

So plpgsql should be modified to be able to report the query identifier of all statements, whether static or dynamic, so external modules can access the information easily and without any additional overhead. Ideally, this could also be available in plpgsql code using a GET [ CURRENT ] DIAGNOSTICS command, so users can also use it as they need.

Queryid reporting in plpgsql_check was originally published by Julien Rouhaud at rjuju's home on November 17, 2020.

New in pg13: WAL monitoring

2020-04-07T15:46:15+00:00

Write-Ahead Logs is a critical part of PostgreSQL, that ensures data durability. While there are multiple configuration parameters , there was no easy to monitor WAL activity, or what is generating it.

New infrastructure to track WAL activity

commit df3b181499b40523bd6244a4e5eb554acb9020ce
Author: Amit Kapila <[email protected]>
Date:   Sat Apr 4 10:02:08 2020 +0530

    Add infrastructure to track WAL usage.

    This allows gathering the WAL generation statistics for each statement
    execution.  The three statistics that we collect are the number of WAL
    records, the number of full page writes and the amount of WAL bytes
    generated.

    This helps the users who have write-intensive workload to see the impact
    of I/O due to WAL.  This further enables us to see approximately what
    percentage of overall WAL is due to full page writes.

    In the future, we can extend this functionality to allow us to compute the
    the exact amount of WAL data due to full page writes.

    This patch in itself is just an infrastructure to compute WAL usage data.
    The upcoming patches will expose this data via explain, auto_explain,
    pg_stat_statements and verbose (auto)vacuum output.

    Author: Kirill Bychik, Julien Rouhaud
    Reviewed-by: Dilip Kumar, Fujii Masao and Amit Kapila
    Discussion: https://postgr.es/m/CAB-hujrP8ZfUkvL5OYETipQwA=e3n7oqHFU=4ZLxWS_Cza3kQQ@mail.gmail.com

With this new infrastructure, each backend will track various information about WAL generation: the number of WAL records, the size of WAL generated and the number of full page images generated. It also makes sure that parallel queries, both DML and utility statements (for now only CREATE INDEX and VACUUM) are correctly handled.

Per-query WAL activity with pg_stat_statements

commit 6b466bf5f2bea0c89fab54eef696bcfc7ecdafd7
Author: Amit Kapila <[email protected]>
Date:   Sun Apr 5 07:34:04 2020 +0530

    Allow pg_stat_statements to track WAL usage statistics.

    This commit adds three new columns in pg_stat_statements output to
    display WAL usage statistics added by commit df3b181499.

    This commit doesn't bump the version of pg_stat_statements as the
    same is done for this release in commit 17e0328224.

    Author: Kirill Bychik and Julien Rouhaud
    Reviewed-by: Julien Rouhaud, Fujii Masao, Dilip Kumar and Amit Kapila
    Discussion: https://postgr.es/m/CAB-hujrP8ZfUkvL5OYETipQwA=e3n7oqHFU=4ZLxWS_Cza3kQQ@mail.gmail.com

This basically exposes the mentionned new information about WAL activity in pg_stat_activity, so per (user, database, normalized query). Here is an example:

=# CREATE TABLE t1 (id integer);
CREATE

=# INSERT INTO t1 SELECT 1;
INSERT 0 1

=# UPDATE t1 SET id = 2 WHERE id = 1;
UPDATE 1

=# CHECKPOINT;
CHECKPOINT

=# DELETE FROM t1 WHERE id = 2;
DELETE 1
=# SELECT query, wal_records, wal_bytes, wal_num_fpw
   FROM pg_stat_statements
   WHERE query LIKE 'UPDATE%' OR query LIKE 'DELETE%';
                   query                | wal_records | wal_bytes | wal_num_fpw
-------------------------------------+-------------+-----------+-------------
 DELETE FROM t1 WHERE id = $1        |           1 |       155 |           1
 UPDATE t1 SET id = $1 WHERE id = $2 |           1 |        69 |           0
(2 rows)

I simply inserted a row, updated it and deleted it. Now, looking specifically at the UPDATE and the DELETE, the numbers can be surprising.

When inserting a row, we indeed expect a single WAL record and some WAL bytes for the new row, with some overhead due to internal implementation.

Now, if you’re familiar with PostgreSQL MVCC implementation, you should know that doing a DELETE should only write a transaction id in the xmax field (this documentation page is a good introduction on that subject). So why writing a 4B field (the size of the recotded xmax field), even with some overhead, is writing more than twice the amount of WAL that was required to update a full row? That’s because the DELETE caused a full page write. This is a side effect of performing a CHECKPOINT before the DELETE. To guarantee data consistency (and if full_page_writes parameter isn’t deactivated), any block modified for the first time after a CHECKPOINT completion will be fully logged, rather than logging only the delta.

You’ll also note that the full page didn’t generate 8kB of data as you could expect. This isn’t because of wal_compression, as I didn’t activate it, but because the page is almost empty. Indeed, as an optimization, any “hole” in a page, as long as it’s a standard page, can be safely skipped in the WAL. If you’re curious, this is done in the XLogRecordAssemble() function . Here’s the relevant extract:

static XLogRecData *
XLogRecordAssemble(RmgrId rmid, uint8 info,
				   XLogRecPtr RedoRecPtr, bool doPageWrites,
				   XLogRecPtr *fpw_lsn, int *num_fpw)
{
[...]
		/*
		 * If needs_backup is true or WAL checking is enabled for current
		 * resource manager, log a full-page write for the current block.
		 */
		include_image = needs_backup || (info & XLR_CHECK_CONSISTENCY) != 0;

		if (include_image)
		{
			Page		page = regbuf->page;
			uint16		compressed_len = 0;

			/*
			 * The page needs to be backed up, so calculate its hole length
			 * and offset.
			 */
			if (regbuf->flags & REGBUF_STANDARD)
			{
				/* Assume we can omit data between pd_lower and pd_upper */
				uint16		lower = ((PageHeader) page)->pd_lower;
				uint16		upper = ((PageHeader) page)->pd_upper;

				if (lower >= SizeOfPageHeaderData &&
					upper > lower &&
					upper <= BLCKSZ)
				{
					bimg.hole_offset = lower;
					cbimg.hole_length = upper - lower;
				}
				else
				{
					/* No "hole" to remove */
					bimg.hole_offset = 0;
					cbimg.hole_length = 0;
				}
			}
            [...]

WAL activity in EXPLAIN (and auto_explain)

A new WAL option is available in the EXPLAIN command, and similarly a auto_explain.log_wal for auto_explain, to display the same counters. In TEXT mode, only the non-zero counters are shown, similarly to other counters. For instance:

=# EXPLAIN (ANALYZE, WAL, COSTS OFF) UPDATE t1 SET id = 1 WHERE id = 1;
                           QUERY PLAN
----------------------------------------------------------------
 Update on t1 (actual time=0.181..0.181 rows=0 loops=1)
   WAL:  records=1  bytes=68
   ->  Seq Scan on t1 (actual time=0.074..0.080 rows=1 loops=1)
         Filter: (id = 1)
 Planning Time: 0.274 ms
 Execution Time: 0.381 ms
(6 rows)

WAL activity in autovacuum logs

And finally, if an autovacuum is logging its activity (when reaching the log_autovacuum_min_duration threshold), the same information will be logged. For instance, after inserting 100k records in the same table, deleting half of them and running a CHECKPOINT, here’s the output I get:

LOG:  automatic vacuum of table "rjuju.public.t1": index scans: 0
	pages: 0 removed, 443 remain, 0 skipped due to pins, 0 skipped frozen
	tuples: 50000 removed, 50001 remain, 0 are dead but not yet removable, oldest xmin: 496
	buffer usage: 912 hits, 3 misses, 448 dirtied
	avg read rate: 0.084 MB/s, avg write rate: 12.485 MB/s
	system usage: CPU: user: 0.17 s, system: 0.00 s, elapsed: 0.28 s
	WAL usage: 1330 records, 445 full page writes, 2197104 bytes

This new log output is in my opinion especially important, especially when it comes to anti-wraparound / FREEZE vacuum. Indeed, by nature an anti-wraparound VACUUM is more likely to touch blocks that weren’t modified for a long time as it’s targeting tuple being visible for more than 200M transactions (by default). Even though it’s only setting a flag bit to mark the tuple as frozen, if that block wasn’t modified since the last CHECKPOINT, this bit will be amplified to a full page image which is way more data.

With this new feature, it’s now possible to really monitor the WAL generation, which will help to better tune your instances!

New in pg13: WAL monitoring was originally published by Julien Rouhaud at rjuju's home on April 07, 2020.

New in pg13: Monitoring the query planner

2020-04-04T12:06:15+00:00

Depending on your workload, the planning time can represent a significant part of the overal query procesing time. This is especially import in OLTP workload, but OLAP queries with numerous tables being joined and an aggressive configuration on the JOIN order search can also lead to hight planning time.

Planning counters in pg_stat_statements

Previously, pg_stat_statements was only keeping track of the execution part of a query processing: the number of execution, cumulated time, but also minimum, maximum, mean and also the standard deviation. With PostgreSQL 13, you’ll also have those metrics for the planification part!

commit 17e03282241c6ac58a714eb0c3b6a8018cf6167a
Author: Fujii Masao <[email protected]>
Date:   Thu Apr 2 11:20:19 2020 +0900

    Allow pg_stat_statements to track planning statistics.

    This commit makes pg_stat_statements support new GUC
    pg_stat_statements.track_planning. If this option is enabled,
    pg_stat_statements tracks the planning statistics of the statements,
    e.g., the number of times the statement was planned, the total time
    spent planning the statement, etc. This feature is useful to check
    the statements that it takes a long time to plan. Previously since
    pg_stat_statements tracked only the execution statistics, we could
    not use that for the purpose.

    The planning and execution statistics are stored at the end of
    each phase separately. So there are not always one-to-one relationship
    between them. For example, if the statement is successfully planned
    but fails in the execution phase, only its planning statistics are stored.
    This may cause the users to be able to see different pg_stat_statements
    results from the previous version. To avoid this,
    pg_stat_statements.track_planning needs to be disabled.

    This commit bumps the version of pg_stat_statements to 1.8
    since it changes the definition of pg_stat_statements function.

    Author: Julien Rouhaud, Pascal Legrand, Thomas Munro, Fujii Masao
    Reviewed-by: Sergei Kornilov, Tomas Vondra, Yoshikazu Imai, Haribabu Kommi, Tom Lane
    Discussion: https://postgr.es/m/CAHGQGwFx_=DO-Gu-MfPW3VQ4qC7TfVdH2zHmvZfrGv6fQ3D-Tw@mail.gmail.com
    Discussion: https://postgr.es/m/CAEepm=0e59Y_6Q_YXYCTHZkqOc6H2pJ54C_Xe=VFu50Aqqp_sA@mail.gmail.com
    Discussion: https://postgr.es/m/DB6PR0301MB21352F6210E3B11934B0DCC790B00@DB6PR0301MB2135.eurprd03.prod.outlook.com

Keep in mind that even simple query can have a surprisingly high planification time. One of the frequent cause was the get_actual_variable_range() function, which is called when the planner wants to know what are the minimum and maximum values of a specific field. This function detects if a suitable index exists, and if there’s one it gets the wanted values. However, when there were a lot of uncommitted values at the end of the index range, it could take a significant amount of time to get a visible value. While this problem has been fixed long ago (see this commit and this other commit for more details), there are still some cases where the planning time is higher than what you’d expect, so having an easy way to monitor the planification metrics is worthwhile.

This feature can also be interesting to know how much you’re using the generic plan feature for instance, and how much of a difference this should make for instance.

Let’s see a simple example, to see the effect of generic plans with prepared statements:

=# PREPARE s1 AS SELECT count(*) FROM pg_class;
PREPARE
=# EXECUTE s1;
 count
-------
   387
(1 row)

[... 5 more times ...]

=# SELECT query, plans, total_plan_time, total_plan_time / plans AS avg_plan,
   calls, total_exec_time, total_exec_time / calls AS avg_exec
   FROM pg_stat_statements
   WHERE query ILIKE '%SELECT count(*) FROM pg_class%';
-[ RECORD 1 ]---+--------------------------------------------
query           | PREPARE s1 AS SELECT count(*) FROM pg_class
plans           | 1
total_plan_time | 2.119496
avg_plan        | 2.119496
calls           | 6
total_exec_time | 3.4918280000000004
avg_exec        | 0.5819713333333334

While the query was executed 6 times, it was actually planned only once (since there’s no parameter, a generic plan is always used). While the execution time is on average slightly more than half a milliscond, a single planning was almost 4 times more expensive. By saving 5 planification, postgres saved up to 10ms.

Planning buffers in EXPLAIN

commit ce77abe63cfc85fb0bc236deb2cc34ae35cb5324
Author: Fujii Masao <[email protected]>
Date:   Sat Apr 4 03:13:17 2020 +0900

    Include information on buffer usage during planning phase, in EXPLAIN output, take two.

    When BUFFERS option is enabled, EXPLAIN command includes the information
    on buffer usage during each plan node, in its output. In addition to that,
    this commit makes EXPLAIN command include also the information on
    buffer usage during planning phase, in its output. This feature makes it
    easier to discern the cases where lots of buffer access happen during
    planning.

    This commit revives the original commit ed7a509571 that was reverted by
    commit 19db23bcbd. The original commit had to be reverted because
    it caused the regression test failure on the buildfarm members prion and
    dory. But since commit c0885c4c30 got rid of the caues of the test failure,
    the original commit can be safely introduced again.

    Author: Julien Rouhaud, slightly revised by Fujii Masao
    Reviewed-by: Justin Pryzby
    Discussion: https://postgr.es/m/[email protected]

Following the same idea, EXPLAIN will now display the buffer usage if the BUFFERS option is used. If you try that on a fresh new connection, before any catalog cache is populated, you could be surprised on how many buffers would be accessed even for a simple query:

=# EXPLAIN (BUFFERS, ANALYZE, COSTS OFF) SELECT * FROM pg_class;
                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------
 Seq Scan on pg_class (actual time=0.028..0.410 rows=388 loops=1)
   Buffers: shared hit=13
 Planning Time: 5.157 ms
   Buffers: shared hit=118
 Execution Time: 1.257 ms
(5 rows)

=# EXPLAIN (BUFFERS, ANALYZE, COSTS OFF) SELECT * FROM pg_class;
                            QUERY PLAN
------------------------------------------------------------------
 Seq Scan on pg_class (actual time=0.035..0.413 rows=388 loops=1)
   Buffers: shared hit=13
 Planning Time: 0.393 ms
 Execution Time: 0.670 ms

We can see here that populating the cache (relation, columns, datatypes…) access 118 blocks, and that’s probably a significant part of the 5 extra ms we saw in the first EXPLAIN output.

New in pg13: Monitoring the query planner was originally published by Julien Rouhaud at rjuju's home on April 04, 2020.

Nouveau dans pg13: Colonne leader_pid dans pg_stat_activity

2020-03-08T05:33:26+00:00

Nouvelle colonne leader_pid dans la vue pg_stat_activity

Étonnamment, depuis que les requêtes parallèles ont été ajoutées dans PostgreSQL 9.6, il était impossible de savoir à quel processus client était lié un worker parallèle. Ainsi, comme Guillaume l’a fait remarquer, it makes il est assez difficile de construire des outils simples permettant d’échantillonner les événements d’attente liés à tous les processus impliqués dans une requête. Une solution simple à ce problème est d’exporter l’information de lock group leader disponible dans le processus client au niveau SQL :

commit b025f32e0b5d7668daec9bfa957edf3599f4baa8
Author: Michael Paquier <[email protected]>
Date:   Thu Feb 6 09:18:06 2020 +0900

Add leader_pid to pg_stat_activity

This new field tracks the PID of the group leader used with parallel
query.  For parallel workers and the leader, the value is set to the
PID of the group leader.  So, for the group leader, the value is the
same as its own PID.  Note that this reflects what PGPROC stores in
shared memory, so as leader_pid is NULL if a backend has never been
involved in parallel query.  If the backend is using parallel query or
has used it at least once, the value is set until the backend exits.

Author: Julien Rouhaud
Reviewed-by: Sergei Kornilov, Guillaume Lelarge, Michael Paquier, Tomas
Vondra
Discussion: https://postgr.es/m/CAOBaU_Yy5bt0vTPZ2_LUM6cUcGeqmYNoJ8-Rgto+c2+w3defYA@mail.gmail.com

Avec cette modification, il est maintenant très simple de trouver tous les processus impliqués dans une requête parallèle. Par exemple :

=# SELECT query, leader_pid,
  array_agg(pid) filter(WHERE leader_pid != pid) AS members
FROM pg_stat_activity
WHERE leader_pid IS NOT NULL
GROUP BY query, leader_pid;
       query       | leader_pid |    members
-------------------+------------+---------------
 select * from t1; |      31630 | {32269,32268}
(1 row)

Attention toutefois, comme indiqué dans le message de commit, si la colonne leader_pid à la même valeur que la colonne pid, cela ne veut pas forcément dire que le processus client est actuellement en train d’effectuer une requête parallèle, car une fois que le champ est positionné il n’est jamais réinitialisé. De plus, pour éviter tout surcoût, aucun verrou supplémentaire n’est maintenu lors de l’affichage de ces données. Cela veut dire que chaque ligne est traitée indépendamment. Ainsi, bien que cela soit fort peu probable, vous pouvez obtenir des données incohérentes dans certaines circonstances, comme par exemple un worker paralèlle pointant vers un pid qui est déjà déconnecté.

Nouveau dans pg13: Colonne leader_pid dans pg_stat_activity was originally published by Julien Rouhaud at rjuju's home on March 08, 2020.

Planner selectivity estimation error statistics with pg_qualstats 2

2020-02-28T12:37:04+00:00

Selectivity estimation error is one of the main cause of bad query plans. It’s quite straighforward to compute those estimation error using EXPLAIN (ANALYZE), either manually or with the help of explain.depesz.com (or other similar tools), but until now there were now tool available to get this information automatically and globally. Version 2 of pg_qualstats fixes that, thanks a lot to Oleg Bartunov for the original idea!

Note: If you don’t know pg_qualstats extension, you may want to see my last article about it.

The problem

There can be many causes to that issue: outdated statistics, complex predicates, non uniform data… But whatever the reason is, if the optimizer doesn’t have an accurate idea on how much data each predicate will filter, the result is the same: a bad query plan, which can lead to longer query execution.

To illustrate the problem, I’ll use here a simple test case, voluntarily built to fool the optimizer.

rjuju=# CREATE TABLE pgqs AS
             SELECT  i%2 val1 , (i+1)%2 val2
             FROM generate_series(1, 50000) i;
SELECT 50000

rjuju=# VACUUM ANALYZE pgqs;
VACUUM

rjuju=# EXPLAIN (ANALYZE) SELECT * FROM pgqs WHERE val1 = 1 AND val2 = 1;
                             QUERY PLAN
--------------------------------------------------------------------
 Seq Scan on pgqs  ([...] rows=12500 width=8) ([...] rows=0 loops=1)
   Filter: ((val1 = 1) AND (val2 = 1))
   Rows Removed by Filter: 50000
 Planning Time: 0.553 ms
 Execution Time: 38.062 ms
(5 rows)

Here postgres think that the query will emit 12500 tuples, while in reality none will be emitted. If you’re wondering how postgres came up with that number, the explanation is simple. When multiple independant (overlapping range predicate can be merged) clauses are AND-ed and no extended statistics are available (see below for more about it), postgres will simply multiply each clause selectivity. This is done in clauselist_selectivity_simple, in src/backend/optimizer/path/clausesel.c:

Selectivity
clauselist_selectivity_simple(PlannerInfo *root,
                List *clauses,
                int varRelid,
                JoinType jointype,
                SpecialJoinInfo *sjinfo,
                Bitmapset *estimatedclauses)
{
  Selectivity s1 = 1.0;
  [...]
  /*
   * Anything that doesn't look like a potential rangequery clause gets
   * multiplied into s1 and forgotten. Anything that does gets inserted into
   * an rqlist entry.
   */
  listidx = -1;
  foreach(l, clauses)
  {
    [...]
    /* Always compute the selectivity using clause_selectivity */
    s2 = clause_selectivity(root, clause, varRelid, jointype, sjinfo);
    [...]
        /*
         * If it's not a "<"/"<="/">"/">=" operator, just merge the
         * selectivity in generically.  But if it's the right oprrest,
         * add the clause to rqlist for later processing.
         */
        switch (get_oprrest(expr->opno))
        {
          [...]
          default:
            /* Just merge the selectivity in generically */
            s1 = s1 * s2;
            break;
          [...]

In this case, each predicate will independantly filter approximately 50% of the table, as we can see in pg_stats view:

rjuju=# SELECT tablename, attname, most_common_vals, most_common_freqs
        FROM pg_stats WHERE tablename = 'pgqs';
 tablename | attname | most_common_vals |    most_common_freqs
-----------+---------+------------------+-------------------------
 pgqs      | val1    | {0,1}            | {0.50116664,0.49883333}
 pgqs      | val2    | {1,0}            | {0.50116664,0.49883333}
(2 rows)

So when using both clauses, the estimate is 25% of the table, since postgres doesn’t know by default that both values are mutually exclusive. Continuing with this artificial test case, let’s see what happens if we add a join on top of if. For instance, joining the table to itself on the val1 column only. For clarity, I’ll use t1 for the table on which I’m applying the mutually exclusive predicates, and t2 the table joined:

rjuju=# EXPLAIN ANALYZE SELECT *
        FROM pgqs t1
        JOIN pgqs t2 ON t1.val1 = t2.val1
        WHERE t1.val1 = 0 AND t1.val2 = 0;
                                     QUERY PLAN
-----------------------------------------------------------------------------------
 Nested Loop  ([...] rows=313475000 width=16) ([...] rows=0 loops=1)
   ->  Seq Scan on pgqs t2  ([...] rows=25078 width=8) ([...] rows=25000 loops=1)
         Filter: (val1 = 0)
         Rows Removed by Filter: 25000
   ->  Materialize  ([...] rows=12500 width=8) ([...] rows=0 loops=25000)
         ->  Seq Scan on pgqs t1  ([...] rows=12500 width=8) ([...] rows=0 loops=1)
               Filter: ((val1 = 0) AND (val2 = 0))
               Rows Removed by Filter: 50000
 Planning Time: 0.943 ms
 Execution Time: 86.757 ms
(14 rows)

Postgres thinks that this join will emit 313 millions rows, while obviously no rows will be emitted. And this is a good example on how bad assumptions can lead to an inefficient plan.

Here Postgres can deduce that the val1 = 0 predicate can be applied to t2. So how to join two relations, one that should emit 25000 tuples and the other that should emit 12500 tuples, with no index available? A nested loop is not a bad choice, as both relation aren’t really big. As no index is available, postgres also chooses to materialize the inner relation, meaning storing it in memory, to make it more efficient. As it tries to limit memory consumption as much as possible, the smallest relation is materialized, and that’s the mistake here.

Indeed, postgres will read the whole table twice: once to get every rows corresponding to the val1 = 0 predicate for the outer relation, and once to find all rows to be materialized. If the opposite was done, as it would probably have if the estimates had been more realistic, the table would only have been read once.

In this case, as the dataset isn’t big and quite artificial, a better plan wouldn’t drastically change the execution time. But keep in mind than with real production environements, it could mean choosing a nested loop assuming that there’ll be only a couple of rows to loop on while in reality the backend will spend minutes or even hours looping over millions of rows, and another plan would have been orders of magnitude quicker.

Detecting the problem

pg_qualstats 2 will now compute the selectivity estimation error, both in a ratio and a raw number, and will keep track for each predicate the minimum, maximum and mean values, with the standard deviation. This is now quite simple to detect problematic quals!

After executing the last query, here’s what the pg_qualstats view will return:

rjuju=# SELECT relname, attname, opno::regoper, qualid, qualnodeid,
    mean_err_estimate_ratio mean_ratio, mean_err_estimate_num mean_num, constvalue
    FROM pg_qualstats pgqs
    JOIN pg_class c ON pgqs.lrelid = c.oid
    JOIN pg_attribute a ON a.attrelid = c.oid AND a.attnum = pgqs.lattnum;
 relname | attname | opno |   qualid   | qualnodeid | mean_ratio | mean_num | constvalue
---------+---------+------+------------+------------+------------+----------+------------
 pgqs    | val1    | =    |     <NULL> | 3161070364 | 1.00393542 |       98 | 0::integer
 pgqs    | val1    | =    | 3864967567 | 3161070364 |      12500 |    12500 | 0::integer
 pgqs    | val2    | =    | 3864967567 | 3065200358 |      12500 |    12500 | 0::integer
(3 rows)

NOTE: qualid is an identifier if multiple qual are AND-ed, NULL otherwise, and qualnodeid is a per-qual only identifier.

We see here that when used alone, the qual pgqs.val = ? doesn’t show any selectivity estimate problem as the ratio (mean_ratio) is very close to 1 and the raw number (mean_num) is quite low. On the other hand, when combined with AND pgqs.val2 = ? pg_qualstats reports significant estimate error. That’s a very strong sign that those columns are functionally dependent.

If for example a qual alone shows issues, it could be a sign of outdated statistics, or that the sample size isn’t big enough.

Also, if you have pg_stat_statements extension installed, pg_qualstats will give you the query identifier for each predicate. With that and a bit of SQL, you can for instance find the query with a long average execution time which contains quals for which the selectivity estimation is off by 10 or more.

Interlude: Extended statistics

If you’re wondering how to solve the issue I just explained, the solution is very easy since extended statistics were introduced in PostgreSQL 10, and assuming that you know that’s the root issue. Create an extended statistcs on the related columns, perform an ANALYZE and you’re done!

rjuju=# CREATE STATISTICS pgqs_stats ON val1, val2 FROM pgqs;
CREATE STATISTICS

rjuju=# ANALYZE pgqs;
ANALYZE

rjuju]=# EXPLAIN ANALYZE SELECT *
        FROM pgqs t1
        JOIN pgqs t2 ON t1.val1 = t2.val1
        WHERE t1.val1 = 0 AND t1.val2 = 0 order by t1.val2;
                             QUERY PLAN
-------------------------------------------------------------------------
 Nested Loop  ([...] rows=25002 width=16) ([...] rows=0 loops=1)
   ->  Seq Scan on pgqs t1  ([...] rows=1 width=8) ([...] rows=0 loops=1)
         Filter: ((val1 = 0) AND (val2 = 0))
         Rows Removed by Filter: 50000
   ->  Seq Scan on pgqs t2  ([...] rows=25002 width=8) (never executed)
         Filter: (val1 = 0)
 Planning Time: 0.559 ms
 Execution Time: 39.471 ms
(8 rows)

If you want more details on extended statistics, I recommend looking at the slides from Tomas Vondra’s excellent talk on this subject.

Going further

Tracking the quals in every single qual executed is of course quite expensive, and would significantly impact the performance for any non datawarehouse workload. That’s why pg_qualstats has an option, pg_qualstats.sample_rate, to sample the query that will be processed. This setting is by default set to 1 / max_connections, which will make the overhead quite negligible, but don’t be surprised if you don’t see any qual reported after running a few queries!

But if you’re instead only interested by the quals that has bad selectivity estimation, for instance to detect this class of problem rather than missing indexes, there are two new options available for that:

pg_qualstats.min_err_estimate_ratio
pg_qualstats.min_err_estimate_num

Those options are cumulative and can be changed at anytime, and will limit the quals that pg_qualstats will store to the ones that have a selectivity estimate ratio and/or raw number higher that what you ask. Although those options will help to reduce the performance overhead, they of course can be combined with pg_qualstats.sample_rate if needed.

Conclusion

After introducing the new global index advisor, this article presented a class of problems that are frequently seen as a DBA, and how to detect and solve them.

I believe that those two new features in pg_qualstats will greatly help PostgreSQL databases administration. Also, external tools that aims to solve related issue, such as pg_plan_advsr or AQO could also benefit from pg_qualstats, as they could directly get the exact data they need to be able perform analysis and optimize the queries!

Planner selectivity estimation error statistics with pg_qualstats 2 was originally published by Julien Rouhaud at rjuju's home on February 28, 2020.

New in pg13: New leader_pid column in pg_stat_activity

2020-02-06T12:59:53+00:00

New leader_pid column in pg_stat_activity view

Surprisingly, since parallel query was introduced in PostgreSQL 9.6, it was impossible to know wich backend a parallel worker was related to. So, as Guillaume pointed out, it makes it quite difficult to build simple tools that can sample the wait events related to all process involved in a query. A simple solution to that problem is to export the lock group leader information available in the backend at the SQL level:

commit b025f32e0b5d7668daec9bfa957edf3599f4baa8
Author: Michael Paquier <[email protected]>
Date:   Thu Feb 6 09:18:06 2020 +0900

Add leader_pid to pg_stat_activity

This new field tracks the PID of the group leader used with parallel
query.  For parallel workers and the leader, the value is set to the
PID of the group leader.  So, for the group leader, the value is the
same as its own PID.  Note that this reflects what PGPROC stores in
shared memory, so as leader_pid is NULL if a backend has never been
involved in parallel query.  If the backend is using parallel query or
has used it at least once, the value is set until the backend exits.

Author: Julien Rouhaud
Reviewed-by: Sergei Kornilov, Guillaume Lelarge, Michael Paquier, Tomas
Vondra
Discussion: https://postgr.es/m/CAOBaU_Yy5bt0vTPZ2_LUM6cUcGeqmYNoJ8-Rgto+c2+w3defYA@mail.gmail.com

With this change, you can now easily find all processes involved in a parallel query. For instance:

=# SELECT query, leader_pid,
  array_agg(pid) filter(WHERE leader_pid != pid) AS members
FROM pg_stat_activity
WHERE leader_pid IS NOT NULL
GROUP BY query, leader_pid;
       query       | leader_pid |    members
-------------------+------------+---------------
 select * from t1; |      31630 | {32269,32268}
(1 row)

Be careful, as mentionned in the commit message, if the leader_pid is the same as pid, it doesn’t necessarily mean that the backend is currently performing a parallel query, as once set this field is never reset. Also, to avoid extra ovherhead, no additional lock is held while outputting the data. It means that each row is processed independently. So, while quite unlikely, you can get in some circumstances inconsistent data, such as a parallel worker pointing to a pid that already disconnected.

New in pg13: New leader_pid column in pg_stat_activity was originally published by Julien Rouhaud at rjuju's home on February 06, 2020.

pg qualstats 2: Suggestion d'index globale

2020-01-06T12:23:29+00:00

Parvenir à une suggestion d’index de qualité peut être une tâche complexe. Cela nécessite à la fois une connaissance des requêtes applicatives et des spécificités de la base de données. Avec le temps de nombreux projets ont essayé de résoudre ce problème, l’un d’entre eux étant PoWA version 3, avec l’aide de pg_qualstats extension. Cet outil donne de plutôt bonnes suggestions d’index, mais il est nécessaire d’installer et configurer PoWA, alors que certains utilisateurs aimeraient n’avoir que la suggestion d’index globale. Pour répondre à ce besoin de simplicité, l’algorithme utilisé dans PoWA est maintenant disponible dans pg_qualstats version 2, sans avoir besoin d’utiliser des composants additionnels.

EDIT: La fonction pg_qualstats_index\_advisor() a été changée pour retourner du json plutôt que du jsonb, afin de conserver la compatibilité avec PostgreSQL 9.3. Les requêtes d’exemples sont donc également modifiées pour utiliser json_array_elements() plutôt que jsonb_array_elements().

Qu’est-ce que pg_qualstats

Une manière simple d’expliquer ce qu’est pg_qualstats serait de dire qu’il s’agit d’une extension similaire à pg_stat_statements mais travaillant au niveaux des prédicats.

Cette extension sauvegarde des statistiques utiles pour les clauses WHERE et JOIN : à quelle table et quelle colonne un prédicat fait référénce, le nombre de fois qu’un prédicat a été utilisé, le nombre d’exécutions de l’opérateur sous-jacent, si le prédicat provient d’un parcours d’index ou non, la sélectivité, la valeur des constantes et bien plus encore.

Il est possible de déduire beaucoup de choses depuis ces informations. Par exemple, si vous examinez les prédicats qui contiennent des références à des tables différentes, vous pouvez trouver quelles tables sont jointes ensembles, et à quel point les conditions de jointures sont sélectives.

Suggestion Globale ?

Comment je l’ai mentionné, la suggestion d’index globale ajoutée dans pg_qualstats 2 utilise la même approche que celle de PoWA, ainsi cet article peut servir à décrire le fonctionnement des deux outils. La seule différence est que vous obtiendrez probablement une suggestion de meilleure qualité avec PoWA, puisque plus de prédicats seront disponibles, et que vous pourrez également choisir sur quel intervalle de temps vous souhaitez effectuer une suggestion d’index manquants.

La chose importante à retenir ici est qu’il s’agit d’une suggestion effectuée de manière globale, c’est-à-dire en prenant en compte tous les prédicats intéressant en même temps. Cette approche est différente de toutes les autres dont j’ai connaissance, qui ne prennent en compte qu’une seule requête à la fois. Selon moi, une approche globale est meilleure, car il est possible de réduire le nombre total d’index, en maximisant l’efficacité des index multi-colonnes.

Comment marche la suggestion globale

La première étape consiste à récupérer tous les prédicats qui pourraient bénéficier de nouveaux index. C’est particulièrement facile à obtenir avec pg_qualstats. En filtrant les prédicats venant d’un parcours séquentiel, exécutés de nombreuses fois et qui filtrent de nombreuses lignes (à la fois en nombre et en pourcentage), vous obtenez une liste parfaite de prédicats qui auraient très probablement besoin d’un index (ou alors dans certains cas une liste des requêtes mal écrites). Voyons regardons par exemple le cas d’une applications qui utiliserait ces 4 prédicats:

Ensuite, il faut construire l’ensemble entier des chemins de toutes les prédicats joints par un AND logique, qui contiennent d’autres prédicats, qui peuvent être eux-meme également joints par des AND logiques. En utilisants les même 4 prédicats vus précédemments, nous obtenons ces chemins :

Une fois tous les chemins construits, il suffit d’obtenir le meilleur chemin pour trouver le meilleur index à suggérer. Le classement de ces chemins est pour le moment fait en donnant un poids à chaque nœud de chaque chemin qui correspond au nombre de prédicats simple qu’il contient, et en additionnant le poids pour chaque chemin. C’est une approche très simple, et qui permet de favoriser un nombre minimal d’index qui optimisent le plus de requêtes possible. Avec nos exemple, nous obtenons :

Bien évidemment, d’autres approches de classement pourraient être utilisée pour prendre en compte d’autres paramètres, et potentiellement obtenir une meilleur suggestion. Par exemple, en prenant en compte également le nombre d’exécution ou la sélectivité des prédicats. Si le ratio de lecture/écriture pour chaque table est connu (ce qui est disponible avec l’extension powa-archivist), il serait également possible d’adapter le classement pour limiter la suggestion d’index pour les tables qui ne sont accédées presque exclusivement en écriture. Avec cet algorithme, ces ajustements seraient relativement simples à faire.

Une fois que le meilleur chemin est trouvé, on peut générer l’ordre de création de l’index ! Comme l’ordre des colonnes peut être important, l’ordre est généré en récupérant les colonnes de chaque nœud par poids croissant. Avec notre exemple, l’index suivant est généré :

CREATE INDEX ON t1 (id, ts, val);

Une fois que l’index est trouvé, on supprime simplement les prédicats contenus de la liste globale de prédicats et on reprendre de zéro jusqu’à ce qu’il n’y ait plus de prédicats.

Un peu plus de détails et mise en garde

Bien évidemment, il s’agit ici d’une version simplifiée de l’algorithme de suggestion, car d’autres informations sont nécessaires. Par exemple, la liste des prédicats est en réalité ajustée avec les classes d’opérateurs et méthode d’acces en fonction du type de la colonne et de sont opérateur, afin de s’assurer d’obtenir des index valides. Si plusieurs méthodes d’accès aux index sont trouvées pour un même meilleur chemin, btree sera choisi en priorité.

Cela nous amène à un autre détail : cette approche est principalement pensée pour les index btree, pour lesqules l’ordre des colonnes est critiques. D’autres méthodes d’accès ne requièrent pas un ordre spécifique pour les colonnes, et pour ces méthodes d’accès il est possible qu’une suggestion plus optimale soit possible si l’ordre des colonnes n’était pas pris en compte.

Un autre point important est que les classes d’opérateurs et méthodes d’accès ne sont pas gérés en dur mais récupérés à l’exécution en utilisant les catalogues locaux. Par conséquent, vous pouvez obtenir des résultats différents (et potentiellement meilleurs) si vous faites en sorte d’avoir toutes les classes d’opérateur additionelles disponibles quand vous utilisez la suggestion d’index globale. Cela pourrait être les extensions btree_gist et btree_gist, mais également d’autres méthodes d’accès aux index. Il est également possible que certain types / opérateurs n’aient pas de méthode d’accès associée dans les catalogues. Dans ce cas, ces prédicats sont retournées séparément dans une liste de prédicats non optimisables automatiquement, et pour lequel une analyse manuelle est nécessaire.

Enfin, comme pg_qualstats ne traite pas les prédicats composés d’expressions, l’outil ne peut pas suggérer d’index sur des expressions, par exemple en cas d’utilisateur de recherche plein texte.

Exemple d’utilisation

Une simple fonction est fournie, avec des paramètres facultatifs, qui retourne une valeur de type json :

CREATE OR REPLACE FUNCTION pg_qualstats_index_advisor (
    min_filter integer DEFAULT 1000,
    min_selectivity integer DEFAULT 30,
    forbidden_am text[] DEFAULT '{}')
    RETURNS json

Les noms de paramètres sont parlants :

min_filter: combien de lignes le prédicat doit-il filtrer en moyenne pour être pris en compte par la suggestion globale, par défaut 1000 ;
min_selectivity: quelle doit être la sélectivité moyenne d’un prédicat pour qu’il soit pris en compte par la suggestion globale, par défaut 30% ;
forbidden_am: liste des méthodes d’accès aux index à ignorer. Aucune par défaut, bien que pour les version 9.6 et inférieures les index hash sont ignoré en interne, puisque ceux-ci ne sont sur que depuis la version 10.

Voici un exemple simple, tirés des tests de non régression de pg_qualstats :

CREATE TABLE pgqs AS SELECT id, 'a' val FROM generate_series(1, 100) id;
CREATE TABLE adv (id1 integer, id2 integer, id3 integer, val text);
INSERT INTO adv SELECT i, i, i, 'line ' || i from generate_series(1, 1000) i;
SELECT pg_qualstats_reset();
SELECT * FROM adv WHERE id1 < 0;
SELECT count(*) FROM adv WHERE id1 < 500;
SELECT * FROM adv WHERE val = 'meh';
SELECT * FROM adv WHERE id1 = 0 and val = 'meh';
SELECT * FROM adv WHERE id1 = 1 and val = 'meh';
SELECT * FROM adv WHERE id1 = 1 and id2 = 2 AND val = 'meh';
SELECT * FROM adv WHERE id1 = 6 and id2 = 6 AND id3 = 6 AND val = 'meh';
SELECT * FROM adv WHERE val ILIKE 'moh';
SELECT COUNT(*) FROM pgqs WHERE id = 1;

Et voici ce que la fonction retourne :

SELECT v
  FROM json_array_elements(
    pg_qualstats_index_advisor(min_filter => 50)->'indexes') v
  ORDER BY v::text COLLATE "C";
                               v
---------------------------------------------------------------
 "CREATE INDEX ON public.adv USING btree (id1)"
 "CREATE INDEX ON public.adv USING btree (val, id1, id2, id3)"
 "CREATE INDEX ON public.pgqs USING btree (id)"
(3 rows)

SELECT v
  FROM json_array_elements(
    pg_qualstats_index_advisor(min_filter => 50)->'unoptimised') v
  ORDER BY v::text COLLATE "C";
        v
-----------------
 "adv.val ~~* ?"
(1 row)

La version 2 de pg_qualstats n’est pas encore disponible en version stable, mais n’hésitez pas à la tester et rapporter tout problème que vous pourriez rencontrer !

pg qualstats 2: Suggestion d'index globale was originally published by Julien Rouhaud at rjuju's home on January 06, 2020.

pg qualstats 2: Global index advisor

2020-01-06T12:23:29+00:00

Coming up with good index suggestion can be a complex task. It requires knowledge of both application queries and database specificities. Over the year multiple projects tried to solve this problem, one of which being PoWA with the version 3, with the help of pg_qualstats extension. It can give pretty good index suggestion, but it requires to install and configure PoWA, while some users wanted to only have the global index advisor. In such case and for simplicity, the algorithm used in PoWA is now available in pg_qualstats version 2 without requiring any additional component.

EDIT: The pg_qualstats_index_advisor() function has been changed to return json rather than jsonb, so that the compatibility with PostgreSQL 9.3 is maintained. The query examples are therefore also modified to use json_array_elements() rather than jsonb_array_elements().

What is pg_qualstats

A simple way to explain what is pg_qualstats would be to say that it’s like pg_stat_statements working at the predicate level.

The extension will save useful statistics for WHERE and JOIN clauses: which table and column a predicate refers to, number of time the predicate has been used, number of execution of the underlying operator, whether it’s a predicate from an index scan or not, selectivity, constant values used and much more.

You can deduce many things from such information. For instance, if you examine the predicates that contains references to different tables, you can find which tables are joined together, and how selective are those join conditions.

Global suggestion?

As I mentioned, the global index advisor added in pg_qualstats 2 uses the same approach as the one in PoWA, so the explanation here will describe both tools. The only difference is that with PoWA you’ll likely get a better suggestion, as more predicates will be available, and you can also choose for wich time interval you want to detect missing indexes.

The important thing here is that the suggestion is performed globally, considering all interesting predicates at the same time. This approach is different to all other approaches I saw that only consider a single query at a time. I believe that a global approach is better, as it’s possible to reduce the total number of indexes, maximizing multi-column indexes usefulness.

How global suggestion is done

The first step is to gather all predicates that could benefit from a new index. This is easy to get with pg_qualstats, by filtering the predicates coming from sequential scans, executed many time, that filter many rows (both in number of rows and in percentage) you get a perfect list of predicates that likely miss an index (or alternatively the list of poorly written queries in certain cases). For instance, let’s consider an application which uses those 4 predicates:

Next, we build the full set of paths with each AND-ed predicates that contains other, also possibly AND-ed, predicates. Using the same 4 predicates, we would get those paths:

Once all the paths are built, we just need to get the best path to find out the best index to suggest. The scoring is for now done by giving a weight to each node of each path corresponding to the number of simple predicates it contains and summing the weight for each path. This is very simple and allows to prefer a smaller amount of indexes to optimize as many queries as possible. With our simple example, we get:

Of course, other scoring approaches could be used to take into account other parameters and give possibly better suggestions. For instance, combining the number of executions or the predicate selectivity. If the read/write ratio for each table is known (this is available using powa-archivist), it would also be possible to adapt the scoring method to limit index suggestions for write-mostly tables. With this algorithm, all of that could be added quite easily.

Once the best path is found, we can generate an index DDL! As the order of the columns can be important, this is done using getting the columns for each node in ascending weight order. In our example, we would generate this index:

CREATE INDEX ON t1 (id, ts, val);

Once an index is found, we simply remove the contained predicates for the global list of predicates and start again from scratch until there are no predicate left.

Additional details and caveat

Of course, this is a simplified version of the suggestion algorithm. Some other informations are required. For instance, the list of predicates is actually expanded with operator classes and access method depending on the column types and operator, to make sure that the suggested indexes are valid. If multiple index methods are found for a best path, btree will be chosen in priority.

This brings another consideration: this approach is mostly thought for btree indexes, for which the column order is critical. Some other access methods don’t require a specific column order, and for those it could be possible to get better index suggestions if the column order parameters wasn’t considered.

Another important point is that the operator classes and access method is not hardcoded but retrieved at execution time using the local catalogs. Therefore, you can get different (and possibly better) results if you make sure that optional operator classes are present when using the index advisor. This could be btree_gist or btree_gin extensions, but also other access methods. It’s also possible that some type / operator combination doesn’t have any associated access method recorded in the catalogs. In this case, those predicates are returned separately as a list of unoptimizable predicates, that should be manually analyzed.

Finally, as pg_qualstats isn’t considering expression predicates, this advisor can’t suggest indexes on expression, for instance if you’re using fulltext search.

Usage example

A simple set-returning function is provided, with optional parameters, that returns a json value:

CREATE OR REPLACE FUNCTION pg_qualstats_index_advisor (
    min_filter integer DEFAULT 1000,
    min_selectivity integer DEFAULT 30,
    forbidden_am text[] DEFAULT '{}')
    RETURNS json

The parameter names are self explanatory:

min_filter: how many tuples should a predicate filter on average to be considered for the global optimization, by default 1000.
min_selectivity: how selective should a predicate filter on average to be considered for the global optimization, by default 30%.
forbidden_am: list of access methods to ignore. None by default, although for PostgreSQL 9.6 and prior hash indexes will internally be discarded, as those are only safe since version 10.

Using pg_qualstats regression tests, let’s see a simple example:

CREATE TABLE pgqs AS SELECT id, 'a' val FROM generate_series(1, 100) id;
CREATE TABLE adv (id1 integer, id2 integer, id3 integer, val text);
INSERT INTO adv SELECT i, i, i, 'line ' || i from generate_series(1, 1000) i;
SELECT pg_qualstats_reset();
SELECT * FROM adv WHERE id1 < 0;
SELECT count(*) FROM adv WHERE id1 < 500;
SELECT * FROM adv WHERE val = 'meh';
SELECT * FROM adv WHERE id1 = 0 and val = 'meh';
SELECT * FROM adv WHERE id1 = 1 and val = 'meh';
SELECT * FROM adv WHERE id1 = 1 and id2 = 2 AND val = 'meh';
SELECT * FROM adv WHERE id1 = 6 and id2 = 6 AND id3 = 6 AND val = 'meh';
SELECT * FROM adv WHERE val ILIKE 'moh';
SELECT COUNT(*) FROM pgqs WHERE id = 1;

And here’s what the function returns:

SELECT v
  FROM json_array_elements(
    pg_qualstats_index_advisor(min_filter => 50)->'indexes') v
  ORDER BY v::text COLLATE "C";
                               v
---------------------------------------------------------------
 "CREATE INDEX ON public.adv USING btree (id1)"
 "CREATE INDEX ON public.adv USING btree (val, id1, id2, id3)"
 "CREATE INDEX ON public.pgqs USING btree (id)"
(3 rows)

SELECT v
  FROM json_array_elements(
    pg_qualstats_index_advisor(min_filter => 50)->'unoptimised') v
  ORDER BY v::text COLLATE "C";
        v
-----------------
 "adv.val ~~* ?"
(1 row)

The version 2 of pg_qualstats is not released yet, but feel free to test it and report any issue you may find!

pg qualstats 2: Global index advisor was originally published by Julien Rouhaud at rjuju's home on January 06, 2020.

PoWA 4: Nouveau daemon powa-collector

2019-12-10T18:54:17+00:00

Cet article fait partie d’une série d’article sur la beta de PoWA 4, et décrit le nouveau daemon powa-collector.

Nouveau daemon powa-collector

Ce daemon remplace le précédent background worker lorsque le nouveau mode remote est utilisé. Il s’agit d’un simple daemon écrit en python, qui s’occupera de toutes les étapes nécessaires pour effectuer des snapshots distants. Il est disponible sur pypi.

Comme je l’ai expliqué dans mon précédent article introduistant PoWA 4, ce daemon est nécessaire pour la configuration d’un mode remote, en gardant cette architecture à l’esprit :

Sa configuration est très simple. Il vous suffit tout simplement de renommer le fichier powa-collector.conf.sample fourni, et d’adapter l’URI de connexion pour décrire comment se connecter sur votre serveur repository dédié, et c’est fini.

Une configuration typique devrait ressembler à :

{
    "repository": {
        "dsn": "postgresql://powa_user@server_dns:5432/powa",
    },
    "debug": true
}

La liste des serveur distants, leur configuration ainsi que tout le reste qui est nécessaire pour le bon fonctionnement sera automatiquement récupéré depuis le serveur repository que vous ave déjà configuré. Une fois démarré, il démarrera un thread dédié par serveur distant déclaré, et maintiendra une connexion persistente sur ce serveur distant. Chaque thread effectuera un snapshot distant, exportant les données sur le serveur repository en utilisant les nouvelles fonctions sources. Chaque thread ouvrira et fermera une connexion sur le serveur repository lors de l’exécution du snapshot distant.

Bien évidemment, ce daemon a besoin de pouvoir se connecter sur tous les serveurs distants déclarés ainsi que le serveur repository. La table powa_servers, qui stocke la liste des serveurs distants, a un champ pour stocker les nom d’utilisateur et mot de passe pour se connecter aux serveur distants. Stocker un mot de passe en clair dans cette table est une hérésie, si l’on considère l’aspect sécurité. Ainsi, comme indiqué dans la section sécurité de PoWA, vous pouve stocker un mot de passe NULL et utiliser à la place n’importe laquelle des autres méthodes d’authentification supportées par la libpq (fichier .pgpass, certificat…). C’est très fortement recommandé pour toute installation sérieuse.

La connexion persistente sur le serveur repository est utilisée pour superviser la daemon :

pour vérifier que le daemon est bien démarré
pour communiquer au travers de l’UI en utilisant un protocole simple afin d’effectuer des actions diverses (recharger la configuration, vérifier le status d’un thread dédié à un serveur distant…)

Il est à noter que vous pouvez également demander au daemon de recharger sa configuration en envoyant un SIGHUP au processus du daemon. Un rechargement est nécessaire pour toute modification effectuée sur la liste des serveurs distants (ajout ou suppression d’un serveur distant, ou mise à jour d’un existant).

Veuillez également noter que, par choix, powa-collector n’effectuera pas de snapshot local. Si vous voulez utiliser PoWA pour le serveur repository, il vous faudra activer le background worker original.

Nouvelle page de configuration

La page de configuration est maintenant modifiée pour donner toutes les informations nécessaires sur le status du background worker, le powa-collector daemon (incluant tous ses threads dédiés) ainsi que la liste des serveurs distants déclarés. Voici un exemple de cette nouvelle page racine de configuration :

Si le daemon powa-collector est utilisé, le status de chaque serveur distant sera récupéré en utilisant le protocole de communication. Si le collecteur rencontre des erreurs (lors de la connexion à un serveur distant, durant un snapshot par exemple), celles-ci seront également affichées ici. À noter également que ces erreurs seront également affichées en haut de chaque page de toutes les pages de l’UI, afin d’être sûr de ne pas les rater.

De plus, la section configuration a maintenant une hiérarchie, et vous pourrez voir la liste des extensions ainsi que la configuration actuelle de PostgreSQL pour le serveur local ou distant en cliquant sur le serveur de votre choix!

Il y a également un nouveau bouton Reload collector sur le bandeau d’en-tête qui, comme on pourrait s’y attendre, demandera au collecteur de recharger sa configuration. Cela peut être utile si vous avez déclarés de nouveaux serveurs mais n’ave pas d’accès au serveur sur lequel le collecteur s’exécute.

Conclusion

Cette article est le dernier de la séurie concernant la nouvelle version de PoWA. Il est toujours en beta, n’hésitez donc pas à le tester, rapporter tout bug rencontré ou donner tout autre retour!

PoWA 4: Nouveau daemon powa-collector was originally published by Julien Rouhaud at rjuju's home on December 10, 2019.

PoWA 4: New powa-collector daemon

2019-12-10T18:54:17+00:00

This article is part of the PoWA 4 beta series, and describes the new powa-collector daemon.

New powa-collector daemon

This daemon replaces the previous background worker when using the new remote mode. It’s a simple daemon written in python, which will perform all the required steps to perform remote snapshots. It’s available on pypi.

As I explained in my previous article introducing PoWA 4, this daemon is required for a remote mode setup, with this architecture in mind:

Its configuration is very simple. All you need to do is copy and rename the provided powa-collector.conf.sample file, and adapt the connection URI to describe how to connect on your dedicated repository server, and you’re done.

A typical configuration will look like:

{
    "repository": {
        "dsn": "postgresql://powa_user@server_dns:5432/powa",
    },
    "debug": true
}

The list of remote servers, their configuration and everything else it needs will be automatically retrieved from the repository server you just configured. When started, it’ll spawn one dedicated thread per declared remote server, and maintain a persistent connection on the configured powa database on this remote server. Each thread will perform a remote snapshot, exporting the data on the repository server using the new source functions. Each thread will open and close a connection on the repository server when performing the remote snapshot.

This daemon obviously needs to be able to connect to all the declared remote servers and the repository server. The powa_servers table, which store the list of remote servers, has a field to store username and password to connect to the remote server. Storing a password in plain text in this table is an heresy as far as security is concerned. So, as mentioned in the PoWA security documentation, you can store a NULL password and instead use any of the authentication method that libpq supports (.pgpass file, certificate…). That’s strongly recommended for any non toy setup.

The persistent connection on the repository server is used to monitor the daemon:

to check that the daemon is up and running
to communicate through the UI using a simple protocol to perform various actions (reload the configuration, check for a remote server thread status…)

Note that you can also ask the daemon to reload its configuration by issuing a SIGHUP to the daemon process. A reload is required if any modification to the list of remote servers (if you added or removed a remote server, or updated a setting for an existing) has been done.

Also note that by choice, powa-collector will not perform local snapshots. If you want to use PoWA for the repository server, you need to enable the original background worker.

New configuration page

The configuration page is now updated to give all needed information about the background worker status and the powa-collector daemon status (including all of its dedicated threads) and the list of registered remote servers. Here’s an example of the new root configuration page:

If the powa-collector daemon is used, each remote server status will be retrieved using the communication protocol. If the collector encountered any error (connecting to a remote server, during a snapshot or anything else), they’ll also be displayed here. Also note that such errors will also be displayed on top of any page of the UI, so that you can’t miss them.

Also, the configuration section has now a hierarchy, and you’ll be able to see the list of extensions and the current PostgreSQL configuration for the local or remote servers by clicking on the server of your choice!

There’s also a new Reload collector button on the header panel, which as expected will ask the collector to reload its configuration. That can be useful if you registered new servers and you don’t have access on the server where the collector is running.

Conclusion

This is the last article introducing the new version of PoWA. It’s still in beta, so feel free to test it, report any issue you may find or give any other feedback!

PoWA 4: New powa-collector daemon was originally published by Julien Rouhaud at rjuju's home on December 10, 2019.

PoWA 4: nouveautés dans powa-archivist !

2019-06-05T14:26:17+00:00

Cet article fait partie d’une série d’article sur la beta de PoWA 4, et décrit les changements présents dans powa-archivist.

Pour plus d’information sur cette version 4, vous pouvez consulter l’article de présentation général.

Aperçu rapide

Tout d’abord, il faut savoir qu’il n’y a pas d’upgrade possible depuis la v3 vers la v4, il est donc nécessaire d’effectuer un DROP EXTENSION powa si vous utilisiez déjà PoWA sur vos serveurs. Cela est du au fait que la v4 apporte de très nombreux changements dans la partie SQL de l’extension, ce qui en fait le changement le plus significatif dans la suite PoWA pour cette nouvelle version. Au moment où j’écris cet article, la quantité de changements apportés dans cette extension est :

 CHANGELOG.md       |   14 +
 powa--4.0.0dev.sql | 2075 +++++++++++++++++++++-------
 powa.c             |   44 +-
 3 files changed, 1629 insertions(+), 504 deletions(-)

L’absence d’upgrade ne devrait pas être un problème en pratique. PoWA est un outil pour analyser les performances, il est fait pour avoir des données avec une grande précision mais un historique très limité. Si vous cherchez une solution de supervision généraliste pour conserver des mois de données, PoWA n’est définitivement pas l’outil qu’il vous faut.

Configurer la liste des serveurs distants

En ce qui concerne les changements à proprement parler, le premier petit changement est que le background worker n’est plus nécessaire pour le fonctionnement de powa-archivist, car il n’est pas utilisé pour le mode distant. Cela signifie qu’un redémarrage de PostgreSQL n’est plus nécessaire pour installer PoWA. Bien évidemment, un redémarrage est toujours nécessaire si vous souhaitez utiliser le mode local, en utilisant le background worker, or si vous voulez installer des extensions additionelles qui nécessitent elles-même un redémarrage.

Ensuite, comme PoWA requiert un peu de configuration (fréquence des snapshot, rétention des données et ainsi de suite), certaines nouvelles tables sont ajouter pour permettre de configurer tout ça. La nouvelle table powa_servers stocke la configuration de toutes les instances distantes dont les données doivent être stockées sur cette instance. Cette instance PoWA locale est appelée un serveur repository (qui devrait typiquement être dédiée à stocker des données PoWA), en opposition aux instances distantes qui sont les instances que vous voulez monitorer. Le contenu de cette table est tout ce qu’il y a de plus simple :

\d powa_servers
                              Table "public.powa_servers"
  Column   |   Type   | Collation | Nullable |                 Default
-----------+----------+-----------+----------+------------------------------------------
 id            | integer  |           | not null | nextval('powa_servers_id_seq'::regclass)
 hostname      | text     |           | not null |
 alias         | text     |           |          |
 port          | integer  |           | not null |
 username      | text     |           | not null |
 password      | text     |           |          |
 dbname        | text     |           | not null |
 frequency     | integer  |           | not null | 300
 powa_coalesce | integer  |           | not null | 100
 retention     | interval |           | not null | '1 day'::interval

Si vous avez déjà utilisé PoWA, vous devriez reconnaître la plupart des options de configuration qui sont maintenant stockées ici. Les nouvelles options sont utilisées pour décrire comment se connecter aux instances distances, et peuvent fournir un alias à afficher sur l’UI.

Vous avez également probablement remarqué une colonne password. Stocker un mot de passe en clair dans cette table est une hérésie pour n’importe qui désirant un minimum de sécurité. Ainsi, comme mentionné dans la section sécurité de la documentation de PoWA , vous pouvez stocker NULL pour le champ password et à la place utiliser n’importe laquelle des autres méthodes d’authentification supportée par la libpq (fichier .pgpass, certificat…). Une authentification plus sécurisée est chaudement recommandée pour toute installation sérieuse.

Une autre table, la table powa_snapshot_metas, est également ajoutée pour stocker quelques métadonnées concernant les informations de snapshot pour chaque serveur distant.

                                   Table "public.powa_snapshot_metas"
    Column    |           Type           | Collation | Nullable |                Default
--------------+--------------------------+-----------+----------+---------------------------------------
 srvid        | integer                  |           | not null |
 coalesce_seq | bigint                   |           | not null | 1
 snapts       | timestamp with time zone |           | not null | '-infinity'::timestamp with time zone
 aggts        | timestamp with time zone |           | not null | '-infinity'::timestamp with time zone
 purgets      | timestamp with time zone |           | not null | '-infinity'::timestamp with time zone
 errors       | text[]

Il s’agit tout simplement d’un compteur pour compter le nombre de snapshots effectués, un timestamp pour chaque type d’événement survenu (snapshot, aggrégation et purge) et un tableau de chaîne de caractères pour stocker toute erreur survenant durant le snapshot, afin que l’UI pour l’afficher.

API SQL pour configurer les serveurs distants

Bien que ces tables soient très simples, une API SQL basique est disponible pour déclarer de nouveaux serveurs et les configurer. 6 fonctions de bases sont disponibles :

powa_register_server(), pour déclarer un nouveau servuer distant, ainsi que la liste des extensions qui y sont disponibles
powa_configure_server() pour mettre à jour un des paramètres pour le serveur distant spécifié (en utilisant un paramètre JSON, où la clé est le nom du paramètre à changer et la valeur la nouvelle valeur à utiliser)
powa_deactivate_server() pour désactiver les snapshots pour le serveur distant spécifiqué (ce qui concrètement positionnera le paramètre frequency à -1)
powa_delete_and_purge_server() pour supprimer le serveur distant spécifié de la liste des serveurs et supprimer toutes les données associées aux snapshots
powa_activate_extension(), pour déclarer qu’une nouvelle extension est disponible sur le serveur distant spécifié
powa_deactivate_extension(), pour spécifier qu’une extension n’est plus disponible sur le serveur distant spécifié

Toute action plus compliquée que ça devra être effectuée en utilisant des requêtes SQL. Heureusement, il ne devrait pas y avoir beaucoup d’autres besoins, et les tables sont vraiment très simple donc cela ne devrait pas poser de soucis. N’hésitez cependant pas à demander de nouvelles fonctions si vous aviez d’autres besoins. Veuillez également noter que l’UI ne vous permet pas d’appeler ces fonctions, puisque celle-ci est pour le moment entièrement en lecture seule.

Effectuer des snapshots distants

Puisque les métriques sont maintenant stockées sur une instance PostgreSQL différente, nous avons énormément changé la façon dont les snapshots (récupérer les données fournies par une extensions statistique et les stockées dans le catalogue PoWA de manière à optimiser le stockage) sont effectués.

La liste de toutes les extensions statistiques, ou sources de données, qui sont disponibles sur un serveur (soit distant soit local) et pour lesquelles un snapshot devrait être effectué est stockée dans une table appelée powa_functions:

               Table "public.powa_functions"
     Column     |  Type   | Collation | Nullable | Default
----------------+---------+-----------+----------+---------
 srvid          | integer |           | not null |
 module         | text    |           | not null |
 operation      | text    |           | not null |
 function_name  | text    |           | not null |
 query_source   | text    |           |          |
 added_manually | boolean |           | not null | true
 enabled        | boolean |           | not null | true
 priority       | numeric |           | not null | 10

Un nouveau champ query_source a été rajouté. Celui-ci fournit le nom de la fonction source, nécessaire pour la compatibilité d’une extension statistique avec les snapshots distants. Cette fonction est utilisée pour exporter les compteurs fournis par cette extension sur un serveur différent, dans une table transitoire dédiée. La fonction de snapshot effectuera alors le snapshot en utilisant automatiquement ces données exportées plutôt que celles fournies par l’extension statististique locale quand le mode distant est utilisé. Il est à noter que l’export de ces compteurs ainsi que le snapshot distant est effectué automatiquement par le nouveau daemon powa-collector que je présenterai dans un autre article.

Voici un exemple montant comment PoWA effectue un snapshot distant d’une liste de base données. Comme vous allez le voir, c’est très simple ce qui signifie qu’il est également très simple d’ajouter cette même compatibilité pour une nouvelle extension statistique.

La table transitoire:

   Unlogged table "public.powa_databases_src_tmp"
 Column  |  Type   | Collation | Nullable | Default
---------+---------+-----------+----------+---------
 srvid   | integer |           | not null |
 oid     | oid     |           | not null |
 datname | name    |           | not null |

Pour de meilleurs performances, toutes les tables transitoires sont non journalisées (unlogged), puisque leur contenu n’est nécessaire que durant un snapshot et sont supprimées juste après. Dans cet examlple, la table transitoire ne stocke que l’identifiant du serveur distant correspondant à ces données, l’oid ainsi que le nom de chacune des bases de données présentes sur le serveur distant.

Et la fonction source :

CREATE OR REPLACE FUNCTION public.powa_databases_src(_srvid integer,
    OUT oid oid, OUT datname name)
 RETURNS SETOF record
 LANGUAGE plpgsql
AS $function$
BEGIN
    IF (_srvid = 0) THEN
        RETURN QUERY SELECT d.oid, d.datname
        FROM pg_database d;
    ELSE
        RETURN QUERY SELECT d.oid, d.datname
        FROM powa_databases_src_tmp d
        WHERE srvid = _srvid;
    END IF;
END;
$function$

Cette fonction retourne simplement le contenu de pg_database si les données locales sont demandées (l’identifiant de serveur 0 est toujours le serveur local), ou alors le contenu de la table transitoire pour le serveur distant spécifié.

La fonction de snapshot peut alors facilement effectuer n’importe quel traitement avec ces données pour le serveur distant voulu. Dans le cas de la fonction powa_databases_snapshot(), il s’agit simplement de synchroniser la liste des bases de données, et de stocker le timestamp de suppression si une base de données qui existait précédemment n’est plus listée.

Pour plus de détails, vous pouvez consulter la documentation concernant l’ajout d’une source de données dans PoWA, qui a été mise à jour pour les spécificités de la version 4.

PoWA 4: nouveautés dans powa-archivist ! was originally published by Julien Rouhaud at rjuju's home on June 05, 2019.

PoWA 4: changes in powa-archivist!

2019-06-05T14:26:17+00:00

This article is part of the PoWA 4 beta series, and describes the changes done in powa-archivist.

For more information about this v4, you can consult the general introduction article.

Quick overview

First of all, you have to know that there is not upgrade possible from v3 to v4, so a DROP EXTENSION powa is required if you were already using PoWA on any of your servers. This is because this v4 involved a lot of changes in the SQL part of the extension, making it the most significant change in the PoWA suite for this new version. Looking at the amount changes at the time I’m writing this article, I get:

 CHANGELOG.md       |   14 +
 powa--4.0.0dev.sql | 2075 +++++++++++++++++++++-------
 powa.c             |   44 +-
 3 files changed, 1629 insertions(+), 504 deletions(-)

The lack of upgrade shouldn’t be a problem in practice though. PoWA is a performance tool, so it’s intended to have data with high precision but with a very limited history. If you’re looking for a general monitoring solution keeping months of counters, PoWA is definitely not the tool you need.

Configuring the list of remote servers

Concerning the features themselves, the first small change is that powa-archivist does not require the background worker to be active anymore, as it won’t be used for remote setup. That means that a PostgreSQL restart is not needed needed anymore to install PoWA. Obviously, a restart is still required if you want to use the local setup, using the background worker, or if you want to install additional extensions that themselves require a restart.

Then, as PoWA needs some configuration (frequency of snapshot, data retention and so on), some new tables are added to be able to configure all of that. The new powa_servers table stores the configuration for all the remote instances whose data should be stored on this instance. This local PoWA instance is call a repository server (that typically should be dedicated to storing PoWA data), in opposition to remote instances which are the instances you want to monitor. The content of this table is pretty straightforward:

\d powa_servers
                              Table "public.powa_servers"
  Column   |   Type   | Collation | Nullable |                 Default
-----------+----------+-----------+----------+------------------------------------------
 id            | integer  |           | not null | nextval('powa_servers_id_seq'::regclass)
 hostname      | text     |           | not null |
 alias         | text     |           |          |
 port          | integer  |           | not null |
 username      | text     |           | not null |
 password      | text     |           |          |
 dbname        | text     |           | not null |
 frequency     | integer  |           | not null | 300
 powa_coalesce | integer  |           | not null | 100
 retention     | interval |           | not null | '1 day'::interval

If you already used PoWA, you should recognize most of the configuration options, that are now stored here. The new options are used to describe how to connect to the remote servers, and can provide an alias to be displayed in the UI.

You also probably noticed a password column here. Storing a password in plain text in this table is an heresy as far as security is concerned. So, as mentioned in the PoWA security section of the documentation, you can store a NULL password and use instead any of the authentication method that libpq supports (.pgpass file, certificate…). That’s strongly recommended for any non toy setup.

Another table, the powa_snapshot_metas table, is also added to store some metadata regarding each remote server snapshot information:

                                   Table "public.powa_snapshot_metas"
    Column    |           Type           | Collation | Nullable |                Default
--------------+--------------------------+-----------+----------+---------------------------------------
 srvid        | integer                  |           | not null |
 coalesce_seq | bigint                   |           | not null | 1
 snapts       | timestamp with time zone |           | not null | '-infinity'::timestamp with time zone
 aggts        | timestamp with time zone |           | not null | '-infinity'::timestamp with time zone
 purgets      | timestamp with time zone |           | not null | '-infinity'::timestamp with time zone
 errors       | text[]

That’s basically a counter to track the number of snapshots done, the timestamp for each kind of event that happened (snapshot, aggregate and purge), and a text array to store any error happening during the snapshot, that the UI can display.

SQL API to configure the remote servers

While thoses table are simple, a basic SQL API is available to register new servers and configure them. Basically, 6 functions are available:

powa_register_server(), to declare a new remote server, and the list of extensions available on it
powa_configure_server() to update any setting for the specified remote server (using a JSON where the key is the name of the parameter to change, and the value is the new value to use)
powa_deactivate_server() to disable snapshots on the specified remote server (which actually is setting up the frequency to -1)
powa_delete_and_purge_server() to remove the specified remote server from the list of servers and remove all associated snapshot data
powa_activate_extension(), to declare that a new extension is available on the specified remote server
powa_deactivate_extension(), to specify that an extension is not available anymore on the specified remote server

Any action more complicated than this should be performed using plain SQL queries. Hopefully, there shouldn’t be many other needs, and the tables are straightforward so this shouldn’t be a problem. Feel free to ask for more functions if you feel the need though. Please also note that the UI doesn’t allow you to call those functions, as the UI is for now entirely read only.

Performing remote snapshots

As metrics are now stored on a different PostgreSQL instance, we had to extensively change the way snapshots (retrieving the data from a stat extension and storing them in PoWA catalog in a space efficient way) are performed.

The list of all stat extensions, or data sources, that are available on a server (either remote or local) and for which we should perform a snapshot are configured in a table called powa_functions:

               Table "public.powa_functions"
     Column     |  Type   | Collation | Nullable | Default
----------------+---------+-----------+----------+---------
 srvid          | integer |           | not null |
 module         | text    |           | not null |
 operation      | text    |           | not null |
 function_name  | text    |           | not null |
 query_source   | text    |           |          |
 added_manually | boolean |           | not null | true
 enabled        | boolean |           | not null | true
 priority       | numeric |           | not null | 10

A new query_source field is added, that provides the name of a source function, required to support remote snapshot of any stat extensions. This function is used to export the counters provided by this extension on a different server, in a dedicated transient table. The snapshot function will then perform the snapshot using those exported data instead of the one provided by stat extensions locally when the remote mode is used. Note that the counters export and the remote snapshot is done automatically with the the new powa-collector daemon, that I’ll cover in another article.

Here’s an example of how PoWA perform a remote snapshot of the list of databases. As you’ll see, this is very simplistic, meaning that it’s very easy to add support for a new stat extension.

The transient table:

   Unlogged table "public.powa_databases_src_tmp"
 Column  |  Type   | Collation | Nullable | Default
---------+---------+-----------+----------+---------
 srvid   | integer |           | not null |
 oid     | oid     |           | not null |
 datname | name    |           | not null |

For better performance, all the transient tables are unlogged, as their content is only needed during a snapshot and are trashed afterwards. In this example the transient table only stores the server identifier for which the data are, the oid and name of each databases present on the remote server.

And the source function:

CREATE OR REPLACE FUNCTION public.powa_databases_src(_srvid integer,
    OUT oid oid, OUT datname name)
 RETURNS SETOF record
 LANGUAGE plpgsql
AS $function$
BEGIN
    IF (_srvid = 0) THEN
        RETURN QUERY SELECT d.oid, d.datname
        FROM pg_database d;
    ELSE
        RETURN QUERY SELECT d.oid, d.datname
        FROM powa_databases_src_tmp d
        WHERE srvid = _srvid;
    END IF;
END;
$function$

This function simply returns the content of pg_database if local data are asked (server id 0 is always the local server), or the content of the transient table for the given remote server otherwise.

The snapshot function can then easily do any required work with the data for the wanted remote server. In the case of the powa_databases_snapshot() function, the just synchronizing the list of databases, and storing the timestamp of removal if a previously existing database is not found anymore.

For more details, you can consult the PoWA datasource integration documentation, which was updated for the version 4 specificities.

PoWA 4: changes in powa-archivist! was originally published by Julien Rouhaud at rjuju's home on June 05, 2019.

PoWA 4 brings a remote mode, available in beta!

2019-05-17T11:04:17+00:00

PoWA 4 is available in beta.

New remote mode!

The new remote mode is the biggest feature introduced in PoWA 4, though there have been other improvements.

I’ll describe here what this new mode implies and what changed in the UI.

If you’re interested in more details about the rest of the changes in PoWA 4, I’ll soon publish other articles for that.

For the most hurried people, feel free to directly go on the v4 demo of PoWA, kindly hosted by Adrien Nayrat. No credential needed, just click on “Login”.

Why is a remote mode important

This feature has probably been the most frequently asked since PoWA was first released, back in 2014. And that was asked for good reasons, as a local mode have some drawbacks.

First, let’s see how was the architecture up to PoWA 3. Assuming an instance with 2 databases (db1 and db2), plus one database dedicated for PoWA. This dedicated database contains both the stat extension required to get the live performance data and to store them.

A custom background worker is started by PoWA, which is responsible for taking snapshots and storing them in the dediacted powa database regularly. Then, using powa-web, you can see the activity of any of the local databases querying the stored data on the dedicated database, and possibly connect to one of the other local database when complete data are needed, for instance when using the index suggestion tool.

With version 4, the architecture with a remote setup change quite a lot:

You can see the a dedicated powa database is still required, but only for the stat extensions. Data are now stored on a different instance. Then, the background worker is replaced by a new collector daemon, which reads the performance data from the remote servers, and store them on the dedicated repository server. Powa-web will then be able to display the activity connecting on the repository server, and also on the remote server when complete data are needed.

In short, with the new remote mode introduced in this version 4:

a PostgreSQL restart is not required anymore to install powa-archivist extension, as the background worker is not mandatory anymore
there is no overhead due to storing and querying data on the same PostgreSQL server as your production server (there are still some part of the UI that requires querying the original server, for instance when showing EXPLAIN plans, but that’s a negligible overhead)
it’s now possible to use PoWA on a hot-standby server

The UI will therefore now welcome you with a initial page to let you chose which server stored on the configured database you want to wotk on:

The main reason it took so much time to bring a remote mode is because this adds quite some complexity, requiring a major rewrite of the whole PoWA stack. We also wanted to add more feature first, such as the global index suggestion, with validation using hypopg introduced with PoWA 3.

Changes in powa-web

The user interface is the component which probably has the most visible changes in this version 4. Here are the most important ones.

Remote mode compatibility

The biggest change is obviously the support for the new remote mode. As a consequence, the first page shown is now a server selector page, displaying all registered remote servers. After choosing the wanted remote server (or local server if you don’t use the remote mode), all other pages will be similar to the one that were available until PoWA 3, but displaying data for a specific remote server only, and of course retrieving the data from the repository powa database, and with some new information I’ll describe just after.

Note that as the data is now stored on a dedicated repository server when using the remote mode, most of the UI is usable without connecting on the currently selected remote server. However, powa-web still requires to connect on the remote server when the original data are needed (for instance, for index suggestion or when showing EXPLAIN plans). The same authentication considerations and possibilities as for the new powa-collector daemon (which will be described in a following article) applies here.

pg_track_settings support

When this extension is properly configured, a new timeline widget will appear, placed between each graph and its overview, displaying any kind of recorded change if any was detected in the currently selected time interval. On the per-database and per-query pages, this list will be filtered by the selected database.

The same timeline will be displayed on every graph of each page, to easily check if this change had any visible impact using the various graphs.

Note that details of the changes will be displayed on mouseover. You can also click on any event on the timeline to make the event stay displayed, and draw a vertical line on the underlying graph.

Here’s an example of such detected configuration change in action:

Please also note that you need at least version 2.0.0 of pg_track_settings, and that the extension has to be installed both on the remote servers and the repository server.

New graphs available

When pg_stat_kcache is setup, its information were previously only displayed on the per-query page. They’re now displayed on per-server and per-database too, in two graphs:

in the Block Access graph, where the OS cache and disk read metrics will replace the read metric
in a new System Resources graph (which is also added in the per-query page), showing the metrics added in pg_stat_kcache 2.1

Here is an example of this new System Resources graph:

There was also a Wait Events graph (available when pg_wait_sampling extension is setup) only available on the per-query page. This graph is now available on the per-server and per-database pages too.

Metrics documentation and documentation link

Some metrics displayed in the user interface was quite self explanatory, while some could be a little bit obscure. Unfortunately, until now there wasn’t any documentation for any of the metrics. That’s now fixed, and all graphs have an information icon, that will display a description of the metrics used in the graph on mouseover. Some graphs will also include a link to the underlying stat extension in PoWA documentation for users who want to learn more about them.

Here’s an example:

And general bugfixes

Some longstanding issues were also reported:

the graph hover box showing metric values had a wrong vertical position
the time selection using the graph preview didn’t show a correct preview after applying the selection
errors on hypothetical index creation or in certain cases their display wasn’t correctly handled in multiple pages
grid filters weren’t reapplied when time selection was changed

If you have ever been annoyed by any of this, you’ll be glad to know that they’re now all fixed!

Conclusion

This 4th version of PoWA represents a lot of time on development, documentation improvements and testing. We’re now quite satisfied with it, but we may have missed some bugs. If you’re interested in this project, I hope that you’ll consider testing the beta, and if needed don’t hesitate to report a bug!

PoWA 4 brings a remote mode, available in beta! was originally published by Julien Rouhaud at rjuju's home on May 17, 2019.

PoWA 4 apporte un mode remote, disponible en beta !

2019-05-17T11:04:17+00:00

PoWA 4 est disponible en beta.

Nouveau mode remote !

Le nouveau mode remote est la plus grosse fonctionnalité ajoutée dans PoWA 4, bien qu’il y ait eu d’autres améliorations.

Je vais décrire ici ce que ce nouveau mode implique ainsi que ce qui a changé sur l’UI.

Si de plus amples détails sur le reste des changements apportés dans PoWA 4 vous intéresse, je publierai bientôt d’autres articles sur le sujet.

Pour les plus pressés, n’hésitez pas à aller directement sur la démo v4 de PoWA, très gentiment hébergée par Adrien Nayrat. Aucun authentification n’est requise, cliquez simplement sur “Login”.

Pourquoi un mode remote est-il important

Cette fonctionnalité a probablement été la plus fréquemment demandée depuis que PoWA a été publié, en 2014. Et c’est pour de bonnes raisons, car un mode local a quelques inconvénients.

Tout d’abord, voyons comment se présentait l’architecture avec les versions 3 et antérieures. Imaginons une instance contenant 2 bases de données (db1 et db2), ainsi qu’une base de données dédiée à PoWA. Cette base de données dédiée contient à la fois les extensions statistiques nécessaires pour récupérer compteurs de performances actuels ainsi que pour les stocker.

Un background worker est démarré par PoWA, qui est responsable d’effectuer des snapshots et de les stocker dans la base powa dédiée à intervalle réguliers. Ensuite, en utilisant powa-web, vous pouvez consulter l’activité de n’importe laquelle des bases de données locales en effectuant des requêtes sur les données stockées dans la base dédié, et potentiellement en se connectant sur l’une des autres bases de données locales lorsque les données complètes sont nécessaires, par exemple lorsque l’outil de suggestion d’index est utilisé.

Avec la version 4, l’architecture avec une configuration distante change de manière significative:

Vous pouvez voir qu’une base de donnée powa dédiée est toujours nécessaire, mais uniquement pour les extensions statistiques. Les données sont maintenant stockées sur une instance différente. Ensuite, le background worker est remplacé par un nouveau daemon collecteur, qui lit les métriques de performance depuis les serveurs distants, et les stocke sur le serveur repository dédié. Powa-web pourra présenter les données en se connectant sur le serveur repository, ainsi que sur les serveurs distants lorsque des données complètes sont nécessaires.

En résumé, avec le nouveau mode distant ajouté dans cette version 4

un redémarrage de PostgreSQL n’est plus nécessaire pour installer powa-archivist
il n’y a plus de surcoût du au fait de stocker et requêter les données sur le même serveur PostgreSQL que vos serveurs de productions (il y a toujours certaines partie de l’UI qui nécessitent d’effectuer des requêtes sur le serveur d’origine, par exemple pour montrer des plans avec EXPLAIN, mais le surcoût est négligeable)
il est maintenant possible d’utiliser PoWA sur un serveur en hot-standby

L’UI vous accueillera donc maintenant avec une page initiale afin de choisir lequel des serveurs stockés sur la base de données cible vous voulez travailler :

La principale raison pour laquelle il a fallu tellement de temps pour apporter ce mode distant est parce que cela apporte beaucoup de complexité, nécessitant une réécriture majeure de PoWA. Nous voulions également ajouter d’abord d’autres fonctionnalités, comme la suggestion globale d’index, avec une validation grâce à hypopg introduit avec PoWA 3.

Changements dans powa-web

L’interface graphique est le composant qui a le plus de changements visibles dans cette version 4. Voici les plus changements les plus importants.

Compatibilité avec le mode distant

Le changement le plus important est bien évidemment le support pour le nouveau mode remote. En conséquence, la première page affichée est maintenant une page de sélection de serveur, affichant tous les serveurs distants enregistrés. Après avoir choisi le serveur distant voulu (ou le serveur local si vous n’utilisez pas le mode distant), toutes les autres pages seront similaires à celles disponibles jusqu’à la version 3, mais afficheront les données pour un serveur distant spécifique uniquement, et bien entendu en récupérant les données depuis la base de données repository, avec en plus de nouvelles informations décrites ci-dessous.

Veuillez notez que puisque les données sont maintenant stockées sur un serveur repository dédié quand le mode remote est utilisé, la majorité de l’UI est utilisable sans se connecter au serveur distant sélectionné. Toutefois, powa-web nécessite toujours de pouvoir se connecter sur le serveur distant quand les données originales sont nécessaires (par exemple, pour la suggestion d’index ou pour montrer des plans avec EXPLAIN). Les mêmes considérations et possibilités concernant l’authentification que pour le nouveau daemon powa-collector (qui sera décrit dans un prochain article) s’appliquent ici.

pg_track_settings support

Quand cette extension est correctement configurée, un nouveau widget timeline apparaîtra, placé entre chaque graph et son aperçu, affichant différents types de changements enregistrés si ceux-ci ont été détectés sur l’intervalle de temps sélectionné. Sur les pages par base de données et par requête, la liste sera également filtrée en fonction de la base de données sélectionnée.

La même timeline sera affichée sur chacun des graphs de chacune des pages, afin de facilement vérifier si ces changements ont eu un impact visible en utilisant les différents graphs.

Veuillez noter que les détails des changements sont affichés au survol de la souris. Vous pouvez également cliquer sur n’importe lequel des événements de la timeline pour figer l’affichage, et tracer une ligne verticale sur le graph associé.

Voici un exemple d’un tel changement de configuration en action :

Veuillez également noter qu’il est nécessaire d’avoir au minimum la version 2.0.0 de pg_track_settings, et que l’extension doit être installée à la fois sur les serveurs distants ainsi que sur le serveur repository.

Nouveaux graphs disponibles

Quand pg_stat_kcache est configuré, ses informations n’étaient auparavant affichées que sur la page par requête. Les informations sont maintenant également affichées sur les pages par serveur et par base, dans deux nouveaux graphs :

dans le graph Block Access, où les métriques OS cache et disk read remplaceront la métrique read
dans un nouveau graph System Resources (qui est également ajouté dans la page par requête), montrant les metrics ajoutées dans pg_stat_kcache 2.1

Voici un example de ce nouveau graph System Resources :

Il y avait également un graph Wait Events (disponible quand l’extension pg_wait_sampling est configuée) disponible uniquement sur la page par requête. Ce graph est maintenant disponible sur les pages par serveur et par base également.

Documentation des métriques et liens vers la documentation

Certaines métriques affichées sur l’interface sont assez parlante, mais certaines autres peuvent être un peu obscures. Jusqu’à maintenant, il n’y avait malheureusement aucune documentation pour les métriques. Le problème est maintenant réglé, et tous les graphs ont une icône d’information, qui affichent une description des métriques utilisée dans le graph au survol de la souris. Certains graphs incluent également un lien vers la documentation PoWA de extension statistiques pour les utilisateurs qui désirent en apprendre plus à leur sujet.

Voici un exemple :

Et des correctifs de bugs divers

Certains problèmes de longues dates ont également été rapportés :

la boîte affichée au survol d’un graph montant les valeurs des métriques avait une position verticale incorrecte
la sélection temporelle en utilisant l’aperçu des graphs ne montrait pas un aperçu correct après avoir appliqué la sélection
les erreurs lors de la création d’index hypothétiques ou dans certains cas leur affichage n’était pas correctement gérés sur plusieurs pages
les filtres des tableaux n’était pas réappliqués quand l’intervalle de temps sélectionné était changé

Si un de ces problèmes vous a un jour posé problème, vous serez ravi d’apprendre qu’ils sont maintenant tous corrigés !

Conclusion

Cette 4ème version de PoWA représente un temps de développement très important, de nombreuses améliorations sur la documentation et beaucoup de tests. Nous somme maintenant assez satisfaits, mais il est possible que nous ayons ratés certains bugs. Si vous vous intéressez à ce projet, j’espère que vous essaierez de tester cette beta, et si besoin n’hésitez pas à nous remonter un bug!

PoWA 4 apporte un mode remote, disponible en beta ! was originally published by Julien Rouhaud at rjuju's home on May 17, 2019.

Nouveauté pg12: Statistiques sur les erreurs de checkums

2019-04-18T11:02:26+00:00

Data checksums

Ajoutés dans PostgreSQL 9.3, les data checksums peuvent aider à détecter les corruptions de données survenant sur votre stockage.

Les checksums sont activés si l’instance a été initialisée en utilisant initdb --data-checksums (ce qui n’est pas le comportement par défaut), ou s’ils ont été activés après en utilisant la nouvelle utilitaire activated afterwards with the new pg_checksums également ajouté dans PostgreSQL 12.

Quand les checksums sont ativés, ceux-ci sont écrits à chaque fois qu’un bloc de données est écrit sur disque, et vérifiés à chaque fois qu’un bloc est lu depuis le disque (ou depuis le cache du système d’exploitation). Si la vérification échoue, une erreur est remontée dans les logs. Si le bloc était lu par un processus client, la requête associée échouera bien évidemment, mais si le bloc était lu par une opération BASE_BACKUP (tel que pg_basebackup), la commande continuera à s’exécuter. Bien que les data checksums ne détecteront qu’un sous ensemble des problèmes possibles, ils ont tout de même une certaine utilisé, surtout si vous ne faites pas confiance à votre stockage.

Jusqu’à PostgreSQL 11, les erreurs de validation de checksum ne pouvaient être trouvées qu’en cherchant dans les logs, ce qui n’est clairement pas pratique si vous voulez monitorer de telles erreurs.

Nouveaux compteurs disponibles dans pg_stat_database

Pour rendre la supervision des erreurs de checksum plus simple, et pour aider les utilisateurs à réagir dès qu’un tel problème survient, PostgreSQL 12 ajoute de nouveaux compteurs dans la vue pg_stat_database :

commit 6b9e875f7286d8535bff7955e5aa3602e188e436
Author: Magnus Hagander <[email protected]>
Date:   Sat Mar 9 10:45:17 2019 -0800

Track block level checksum failures in pg_stat_database

This adds a column that counts how many checksum failures have occurred
on files belonging to a specific database. Both checksum failures
during normal backend processing and those created when a base backup
detects a checksum failure are counted.

Author: Magnus Hagander
Reviewed by: Julien Rouhaud

commit 77bd49adba4711b4497e7e39a5ec3a9812cbd52a
Author: Magnus Hagander <[email protected]>
Date:   Fri Apr 12 14:04:50 2019 +0200

    Show shared object statistics in pg_stat_database

    This adds a row to the pg_stat_database view with datoid 0 and datname
    NULL for those objects that are not in a database. This was added
    particularly for checksums, but we were already tracking more satistics
    for these objects, just not returning it.

    Also add a checksum_last_failure column that holds the timestamptz of
    the last checksum failure that occurred in a database (or in a
    non-dataabase file), if any.

    Author: Julien Rouhaud <[email protected]>

commit 252b707bc41cc9bf6c55c18d8cb302a6176b7e48
Author: Magnus Hagander <[email protected]>
Date:   Wed Apr 17 13:51:48 2019 +0200

    Return NULL for checksum failures if checksums are not enabled

    Returning 0 could falsely indicate that there is no problem. NULL
    correctly indicates that there is no information about potential
    problems.

    Also return 0 as numbackends instead of NULL for shared objects (as no
    connection can be made to a shared object only).

    Author: Julien Rouhaud <[email protected]>
    Reviewed-by: Robert Treat <[email protected]>

Ces compteurs reflèteront les erreurs de validation de checksum à la fois pour les processus clients et pour l’activité BASE_BACKUP, par base de données.

rjuju=# \d pg_stat_database
                        View "pg_catalog.pg_stat_database"
        Column         |           Type           | Collation | Nullable | Default
-----------------------+--------------------------+-----------+----------+---------
 datid                 | oid                      |           |          |
 datname               | name                     |           |          |
 [...]
 checksum_failures     | bigint                   |           |          |
 checksum_last_failure | timestamp with time zone |           |          |
 [...]
 stats_reset           | timestamp with time zone |           |          |

La colonne checksum_failures montrera un nombre cumulé d’erreurs, et la colonne checksum_last_failure montrera l’horodatage de la dernière erreur de validation sur la base de données (NULL si aucune erreur n’est jamais survenue).

Pour éviter toute confusion (merci à Robert Treat pour l’avoir signalé), ces deux colonnes retourneront toujours NULL si les data checkums ne sont pas activés, afin qu’on ne puisse pas croire que les checksums sont toujours vérifiés avec succès.

Comme effet de bord, pg_stat_database montrera maintenant également les statistiques disponibles pour les objets partagés (tels que la table pg_database par exemple), dans une nouvelle ligne pour laquelle datid vaut 0, et datname vaut NULL.

~~Une sonde dédiée est également déjà planifiée dans check_pgactivity !~~ Une sonde dédiée est également déjà disponible dans check_pgactivity !

Nouveauté pg12: Statistiques sur les erreurs de checkums was originally published by Julien Rouhaud at rjuju's home on April 18, 2019.

New in pg12: Statistics on checkums errors

2019-04-18T11:02:26+00:00

Data checksums

Added in PostgreSQL 9.3, data checksums can help to detect data corruption happening on the storage side.

Checksums are only enabled if the instance was setup using initdb --data-checksums (which isn’t the default behavior), or if activated afterwards with the new pg_checksums tool also added in PostgreSQL 12.

When enabled, checksums are written each time a block is written to disk, and verified each time a block is read from disk (or from the operating system cache). If the checksum verification fails, an error is reported in the logs. If the block was read by a backend, the query will obviously fails, but if the block was read by a BASE_BACKUP operation (such as pg_basebackup), the command will continue its processing . While data checkums will only catch a subset of possible problems, they still have some values, especially if you don’t trust your storage reliability.

Up to PostgreSQL 11, any checksum validation error could only be found by looking into the logs, which clearly isn’t convenient if you want to monitor such error.

New counters available in pg_stat_database

To make checksum errors easier to monitor, and help users to react as soon as such a problem occurs, PostgreSQL 12 adds new counters in the pg_stat_database view:

commit 6b9e875f7286d8535bff7955e5aa3602e188e436
Author: Magnus Hagander <[email protected]>
Date:   Sat Mar 9 10:45:17 2019 -0800

Track block level checksum failures in pg_stat_database

This adds a column that counts how many checksum failures have occurred
on files belonging to a specific database. Both checksum failures
during normal backend processing and those created when a base backup
detects a checksum failure are counted.

Author: Magnus Hagander
Reviewed by: Julien Rouhaud

commit 77bd49adba4711b4497e7e39a5ec3a9812cbd52a
Author: Magnus Hagander <[email protected]>
Date:   Fri Apr 12 14:04:50 2019 +0200

    Show shared object statistics in pg_stat_database

    This adds a row to the pg_stat_database view with datoid 0 and datname
    NULL for those objects that are not in a database. This was added
    particularly for checksums, but we were already tracking more satistics
    for these objects, just not returning it.

    Also add a checksum_last_failure column that holds the timestamptz of
    the last checksum failure that occurred in a database (or in a
    non-dataabase file), if any.

    Author: Julien Rouhaud <[email protected]>

commit 252b707bc41cc9bf6c55c18d8cb302a6176b7e48
Author: Magnus Hagander <[email protected]>
Date:   Wed Apr 17 13:51:48 2019 +0200

    Return NULL for checksum failures if checksums are not enabled

    Returning 0 could falsely indicate that there is no problem. NULL
    correctly indicates that there is no information about potential
    problems.

    Also return 0 as numbackends instead of NULL for shared objects (as no
    connection can be made to a shared object only).

    Author: Julien Rouhaud <[email protected]>
    Reviewed-by: Robert Treat <[email protected]>

Those counters will reflect checksum validation errors for both backend activity and BASE_BACKUP activity, per database.

rjuju=# \d pg_stat_database
                        View "pg_catalog.pg_stat_database"
        Column         |           Type           | Collation | Nullable | Default
-----------------------+--------------------------+-----------+----------+---------
 datid                 | oid                      |           |          |
 datname               | name                     |           |          |
 [...]
 checksum_failures     | bigint                   |           |          |
 checksum_last_failure | timestamp with time zone |           |          |
 [...]
 stats_reset           | timestamp with time zone |           |          |

The checksum_failures column will show a cumulated number of errors, and the checksum_last_failure column will show the timestamp of the last checksum failure on the database (NULL if no error ever happened).

To avoid any confusion (thanks to Robert Treat for pointing it), those two columns will always return NULL if data checksums aren’t enabled, so people won’t mistakenly think that data checksums are always successfully verified.

As a side effect, pg_stat_database will also now show available statistics for shared objects (such as the pg_database table for instance), in a new row with datid valued to 0, and a NULL datname. Those were always accumulated, but weren’t displayed in any system view until now.

~~A dedicated check is also already planned in check_pgactivity!~~ A dedicated check is also already available in check_pgactivity!

New in pg12: Statistics on checkums errors was originally published by Julien Rouhaud at rjuju's home on April 18, 2019.

Minimiser le surcoût de stockage par ligne

2019-04-06T07:51:28+00:00

J’entends régulièrement des complaintes sur la quantité d’espace disque gâchée par PostgreSQL pour chacune des lignes qu’il stocke. Je vais essayer de montrer ici quelques astuces pour minimiser cet effet, afin d’avoir un stockage plus efficace.

Quel surcoût ?

Si vous n’avez pas de table avec plus que quelques centaines de millions de lignes, il est probable que ce n’est pas un problème pour vous.

Pour chaque ligne stockée, postgres conservera quelques données additionnelles pour ses propres besoins. C’est documenté ici. La documentation indique :

Field	Type	Length	Description
t_xmin	TransactionId	4 bytes	XID d’insertion
t_xmax	TransactionId	4 bytes	XID de suppresion
t_cid	CommandId	4 bytes	CID d’insertion et de suppression (surcharge avec t_xvac)
t_xvac	TransactionId	4 bytes	XID pour l’opération VACUUM déplaçant une version de ligne
t_ctid	ItemPointerData	6 bytes	TID en cours pour cette version de ligne ou pour une version plus récente
t_infomask2	uint16	2 bytes	nombre d’attributs et quelques bits d’état
t_infomask	uint16	2 bytes	différents bits d’options (flag bits)
t_hoff	uint8	1 byte	décalage vers les données utilisateur

Ce qui représente 23 octets sur la plupart des architectures (il y a soit t_cid soit t_xvac).

Vous pouvez d’ailleurs consulter une partie de ces champs grâce aux colonnes cachées présentes dans n’importe quelle table en les ajoutant dans la partie SELECT d’une requête, ou en cherchant pour les numéros d’attribut négatifs dans le catalogue pg_attribute :

# \d test
     Table "public.test"
 Column |  Type   | Modifiers
--------+---------+-----------
 id     | integer |

# SELECT xmin, xmax, id FROM test LIMIT 1;
 xmin | xmax | id
------+------+----
 1361 |    0 |  1

# SELECT attname, attnum, atttypid::regtype, attlen
FROM pg_class c
JOIN pg_attribute a ON a.attrelid = c.oid
WHERE relname = 'test'
ORDER BY attnum;
 attname  | attnum | atttypid | attlen
----------+--------+----------+--------
 tableoid |     -7 | oid      |      4
 cmax     |     -6 | cid      |      4
 xmax     |     -5 | xid      |      4
 cmin     |     -4 | cid      |      4
 xmin     |     -3 | xid      |      4
 ctid     |     -1 | tid      |      6
 id       |      1 | integer  |      4

Si vous comparez ces champs avec le tableau précédent, vous pouvez constater que toutes ces colonnes ne sont pas stockées sur disque. Bien évidemment, PostgreSQL ne stocke pas l’oid de la table pour chaque ligne. Celui-ci est ajouté après, lors de la construction d’une ligne.

Si vous voulez plus de détails techniques, vous pouvez regarder htup_detail.c, en commençant par TupleHeaderData struct.

Combien est-ce que ça coûte ?

Puisque ce surcoût est fixe, plus la taille des lignes croît plus il devient négligeable. Si vous ne stocker qu’une simple colonne de type intt (4 octets), chaque ligne nécessitera :

23B + 4B = 27B

soit 85% de surcoût, ce qui est plutôt horrible.

D’une autre côté, si vous stockez 5 integer, 3 bigint et 2 colonnes de type texte (disons environ 80 octets en moyenne), cela donnera :

23B + 5*4B + 3*8B + 2*80B = 227B

C’est “seulement” 10% de surcoût.

Et donc, comment minimiser ce surcoût

L’idée est de stocker les même données, mais avec moins d’enregistrements. Comment faire ? En aggrégeant les données dans des tableaux. Plus vous mettez d’enregistrements dans un seul tableau, plus vous minimiserez le surcoût. Et si vous aggrégez suffisamment de données, vous pouvez bénéficier d’une compression entièrement transparente grâce au mécanisme de TOAST.

Voyons ce que cela donne avec une table ne disposant que d’une seule colonne, avec 10 millions de lignes :

# CREATE TABLE raw_1 (id integer);

# INSERT INTO raw_1 SELECT generate_series(1,10000000);

# CREATE INDEX ON raw_1 (id);

Les données utilisateur ne devrait nécessiter que 10M * 4 octets, soit environ 30 Mo, alors que cette table pèse 348 Mo. L’insertion des données prend environ 23 secondes.

NOTE : Si vous faites le calcul, vous trouverez que le surcoût est d’un peu plus que 32 octets par ligne, pas 23 octets. C’est parce que chaque bloc de données a également un surcoût, une gestion des colonnes NULL ainsi que des contraintes d’alignement. Si vous voulez plus d’informations à ce sujet, je vous recommande de regarder cette présentation

Comparons maintenant cela avec la version aggrégées des même données :

# CREATE TABLE agg_1 (id integer[]);

# INSERT INTO agg_1 SELECT array_agg(i)
FROM generate_series(1,10000000) i
GROUP BY i % 2000000;

# CREATE INDEX ON agg_1 (id);

Cette requête insèrera 5 éléments par ligne. J’ai fait le même test avec 20, 100, 200 et 1000 éléments par ligne. Les résultats sont les suivants :

NOTE : La taille pour 1000 éléments par ligne est un peu plus importante que pour la valeur précédents. C’est parce que c’est le seul qui implique une taille suffisamment importante pour être TOAST-ée, mais pas assez pour être compressée. On peut donc voir ici un peu de surcoût lié au TOAST.

Jusqu’ici tout va bien, on peut voir de plutôt bonnes améliorations à la fois sur la taille et sur le temps d’insertion, même pour les tableaux les plus petits. Voyons maintenant l’impact pour récupérer des lignes. Je testerai la récupération de toutes les lignes, ainsi qu’une seule ligne au moyen d’un parcours d’index (j’ai utilisé pour les tests EXPLAIN ANALYZE afin de minimiser le temps passé par psql à afficher les données) : psql):

# SELECT id FROM raw_1;

# CREATE INDEX ON raw_1 (id);

# SELECT * FROM raw_1 WHERE id = 500;

Pour correctement indexer le tableau, nous avons besoin d’un index GIN. Pour récupérer les valeurs de toutes les données aggrégées, il est nécessaire d’appeler unnest() sur le tableau, et pour récupérer un seul enregistrement il faut être un peu plus créatif :

# SELECT unnest(id) AS id FROM agg_1;

# CREATE INDEX ON agg_1 USING gin (id);

# WITH s(id) AS (
    SELECT unnest(id)
    FROM agg_1
    WHERE id && array[500]
)
SELECT id FROM s WHERE id = 500;

Voici le tableau comparant les temps de création de l’index ainsi que la taille de celui-ci, pour chaque dimension de tableau :

L’index GIN est un peu plus que deux fois plus volumineux que l’index btree, et si on accumule la taille de la table à la taille de l’index, la taille totale est presque identique avec ou sans aggrégation. Ce n’est pas un gros problème puisque cet exemple est très naïf, et nous verrons juste après comme éviter d’avoir recours à un index GIN pour conserver une taille totale faible. De plus, l’index est bien plus lent à créer, ce qui signifie qu’INSERT sera également plus lent.

Voici le tableau comparant le temps pour récupérer toutes les lignes ainsi qu’une seule ligne :

Récupérer toutes les lignes n’est probablement pas un exemple intéressant, mais il est intéressant de noter que dès que le tableau contient suffisamement d’éléments cela devient plus efficace que faire la même chose avec la table originale. Nous voyons également que récuérer un seul élément est bien plus rapide qu’avec l’index btree, grâce à l’efficacité de GIN. Ce n’est pas testé ici, mais puisque seul les index btree sont nativement triés, si vous devez récupérer un grand nombre d’enregistrements triés, l’utilisation d’un index GIN nécessitera un tri supplémentaire, ce qui sera bien plus lent qu’un simple parcours d’index btree.

Un exemple plus réaliste

Maintenant que nous avons vu les bases, voyons comment aller un peu plus loin : aggréger plus d’une colonne et éviter d’utiliser trop d’espce disque (et de ralentissements à l’écriture) du fait d’un index GIN. Pour cela, je vais présenter comme PoWA stocke ses données.

Pour chaque source de données collectée, deux tables sont utilisées : une pour les données historiques et aggrégées, ainsi qu’une pour les données courantes. Ces tables stockent les données dans un type de données personnalisé plutôt que des colonnes. Voyons les tables liées à l’extension pg_stat_statements :

Le type de données, grosso modo tous les compteurs présents dans pg_stat_statements ainsi que l’horodatage associé à l’enregistrement :

powa=# \d powa_statements_history_record
   Composite type "public.powa_statements_history_record"
       Column        |           Type           | Modifiers
---------------------+--------------------------+-----------
 ts                  | timestamp with time zone |
 calls               | bigint                   |
 total_time          | double precision         |
 rows                | bigint                   |
 shared_blks_hit     | bigint                   |
 shared_blks_read    | bigint                   |
 shared_blks_dirtied | bigint                   |
 shared_blks_written | bigint                   |
 local_blks_hit      | bigint                   |
 local_blks_read     | bigint                   |
 local_blks_dirtied  | bigint                   |
 local_blks_written  | bigint                   |
 temp_blks_read      | bigint                   |
 temp_blks_written   | bigint                   |
 blk_read_time       | double precision         |
 blk_write_time      | double precision         |

La table pour les données courrante stocke l’identifieur unique de pg_stat_statements (queryid, dbid, userid), ainsi qu’un enregistrement de compteurs :

powa=# \d powa_statements_history_current
    Table "public.powa_statements_history_current"
 Column  |              Type              | Modifiers
---------+--------------------------------+-----------
 queryid | bigint                         | not null
 dbid    | oid                            | not null
 userid  | oid                            | not null
 record  | powa_statements_history_record | not null

La table pour les données aggrégées contient le même identifieur unique, un tableau d’enregistrements ainsi que quelques champs spéciaux :

powa=# \d powa_statements_history
            Table "public.powa_statements_history"
     Column     |               Type               | Modifiers
----------------+----------------------------------+-----------
 queryid        | bigint                           | not null
 dbid           | oid                              | not null
 userid         | oid                              | not null
 coalesce_range | tstzrange                        | not null
 records        | powa_statements_history_record[] | not null
 mins_in_range  | powa_statements_history_record   | not null
 maxs_in_range  | powa_statements_history_record   | not null
Indexes:
    "powa_statements_history_query_ts" gist (queryid, coalesce_range)

Nous stockons également l’intervalle d’horodatage (coalesce_range) contenant tous les compteurs aggrégés dans la ligne, ainsi que les valeurs minimales et maximales de chaque compteurs dans deux compteurs dédiés. Ces champs supplémentaires ne consomment pas trop d’espace, et permettent une indexation ainsi qu’un traitement très efficace, basé sur les modèles d’accès aux données de l’application associée.

Cette table est utilisée pour savoir combien de ressources ont été utilisée par une requête sur un intervalle de temps donné. L’index GiST ne sera pas très gros puisqu’il n’indexe que deux petites valeus pour X compteurs aggrégés, et trouvera les lignes correspondant à une requête et un intervalle de temps données de manière très efficace.

Ensuite, calculer les ressources consommées peut être fait de manière très efficace, puisque les compteurs de pg_stat_statements sont strictement monotones. L’algorithme pourrait être :

si l’intervalle de temps de la ligne est entièrement contenu dans l’intervalle de temps demandé, nous n’avons besoin de calculer que le delta du résumé de l’enregistrement : maxs_in_range.counter - mins_in_range.counter
sinon (c’est-à-dire pour uniquement deux lignes par queryid) nous dépilons le tableau, filtrons les enregistrements qui ne sont pas compris dans l’intervalle de temps demandé, conservons la première et dernière valeur et calculons pour chaque compteur le maximum moins le minimum.

NOTE : Dans les faits, l’interface de PoWA dépilera toujours tous les enregistrements contenus dans l’intervalle de temps demandé, puisque l’interface est faite pour montrer l’évolution de ces compteurs sur un intervalle de temps relativement réduit, mais avec une grande précision. Heureusement, dépiler les tableaux n’est pas si coûteux que ça, surtout en regard de l’espace disque économisé.

Et voici la taille nécessaire pour les valeurs aggrégées et non aggrégées. Pour cela j’ai laissé PoWA générer 12 331 366 enregistrements (en configurant une capture toutes les 5 secondes pendant quelques heures, et avec l’aggrégation par défaut de 100 enregistrements par lignes), et créé un index btree sur (queryid, ((record).ts) pour simuler l’index présent sur les tables aggrégées :

Vous trouvez aussi que c’est plutôt efficace ?

Limitations

Il y a quelques limitations avec l’aggrégation d’enregistrements. Si vous faites ça, vous ne pouvez plus garantir de contraintes telles que des clés étrangères ou contrainte d’unicité. C’est donc à utiliser pour des données non relationnelles, telles que des compteurs ou des métadonnées.

Bonus

L’utilisation de type de données personnalisés vous permet de faire des choses sympathiques, comme définir des opérateurs personnalisés. Par exemple, la version 3.1.0 de PoWA fournit deux opérateurs pour chacun des types de données personnalisé définis :

l’opérateur -, pour obtenir la différent entre deux enregistrements
l’opérateur /, pour obtenir la différence par seconde

Vous pouvez donc faire très facilement des requêtes du genre :

# SELECT (record - lag(record) over()).*
FROM from powa_statements_history_current
WHERE queryid = 3589441560 AND dbid = 16384;
      intvl      | calls  |    total_time    |  rows  | ...
-----------------+--------+------------------+--------+ ...
 <NULL>          | <NULL> |           <NULL> | <NULL> | ...
 00:00:05.004611 |   5753 | 20.5570000000005 |   5753 | ...
 00:00:05.004569 |   1879 | 6.40500000000047 |   1879 | ...
 00:00:05.00477  |  14369 | 48.9060000000006 |  14369 | ...
 00:00:05.00418  |      0 |                0 |      0 | ...

# SELECT (record / lag(record) over()).*
FROM powa_statements_history_current
WHERE queryid = 3589441560 AND dbid = 16384;

  sec   | calls_per_sec | runtime_per_sec  | rows_per_sec | ...
--------+---------------+------------------+--------------+ ...
 <NULL> |        <NULL> |           <NULL> |       <NULL> | ...
      5 |        1150.6 |  4.1114000000001 |       1150.6 | ...
      5 |         375.8 | 1.28100000000009 |        375.8 | ...
      5 |        2873.8 | 9.78120000000011 |       2873.8 | ...

Si vous êtes intéressés sur la façon d’implémenter de tels opérateurs, vous pouvez regarder l’implémentation de PoWA.

Conclusion

Vous connaissez maintenant les bases pour éviter le surcoût de stockage par ligne. En fonction de vos besoins et de la spécificité de vos données, vous devriez pouvoir trouver un moyen d’aggréger vos données, en ajoutant potentiellement quelques colonnes supplémentaires, afin de conserver de bonnes performances et économiser de l’espace disque.

Minimiser le surcoût de stockage par ligne was originally published by Julien Rouhaud at rjuju's home on April 06, 2019.

Support des Wait Events pour PoWA

2019-04-02T17:08:24+00:00

Vous avez la possibilité de visualiser les Wait Events dans PoWA 3.2.0 grâce à l’extension pg_wait_sampling extension.

Wait Events & pg_wait_sampling

Les wait events sont une fonctionnalité connues, et bien utiles, dans de nombreux moteurs de base de données relationnelles. Ceux-ci ont été ajouté à PostgreSQL 9.6, il y a maintenant quelques versions. Contrairement à la plupart des autres statistiques exposées par PostgreSQL, ceux-ci ne sont qu’une vision à un instant donné des événements sur lesquels les processus sont en attente, et non pas des compteurs cumulés. Vous pouvez consulter cette information en utilisant la vue pg_stat_activity, par exemple :

=# SELECT datid, pid, wait_event_type, wait_event, query FROM pg_stat_activity;
 datid  |  pid  | wait_event_type |     wait_event      |                                  query
--------+-------+-----------------+---------------------+-------------------------------------------------------------------------
 <NULL> | 13782 | Activity        | AutoVacuumMain      |
  16384 | 16615 | Lock            | relation            | SELECT * FROM t1;
  16384 | 16621 | Client          | ClientRead          | LOCK TABLE t1;
 847842 | 16763 | LWLock          | WALWriteLock        | END;
 847842 | 16764 | Lock            | transactionid       | UPDATE pgbench_branches SET bbalance = bbalance + 1229 WHERE bid = 1;
 847842 | 16766 | LWLock          | WALWriteLock        | END;
 847842 | 16767 | Lock            | transactionid       | UPDATE pgbench_tellers SET tbalance = tbalance + 3383 WHERE tid = 86;
 847842 | 16769 | Lock            | transactionid       | UPDATE pgbench_branches SET bbalance = bbalance + -3786 WHERE bid = 10;
[...]

Dans cet exemple, nous voyons que le //wait event// pour le pid 16615 est un Lock sur une Relation. En d’autre terme, la requête est bloquée en attente d’un verrou lourd, alors que le pid 16621, qui clairement détient le verrou, est inactif en attente de commandes du client. Il s’agit d’informations qu’il était déjà possible d’obtenir avec les anciennes versions, bien que cela se faisait d’une autre manière. Mais plus intéressant, nous pouvons également voir que le //wait event// pour le pid 16766 est un LWLock, c’est-à-dire un Lightweight Lock, ou verrou léger. Les verrous légers sont des verrous internes et transitoires qu’il était auparavant impossible de voir au niveau SQL. dans cet exemple, la requête est en attente d’un WALWriteLock, un verrou léger principalement utilisé pour contrôler l’écriture dans les tampons des journaux de transaction. Une liste complète des //wait events// disponible est disponible sur la documentation officielle.

Ces informations manquaient curellement et sont bien utiles pour diagnostiquer les causes de ralentissement. Cependant, n’avoir que la vue de ces //wait events// à l’instant présent n’est clairement pas suffisant pour avoir une bonne idée de ce qu’il se passe sur le serveur. Puisque la plupart des //wait events// sont pas nature très éphémères, ce dont vous avez besoin est de les échantilloner à une fréquence élevée. Tenter de faire cet échantillonage avec un outil externe, même à une seconde d’intervalle, n’est généralement pas suffisant. C’est là que l’extension pg_wait_sampling apporte une solution vraiment brillante. Il s’agit d’une extension écrite par Alexander Korotkov et Ildus Kurbangaliev. Une fois activée (il est nécessaire de la configurer dans le shared_preload_libraries, un redémarrage de l’instance est donc nécessaire), elle échantillonera en mémoire partagée les //wait events// toutes les 10 ms (par défaut), et aggèrega également les compteurs par type de //wait event// (wait_event_type), //wait event// et queryid (si pg_stat_statements est également acctivé). Pour plus de détails sur la configuration et l’utilisation de cette extension, vous pouvez consulter le README de l’extension. Comme tout le travail est fait en mémoire au moyen d’une extension écrite en C, c’est très efficace. De plus, l’implémentation est faite avec très peu de verouillage, le surcoût de cette extension devrait être presque négligable. J’ai fait quelques tests de performance sur mon pc portable (je n’ai malheureusement pas de meilleure machine sur laquelle tester) avec un pgbench en lecture seule où toutes les données tenaient dans le cache de PostgreSQL (shared_buffers), avec 8 puis 90 clients, afin d’essayer d’avoir le maximum de surcoût possible. La moyenne sur 3 tests était d’environ 1% de surcoût, avec des fluctuations entre chaque test d’environ 0.8%.

Et PoWA ?

Ainsi, grâce à cette extension, nous avons à notre disposition une vue cumulée et extrêmement précise des //wait events//. C’est très bien, mais comme toutes les autres statistiques cumulées dans PostgreSQL, vous devez échantillonner ces compteurs régulièrement si vous voulez pouvoir être capable de savoir ce qu’il s’est passé à un certain moment dans le passé, comme c’est d’ailleurs précisé dans le README de l’extension :

[…] Waits profile. It’s implemented as in-memory hash table where count of samples are accumulated per each process and each wait event (and each query with pg_stat_statements). This hash table can be reset by user request. Assuming there is a client who periodically dumps profile and resets it, user can have statistics of intensivity of wait events among time.

C’est exactement le but de PoWA: sauvegarder les compteurs statistiques de manière efficace, et les afficher sur une interface graphique.

PoWA 3.2 détecte automatiquement si l’extension pg_wait_sampling est déjà présente ou si vous l’installez ultérieurement, et commencera à collecter ses données, vous donnant une vue vraiment précise des //wait events// dans le temps sur vos bases de données !

Les données sont centralisée dans des tables PoWA classiques, powa_wait_sampling_history_current pour les 100 dernières collectes (valeur par défaut de powa.coalesce), et les valeurs plus anciennes sont aggrégées dans la table powa_wait_sampling_history, avec un historique allant jusqu’à une période définie par powa.retention. Par exemple, voici une requête simple affichant les 20 premiers changements survenus au sein des 100 premiers instantanés :

WITH s AS (
SELECT (record).ts, queryid, event_type, event,
(record).count - lag((record).count)
    OVER (PARTITION BY queryid, event_type, event ORDER BY (record).ts)
    AS events
FROM powa_wait_sampling_history_current w
JOIN pg_database d ON d.oid = w.dbid
WHERE d.datname = 'bench'
)
SELECT *
FROM s
WHERE events != 0
ORDER BY ts ASC, event DESC
LIMIT 20;
              ts               |       queryid        | event_type |     event      | events
-------------------------------+----------------------+------------+----------------+--------
 2018-07-09 10:44:08.037191+02 | -6531859117817823569 | LWLock     | pg_qualstats   |   1233
 2018-07-09 10:44:28.035212+02 |  8851222058009799098 | Lock       | tuple          |      4
 2018-07-09 10:44:28.035212+02 | -6860707137622661878 | Lock       | tuple          |    149
 2018-07-09 10:44:28.035212+02 |  8851222058009799098 | Lock       | transactionid  |    193
 2018-07-09 10:44:28.035212+02 | -6860707137622661878 | Lock       | transactionid  |   1143
 2018-07-09 10:44:28.035212+02 | -6531859117817823569 | LWLock     | pg_qualstats   |      1
 2018-07-09 10:44:28.035212+02 |  8851222058009799098 | LWLock     | lock_manager   |      2
 2018-07-09 10:44:28.035212+02 | -6860707137622661878 | LWLock     | lock_manager   |      3
 2018-07-09 10:44:28.035212+02 | -6860707137622661878 | LWLock     | buffer_content |      2
 2018-07-09 10:44:48.037205+02 |  8851222058009799098 | Lock       | tuple          |     14
 2018-07-09 10:44:48.037205+02 | -6860707137622661878 | Lock       | tuple          |    335
 2018-07-09 10:44:48.037205+02 | -6860707137622661878 | Lock       | transactionid  |   2604
 2018-07-09 10:44:48.037205+02 |  8851222058009799098 | Lock       | transactionid  |    384
 2018-07-09 10:44:48.037205+02 | -6860707137622661878 | LWLock     | lock_manager   |     13
 2018-07-09 10:44:48.037205+02 |  8851222058009799098 | LWLock     | lock_manager   |      4
 2018-07-09 10:44:48.037205+02 |  8221555873158496753 | IO         | DataFileExtend |      1
 2018-07-09 10:44:48.037205+02 | -6860707137622661878 | LWLock     | buffer_content |      4
 2018-07-09 10:45:08.032938+02 |  8851222058009799098 | Lock       | tuple          |      5
 2018-07-09 10:45:08.032938+02 | -6860707137622661878 | Lock       | tuple          |    312
 2018-07-09 10:45:08.032938+02 | -6860707137622661878 | Lock       | transactionid  |   2586
(20 rows)

NOTE: Il y a également une version par base de données de ces valeurs pour un traitement plus efficace au niveau des basesn dans les tables powa_wait_sampling_history_current_db et powa_wait_sampling_history_db

Et ces données sont visibles avec l’interface powa-web. Voici quelques exemples d’affichage des //wait events// tels qu’affichés par PoWA avec un simple pgbench :

Wait events pour l’instance entière

Wait events pour une base de données

Wait events pour une seule requête

Cette fonctionnalité est disponible depuis la version 3.2 de PoWA. J’espère pouvoir afficher plus de vues de ces données dans le futur, en incluant d’autres graphes, puisque toutes les données sont déjà disponibles en bases. Également, si vous êtes un développeur python ou javascript, les contributions sont toujours bienvenues!

Support des Wait Events pour PoWA was originally published by Julien Rouhaud at rjuju's home on April 02, 2019.