rjuju's homeJekyll2024-08-02T23:53:42+00:00https://rjuju.github.io/Julien Rouhaudhttps://rjuju.github.io/https://rjuju.github.io/postgresql/2023/12/20/extract-sql-from-wal-part22023-12-20T03:04:10+00:002023-12-20T03:04:10+00:00Julien Rouhaudhttps://rjuju.github.io
<p>In the <a href="/postgresql/2023/12/06/extract-sql-from-wal.html">previous article</a> of this series, we saw how to extract
WAL records related to the exact SQL commands we want, INSERTs on heap tables,
and what the structure of those records was. In this article we will focus on
the heap specific information contained in those records and how to extract SQL
queries from them.</p>
<h3 id="insert-data">INSERT data</h3>
<p>At the end of the <a href="/postgresql/2023/12/06/extract-sql-from-wal.html">previous article</a>, we could locate the various
<code class="language-plaintext highlighter-rouge">xl_heap_insert</code> records from the WAL stream. From there, we extracted some
metadata about the file’s physical location (tablespace oid, database oid and
relation filenode among other things) and the data that was inserted itself.</p>
<p>As a reminder, here’s an extract of the code responsible for generating the
WAL records for an INSERT, in the <a href="https://github.com/postgres/postgres/blob/master/src/backend/access/heap/heapam.c">heap_insert()
function</a>,
focusing on the interesting data:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">heap_insert</span><span class="p">(</span><span class="n">Relation</span> <span class="n">relation</span><span class="p">,</span> <span class="n">HeapTuple</span> <span class="n">tup</span><span class="p">,</span> <span class="n">CommandId</span> <span class="n">cid</span><span class="p">,</span>
<span class="kt">int</span> <span class="n">options</span><span class="p">,</span> <span class="n">BulkInsertState</span> <span class="n">bistate</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">[...]</span>
<span class="n">xl_heap_header</span> <span class="n">xlhdr</span><span class="p">;</span>
<span class="p">[...]</span>
<span class="n">xlhdr</span><span class="p">.</span><span class="n">t_infomask2</span> <span class="o">=</span> <span class="n">heaptup</span><span class="o">-></span><span class="n">t_data</span><span class="o">-></span><span class="n">t_infomask2</span><span class="p">;</span>
<span class="n">xlhdr</span><span class="p">.</span><span class="n">t_infomask</span> <span class="o">=</span> <span class="n">heaptup</span><span class="o">-></span><span class="n">t_data</span><span class="o">-></span><span class="n">t_infomask</span><span class="p">;</span>
<span class="n">xlhdr</span><span class="p">.</span><span class="n">t_hoff</span> <span class="o">=</span> <span class="n">heaptup</span><span class="o">-></span><span class="n">t_data</span><span class="o">-></span><span class="n">t_hoff</span><span class="p">;</span>
<span class="p">[...]</span>
<span class="n">XLogRegisterBuffer</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">buffer</span><span class="p">,</span> <span class="n">REGBUF_STANDARD</span> <span class="o">|</span> <span class="n">bufflags</span><span class="p">);</span>
<span class="n">XLogRegisterBufData</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span> <span class="o">&</span><span class="n">xlhdr</span><span class="p">,</span> <span class="n">SizeOfHeapHeader</span><span class="p">);</span>
<span class="cm">/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */</span>
<span class="n">XLogRegisterBufData</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span>
<span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span> <span class="n">heaptup</span><span class="o">-></span><span class="n">t_data</span> <span class="o">+</span> <span class="n">SizeofHeapTupleHeader</span><span class="p">,</span>
<span class="n">heaptup</span><span class="o">-></span><span class="n">t_len</span> <span class="o">-</span> <span class="n">SizeofHeapTupleHeader</span><span class="p">);</span>
<span class="p">[...]</span>
</code></pre></div></div>
<p>2 entries are inserted: an <code class="language-plaintext highlighter-rouge">xl_heap_header</code> which contains some metadata about
the tuple, extracted from the <em>tuple header</em>, and the data part of a
<code class="language-plaintext highlighter-rouge">HeapTuple</code>. Let’s look at those in details.</p>
<h3 id="page-layout">Page layout</h3>
<p>First of all, let’s quickly see how postgres stores tables and indexes on disk.
I will only cover those basics that will be helpful for the rest of the
article. If you want to dig more into this topic, there are a tons of resource
available. You can refer to <a href="https://github.com/postgres/postgres/blob/master/src/include/storage/bufpage.h.">this entry point in the
code</a>,
and I otherwise recommend looking at <a href="https://www.interdb.jp/pg/pgsql01.html#_1.3.">the section about it in “The internals of
postgres” website</a>.</p>
<p>A good general introduction is <a href="https://www.postgresql.org/docs/current/storage-page-layout.html">the
documentation</a>,
which comes with a diagram of the layout that I include here:</p>
<p><a href="/images/page_layout.png"><img src="/images/page_layout.png" alt="Physical page layout, from the offical postgres
documentation" /></a></p>
<p>Each tuple and index piece of data that postgres stores on disk is stored into
a <code class="language-plaintext highlighter-rouge">Page</code>, which is by default 8kB. Each page starts with a header that
contains some metadata about the page and ends with an optional “special area”,
which can contain additional information specific to the component of postgres
that will use this page.</p>
<p>In between is the actual data. The beginning of the data part is an array of
<code class="language-plaintext highlighter-rouge">ItemId</code>, in ascending order, and the end of the data part are the items
themselves (which will be the tuples in case of heap table pages), stored in
the reverse order from the <code class="language-plaintext highlighter-rouge">ItemId</code>. Unless the page is totally full, there
will be an empty space between the last <code class="language-plaintext highlighter-rouge">ItemId</code> and the first item (the
pd_lower and pd_upper offset in the Page metadata).</p>
<p>Here’s the <code class="language-plaintext highlighter-rouge">ItemId</code> definition:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">ItemIdData</span>
<span class="p">{</span>
<span class="kt">unsigned</span> <span class="n">lp_off</span><span class="o">:</span><span class="mi">15</span><span class="p">,</span> <span class="cm">/* offset to tuple (from start of page) */</span>
<span class="nl">lp_flags:</span><span class="mi">2</span><span class="p">,</span> <span class="cm">/* state of line pointer, see below */</span>
<span class="nl">lp_len:</span><span class="mi">15</span><span class="p">;</span> <span class="cm">/* byte length of tuple */</span>
<span class="p">}</span> <span class="n">ItemIdData</span><span class="p">;</span>
</code></pre></div></div>
<p>As you can see it holds the location of the item in the page, minimal metadata
and the length of the item.</p>
<h3 id="heaptuple">HeapTuple</h3>
<p>The largest part stored in the record is the tuple itself. As the historic and
default access method to store tuple is called <code class="language-plaintext highlighter-rouge">heap</code>, the struct that holds
the tuple is called <code class="language-plaintext highlighter-rouge">HeapTuple</code>. Any custom <strong>Table Access Method</strong> can use a
different struct to store what it needs for its specific implementation, but it
will then also use a custom resource manager to generate specific WAL records.</p>
<p>Here’s the <a href="https://github.com/postgres/postgres/blob/master/src/include/access/htup.h">definition of a
HeapTuple</a>:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">HeapTupleData</span>
<span class="p">{</span>
<span class="n">uint32</span> <span class="n">t_len</span><span class="p">;</span> <span class="cm">/* length of *t_data */</span>
<span class="n">ItemPointerData</span> <span class="n">t_self</span><span class="p">;</span> <span class="cm">/* SelfItemPointer */</span>
<span class="n">Oid</span> <span class="n">t_tableOid</span><span class="p">;</span> <span class="cm">/* table the tuple came from */</span>
<span class="cp">#define FIELDNO_HEAPTUPLEDATA_DATA 3
</span> <span class="n">HeapTupleHeader</span> <span class="n">t_data</span><span class="p">;</span> <span class="cm">/* -> tuple header and data */</span>
<span class="p">}</span> <span class="n">HeapTupleData</span><span class="p">;</span>
</code></pre></div></div>
<p>It starts with some metadata, which isn’t stored on disk but generated or
retrieved from somewhere else when the struct is read from disk. Indeed, there
wouldn’t be much value storing the relation’s oid for each tuple on disk. The
length of the tuple is stored on disk, as it’s a necessary piece of
information, and is retrieved from the associated <code class="language-plaintext highlighter-rouge">ItemId</code> the we saw just
before.</p>
<p>After that follows the “real” data, which is what is stored in the <strong>item</strong>
part of the <code class="language-plaintext highlighter-rouge">Page</code>. It’s again split in 2 parts: the tuple header, which I
will cover a bit later, and the tuple data.</p>
<p>The tuple data is the physical on-disk representation of the tuple. It was
designed to be as space efficient as possible, so accessing individual fields
is a bit complex, and CPU intensive. Let’s the most important part of this
design. First, the tuple data is <a href="https://github.com/postgres/postgres/blob/master/src/include/access/htup_details.h">defined like
that</a>:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">HeapTupleHeaderData</span>
<span class="p">{</span>
<span class="p">[...]</span>
<span class="cm">/* ^ - 23 bytes - ^ */</span>
<span class="cp">#define FIELDNO_HEAPTUPLEHEADERDATA_BITS 5
</span> <span class="n">bits8</span> <span class="n">t_bits</span><span class="p">[</span><span class="n">FLEXIBLE_ARRAY_MEMBER</span><span class="p">];</span> <span class="cm">/* bitmap of NULLs */</span>
<span class="cm">/* MORE DATA FOLLOWS AT END OF STRUCT */</span>
<span class="p">};</span>
</code></pre></div></div>
<p>You probably know or heard that in postgres, NULL attributes don’t use any
storage. Indeed, if an attribute is NULL there won’t be anything in the “data
section”, and the bit for its attribute number in the <code class="language-plaintext highlighter-rouge">t_bit</code> bitmap will be
set.</p>
<p>Then, a lot of data types have a variable size (which is internally referred as
<code class="language-plaintext highlighter-rouge">varlena</code>). So, to save space postgres doesn’t store the offset of each
attributes in the <code class="language-plaintext highlighter-rouge">HeapTuple</code> and just stores them next to each other
(according to the datatype alignment rules) in a big chunk of memory.</p>
<p>This is indeed efficient, but unless your tuple only contains non-null
fixed-sized attribute, the only way to access a specific attribute is to read
all the previous ones, skip the NULL attribute and compute the position of the
next one reading the length of variable datatype. This process is called
<strong>tuple deforming</strong>, it takes a tuple in input and outputs two arrays: one with
the datums and one with the null references, all indexed by the attribute
number (0 based). The opposite operation (transform a tuple of datum and a
tuple of nulls in a tuple) is unsurprisingly called <strong>tuple forming</strong>. If you
want to read a bit more about those operations, the underlying functions are
called <a href="https://github.com/postgres/postgres/blob/master/src/backend/access/common/heaptuple.c">heap_deform_tuple() and
heap_form_tuple()</a>.</p>
<p>Note that tuple deforming is one of the operations that can be
<a href="https://www.postgresql.org/docs/current/jit.html">JITted</a>, and there are some
optimisations on the tuple deforming operation. Postgres supports “partial”
deforming and will avoid deforming the full tuple when possible, stopping at
the last attribute that the query is referencing, and will cache the offset of
the latest attribute that has been deformed. But that can only help to some
extent, so it’s always a good idea to mark columns as NOT NULL when possible,
put all the columns with fixed-length attributes at the beginning of the tuples
(with the NOT NULL first), ideally grouped by alignment size to avoid wasting a
few bits, and put the most frequently accessed columns of variable length
datatype next. All of that will help speeding up tuple deforming as much as
possible.</p>
<h4 id="tuple-header">Tuple header</h4>
<p>The first part of the stored data is an <code class="language-plaintext highlighter-rouge">xl_heap_header</code> struct. It’s just a
shorter version of the real tuple header that only contains some part of it, the
rest of the header being available elsewhere in the WAL record or just not
needed otherwise. Doing it this way can save a few bytes for each insert in
the WAL, which is always a good thing. Its definition is:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">xl_heap_header</span>
<span class="p">{</span>
<span class="n">uint16</span> <span class="n">t_infomask2</span><span class="p">;</span>
<span class="n">uint16</span> <span class="n">t_infomask</span><span class="p">;</span>
<span class="n">uint8</span> <span class="n">t_hoff</span><span class="p">;</span>
<span class="p">}</span> <span class="n">xl_heap_header</span><span class="p">;</span>
</code></pre></div></div>
<p><em>t_infomask2</em> and <em>t_infomask2</em> are two bitmaps that contain information about
the tuple. You may have heard about <a href="https://wiki.postgresql.org/wiki/Hint_Bits">hint
bits</a>, those two fields contains
the tuple-level hint bits.</p>
<p>Let’s look at their details
<a href="https://github.com/postgres/postgres/blob/master/src/include/access/htup_details.h">htup_details.c</a></p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">HeapTupleHeaderData</span>
<span class="p">{</span>
<span class="p">[...]</span>
<span class="cm">/* Fields below here must match MinimalTupleData! */</span>
<span class="cp">#define FIELDNO_HEAPTUPLEHEADERDATA_INFOMASK2 2
</span> <span class="n">uint16</span> <span class="n">t_infomask2</span><span class="p">;</span> <span class="cm">/* number of attributes + various flags */</span>
<span class="cp">#define FIELDNO_HEAPTUPLEHEADERDATA_INFOMASK 3
</span> <span class="n">uint16</span> <span class="n">t_infomask</span><span class="p">;</span> <span class="cm">/* various flag bits, see below */</span>
<span class="cp">#define FIELDNO_HEAPTUPLEHEADERDATA_HOFF 4
</span> <span class="n">uint8</span> <span class="n">t_hoff</span><span class="p">;</span> <span class="cm">/* sizeof header incl. bitmap, padding */</span>
<span class="cm">/* ^ - 23 bytes - ^ */</span>
<span class="p">[...]</span>
<span class="p">}</span>
<span class="o">*</span> <span class="n">information</span> <span class="n">stored</span> <span class="n">in</span> <span class="n">t_infomask2</span><span class="o">:</span>
<span class="err">*/</span>
<span class="cp">#define HEAP_NATTS_MASK 0x07FF </span><span class="cm">/* 11 bits for number of attributes */</span><span class="cp">
</span><span class="cm">/* bits 0x1800 are available */</span>
<span class="cp">#define HEAP_KEYS_UPDATED 0x2000 </span><span class="cm">/* tuple was updated and key cols
* modified, or tuple deleted */</span><span class="cp">
#define HEAP_HOT_UPDATED 0x4000 </span><span class="cm">/* tuple was HOT-updated */</span><span class="cp">
#define HEAP_ONLY_TUPLE 0x8000 </span><span class="cm">/* this is heap-only tuple */</span><span class="cp">
</span>
<span class="cp">#define HEAP2_XACT_MASK 0xE000 </span><span class="cm">/* visibility-related bits */</span><span class="cp">
</span><span class="p">[...]</span>
<span class="o">*</span> <span class="n">information</span> <span class="n">stored</span> <span class="n">in</span> <span class="n">t_infomask</span><span class="o">:</span>
<span class="err">*/</span>
<span class="cp">#define HEAP_HASNULL 0x0001 </span><span class="cm">/* has null attribute(s) */</span><span class="cp">
#define HEAP_HASVARWIDTH 0x0002 </span><span class="cm">/* has variable-width attribute(s) */</span><span class="cp">
</span><span class="p">[...]</span>
<span class="cp">#define HEAP_XMIN_COMMITTED 0x0100 </span><span class="cm">/* t_xmin committed */</span><span class="cp">
#define HEAP_XMIN_INVALID 0x0200 </span><span class="cm">/* t_xmin invalid/aborted */</span><span class="cp">
#define HEAP_XMIN_FROZEN (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)
#define HEAP_XMAX_COMMITTED 0x0400 </span><span class="cm">/* t_xmax committed */</span><span class="cp">
#define HEAP_XMAX_INVALID 0x0800 </span><span class="cm">/* t_xmax invalid/aborted */</span><span class="cp">
</span><span class="p">[...]</span>
</code></pre></div></div>
<p>We can see a few bits useful for the <strong>tuple deforming</strong>. For instance, we
see that 11 bits of <em>t_infomask2</em> are used to store the actual number of
attributes stored in this tuple. Adding a new column in a table doesn’t always
require a full table rewrite, and in that case those bits are critical to know
when to stop looking for additional attributes when accessing tuples stored
before the column was added. There’s also information on whether the tuple
contains any NULL or variable-length datatype attribute. The rest of the hint
bits are a clever use of the available space to handle various SQL operations,
MVCC rules, HOT updates and other low level optimisations.</p>
<h3 id="tuple-descriptors">Tuple descriptors</h3>
<p>Now that we covered some internals of the <code class="language-plaintext highlighter-rouge">HeapTuple</code>, it seems much easier to
reach our goal: transform the INSERT WAL records into plain SQL statements. We
know that we just have to <em>deform</em> the tuples to retrieve the values and the
NULL attributes, generating the SQL statements around isn’t hard. But here
comes the second reason why we need a proper data directory to do so, and why
the lack of DDL is important.</p>
<p>As you probably guessed by now, one critical piece of information needed for
the <em>tuple deforming</em> operation is the table structure declaration. Indeed,
the <code class="language-plaintext highlighter-rouge">HeapTuple</code> is just a big chunk of memory, and without the list of columns,
data types, and the types details, it’s impossible to interpret those. If your
model doesn’t change too much it’s probably possible to do without and instead
generate some kind of mapping manually based on what you know about the history
of the instance. Be careful if you go this way, any discrepancy between the
original and generated data types can lead to bogus output in the best case, or
crashing your whole instance. But in my case I had the guarantee that no DDL
happened since the incident, and the other data directory available so I could
just rely on it.</p>
<p>Postgres handles the table structure declaration using another struct, called
<code class="language-plaintext highlighter-rouge">TupleDesc</code>, for <em>tuple descriptor</em>. Its definition is:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">TupleDescData</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">natts</span><span class="p">;</span> <span class="cm">/* number of attributes in the tuple */</span>
<span class="n">Oid</span> <span class="n">tdtypeid</span><span class="p">;</span> <span class="cm">/* composite type ID for tuple type */</span>
<span class="n">int32</span> <span class="n">tdtypmod</span><span class="p">;</span> <span class="cm">/* typmod for tuple type */</span>
<span class="kt">int</span> <span class="n">tdrefcount</span><span class="p">;</span><span class="cm">/* reference count, or -1 if not counting */</span>
<span class="n">TupleConstr</span> <span class="o">*</span><span class="n">constr</span><span class="p">;</span> <span class="cm">/* constraints, or NULL if none */</span>
<span class="cm">/* attrs[N] is the description of Attribute Number N+1 */</span>
<span class="n">FormData_pg_attribute</span> <span class="n">attrs</span><span class="p">[</span><span class="n">FLEXIBLE_ARRAY_MEMBER</span><span class="p">];</span>
<span class="p">}</span> <span class="n">TupleDescData</span><span class="p">;</span>
</code></pre></div></div>
<p>In our case the most interesting members are the number of attributes (<code class="language-plaintext highlighter-rouge">natts</code>)
and the array of <code class="language-plaintext highlighter-rouge">pg_attribute</code> records (<code class="language-plaintext highlighter-rouge">attrs</code>). Those are also useful for
the SQL generation part, as we can retrieve the columns from it. Note also
that postgres will generate a <code class="language-plaintext highlighter-rouge">TupleDesc</code> automatically when you internally
open a relation.</p>
<p>Let’s recapitulate. We have the record data, the filename contains the
physical file location information that we can use to retrieve the actual
relation, we know how to get the tuple descriptor for this relation and we can
use it to deform the tuple and get the values from it. We have <em>almost</em>
everything we need to generate the SQL queries.</p>
<p>The only remaining detail is that the values we get from the tuple deforming
operation are in their physical representation, and we need to emit their
textual representation. Again, that’s not a problem as each data type has a
dedicated function for that, called <strong>type output function</strong>, available in
<code class="language-plaintext highlighter-rouge">pg_type.typoutput</code>.</p>
<h3 id="extracting-sql-from-the-insert-records">Extracting SQL from the INSERT records</h3>
<p>Now is time for the fun part where we just need to put everything together to
finish the project!</p>
<p>I chose to write it as an extension to be able to add and remove it easily from
a production server. I also chose to minimize the amount of C code and rely on
plpgsql functions when possible. It’s faster to write and plpgsql is also way
safer.</p>
<p>I only wrote a single <code class="language-plaintext highlighter-rouge">pg_decode_record()</code> C function, that takes as input a
record as a bytea, the tablespace oid and the relation filenode and emits the
underlying SQL query. I wrote an extra <code class="language-plaintext highlighter-rouge">pg_decode_all_records()</code> function in
plpgsql that uses existing <code class="language-plaintext highlighter-rouge">pg_ls_dir()</code> and <code class="language-plaintext highlighter-rouge">pg_read_binary_file()</code> to
retrieve the files and record, and <code class="language-plaintext highlighter-rouge">split_part()</code> to extract the metadata from
the filename.</p>
<p>I’m <a href="/assets/patch/pg_decode_record.tgz">attaching the resulting extension to this
article</a> so you can see the whole
implementation and adapt it if needed, and will just quickly describe the main
parts here as we already covered the underlying elements. I’m also only
showing here a simplified version to avoid too many implementation details.</p>
<p>First, I look for a matching relation oid in the pg_class catalog for the given
tablespace and relfilenode, open the found relation with the weakest lock
possible, make a copy of the tuple descriptor and start generating the SQL
query with the qualified relation name. As for normal application, you need to
make sure that the identifiers are properly quoted to generate working queries:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">PGDLLEXPORT</span> <span class="n">Datum</span>
<span class="nf">pg_decode_record</span><span class="p">(</span><span class="n">PG_FUNCTION_ARGS</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">bytea</span> <span class="o">*</span><span class="n">record</span> <span class="o">=</span> <span class="n">PG_GETARG_BYTEA_PP</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
<span class="n">Oid</span> <span class="n">spc</span> <span class="o">=</span> <span class="n">PG_GETARG_OID</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
<span class="n">Oid</span> <span class="n">relfilenode</span> <span class="o">=</span> <span class="n">PG_GETARG_OID</span><span class="p">(</span><span class="mi">2</span><span class="p">);</span>
<span class="cm">/* Get the relation oid from the tablespace oid and relfilenode */</span>
<span class="n">relid</span> <span class="o">=</span> <span class="n">get_spc_relnumber_relid</span><span class="p">(</span><span class="n">spcOid</span><span class="p">,</span> <span class="n">relNumber</span><span class="p">);</span>
<span class="n">relation</span> <span class="o">=</span> <span class="n">table_open</span><span class="p">(</span><span class="n">relid</span><span class="p">,</span> <span class="n">AccessShareLock</span><span class="p">);</span>
<span class="n">tupdesc</span> <span class="o">=</span> <span class="n">CreateTupleDescCopy</span><span class="p">(</span><span class="n">RelationGetDescr</span><span class="p">(</span><span class="n">relation</span><span class="p">));</span>
<span class="cm">/* Start generating the SQL query */</span>
<span class="n">initStringInfo</span><span class="p">(</span><span class="n">buf</span><span class="p">);</span>
<span class="n">appendStringInfo</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="s">"INSERT INTO %s.%s"</span><span class="p">,</span>
<span class="n">quote_identifier</span><span class="p">(</span><span class="n">get_namespace_name</span><span class="p">(</span><span class="n">RelationGetNamespace</span><span class="p">(</span><span class="n">relation</span><span class="p">))),</span>
<span class="n">quote_identifier</span><span class="p">(</span><span class="n">RelationGetRelationName</span><span class="p">(</span><span class="n">relation</span><span class="p">)));</span>
</code></pre></div></div>
<p>The next part extracts the data from the record and generate a <code class="language-plaintext highlighter-rouge">HeapTuple</code> with
just enough information to be correctly deformed:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="cm">/* mimic heap_xlog_insert */</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">VARDATA</span><span class="p">(</span><span class="n">record</span><span class="p">);</span>
<span class="n">datalen</span> <span class="o">=</span> <span class="n">VARSIZE_ANY</span><span class="p">(</span><span class="n">record</span><span class="p">);</span>
<span class="p">[...]</span>
<span class="n">htup</span> <span class="o">=</span> <span class="o">&</span><span class="n">tbuf</span><span class="p">.</span><span class="n">hdr</span><span class="p">;</span>
<span class="p">[...]</span>
<span class="n">htup</span><span class="o">-></span><span class="n">t_hoff</span> <span class="o">=</span> <span class="n">xlhdr</span><span class="p">.</span><span class="n">t_hoff</span><span class="p">;</span>
<span class="cm">/* build a fake tuple with the bare minimum to deform it */</span>
<span class="n">tuple</span> <span class="o">=</span> <span class="p">(</span><span class="n">HeapTuple</span><span class="p">)</span> <span class="n">palloc0</span><span class="p">(</span><span class="n">HEAPTUPLESIZE</span> <span class="o">+</span> <span class="n">VARSIZE_ANY</span><span class="p">(</span><span class="n">record</span><span class="p">));</span>
<span class="n">tuple</span><span class="o">-></span><span class="n">t_data</span> <span class="o">=</span> <span class="n">htup</span><span class="p">;</span>
<span class="n">tuple</span><span class="o">-></span><span class="n">t_len</span> <span class="o">=</span> <span class="n">VARSIZE_ANY</span><span class="p">(</span><span class="n">record</span><span class="p">);</span>
<span class="n">ItemPointerSetInvalid</span><span class="p">(</span><span class="o">&</span><span class="p">(</span><span class="n">tuple</span><span class="o">-></span><span class="n">t_self</span><span class="p">));</span>
<span class="n">tuple</span><span class="o">-></span><span class="n">t_tableOid</span> <span class="o">=</span> <span class="n">relid</span><span class="p">;</span>
</code></pre></div></div>
<p>For the next step, we just need to allocate the 2 arrays needed for the
deforming and call <code class="language-plaintext highlighter-rouge">heap_deform_tuple()</code>:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">values</span> <span class="o">=</span> <span class="n">palloc0</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="n">Datum</span><span class="p">)</span> <span class="o">*</span> <span class="n">tupdesc</span><span class="o">-></span><span class="n">natts</span><span class="p">);</span>
<span class="n">isnull</span> <span class="o">=</span> <span class="n">palloc0</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="n">bool</span><span class="p">)</span> <span class="o">*</span> <span class="n">tupdesc</span><span class="o">-></span><span class="n">natts</span><span class="p">);</span>
<span class="n">heap_deform_tuple</span><span class="p">(</span><span class="n">tuple</span><span class="p">,</span> <span class="n">tupdesc</span><span class="p">,</span> <span class="n">values</span><span class="p">,</span> <span class="n">isnull</span><span class="p">);</span>
</code></pre></div></div>
<p>Now that we have all the elements, we just need to iterate over the list of
columns in the tuple descriptor, output a NULL if needed, otherwise find the
type output function, call it for our value, and output it in the query after
escaping it:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="cm">/* append the values */</span>
<span class="n">appendStringInfoString</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="s">" VALUES ("</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">tupdesc</span><span class="o">-></span><span class="n">natts</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">char</span> <span class="o">*</span><span class="n">value</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="n">Oid</span> <span class="n">typoutput</span><span class="p">;</span>
<span class="n">bool</span> <span class="n">typisvarlena</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">i</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span>
<span class="n">appendStringInfoString</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="s">", "</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">isnull</span><span class="p">[</span><span class="n">i</span><span class="p">])</span>
<span class="p">{</span>
<span class="n">appendStringInfoString</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="s">"NULL"</span><span class="p">);</span>
<span class="k">continue</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">getTypeOutputInfo</span><span class="p">(</span><span class="n">TupleDescAttr</span><span class="p">(</span><span class="n">tupdesc</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span><span class="o">-></span><span class="n">atttypid</span><span class="p">,</span>
<span class="o">&</span><span class="n">typoutput</span><span class="p">,</span> <span class="o">&</span><span class="n">typisvarlena</span><span class="p">);</span>
<span class="n">value</span> <span class="o">=</span> <span class="n">OidOutputFunctionCall</span><span class="p">(</span><span class="n">typoutput</span><span class="p">,</span> <span class="n">values</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
<span class="n">value</span> <span class="o">=</span> <span class="n">quote_literal_cstr</span><span class="p">(</span><span class="n">value</span><span class="p">);</span>
<span class="n">appendStringInfo</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="s">"%s"</span><span class="p">,</span> <span class="n">value</span><span class="p">);</span>
<span class="n">pfree</span><span class="p">(</span><span class="n">value</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">appendStringInfoString</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="s">");"</span><span class="p">);</span>
</code></pre></div></div>
<p>Once done, we just need to properly close the relation and return the generated
query to the caller:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">table_close</span><span class="p">(</span><span class="n">relation</span><span class="p">,</span> <span class="n">NoLock</span><span class="p">);</span>
<span class="n">PG_RETURN_TEXT_P</span><span class="p">(</span><span class="n">cstring_to_text</span><span class="p">(</span><span class="n">buf</span><span class="p">.</span><span class="n">data</span><span class="p">));</span>
<span class="err">}</span>
</code></pre></div></div>
<p>And that’s all you need for the basic scenario! The real implementation has a
bit more code for various other cases, like <strong>very basic</strong> TOAST table
support, but is still unlikely to correctly handle any weird corner cases that
can happen in the wild.</p>
<h3 id="basic-usage">Basic usage</h3>
<p>We can finally see the result of all the hard work in this article and the
previous one! I will be using a simple scenario, first saving the current
WAL position to only keep the records generated afterwards, then removing all
the data from the table (without changing its relfilenode) to make sure that we
don’t read anything from the table itself.</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- Get the current WAL location</span>
<span class="n">rjuju</span> <span class="o">=#</span> <span class="k">SELECT</span> <span class="n">pg_current_wal_lsn</span><span class="p">();</span>
<span class="n">pg_current_wal_lsn</span>
<span class="c1">--------------------</span>
<span class="n">F</span><span class="o">/</span><span class="mi">46349</span><span class="n">E80</span>
<span class="p">(</span><span class="mi">1</span> <span class="k">row</span><span class="p">)</span>
<span class="n">rjuju</span><span class="o">=#</span> <span class="k">CREATE</span> <span class="n">EXTENSION</span> <span class="n">pg_decode_record</span><span class="p">;</span>
<span class="k">CREATE</span> <span class="n">EXTENSION</span>
<span class="n">rjuju</span><span class="o">=#</span> <span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">decode_record</span><span class="p">(</span><span class="n">id</span> <span class="nb">integer</span><span class="p">,</span> <span class="n">val</span> <span class="nb">text</span> <span class="k">storage</span> <span class="k">external</span><span class="p">);</span>
<span class="k">CREATE</span> <span class="k">TABLE</span>
<span class="n">rjuju</span><span class="o">=#</span> <span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">decode_record</span>
<span class="k">SELECT</span> <span class="mi">1</span><span class="p">,</span> <span class="s1">'simple test'</span><span class="p">;</span>
<span class="k">INSERT</span> <span class="mi">0</span> <span class="mi">1</span>
<span class="c1">-- Force a full-page write</span>
<span class="n">rjuju</span><span class="o">=#</span> <span class="k">CHECKPOINT</span><span class="p">;</span>
<span class="k">CHECKPOINT</span>
<span class="n">rjuju</span><span class="o">=#</span> <span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">decode_record</span>
<span class="k">SELECT</span> <span class="mi">2</span><span class="p">,</span> <span class="s1">'full-page write'</span><span class="p">;</span>
<span class="k">INSERT</span> <span class="mi">0</span> <span class="mi">1</span>
<span class="n">rjuju</span><span class="o">=#</span> <span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">decode_record</span>
<span class="k">SELECT</span> <span class="mi">3</span><span class="p">,</span> <span class="s1">'a bit big '</span><span class="o">||</span><span class="n">string_agg</span><span class="p">(</span><span class="n">random</span><span class="p">()::</span><span class="nb">text</span><span class="p">,</span> <span class="s1">' '</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">generate_series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">10</span><span class="p">);</span>
<span class="k">INSERT</span> <span class="mi">0</span> <span class="mi">1</span>
<span class="n">rjuju</span><span class="o">=#</span> <span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">decode_record</span>
<span class="k">SELECT</span> <span class="mi">4</span><span class="p">,</span> <span class="s1">'way bigger '</span><span class="o">||</span><span class="n">string_agg</span><span class="p">(</span><span class="n">random</span><span class="p">()::</span><span class="nb">text</span><span class="p">,</span> <span class="s1">' '</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">generate_series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">120</span><span class="p">);</span>
<span class="k">INSERT</span> <span class="mi">0</span> <span class="mi">1</span>
<span class="c1">-- Check the heap table size and underlying TOAST table size</span>
<span class="n">rjuju</span><span class="o">=#</span> <span class="k">SELECT</span> <span class="n">oid</span><span class="p">::</span><span class="n">regclass</span><span class="p">::</span><span class="nb">text</span><span class="p">,</span> <span class="n">pg_size_pretty</span><span class="p">(</span><span class="n">pg_relation_size</span><span class="p">(</span><span class="n">oid</span><span class="p">)),</span>
<span class="n">reltoastrelid</span><span class="p">::</span><span class="n">regclass</span><span class="p">::</span><span class="nb">text</span><span class="p">,</span> <span class="n">pg_size_pretty</span><span class="p">(</span><span class="n">pg_relation_size</span><span class="p">(</span><span class="n">reltoastrelid</span><span class="p">))</span>
<span class="k">FROM</span> <span class="n">pg_class</span>
<span class="k">WHERE</span> <span class="n">relname</span> <span class="o">=</span> <span class="s1">'decode_record'</span><span class="p">;</span>
<span class="n">oid</span> <span class="o">|</span> <span class="n">pg_size_pretty</span> <span class="o">|</span> <span class="n">reltoastrelid</span> <span class="o">|</span> <span class="n">pg_size_pretty</span>
<span class="c1">---------------+----------------+-------------------------+----------------</span>
<span class="n">decode_record</span> <span class="o">|</span> <span class="mi">8192</span> <span class="n">bytes</span> <span class="o">|</span> <span class="n">pg_toast</span><span class="p">.</span><span class="n">pg_toast_66731</span> <span class="o">|</span> <span class="mi">8192</span> <span class="n">bytes</span>
<span class="p">(</span><span class="mi">1</span> <span class="k">row</span><span class="p">)</span>
<span class="n">rjuju</span><span class="o">=#</span> <span class="k">DELETE</span> <span class="k">FROM</span> <span class="n">decode_record</span><span class="p">;</span>
<span class="k">DELETE</span> <span class="mi">4</span>
<span class="c1">-- Make sure we remove all records and physically empty the tables</span>
<span class="n">rjuju</span><span class="o">=#</span> <span class="k">VACUUM</span> <span class="n">decode_record</span><span class="p">;</span>
<span class="k">VACUUM</span>
<span class="n">rjuju</span><span class="o">=#</span> <span class="k">SELECT</span> <span class="n">oid</span><span class="p">::</span><span class="n">regclass</span><span class="p">::</span><span class="nb">text</span><span class="p">,</span> <span class="n">pg_size_pretty</span><span class="p">(</span><span class="n">pg_relation_size</span><span class="p">(</span><span class="n">oid</span><span class="p">)),</span>
<span class="n">reltoastrelid</span><span class="p">::</span><span class="n">regclass</span><span class="p">::</span><span class="nb">text</span><span class="p">,</span> <span class="n">pg_size_pretty</span><span class="p">(</span><span class="n">pg_relation_size</span><span class="p">(</span><span class="n">reltoastrelid</span><span class="p">))</span>
<span class="k">FROM</span> <span class="n">pg_class</span>
<span class="k">WHERE</span> <span class="n">relname</span> <span class="o">=</span> <span class="s1">'decode_record'</span><span class="p">;</span>
<span class="n">oid</span> <span class="o">|</span> <span class="n">pg_size_pretty</span> <span class="o">|</span> <span class="n">reltoastrelid</span> <span class="o">|</span> <span class="n">pg_size_pretty</span>
<span class="c1">---------------+----------------+-------------------------+----------------</span>
<span class="n">decode_record</span> <span class="o">|</span> <span class="mi">0</span> <span class="n">bytes</span> <span class="o">|</span> <span class="n">pg_toast</span><span class="p">.</span><span class="n">pg_toast_66737</span> <span class="o">|</span> <span class="mi">0</span> <span class="n">bytes</span>
<span class="p">(</span><span class="mi">1</span> <span class="k">row</span><span class="p">)</span>
</code></pre></div></div>
<p>Ok, we should have a few records generated in the WAL corresponding to data we
definitely lost in the table. Let’s extract the INSERT records using the
custom <em>pg_waldump</em> we created in the previous article:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ mkdir -p /tmp/pg_decode_record
$ pg_waldump --start "F/46349E80" --save-records /tmp/pg_decode_record
[...]
$ ls -l /tmp/pg_decode_record
0000000F-46367520.1663.16384.66743.0_main
0000000F-46367660.1663.16384.66743.0_main
0000000F-46367738.1663.16384.66743.0_main
0000000F-46367868.1663.16384.66746.0_main
0000000F-46368130.1663.16384.66746.0_main
0000000F-46368300.1663.16384.66743.0_main
</code></pre></div></div>
<p>You might wonder why there are 6 records extracted while we only inserted 4
rows. That’s because the last record was big enough to be TOASTed using 2
chunks, and as far as the WAL are concerned that’s 3 separate INSERTs in 2
different tables. Let’s see that in detail using the extension to decode the
records (truncating the output as some rows are quite big):</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rjuju</span><span class="o">=#</span> <span class="k">SELECT</span> <span class="n">substr</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">95</span><span class="p">)</span>
<span class="k">FROM</span> <span class="n">pg_decode_all_records</span><span class="p">(</span><span class="s1">'/tmp/pg_decode_records'</span><span class="p">)</span> <span class="n">f</span><span class="p">(</span><span class="n">v</span><span class="p">);</span>
<span class="n">substr</span>
<span class="c1">-------------------------------------------------------------------------------------------</span>
<span class="k">INSERT</span> <span class="k">INTO</span> <span class="k">public</span><span class="p">.</span><span class="n">decode_record</span> <span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">val</span><span class="p">)</span> <span class="k">VALUES</span> <span class="p">(</span><span class="s1">'1'</span><span class="p">,</span> <span class="s1">'simple test'</span><span class="p">);</span>
<span class="k">INSERT</span> <span class="k">INTO</span> <span class="k">public</span><span class="p">.</span><span class="n">decode_record</span> <span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">val</span><span class="p">)</span> <span class="k">VALUES</span> <span class="p">(</span><span class="s1">'2'</span><span class="p">,</span> <span class="s1">'full-page write'</span><span class="p">);</span>
<span class="k">INSERT</span> <span class="k">INTO</span> <span class="k">public</span><span class="p">.</span><span class="n">decode_record</span> <span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">val</span><span class="p">)</span> <span class="k">VALUES</span> <span class="p">(</span><span class="s1">'3'</span><span class="p">,</span> <span class="s1">'a bit big 0.5356172842583808 0.3...'</span>
<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">pg_toast</span><span class="p">.</span><span class="n">pg_toast_66810</span> <span class="k">VALUES</span> <span class="p">(</span><span class="s1">'66815'</span><span class="p">,</span> <span class="s1">'0'</span><span class="p">,</span> <span class="n">E</span><span class="s1">'</span><span class="se">\\</span><span class="s1">x7761792062696767657220302e...'</span>
<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">pg_toast</span><span class="p">.</span><span class="n">pg_toast_66810</span> <span class="k">VALUES</span> <span class="p">(</span><span class="s1">'66815'</span><span class="p">,</span> <span class="s1">'1'</span><span class="p">,</span> <span class="n">E</span><span class="s1">'</span><span class="se">\\</span><span class="s1">x3337383137353120302e303439...'</span>
<span class="k">INSERT</span> <span class="k">INTO</span> <span class="k">public</span><span class="p">.</span><span class="n">decode_record</span> <span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">val</span><span class="p">)</span> <span class="k">VALUES</span> <span class="p">(</span><span class="s1">'4'</span><span class="p">,</span> <span class="cm">/* toast pointer 66815 */</span><span class="p">);</span>
<span class="p">(</span><span class="mi">6</span> <span class="k">rows</span><span class="p">)</span>
</code></pre></div></div>
<p>(note: I slightly edited the output to make it smaller and have correct syntax
highlighting, the real extension will emit the real table name in a comment in
case of INSERT in a TOAST table)</p>
<p>We see the first normal records properly decoded, whether they’re in a
full-page image or not. The last record is indeed split into 3 different
INSERTs, 2 in the TOAST table and 1 in the heap table.</p>
<p>As I mentioned earlier I only added <strong>very minimal</strong> support for TOAST tables,
as I didn’t have any information about the customer tables and whether they
would hit that case or not, or how often. The last insert isn’t a valid
statement as the 2nd value is missing, but we can manually extract the value
from the INSERT statements in the TOAST table and therefore fix the normal
INSERT. For instance, using the first few bytes that we can see in the first
chunk:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rjuju</span><span class="o">=#</span> <span class="k">SELECT</span> <span class="n">encode</span><span class="p">(</span><span class="n">E</span><span class="s1">'</span><span class="se">\\</span><span class="s1">x7761792062696767657220302e'</span><span class="p">,</span> <span class="s1">'escape'</span><span class="p">);</span>
<span class="o">-</span><span class="p">[</span> <span class="n">RECORD</span> <span class="mi">1</span> <span class="p">]</span><span class="c1">---------</span>
<span class="n">encode</span> <span class="o">|</span> <span class="n">way</span> <span class="n">bigger</span> <span class="mi">0</span><span class="p">.</span>
</code></pre></div></div>
<p>The data is there, it just needs a bit of manual processing to get it.</p>
<p>To be totally fair, I also cheated a bit in that example by making sure that
the data will be TOASTed but not compressed, so it’s very easy to manually
retrieve the raw value from the extra INSERTs in the TOAST tables. It wouldn’t
be very hard to have all of that working transparently, but I simply didn’t
have the need. If you’re interested in that, I’d recommend looking at the
<code class="language-plaintext highlighter-rouge">detoast_attr()</code> function in
<a href="https://github.com/postgres/postgres/blob/master/src/backend/access/common/detoast.c">src/backend/access/common/detoast.c</a>
and all underlying code to see how you can manually decompress data. You would
then only need to store the detoasted (and potentially decompressed) value
referenced by the toast’s chunk_id locally, and emit it in the query instead of
the currently emitted comment.</p>
<h3 id="conclusion">Conclusion</h3>
<p>I hope you enjoyed those two articles and learned a bit about the WAL
infrastructure and the way pages and tuples work internally.</p>
<p>If you missed it in the article, <a href="/assets/patch/pg_decode_record.tgz">here is the link for the full
extension</a>.</p>
<p>I want to emphasize again that all the code I showed here is only a quick proof
of concept that’s thought for one narrow use case, and it should be used
with care. My goal here wasn’t to show state of the art code but rather show
one possible way to quickly come up with a plan to salvage data in case of
production incident. If you’re unfortunately confronted to a
similar problem, or some major other accident I hope you will find some
valuable resources and a starting point to come up with your own dedicated
solution!</p>
<p><a href="https://rjuju.github.io/postgresql/2023/12/20/extract-sql-from-wal-part2.html">Extracting SQL from WAL? (part 2)</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on December 20, 2023.</p>
https://rjuju.github.io/postgresql/2023/12/06/extract-sql-from-wal2023-12-06T03:04:10+00:002023-12-06T03:04:10+00:00Julien Rouhaudhttps://rjuju.github.io
<p>Is it actually possible to extract SQL commands from WAL generated in “replica”
<code class="language-plaintext highlighter-rouge">wal_level</code>?</p>
<p>The answer is usually no, the “logical” <code class="language-plaintext highlighter-rouge">wal_level</code> exists for a reason after
all, and you shouldn’t expect some kind of miracle here.</p>
<p>But in this series of articles you will see that if some conditions are met
you can still manage to extract some information, and how to do it. This first
article focuses on the WAL records and how to extract the ones you want, while
the next one will show how to try to extract the information contained in those
records.</p>
<h3 id="some-context">Some context</h3>
<p>This article is based of some work I did a few months ago to help a customer
recover some data after an incident. It’s not a perfect solution and mostly a
set of quick hacks I did to come up with something able to retrieve data in a
few hours of work only, but I hope sharing details about it and some
methodology can be helpful if you ever get in a similar situation. You will
probably need to adapt it to your needs, with yet other hacks, but it should
give you a good start. It can otherwise be of some interest if you want to
know a bit more about the WAL records internals and some associated
infrastructure.</p>
<h3 id="the-incident">The incident</h3>
<p>Due to a series of unfortunate events, one of their HA clusters ended in a
split-brain situation for a some time before being reinitialised, which
entirely removed one of the data directory. After that, only the WALs that
were were generated on that instance were available, those being in “replica”
<code class="language-plaintext highlighter-rouge">wal_level</code>, and nothing else.</p>
<p>One possibility to try recover the data would be to restore a physical backup,
if any, replay archived WALs until the last transaction before the removed node
is promoted (assuming those are still available) and then replay the WALs
generated on that newly promoted node. Once there you still need to look at
each row of each table of each database and compare it to yet another instance
restore from the same backup to approximately the same time as this one.
That’s clearly not ideal as it will likely require many days or even weeks of
tedious hard work to do so, and will consume a lot of resources along the way.
Is there a way to do better?</p>
<p>After a quick discussion, it turned out that there were a few elements that
made some recovery from the WALs themselves possible (more on why later):</p>
<ol>
<li>One of the data directories was still available</li>
<li>The customer guaranteed that no DDL happened since the incident</li>
<li>Only INSERTs happened during the split-brain</li>
</ol>
<h3 id="wals--physical-replication">WALs & Physical replication</h3>
<p>As you probably know, postgres physical replication works by sending an exact
copy of the modified binary raw data to the various standby servers, in a
continuous stream of WAL records. As a consequence, those records don’t really
know much about the database objects they reference, and nothing about the SQL
queries that generated them. So what do they really contain? Let’s see what’s
inside the WAL records generated for an INSERT into a normal heap relation.</p>
<h4 id="wal-records">WAL records</h4>
<p>First of all, you have to know that the WAL records are split into <strong>Resource
Managers</strong> (declared in
<a href="https://github.com/postgres/postgres/blob/master/src/include/access/rmgrlist.h">src/include/access/rmgrlist.h</a>),
each being responsible for a specific part of postgres (heap tables, indexes,
vauum…). They’re identified by a numeric identifier and often referred to as
a <code class="language-plaintext highlighter-rouge">rmid</code>, for //resource manager identifier//.</p>
<p>Each of those resource managers can handle various operations, which are
internally called <strong>opcodes</strong>. Here we’re interested in the WAL records
generated while operating on standard heap tables, and especially during
INSERTs. This resource manager is a bit particular as it’s split into 2
different <code class="language-plaintext highlighter-rouge">rmid</code>: <code class="language-plaintext highlighter-rouge">RM_HEAP_ID</code> and R<code class="language-plaintext highlighter-rouge">M_HEAP2_ID</code>. This is only an
implementation details, as each resource manager can only handle a limited
number of opcodes, everything is the same otherwise.</p>
<p>If you’re curious, here’s the definition of the main WAL record in the <a href="https://github.com/postgres/postgres/blob/master/src/include/access/xlogrecord.h">source
code</a>
and a bit of details on the exact layout in the files:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/*
* The overall layout of an XLOG record is:
* Fixed-size header (XLogRecord struct)
* XLogRecordBlockHeader struct
* XLogRecordBlockHeader struct
* ...
* XLogRecordDataHeader[Short|Long] struct
* block data
* block data
* ...
* main data
* [...]
*/</span>
<span class="k">typedef</span> <span class="k">struct</span> <span class="n">XLogRecord</span>
<span class="p">{</span>
<span class="n">uint32</span> <span class="n">xl_tot_len</span><span class="p">;</span> <span class="cm">/* total len of entire record */</span>
<span class="n">TransactionId</span> <span class="n">xl_xid</span><span class="p">;</span> <span class="cm">/* xact id */</span>
<span class="n">XLogRecPtr</span> <span class="n">xl_prev</span><span class="p">;</span> <span class="cm">/* ptr to previous record in log */</span>
<span class="n">uint8</span> <span class="n">xl_info</span><span class="p">;</span> <span class="cm">/* flag bits, see below */</span>
<span class="n">RmgrId</span> <span class="n">xl_rmid</span><span class="p">;</span> <span class="cm">/* resource manager for this record */</span>
<span class="cm">/* 2 bytes of padding here, initialize to zero */</span>
<span class="n">pg_crc32c</span> <span class="n">xl_crc</span><span class="p">;</span> <span class="cm">/* CRC for this record */</span>
<span class="cm">/* XLogRecordBlockHeaders and XLogRecordDataHeader follow, no padding */</span>
<span class="p">}</span> <span class="n">XLogRecord</span><span class="p">;</span>
</code></pre></div></div>
<p>and a block data header:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="cm">/*
* Header info for block data appended to an XLOG record.
*
* 'data_length' is the length of the rmgr-specific payload data associated
* with this block. It does not include the possible full page image, nor
* XLogRecordBlockHeader struct itself.
*
* Note that we don't attempt to align the XLogRecordBlockHeader struct!
* So, the struct must be copied to aligned local storage before use.
*/</span>
<span class="k">typedef</span> <span class="k">struct</span> <span class="n">XLogRecordBlockHeader</span>
<span class="p">{</span>
<span class="n">uint8</span> <span class="n">id</span><span class="p">;</span> <span class="cm">/* block reference ID */</span>
<span class="n">uint8</span> <span class="n">fork_flags</span><span class="p">;</span> <span class="cm">/* fork within the relation, and flags */</span>
<span class="n">uint16</span> <span class="n">data_length</span><span class="p">;</span> <span class="cm">/* number of payload bytes (not including page
* image) */</span>
<span class="cm">/* If BKPBLOCK_HAS_IMAGE, an XLogRecordBlockImageHeader struct follows */</span>
<span class="cm">/* If BKPBLOCK_SAME_REL is not set, a RelFileLocator follows */</span>
<span class="cm">/* BlockNumber follows */</span>
<span class="p">}</span> <span class="n">XLogRecordBlockHeader</span><span class="p">;</span>
</code></pre></div></div>
<p>Everything here is very generic as it’s used by all the resource managers. One
important bit though is the mention of a <strong>RelFileLocator</strong> after the block
header if the record contains information about a different relation from the
previous block, whatever is was (which is the meaning of BKPBLOCK_SAME_REL).
This is of course important information for us.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">RelFileLocator</span>
<span class="p">{</span>
<span class="n">Oid</span> <span class="n">spcOid</span><span class="p">;</span> <span class="cm">/* tablespace */</span>
<span class="n">Oid</span> <span class="n">dbOid</span><span class="p">;</span> <span class="cm">/* database */</span>
<span class="n">RelFileNumber</span> <span class="n">relNumber</span><span class="p">;</span> <span class="cm">/* relation */</span>
<span class="p">}</span> <span class="n">RelFileLocator</span><span class="p">;</span>
</code></pre></div></div>
<p>But here’s a first reason why you need a proper data directory to do anything
with the WALs: this doesn’t contain the schema name and table name, or even the
table oid, but the <strong>tablespace oid, database oid and relfilenode</strong>, which is
what the WAL actually need to identify a physical relation file (which is
itself split into multiple files, the exact
<a href="https://github.com/postgres/postgres/blob/master/src/backend/storage/smgr/README">fork</a>
and segment are deduced using other information). So any table rewrite
happening since the WAL records were generated (e.g. a VACUUM FULL) and you
won’t be able to identify which relation a record is about, unless of course
you find a way to map the current relfilenode to the one before the table
rewrite.</p>
<h4 id="heap-insert-wal-records">Heap INSERT WAL records</h4>
<p>Now that we saw a bit of the general WAL structures, let’s focus on the data
specific to an INSERT. If you’re not familiar really with the internals, one
easy way to locate the code related to a specific command is to look at the
functions associated to a resource manager. Let’s look at the <strong>RM_HEAP_ID</strong>
information in
<a href="https://github.com/postgres/postgres/blob/master/src/include/access/rmgrlist.h">src/include/access/rmgrlist.h</a>:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* symbol name, textual name, redo, desc, identify, startup, cleanup, mask, decode */</span>
<span class="n">PG_RMGR</span><span class="p">(</span><span class="n">RM_HEAP_ID</span><span class="p">,</span> <span class="s">"Heap"</span><span class="p">,</span> <span class="n">heap_redo</span><span class="p">,</span> <span class="n">heap_desc</span><span class="p">,</span> <span class="n">heap_identify</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="n">heap_mask</span><span class="p">,</span> <span class="n">heap_decode</span><span class="p">)</span>
</code></pre></div></div>
<p>We here have the name of the actual functions responsible for many operations
(the exact list will vary depending on the postgres major version, I’m here
using the list in postgres 17).</p>
<p>The <strong>redo</strong> function is the name of the function that applies an RM_HEAP_ID
record, the <strong>desc</strong> functions is the one that emits the info you see in
pg_waldump, the <strong>identify</strong> function returns a string describing the opcode
and so on. Let’s look at <code class="language-plaintext highlighter-rouge">heap_identify()</code>:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span>
<span class="nf">heap_identify</span><span class="p">(</span><span class="n">uint8</span> <span class="n">info</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">id</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="k">switch</span> <span class="p">(</span><span class="n">info</span> <span class="o">&</span> <span class="o">~</span><span class="n">XLR_INFO_MASK</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">case</span> <span class="n">XLOG_HEAP_INSERT</span><span class="p">:</span>
<span class="n">id</span> <span class="o">=</span> <span class="s">"INSERT"</span><span class="p">;</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">[...]</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">id</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>We now know that the opcode we’re interested in is <strong>XLOG_HEAP_INSERT</strong>. A
quick <code class="language-plaintext highlighter-rouge">git grep</code> in the tree will lead you to
<a href="https://github.com/postgres/postgres/blob/master/src/backend/access/heap/heapam.c">src/backend/access/heap/heapam.c</a>,
more precisely the <strong>heap_insert</strong> function. The interesting bit is located in
the “XLOG stuff” block. I will show here an extract focusing on the bit we
will need:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">heap_insert</span><span class="p">(</span><span class="n">Relation</span> <span class="n">relation</span><span class="p">,</span> <span class="n">HeapTuple</span> <span class="n">tup</span><span class="p">,</span> <span class="n">CommandId</span> <span class="n">cid</span><span class="p">,</span>
<span class="kt">int</span> <span class="n">options</span><span class="p">,</span> <span class="n">BulkInsertState</span> <span class="n">bistate</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">[...]</span>
<span class="cm">/* XLOG stuff */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">RelationNeedsWAL</span><span class="p">(</span><span class="n">relation</span><span class="p">))</span>
<span class="p">{</span>
<span class="n">xl_heap_insert</span> <span class="n">xlrec</span><span class="p">;</span>
<span class="n">xl_heap_header</span> <span class="n">xlhdr</span><span class="p">;</span>
<span class="n">XLogRecPtr</span> <span class="n">recptr</span><span class="p">;</span>
<span class="n">Page</span> <span class="n">page</span> <span class="o">=</span> <span class="n">BufferGetPage</span><span class="p">(</span><span class="n">buffer</span><span class="p">);</span>
<span class="n">uint8</span> <span class="n">info</span> <span class="o">=</span> <span class="n">XLOG_HEAP_INSERT</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">bufflags</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">[...]</span>
<span class="n">xlrec</span><span class="p">.</span><span class="n">offnum</span> <span class="o">=</span> <span class="n">ItemPointerGetOffsetNumber</span><span class="p">(</span><span class="o">&</span><span class="n">heaptup</span><span class="o">-></span><span class="n">t_self</span><span class="p">);</span>
<span class="n">xlrec</span><span class="p">.</span><span class="n">flags</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">[...]</span>
<span class="n">XLogBeginInsert</span><span class="p">();</span>
<span class="n">XLogRegisterData</span><span class="p">((</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span> <span class="o">&</span><span class="n">xlrec</span><span class="p">,</span> <span class="n">SizeOfHeapInsert</span><span class="p">);</span>
<span class="n">xlhdr</span><span class="p">.</span><span class="n">t_infomask2</span> <span class="o">=</span> <span class="n">heaptup</span><span class="o">-></span><span class="n">t_data</span><span class="o">-></span><span class="n">t_infomask2</span><span class="p">;</span>
<span class="n">xlhdr</span><span class="p">.</span><span class="n">t_infomask</span> <span class="o">=</span> <span class="n">heaptup</span><span class="o">-></span><span class="n">t_data</span><span class="o">-></span><span class="n">t_infomask</span><span class="p">;</span>
<span class="n">xlhdr</span><span class="p">.</span><span class="n">t_hoff</span> <span class="o">=</span> <span class="n">heaptup</span><span class="o">-></span><span class="n">t_data</span><span class="o">-></span><span class="n">t_hoff</span><span class="p">;</span>
<span class="cm">/*
* note we mark xlhdr as belonging to buffer; if XLogInsert decides to
* write the whole page to the xlog, we don't need to store
* xl_heap_header in the xlog.
*/</span>
<span class="n">XLogRegisterBuffer</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">buffer</span><span class="p">,</span> <span class="n">REGBUF_STANDARD</span> <span class="o">|</span> <span class="n">bufflags</span><span class="p">);</span>
<span class="n">XLogRegisterBufData</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span> <span class="o">&</span><span class="n">xlhdr</span><span class="p">,</span> <span class="n">SizeOfHeapHeader</span><span class="p">);</span>
<span class="cm">/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */</span>
<span class="n">XLogRegisterBufData</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span>
<span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span> <span class="n">heaptup</span><span class="o">-></span><span class="n">t_data</span> <span class="o">+</span> <span class="n">SizeofHeapTupleHeader</span><span class="p">,</span>
<span class="n">heaptup</span><span class="o">-></span><span class="n">t_len</span> <span class="o">-</span> <span class="n">SizeofHeapTupleHeader</span><span class="p">);</span>
<span class="p">[...]</span>
<span class="n">recptr</span> <span class="o">=</span> <span class="n">XLogInsert</span><span class="p">(</span><span class="n">RM_HEAP_ID</span><span class="p">,</span> <span class="n">info</span><span class="p">);</span>
<span class="n">PageSetLSN</span><span class="p">(</span><span class="n">page</span><span class="p">,</span> <span class="n">recptr</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>We see here that this function is as expected inserting an <code class="language-plaintext highlighter-rouge">RM_HEAP_ID</code> record,
with an <code class="language-plaintext highlighter-rouge">XLOG_HEAP_INSERT</code> opcode. There are 2 data parts associated with this
record: the header of the tuple that’s being inserted and the tuple itself.</p>
<p>That’s great! At this point we know how to identify what relation an INSERT is
about and the content of that INSERT. Let’s see how to filter those records
from the WALs.</p>
<h3 id="extracting-and-filtering-wal-records">Extracting and filtering WAL records</h3>
<p>Parsing the postgres WALs isn’t that complicated but still requires to know
quite a bit more than what I showed here. Writing such code is possible but
wait, don’t we already have a tool shipped with postgres which is designed
to do exactly that? Yes there sure is, it’s
<a href="https://github.com/postgres/postgres/tree/master/src/bin/pg_waldump">pg_waldump</a>.</p>
<p>Rather that writing something similar, couldn’t we simply teach pg_waldump to
filter the records we’re interested in and save them somewhere so that we can
later process them and generate SQL queries? This way we can then also benefit
from all options in pg_waldump like specifying the starting and/or ending LSN
or filtering a specific resource manager, without the need to worry about most
of the WAL implementation details and only focusing on the few functions
provided by postgres necessary for our need. Let’s see how to implement that.</p>
<p>The main source file is
<a href="https://github.com/postgres/postgres/blob/master/src/bin/pg_waldump/pg_waldump.c">src/bin/pg_waldump/pg_waldump.c</a>.
Skipping most of the unrelated code, we can see that there’s a main loop that
takes care of reading each record one by one, optionally filter them and then
do something with them depending on how the tool was executed. I will again
show an extract to focus on the most relevant part only:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">for</span> <span class="p">(;;)</span>
<span class="p">{</span>
<span class="p">[...]</span>
<span class="cm">/* try to read the next record */</span>
<span class="n">record</span> <span class="o">=</span> <span class="n">XLogReadRecord</span><span class="p">(</span><span class="n">xlogreader_state</span><span class="p">,</span> <span class="o">&</span><span class="n">errormsg</span><span class="p">);</span>
<span class="p">[...]</span>
<span class="cm">/* apply all specified filters */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">config</span><span class="p">.</span><span class="n">filter_by_rmgr_enabled</span> <span class="o">&&</span>
<span class="o">!</span><span class="n">config</span><span class="p">.</span><span class="n">filter_by_rmgr</span><span class="p">[</span><span class="n">record</span><span class="o">-></span><span class="n">xl_rmid</span><span class="p">])</span>
<span class="k">continue</span><span class="p">;</span>
<span class="p">[...]</span>
<span class="cm">/* perform any per-record work */</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">config</span><span class="p">.</span><span class="n">quiet</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">config</span><span class="p">.</span><span class="n">stats</span> <span class="o">==</span> <span class="nb">true</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">XLogRecStoreStats</span><span class="p">(</span><span class="o">&</span><span class="n">stats</span><span class="p">,</span> <span class="n">xlogreader_state</span><span class="p">);</span>
<span class="n">stats</span><span class="p">.</span><span class="n">endptr</span> <span class="o">=</span> <span class="n">xlogreader_state</span><span class="o">-></span><span class="n">EndRecPtr</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">else</span>
<span class="n">XLogDumpDisplayRecord</span><span class="p">(</span><span class="o">&</span><span class="n">config</span><span class="p">,</span> <span class="n">xlogreader_state</span><span class="p">);</span>
<span class="p">}</span>
<span class="cm">/* save full pages if requested */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">config</span><span class="p">.</span><span class="n">save_fullpage_path</span> <span class="o">!=</span> <span class="nb">NULL</span><span class="p">)</span>
<span class="n">XLogRecordSaveFPWs</span><span class="p">(</span><span class="n">xlogreader_state</span><span class="p">,</span> <span class="n">config</span><span class="p">.</span><span class="n">save_fullpage_path</span><span class="p">);</span>
<span class="cm">/* check whether we printed enough */</span>
<span class="n">config</span><span class="p">.</span><span class="n">already_displayed_records</span><span class="o">++</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">config</span><span class="p">.</span><span class="n">stop_after_records</span> <span class="o">></span> <span class="mi">0</span> <span class="o">&&</span>
<span class="n">config</span><span class="p">.</span><span class="n">already_displayed_records</span> <span class="o">>=</span> <span class="n">config</span><span class="p">.</span><span class="n">stop_after_records</span><span class="p">)</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>That’s quite simple, pg_waldump read the records one by one until it needs to
stop, ignore the records that the users asked to discard and then takes action
on the remaining ones. We can see that there’s already an option to save full
page images, it definitely looks like we could just add something similar
there, but for all records.</p>
<p>First, we will need to provide a way to identify the relation the INSERT is
about. That’s the <code class="language-plaintext highlighter-rouge">RelFileLocator</code>, and we already know that it can be found
just after the XLogRecordBlockHeader. Postgres provides a function to retrieve
this information, and a bit more, named
<a href="https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlogreader.c"><code class="language-plaintext highlighter-rouge">XLogRecGetBlockTagExtended()</code></a>.
Here is it’s description:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/*
* Returns information about the block that a block reference refers to,
* optionally including the buffer that the block may already be in.
*
* If the WAL record contains a block reference with the given ID, *rlocator,
* *forknum, *blknum and *prefetch_buffer are filled in (if not NULL), and
* returns true. Otherwise returns false.
*/</span>
<span class="n">bool</span>
<span class="n">XLogRecGetBlockTagExtended</span><span class="p">(</span><span class="n">XLogReaderState</span> <span class="o">*</span><span class="n">record</span><span class="p">,</span> <span class="n">uint8</span> <span class="n">block_id</span><span class="p">,</span>
<span class="n">RelFileLocator</span> <span class="o">*</span><span class="n">rlocator</span><span class="p">,</span>
<span class="n">ForkNumber</span> <span class="o">*</span><span class="n">forknum</span><span class="p">,</span>
<span class="n">BlockNumber</span> <span class="o">*</span><span class="n">blknum</span><span class="p">,</span>
<span class="n">Buffer</span> <span class="o">*</span><span class="n">prefetch_buffer</span><span class="p">)</span>
</code></pre></div></div>
<p>We need to provide the record - pg_waldump already retrieves it for us - and
the <code class="language-plaintext highlighter-rouge">block_id</code>. The <code class="language-plaintext highlighter-rouge">block_id</code>, or block reference, is simply an offset in the
array of data that the WAL records contains. If you look a bit above in this
article, you will see that we already know that <code class="language-plaintext highlighter-rouge">heap_insert()</code> only uses a
hardcoded <strong>0</strong> block_id: this is the first argument in the various
<code class="language-plaintext highlighter-rouge">XLogRegisterXXX()</code> function calls.</p>
<p>Next we need to retrieve the actual WAL record data, the tuple header and the
tuple itself. This one is a bit trickier, as the record can either be found in
a simple WAL record or in a full-page record. We need to check for a simple
WAL record first. The associated function is
<a href="https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlogreader.c"><code class="language-plaintext highlighter-rouge">XLogRecGetBlockData()</code></a>:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/*
* Returns the data associated with a block reference, or NULL if there is
* no data (e.g. because a full-page image was taken instead). The returned
* pointer points to a MAXALIGNed buffer.
*/</span>
<span class="kt">char</span> <span class="o">*</span>
<span class="n">XLogRecGetBlockData</span><span class="p">(</span><span class="n">XLogReaderState</span> <span class="o">*</span><span class="n">record</span><span class="p">,</span> <span class="n">uint8</span> <span class="n">block_id</span><span class="p">,</span> <span class="n">Size</span> <span class="o">*</span><span class="n">len</span><span class="p">)</span>
</code></pre></div></div>
<p>As noted in the comment, if the function returns NULL (and sets len to <strong>0</strong>)
then the data may be in a full-page image instead (or the data could be missing
entirely). If that’s the case we need to retrieve the full-page image, and
then locate the tuple the INSERT was about and extract it in the same format as
a simple WAL record.</p>
<p>Postgres provides a function to extract the full-page image:
<a href="https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlogreader.c"><code class="language-plaintext highlighter-rouge">RestoreBlockImage()</code></a>:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/*
* Restore a full-page image from a backup block attached to an XLOG record.
*
* Returns true if a full-page image is restored, and false on failure with
* an error to be consumed by the caller.
*/</span>
<span class="n">bool</span>
<span class="n">RestoreBlockImage</span><span class="p">(</span><span class="n">XLogReaderState</span> <span class="o">*</span><span class="n">record</span><span class="p">,</span> <span class="n">uint8</span> <span class="n">block_id</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">page</span><span class="p">)</span>
</code></pre></div></div>
<p>which is straightforward to use: just provide the record and the block
identifier and you get the full-page image if found. However, there’s no
function available to extract a tuple for a full-page image. Indeed postgres
can simply overwrite the whole block with the full-page image as it contains
the latest version of the block at the time it was generated, but in our case
we definitely don’t want to emit an INSERT statement for every already existing
tuple in the block!</p>
<p>Fortunately, even when we get a full-page image, our record still contains a
//main data area//. If you look up at the <code class="language-plaintext highlighter-rouge">heap_insert()</code> function, that’s
the call to <code class="language-plaintext highlighter-rouge">XLogRegisterData()</code>, and as you see here it contains an
<code class="language-plaintext highlighter-rouge">xl_heap_insert</code> struct. And the first member of this struct, <strong>offnum</strong>, is
actually the position of the tuple in the page which is exactly what we need!</p>
<p>With all of that, it’s just a matter of accessing the tuple header and tuple at
the correct place among all the tuples present in the page, and save as we
would way it would be if it were a simple WAL record. If you’re wondering how
exactly it should be done, you can always look at how postgres itself does it
when it needs to return a specific tuple and adapt that code to your need. The
functions responsible for that are <code class="language-plaintext highlighter-rouge">heapgetpage()</code> and <code class="language-plaintext highlighter-rouge">heapgettup()</code>, located
in the
<a href="https://github.com/postgres/postgres/blob/master/src/backend/access/heap/heapam.c">src/backend/access/heap/heapam.c</a>
file we already mentioned.</p>
<p>We now have the information about the physical file location and the record
itself that we will need to transmit to another program to decode it. The best
way to do that is to simply save the record as-is in a binary file, and use the
file name to transmit the metadata. I chose the following pattern to name the
produced files:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>LSN.TABLESPACE_OID.DATABASE_OID.RELFILENODE.FORKNAME
</code></pre></div></div>
<p>It will be trivial for the consumer to parse it and extract the required
metadata. One thing to note is that I don’t put the <code class="language-plaintext highlighter-rouge">rmid</code> or the <code class="language-plaintext highlighter-rouge">opcode</code>
here as I’m only emitting the only one I’m interested in and discard everything
else. If that’s not your case you should definitely remember to add those in
the filename pattern.</p>
<p>Since this requires a bit of code to implement, I won’t detail it here but you
can find the full result in the patch for pg_waldump that I’m attaching to
this article, which implements this as a new <strong>–save-records</strong> option.</p>
<p>To conclude, let me also remind you that a compiled version of pg_waldump will
only work for a single major postgres version. In my case, I had to work with
postgres 11, so you can <a href="/assets/patch/0001-Add-a-save-records-PATH-option-to-pg_waldump_pg11.patch">find the patch for this version
here</a>,
but if needed I also rebased it again the current commit on the master branch,
which <a href="/assets/patch/0001-Add-a-save-records-PATH-option-to-pg_waldump_pg17.patch">can be found
here</a>.</p>
<h3 id="whats-next">What’s next?</h3>
<p>This is the end of this first article. We saw some details on the postgres WAL
infrastructure, with a full example for the case of a plain INSERT on a heap
table. We also learned where to look to find where other WAL records are
generated and to see more details about the implementation.</p>
<p>We also checked how pg_waldump is working and how to adapt it for our need,
with a provided complete patch for both <a href="/assets/patch/0001-Add-a-save-records-PATH-option-to-pg_waldump_pg11.patch">postgres
11</a>
and <a href="/assets/patch/0001-Add-a-save-records-PATH-option-to-pg_waldump_pg17.patch">the current dev version (postgres
17)</a>.
Again, I’d like to remind you that all this work is only at a proof-of-concept
stage, it’s definitely not polished and I’m sure that are many problems that
would need to be fixed. One obvious example of such problem is that we’re
saving all INSERT we find in the logs but we don’t check if the transaction
they’re in eventually committed. It would be possible to fix that but it would
require extraneous code, so as is it’s up to the users to double check that as
needed. Overall it was enough to recover the needed data so I didn’t pursue
any more work on it.</p>
<p>In the next article we will see some usage of this new <strong>–save-records</strong>
option, and also how to read those records and decode them to generate plain
INSERT queries. Stay tuned!</p>
<p><a href="https://rjuju.github.io/postgresql/2023/12/06/extract-sql-from-wal.html">Extracting SQL from WAL? (part 1)</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on December 06, 2023.</p>
https://rjuju.github.io/postgresql/2020/11/17/queryid-reporting-in-plpgsql_check2020-11-17T02:42:33+00:002020-11-17T02:42:33+00:00Julien Rouhaudhttps://rjuju.github.io
<p>plpgsql_check version 1.14.0 was just released and brings some improvement for
performance diagnostic.</p>
<p>Thanks <strong>a lot</strong> to <a href="http://okbob.blogspot.com/">Pavel Stěhule</a> for the awesome
plpgsql_check extension and the help for implementing the queryid reporting in
v1.14!</p>
<h3 id="plpgsql_check-static-code-analysis-and-more">plpgsql_check: static code analysis and more</h3>
<p>PostgreSQL supports procedural code for many languages, the most popular one
probably being plpgsql.</p>
<p>Even if that language allows you to write raw SQL statements, any function
written in that language is still a block box as far as PostgreSQL is
concerned, which means that PostgreSQL won’t perform a lot of checks to verify
code quality, typo or any other problem related to code development. That’s
where <a href="https://github.com/okbob/plpgsql_check">plpgsql_check extension</a> comes
into play.</p>
<p>If you write any plpgsql code, this extension will be your best friend, as it
brings so many cool features. The major feature is static code analysis, which
can detect many bugs, security / SQL inject issue and even possible performance
issue by detecting implicit casts that could prevent PostgreSQL from using
indexes and much more.</p>
<p>It also brings a simple, but yet very useful, <strong>code profiler</strong>.</p>
<h3 id="how-to-track-down-performance-issue-in-plpgsql-code">How to track down performance issue in plpgsql code?</h3>
<p>As I mentioned above, plpgsql code is a black box as far as PostgreSQL is
concerned. The direct consequence is that the performance diagnostic
possibilities are quite limited.</p>
<p>Using core PostgreSQL, the only option is using <code class="language-plaintext highlighter-rouge">pg_stat_user_functions</code> (which
requires <code class="language-plaintext highlighter-rouge">track_functions</code> to be set to <strong>pl</strong> or <strong>all</strong>). It’ll show the
number of time each function has been called, and how long the execution took
including and excluding nested functions. Unfortunately, this view can only
help you track down <strong>which</strong> function is slow, but not <strong>why</strong>, as you
don’t get any per-instruction metric.</p>
<p>You can somehow work around that limitation using the contrib extensions
<a href="https://www.postgresql.org/docs/current/pgstatstatements.html">pg_stat_statements</a>.
This extensions is one of the most popular one as far as performance diagnostic
is concerned, and gives you a lot of data on query performance (including
<a href="/postgresql/2020/04/04/new-in-pg13-monitoring-query-planner.html">planning counters</a> and <a href="/postgresql/2020/04/07/new-in-pg13-WAL-monitoring.html">WAL counters</a> since PostgreSQL 13).</p>
<p>The only problem is that it can be quite tricky to match pg_stat_statements
entries with your plpgsql code, as there’s way to directly identify which
queries are run inside your plpgsql code.</p>
<h3 id="plpgsql_check-code-profiler">plpgsql_check code profiler</h3>
<p>Another alternative is to use a plpgsql code profiler. There are multiple
extensions that bring this feature, and I personally chose
<a href="https://github.com/okbob/plpgsql_check">plpgsql_check</a>, as it perfectly suited
my need: simple to setup and use, all performance information I needed and
possibility to use it either in a per-connection base or globally when
configuration the extension in <strong>shared_preload_libraries</strong>. Thanks to this
profiler, you can finally get performance metrics at the statement level
<strong>inside plpgsql code</strong>:</p>
<ul>
<li>total execution time, that is the cumulated execution time for all the
statements in the source code line</li>
<li>average execution time, that is the total execution time divided by the
number of statements in the source code line</li>
<li>maximum execution time, per statement</li>
<li>number of rows processed, per statement</li>
</ul>
<p>With those information, it becomes quite easy to track down the slow part of
your functions. Here’s a simplistic example:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="o">=#</span> <span class="k">SELECT</span> <span class="n">lineno</span><span class="p">,</span> <span class="n">cmds_on_row</span><span class="p">,</span> <span class="n">total_time</span><span class="p">,</span> <span class="n">avg_time</span><span class="p">,</span> <span class="n">max_time</span><span class="p">,</span> <span class="k">source</span>
<span class="k">FROM</span> <span class="n">plpgsql_profiler_function_tb</span><span class="p">(</span><span class="s1">'pltest()'</span><span class="p">);</span>
<span class="n">lineno</span> <span class="o">|</span> <span class="n">cmds_on_row</span> <span class="o">|</span> <span class="n">total_time</span> <span class="o">|</span> <span class="n">avg_time</span> <span class="o">|</span> <span class="n">max_time</span> <span class="o">|</span> <span class="k">source</span>
<span class="c1">--------+-------------+------------+----------+------------------+-------------------------------------------------------</span>
<span class="mi">1</span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span>
<span class="mi">2</span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="k">DECLARE</span>
<span class="mi">3</span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="n">num</span> <span class="nb">bigint</span><span class="p">;</span>
<span class="mi">4</span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="n">_tbl</span> <span class="nb">text</span> <span class="o">=</span> <span class="s1">'pg_class'</span><span class="p">;</span>
<span class="mi">5</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="mi">0</span><span class="p">.</span><span class="mi">085</span> <span class="o">|</span> <span class="mi">0</span><span class="p">.</span><span class="mi">085</span> <span class="o">|</span> <span class="p">{</span><span class="mi">0</span><span class="p">.</span><span class="mi">085</span><span class="p">}</span> <span class="o">|</span> <span class="k">BEGIN</span>
<span class="mi">6</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="mi">0</span><span class="p">.</span><span class="mi">504</span> <span class="o">|</span> <span class="mi">0</span><span class="p">.</span><span class="mi">504</span> <span class="o">|</span> <span class="p">{</span><span class="mi">0</span><span class="p">.</span><span class="mi">504</span><span class="p">}</span> <span class="o">|</span> <span class="k">drop</span> <span class="k">table</span> <span class="n">if</span> <span class="k">exists</span> <span class="n">meh</span><span class="p">;</span>
<span class="mi">7</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="mi">0</span><span class="p">.</span><span class="mi">81</span> <span class="o">|</span> <span class="mi">0</span><span class="p">.</span><span class="mi">81</span> <span class="o">|</span> <span class="p">{</span><span class="mi">0</span><span class="p">.</span><span class="mi">81</span><span class="p">}</span> <span class="o">|</span> <span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">meh</span><span class="p">(</span><span class="n">id</span> <span class="nb">integer</span><span class="p">);</span>
<span class="mi">8</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="mi">0</span><span class="p">.</span><span class="mi">362</span> <span class="o">|</span> <span class="mi">0</span><span class="p">.</span><span class="mi">362</span> <span class="o">|</span> <span class="p">{</span><span class="mi">0</span><span class="p">.</span><span class="mi">362</span><span class="p">}</span> <span class="o">|</span> <span class="k">EXECUTE</span> <span class="s1">'SELECT COUNT(*) FROM '</span> <span class="o">||</span> <span class="n">_tbl</span> <span class="k">INTO</span> <span class="n">num</span><span class="p">;</span>
<span class="mi">9</span> <span class="o">|</span> <span class="mi">2</span> <span class="o">|</span> <span class="mi">1000</span><span class="p">.</span><span class="mi">84</span> <span class="o">|</span> <span class="mi">500</span><span class="p">.</span><span class="mi">42</span> <span class="o">|</span> <span class="p">{</span><span class="mi">0</span><span class="p">.</span><span class="mi">349</span><span class="p">,</span><span class="mi">1000</span><span class="p">.</span><span class="mi">491</span><span class="p">}</span> <span class="o">|</span> <span class="k">delete</span> <span class="k">from</span> <span class="n">meh</span><span class="p">;</span> <span class="n">PERFORM</span> <span class="n">pg_sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
<span class="mi">10</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="mi">0</span> <span class="o">|</span> <span class="mi">0</span> <span class="o">|</span> <span class="p">{</span><span class="mi">0</span><span class="p">}</span> <span class="o">|</span> <span class="k">RETURN</span> <span class="n">num</span><span class="p">;</span>
<span class="mi">11</span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="k">END</span><span class="p">;</span>
<span class="p">(</span><span class="mi">11</span> <span class="k">rows</span><span class="p">)</span></code></pre></figure>
<p>In this example, we can see immediately that the slowdown comes from source
code line n°9, which has a total execution time of 1s. Using the <strong>max_time</strong>
field, we see that it’s because of the 2nd statements. As we also have the
source code available in the view, we can immediately see the problematic
query, which here is a simple call to <code class="language-plaintext highlighter-rouge">pg_sleep(1)</code>.</p>
<p>So far so good. But with less naive example the cause of slow execution might
be less obvious, and it could be handy to rely on all the available extensions
to get more information:
<a href="https://www.postgresql.org/docs/current/pgstatstatements.html">pg_stat_statements</a>
for general counters,
<a href="https://github.com/powa-team/pg_stat_kcache">pg_stat_kcache</a> for CPU and disk
usage counters,
<a href="https://github.com/postgrespro/pg_wait_sampling">pg_wait_sampling</a> for wait
events and so on.</p>
<p>But how to match the plpgsql statement with entries in those extensions?</p>
<h3 id="exposing-queryid-in-plpgql_check-profiler">Exposing queryid in plpgql_check profiler</h3>
<p>Indeed, those extensions identify queries using a <strong>query identifier</strong>,
computed by <strong>pg_stat_statements</strong>. You could try to manually find the related
entry using the query text stored by <strong>pg_stat_statements</strong>, but it may not
always be possible. What if the query is dynamic SQL or using unqualified
names?</p>
<p>The solution here is quite simple: since plpgsql_check profiler already show
per-statement information, also report the statement’s underlying queryid.</p>
<p>This is now available with version 1.14.0. Using the previous naive example,
here’s what we now see:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="o">=#</span> <span class="k">SELECT</span> <span class="n">lineno</span><span class="p">,</span> <span class="n">max_time</span><span class="p">,</span> <span class="n">queryids</span><span class="p">,</span> <span class="k">source</span>
<span class="k">FROM</span> <span class="n">plpgsql_profiler_function_tb</span><span class="p">(</span><span class="s1">'pltest()'</span><span class="p">);</span>
<span class="n">lineno</span> <span class="o">|</span> <span class="n">max_time</span> <span class="o">|</span> <span class="n">queryids</span> <span class="o">|</span> <span class="k">source</span>
<span class="c1">--------+------------------+-------------------------------------------+-------------------------------------------------------</span>
<span class="mi">1</span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span>
<span class="mi">2</span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="k">DECLARE</span>
<span class="mi">3</span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="n">num</span> <span class="nb">bigint</span><span class="p">;</span>
<span class="mi">4</span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="n">_tbl</span> <span class="nb">text</span> <span class="o">=</span> <span class="s1">'pg_class'</span><span class="p">;</span>
<span class="mi">5</span> <span class="o">|</span> <span class="p">{</span><span class="mi">0</span><span class="p">.</span><span class="mi">085</span><span class="p">}</span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="k">BEGIN</span>
<span class="mi">6</span> <span class="o">|</span> <span class="p">{</span><span class="mi">0</span><span class="p">.</span><span class="mi">504</span><span class="p">}</span> <span class="o">|</span> <span class="p">{</span><span class="k">NULL</span><span class="p">}</span> <span class="o">|</span> <span class="k">drop</span> <span class="k">table</span> <span class="n">if</span> <span class="k">exists</span> <span class="n">meh</span><span class="p">;</span>
<span class="mi">7</span> <span class="o">|</span> <span class="p">{</span><span class="mi">0</span><span class="p">.</span><span class="mi">81</span><span class="p">}</span> <span class="o">|</span> <span class="p">{</span><span class="k">NULL</span><span class="p">}</span> <span class="o">|</span> <span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">meh</span><span class="p">(</span><span class="n">id</span> <span class="nb">integer</span><span class="p">);</span>
<span class="mi">8</span> <span class="o">|</span> <span class="p">{</span><span class="mi">0</span><span class="p">.</span><span class="mi">362</span><span class="p">}</span> <span class="o">|</span> <span class="p">{</span><span class="o">-</span><span class="mi">7484655548452190292</span><span class="p">}</span> <span class="o">|</span> <span class="k">EXECUTE</span> <span class="s1">'SELECT COUNT(*) FROM '</span> <span class="o">||</span> <span class="n">_tbl</span> <span class="k">INTO</span> <span class="n">num</span><span class="p">;</span>
<span class="mi">9</span> <span class="o">|</span> <span class="p">{</span><span class="mi">0</span><span class="p">.</span><span class="mi">349</span><span class="p">,</span><span class="mi">1000</span><span class="p">.</span><span class="mi">491</span><span class="p">}</span> <span class="o">|</span> <span class="p">{</span><span class="mi">8162364748417812595</span><span class="p">,</span><span class="mi">6729783856403017864</span><span class="p">}</span> <span class="o">|</span> <span class="k">delete</span> <span class="k">from</span> <span class="n">meh</span><span class="p">;</span> <span class="n">PERFORM</span> <span class="n">pg_sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
<span class="mi">10</span> <span class="o">|</span> <span class="p">{</span><span class="mi">0</span><span class="p">}</span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="k">RETURN</span> <span class="n">num</span><span class="p">;</span>
<span class="mi">11</span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="k">END</span><span class="p">;</span>
<span class="p">(</span><span class="mi">11</span> <span class="k">rows</span><span class="p">)</span></code></pre></figure>
<p>You’re now only a JOIN away from matching your plpgsql profile data from your
favorite extensions!</p>
<h3 id="limitations">Limitations</h3>
<p>There are unfortunately some limitations.</p>
<p>Due to pg_stat_statements implementation, queryid for DDL queries is not
exposed outside the extension, so plpgsql_check can’t retrieve it.</p>
<p>When using dynamic SQL, there might be <strong>many</strong> queries involved:</p>
<ul>
<li>the query text itself will be generated using SQL statement(s)</li>
<li>the parameters, if any, will also be resolved running SQL statement(s)</li>
<li>if the query text depends on some parameters, you can end up with multiple
different top level query</li>
</ul>
<p>plpgsql_check will only report the top level query identifier, and if multiple
different queries are generated only the query identifier of the first one will
be reported.</p>
<p>Even with those limitations I still hope that this new feature will be helpful.</p>
<h3 id="whats-next">What’s next?</h3>
<p>Due to current plpgsql implementation, when a dynamic SQL statement is executed
the query identifier is not visible outside plpgsql itself. It means that
retrieving the query identifier in that case is a bit costly, as plpgsql_check
has to do some of the work that plpgsql is doing:</p>
<ul>
<li>generate the final query string</li>
<li>parse the query string</li>
<li>call the parse analysis step (this is where the query identifier is
generated)</li>
</ul>
<p>Of course the query itself won’t be executed or even planned, but those extra
steps might add non negligible overhead, especially when the dynamic SQL is
executing very short OLTP-style queries.</p>
<p>So plpgsql should be modified to be able to report the query identifier of all
statements, whether static or dynamic, so external modules can access the
information easily and without any additional overhead. Ideally, this could
also be available in plpgsql code using a <strong>GET [ CURRENT ] DIAGNOSTICS</strong>
command, so users can also use it as they need.</p>
<p><a href="https://rjuju.github.io/postgresql/2020/11/17/queryid-reporting-in-plpgsql_check.html">Queryid reporting in plpgsql_check</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on November 17, 2020.</p>
https://rjuju.github.io/postgresql/2020/04/07/new-in-pg13-WAL-monitoring2020-04-07T15:46:15+00:002020-04-07T15:46:15+00:00Julien Rouhaudhttps://rjuju.github.io
<p>Write-Ahead Logs is a critical part of PostgreSQL, that ensures data
durability. While there are multiple <a href="https://www.postgresql.org/docs/current/runtime-config-wal.html">configuration parameters
</a>, there was
no easy to monitor WAL activity, or what is generating it.</p>
<h3 id="new-infrastructure-to-track-wal-activity">New infrastructure to track WAL activity</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>commit df3b181499b40523bd6244a4e5eb554acb9020ce
Author: Amit Kapila <[email protected]>
Date: Sat Apr 4 10:02:08 2020 +0530
Add infrastructure to track WAL usage.
This allows gathering the WAL generation statistics for each statement
execution. The three statistics that we collect are the number of WAL
records, the number of full page writes and the amount of WAL bytes
generated.
This helps the users who have write-intensive workload to see the impact
of I/O due to WAL. This further enables us to see approximately what
percentage of overall WAL is due to full page writes.
In the future, we can extend this functionality to allow us to compute the
the exact amount of WAL data due to full page writes.
This patch in itself is just an infrastructure to compute WAL usage data.
The upcoming patches will expose this data via explain, auto_explain,
pg_stat_statements and verbose (auto)vacuum output.
Author: Kirill Bychik, Julien Rouhaud
Reviewed-by: Dilip Kumar, Fujii Masao and Amit Kapila
Discussion: https://postgr.es/m/CAB-hujrP8ZfUkvL5OYETipQwA=e3n7oqHFU=4ZLxWS_Cza3kQQ@mail.gmail.com
</code></pre></div></div>
<p>With this new infrastructure, each backend will track various information about
WAL generation: the number of WAL records, the size of WAL generated and the
number of full page images generated. It also makes sure that parallel
queries, both DML and utility statements (for now only CREATE INDEX and VACUUM)
are correctly handled.</p>
<h3 id="per-query-wal-activity-with-pg_stat_statements">Per-query WAL activity with pg_stat_statements</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>commit 6b466bf5f2bea0c89fab54eef696bcfc7ecdafd7
Author: Amit Kapila <[email protected]>
Date: Sun Apr 5 07:34:04 2020 +0530
Allow pg_stat_statements to track WAL usage statistics.
This commit adds three new columns in pg_stat_statements output to
display WAL usage statistics added by commit df3b181499.
This commit doesn't bump the version of pg_stat_statements as the
same is done for this release in commit 17e0328224.
Author: Kirill Bychik and Julien Rouhaud
Reviewed-by: Julien Rouhaud, Fujii Masao, Dilip Kumar and Amit Kapila
Discussion: https://postgr.es/m/CAB-hujrP8ZfUkvL5OYETipQwA=e3n7oqHFU=4ZLxWS_Cza3kQQ@mail.gmail.com
</code></pre></div></div>
<p>This basically exposes the mentionned new information about WAL activity in
pg_stat_activity, so per (user, database, normalized query). Here is an
example:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="o">=#</span> <span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">t1</span> <span class="p">(</span><span class="n">id</span> <span class="nb">integer</span><span class="p">);</span>
<span class="k">CREATE</span>
<span class="o">=#</span> <span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">t1</span> <span class="k">SELECT</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">INSERT</span> <span class="mi">0</span> <span class="mi">1</span>
<span class="o">=#</span> <span class="k">UPDATE</span> <span class="n">t1</span> <span class="k">SET</span> <span class="n">id</span> <span class="o">=</span> <span class="mi">2</span> <span class="k">WHERE</span> <span class="n">id</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">UPDATE</span> <span class="mi">1</span>
<span class="o">=#</span> <span class="k">CHECKPOINT</span><span class="p">;</span>
<span class="k">CHECKPOINT</span>
<span class="o">=#</span> <span class="k">DELETE</span> <span class="k">FROM</span> <span class="n">t1</span> <span class="k">WHERE</span> <span class="n">id</span> <span class="o">=</span> <span class="mi">2</span><span class="p">;</span>
<span class="k">DELETE</span> <span class="mi">1</span>
<span class="o">=#</span> <span class="k">SELECT</span> <span class="n">query</span><span class="p">,</span> <span class="n">wal_records</span><span class="p">,</span> <span class="n">wal_bytes</span><span class="p">,</span> <span class="n">wal_num_fpw</span>
<span class="k">FROM</span> <span class="n">pg_stat_statements</span>
<span class="k">WHERE</span> <span class="n">query</span> <span class="k">LIKE</span> <span class="s1">'UPDATE%'</span> <span class="k">OR</span> <span class="n">query</span> <span class="k">LIKE</span> <span class="s1">'DELETE%'</span><span class="p">;</span>
<span class="n">query</span> <span class="o">|</span> <span class="n">wal_records</span> <span class="o">|</span> <span class="n">wal_bytes</span> <span class="o">|</span> <span class="n">wal_num_fpw</span>
<span class="c1">-------------------------------------+-------------+-----------+-------------</span>
<span class="k">DELETE</span> <span class="k">FROM</span> <span class="n">t1</span> <span class="k">WHERE</span> <span class="n">id</span> <span class="o">=</span> <span class="err">$</span><span class="mi">1</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="mi">155</span> <span class="o">|</span> <span class="mi">1</span>
<span class="k">UPDATE</span> <span class="n">t1</span> <span class="k">SET</span> <span class="n">id</span> <span class="o">=</span> <span class="err">$</span><span class="mi">1</span> <span class="k">WHERE</span> <span class="n">id</span> <span class="o">=</span> <span class="err">$</span><span class="mi">2</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="mi">69</span> <span class="o">|</span> <span class="mi">0</span>
<span class="p">(</span><span class="mi">2</span> <span class="k">rows</span><span class="p">)</span></code></pre></figure>
<p>I simply inserted a row, updated it and deleted it. Now, looking specifically
at the UPDATE and the DELETE, the numbers can be surprising.</p>
<p>When inserting a row, we indeed expect a single WAL record and some WAL bytes
for the new row, with some overhead due to internal implementation.</p>
<p>Now, if you’re familiar with PostgreSQL MVCC implementation, you should know
that doing a DELETE should only write a transaction id in the <code class="language-plaintext highlighter-rouge">xmax</code> field
(<a href="https://www.postgresql.org/docs/current/storage-page-layout.html">this documentation
page</a> is a
good introduction on that subject). So why writing a 4B field (the size of the
recotded <code class="language-plaintext highlighter-rouge">xmax</code> field), even with some overhead, is writing more than twice the
amount of WAL that was required to update a full row? That’s because the
DELETE caused a <a href="https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-FULL-PAGE-WRITES">full page
write</a>.
This is a side effect of performing a <strong>CHECKPOINT</strong> before the DELETE. To
guarantee data consistency (and if <code class="language-plaintext highlighter-rouge">full_page_writes</code> parameter isn’t
deactivated), any block modified for the first time after a <strong>CHECKPOINT</strong>
completion will be fully logged, rather than logging only the delta.</p>
<p>You’ll also note that the full page didn’t generate 8kB of data as you could
expect. This isn’t because of <code class="language-plaintext highlighter-rouge">wal_compression</code>, as I didn’t activate it, but
because the page is almost empty. Indeed, as an optimization, any “hole” in
a page, as long as it’s a standard page, can be safely skipped in the WAL. If
you’re curious, this is done in the <a href="https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xloginsert.c">XLogRecordAssemble() function
</a>.
Here’s the relevant extract:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">static</span> <span class="n">XLogRecData</span> <span class="o">*</span>
<span class="n">XLogRecordAssemble</span><span class="p">(</span><span class="n">RmgrId</span> <span class="n">rmid</span><span class="p">,</span> <span class="n">uint8</span> <span class="n">info</span><span class="p">,</span>
<span class="n">XLogRecPtr</span> <span class="n">RedoRecPtr</span><span class="p">,</span> <span class="nb">bool</span> <span class="n">doPageWrites</span><span class="p">,</span>
<span class="n">XLogRecPtr</span> <span class="o">*</span><span class="n">fpw_lsn</span><span class="p">,</span> <span class="nb">int</span> <span class="o">*</span><span class="n">num_fpw</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">[...]</span>
<span class="cm">/*
* If needs_backup is true or WAL checking is enabled for current
* resource manager, log a full-page write for the current block.
*/</span>
<span class="n">include_image</span> <span class="o">=</span> <span class="n">needs_backup</span> <span class="o">||</span> <span class="p">(</span><span class="n">info</span> <span class="o">&</span> <span class="n">XLR_CHECK_CONSISTENCY</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">if</span> <span class="p">(</span><span class="n">include_image</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">Page</span> <span class="n">page</span> <span class="o">=</span> <span class="n">regbuf</span><span class="o">-></span><span class="n">page</span><span class="p">;</span>
<span class="n">uint16</span> <span class="n">compressed_len</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="cm">/*
* The page needs to be backed up, so calculate its hole length
* and offset.
*/</span>
<span class="n">if</span> <span class="p">(</span><span class="n">regbuf</span><span class="o">-></span><span class="n">flags</span> <span class="o">&</span> <span class="n">REGBUF_STANDARD</span><span class="p">)</span>
<span class="p">{</span>
<span class="cm">/* Assume we can omit data between pd_lower and pd_upper */</span>
<span class="n">uint16</span> <span class="k">lower</span> <span class="o">=</span> <span class="p">((</span><span class="n">PageHeader</span><span class="p">)</span> <span class="n">page</span><span class="p">)</span><span class="o">-></span><span class="n">pd_lower</span><span class="p">;</span>
<span class="n">uint16</span> <span class="k">upper</span> <span class="o">=</span> <span class="p">((</span><span class="n">PageHeader</span><span class="p">)</span> <span class="n">page</span><span class="p">)</span><span class="o">-></span><span class="n">pd_upper</span><span class="p">;</span>
<span class="n">if</span> <span class="p">(</span><span class="k">lower</span> <span class="o">>=</span> <span class="n">SizeOfPageHeaderData</span> <span class="o">&&</span>
<span class="k">upper</span> <span class="o">></span> <span class="k">lower</span> <span class="o">&&</span>
<span class="k">upper</span> <span class="o"><=</span> <span class="n">BLCKSZ</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">bimg</span><span class="p">.</span><span class="n">hole_offset</span> <span class="o">=</span> <span class="k">lower</span><span class="p">;</span>
<span class="n">cbimg</span><span class="p">.</span><span class="n">hole_length</span> <span class="o">=</span> <span class="k">upper</span> <span class="o">-</span> <span class="k">lower</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">else</span>
<span class="p">{</span>
<span class="cm">/* No "hole" to remove */</span>
<span class="n">bimg</span><span class="p">.</span><span class="n">hole_offset</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">cbimg</span><span class="p">.</span><span class="n">hole_length</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">[...]</span></code></pre></figure>
<h3 id="wal-activity-in-explain-and-auto_explain">WAL activity in EXPLAIN (and auto_explain)</h3>
<p>A new <code class="language-plaintext highlighter-rouge">WAL</code> option is available in the <strong>EXPLAIN</strong> command, and similarly a
<code class="language-plaintext highlighter-rouge">auto_explain.log_wal</code> for <strong>auto_explain</strong>, to display the same counters. In
TEXT mode, only the non-zero counters are shown, similarly to other counters.
For instance:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="o">=#</span> <span class="k">EXPLAIN</span> <span class="p">(</span><span class="k">ANALYZE</span><span class="p">,</span> <span class="n">WAL</span><span class="p">,</span> <span class="n">COSTS</span> <span class="k">OFF</span><span class="p">)</span> <span class="k">UPDATE</span> <span class="n">t1</span> <span class="k">SET</span> <span class="n">id</span> <span class="o">=</span> <span class="mi">1</span> <span class="k">WHERE</span> <span class="n">id</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="n">QUERY</span> <span class="n">PLAN</span>
<span class="c1">----------------------------------------------------------------</span>
<span class="k">Update</span> <span class="k">on</span> <span class="n">t1</span> <span class="p">(</span><span class="n">actual</span> <span class="nb">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">181</span><span class="p">..</span><span class="mi">0</span><span class="p">.</span><span class="mi">181</span> <span class="k">rows</span><span class="o">=</span><span class="mi">0</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">WAL</span><span class="p">:</span> <span class="n">records</span><span class="o">=</span><span class="mi">1</span> <span class="n">bytes</span><span class="o">=</span><span class="mi">68</span>
<span class="o">-></span> <span class="n">Seq</span> <span class="n">Scan</span> <span class="k">on</span> <span class="n">t1</span> <span class="p">(</span><span class="n">actual</span> <span class="nb">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">074</span><span class="p">..</span><span class="mi">0</span><span class="p">.</span><span class="mi">080</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">Filter</span><span class="p">:</span> <span class="p">(</span><span class="n">id</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">Planning</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">274</span> <span class="n">ms</span>
<span class="n">Execution</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">381</span> <span class="n">ms</span>
<span class="p">(</span><span class="mi">6</span> <span class="k">rows</span><span class="p">)</span></code></pre></figure>
<h3 id="wal-activity-in-autovacuum-logs">WAL activity in autovacuum logs</h3>
<p>And finally, if an autovacuum is logging its activity (when reaching the
<code class="language-plaintext highlighter-rouge">log_autovacuum_min_duration</code> threshold), the same information will be logged.
For instance, after inserting 100k records in the same table, deleting half of
them and running a <strong>CHECKPOINT</strong>, here’s the output I get:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="n">LOG</span><span class="p">:</span> <span class="n">automatic</span> <span class="k">vacuum</span> <span class="k">of</span> <span class="k">table</span> <span class="nv">"rjuju.public.t1"</span><span class="p">:</span> <span class="k">index</span> <span class="n">scans</span><span class="p">:</span> <span class="mi">0</span>
<span class="n">pages</span><span class="p">:</span> <span class="mi">0</span> <span class="n">removed</span><span class="p">,</span> <span class="mi">443</span> <span class="n">remain</span><span class="p">,</span> <span class="mi">0</span> <span class="n">skipped</span> <span class="n">due</span> <span class="k">to</span> <span class="n">pins</span><span class="p">,</span> <span class="mi">0</span> <span class="n">skipped</span> <span class="n">frozen</span>
<span class="n">tuples</span><span class="p">:</span> <span class="mi">50000</span> <span class="n">removed</span><span class="p">,</span> <span class="mi">50001</span> <span class="n">remain</span><span class="p">,</span> <span class="mi">0</span> <span class="k">are</span> <span class="n">dead</span> <span class="n">but</span> <span class="k">not</span> <span class="n">yet</span> <span class="n">removable</span><span class="p">,</span> <span class="n">oldest</span> <span class="n">xmin</span><span class="p">:</span> <span class="mi">496</span>
<span class="n">buffer</span> <span class="k">usage</span><span class="p">:</span> <span class="mi">912</span> <span class="n">hits</span><span class="p">,</span> <span class="mi">3</span> <span class="n">misses</span><span class="p">,</span> <span class="mi">448</span> <span class="n">dirtied</span>
<span class="k">avg</span> <span class="k">read</span> <span class="n">rate</span><span class="p">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">084</span> <span class="n">MB</span><span class="o">/</span><span class="n">s</span><span class="p">,</span> <span class="k">avg</span> <span class="k">write</span> <span class="n">rate</span><span class="p">:</span> <span class="mi">12</span><span class="p">.</span><span class="mi">485</span> <span class="n">MB</span><span class="o">/</span><span class="n">s</span>
<span class="k">system</span> <span class="k">usage</span><span class="p">:</span> <span class="n">CPU</span><span class="p">:</span> <span class="k">user</span><span class="p">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">17</span> <span class="n">s</span><span class="p">,</span> <span class="k">system</span><span class="p">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">00</span> <span class="n">s</span><span class="p">,</span> <span class="n">elapsed</span><span class="p">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">28</span> <span class="n">s</span>
<span class="n">WAL</span> <span class="k">usage</span><span class="p">:</span> <span class="mi">1330</span> <span class="n">records</span><span class="p">,</span> <span class="mi">445</span> <span class="k">full</span> <span class="n">page</span> <span class="n">writes</span><span class="p">,</span> <span class="mi">2197104</span> <span class="n">bytes</span></code></pre></figure>
<p>This new log output is in my opinion especially important, especially when it
comes to <a href="https://www.postgresql.org/docs/current/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND">anti-wraparound / FREEZE
vacuum</a>.
Indeed, by nature an anti-wraparound VACUUM is more likely to touch blocks that
weren’t modified for a long time as it’s targeting tuple being visible for
more than 200M transactions (by default). Even though it’s only setting a flag
bit to mark the tuple as frozen, if that block wasn’t modified since the last
<strong>CHECKPOINT</strong>, this bit will be amplified to a <strong>full page image</strong> which is
way more data.</p>
<p>With this new feature, it’s now possible to really monitor the WAL
generation, which will help to better tune your instances!</p>
<p><a href="https://rjuju.github.io/postgresql/2020/04/07/new-in-pg13-WAL-monitoring.html">New in pg13: WAL monitoring</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on April 07, 2020.</p>
https://rjuju.github.io/postgresql/2020/04/04/new-in-pg13-monitoring-query-planner2020-04-04T12:06:15+00:002020-04-04T12:06:15+00:00Julien Rouhaudhttps://rjuju.github.io
<p>Depending on your workload, the planning time can represent a significant part
of the overal query procesing time. This is especially import in OLTP
workload, but OLAP queries with numerous tables being joined and an aggressive
configuration on the JOIN order search can also lead to hight planning time.</p>
<h3 id="planning-counters-in-pg_stat_statements">Planning counters in pg_stat_statements</h3>
<p>Previously, pg_stat_statements was only keeping track of the execution part
of a query processing: the number of execution, cumulated time, but also
minimum, maximum, mean and also the standard deviation. With PostgreSQL 13,
you’ll also have those metrics for the planification part!</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>commit 17e03282241c6ac58a714eb0c3b6a8018cf6167a
Author: Fujii Masao <[email protected]>
Date: Thu Apr 2 11:20:19 2020 +0900
Allow pg_stat_statements to track planning statistics.
This commit makes pg_stat_statements support new GUC
pg_stat_statements.track_planning. If this option is enabled,
pg_stat_statements tracks the planning statistics of the statements,
e.g., the number of times the statement was planned, the total time
spent planning the statement, etc. This feature is useful to check
the statements that it takes a long time to plan. Previously since
pg_stat_statements tracked only the execution statistics, we could
not use that for the purpose.
The planning and execution statistics are stored at the end of
each phase separately. So there are not always one-to-one relationship
between them. For example, if the statement is successfully planned
but fails in the execution phase, only its planning statistics are stored.
This may cause the users to be able to see different pg_stat_statements
results from the previous version. To avoid this,
pg_stat_statements.track_planning needs to be disabled.
This commit bumps the version of pg_stat_statements to 1.8
since it changes the definition of pg_stat_statements function.
Author: Julien Rouhaud, Pascal Legrand, Thomas Munro, Fujii Masao
Reviewed-by: Sergei Kornilov, Tomas Vondra, Yoshikazu Imai, Haribabu Kommi, Tom Lane
Discussion: https://postgr.es/m/CAHGQGwFx_=DO-Gu-MfPW3VQ4qC7TfVdH2zHmvZfrGv6fQ3D-Tw@mail.gmail.com
Discussion: https://postgr.es/m/CAEepm=0e59Y_6Q_YXYCTHZkqOc6H2pJ54C_Xe=VFu50Aqqp_sA@mail.gmail.com
Discussion: https://postgr.es/m/DB6PR0301MB21352F6210E3B11934B0DCC790B00@DB6PR0301MB2135.eurprd03.prod.outlook.com
</code></pre></div></div>
<p>Keep in mind that even simple query can have a surprisingly high planification
time. One of the frequent cause was the <code class="language-plaintext highlighter-rouge">get_actual_variable_range()</code>
function, which is called when the planner wants to know what are the minimum
and maximum values of a specific field. This function detects if a suitable
index exists, and if there’s one it gets the wanted values. However, when
there were a lot of uncommitted values at the end of the index range, it could
take a significant amount of time to get a visible value. While this problem
has been fixed long ago (see <a href="https://github.com/postgres/postgres/commit/fccebe421d0c410e6378fb281419442c84759213">this
commit</a>
and <a href="https://github.com/postgres/postgres/commit/3ca930fc39ccf987c1c22fd04a1e7463b5dd0dfd">this other
commit</a>
for more details), there are still some cases where the planning time is higher
than what you’d expect, so having an easy way to monitor the planification
metrics is worthwhile.</p>
<p>This feature can also be interesting to know how much you’re using the <a href="https://www.postgresql.org/docs/current/sql-prepare.html">generic
plan feature</a> for
instance, and how much of a difference this should make for instance.</p>
<p>Let’s see a simple example, to see the effect of generic plans with prepared
statements:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="o">=#</span> <span class="k">PREPARE</span> <span class="n">s1</span> <span class="k">AS</span> <span class="k">SELECT</span> <span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">pg_class</span><span class="p">;</span>
<span class="k">PREPARE</span>
<span class="o">=#</span> <span class="k">EXECUTE</span> <span class="n">s1</span><span class="p">;</span>
<span class="k">count</span>
<span class="c1">-------</span>
<span class="mi">387</span>
<span class="p">(</span><span class="mi">1</span> <span class="k">row</span><span class="p">)</span>
<span class="p">[...</span> <span class="mi">5</span> <span class="k">more</span> <span class="n">times</span> <span class="p">...]</span>
<span class="o">=#</span> <span class="k">SELECT</span> <span class="n">query</span><span class="p">,</span> <span class="n">plans</span><span class="p">,</span> <span class="n">total_plan_time</span><span class="p">,</span> <span class="n">total_plan_time</span> <span class="o">/</span> <span class="n">plans</span> <span class="k">AS</span> <span class="n">avg_plan</span><span class="p">,</span>
<span class="n">calls</span><span class="p">,</span> <span class="n">total_exec_time</span><span class="p">,</span> <span class="n">total_exec_time</span> <span class="o">/</span> <span class="n">calls</span> <span class="k">AS</span> <span class="n">avg_exec</span>
<span class="k">FROM</span> <span class="n">pg_stat_statements</span>
<span class="k">WHERE</span> <span class="n">query</span> <span class="k">ILIKE</span> <span class="s1">'%SELECT count(*) FROM pg_class%'</span><span class="p">;</span>
<span class="o">-</span><span class="p">[</span> <span class="n">RECORD</span> <span class="mi">1</span> <span class="p">]</span><span class="c1">---+--------------------------------------------</span>
<span class="n">query</span> <span class="o">|</span> <span class="k">PREPARE</span> <span class="n">s1</span> <span class="k">AS</span> <span class="k">SELECT</span> <span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">pg_class</span>
<span class="n">plans</span> <span class="o">|</span> <span class="mi">1</span>
<span class="n">total_plan_time</span> <span class="o">|</span> <span class="mi">2</span><span class="p">.</span><span class="mi">119496</span>
<span class="n">avg_plan</span> <span class="o">|</span> <span class="mi">2</span><span class="p">.</span><span class="mi">119496</span>
<span class="n">calls</span> <span class="o">|</span> <span class="mi">6</span>
<span class="n">total_exec_time</span> <span class="o">|</span> <span class="mi">3</span><span class="p">.</span><span class="mi">4918280000000004</span>
<span class="n">avg_exec</span> <span class="o">|</span> <span class="mi">0</span><span class="p">.</span><span class="mi">5819713333333334</span></code></pre></figure>
<p>While the query was executed 6 times, it was actually planned only once (since
there’s no parameter, a generic plan is always used). While the execution time
is on average slightly more than half a milliscond, a single planning was
almost <strong>4 times</strong> more expensive. By saving 5 planification, postgres saved
up to <strong>10ms</strong>.</p>
<h3 id="planning-buffers-in-explain">Planning buffers in EXPLAIN</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>commit ce77abe63cfc85fb0bc236deb2cc34ae35cb5324
Author: Fujii Masao <[email protected]>
Date: Sat Apr 4 03:13:17 2020 +0900
Include information on buffer usage during planning phase, in EXPLAIN output, take two.
When BUFFERS option is enabled, EXPLAIN command includes the information
on buffer usage during each plan node, in its output. In addition to that,
this commit makes EXPLAIN command include also the information on
buffer usage during planning phase, in its output. This feature makes it
easier to discern the cases where lots of buffer access happen during
planning.
This commit revives the original commit ed7a509571 that was reverted by
commit 19db23bcbd. The original commit had to be reverted because
it caused the regression test failure on the buildfarm members prion and
dory. But since commit c0885c4c30 got rid of the caues of the test failure,
the original commit can be safely introduced again.
Author: Julien Rouhaud, slightly revised by Fujii Masao
Reviewed-by: Justin Pryzby
Discussion: https://postgr.es/m/[email protected]
</code></pre></div></div>
<p>Following the same idea, EXPLAIN will now display the buffer usage if the
<code class="language-plaintext highlighter-rouge">BUFFERS</code> option is used. If you try that on a fresh new connection, before
any catalog cache is populated, you could be surprised on how many buffers
would be accessed even for a simple query:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="o">=#</span> <span class="k">EXPLAIN</span> <span class="p">(</span><span class="n">BUFFERS</span><span class="p">,</span> <span class="k">ANALYZE</span><span class="p">,</span> <span class="n">COSTS</span> <span class="k">OFF</span><span class="p">)</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">pg_class</span><span class="p">;</span>
<span class="n">QUERY</span> <span class="n">PLAN</span>
<span class="c1">---------------------------------------------------------------------------------------------------------</span>
<span class="n">Seq</span> <span class="n">Scan</span> <span class="k">on</span> <span class="n">pg_class</span> <span class="p">(</span><span class="n">actual</span> <span class="nb">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">028</span><span class="p">..</span><span class="mi">0</span><span class="p">.</span><span class="mi">410</span> <span class="k">rows</span><span class="o">=</span><span class="mi">388</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">Buffers</span><span class="p">:</span> <span class="n">shared</span> <span class="n">hit</span><span class="o">=</span><span class="mi">13</span>
<span class="n">Planning</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">5</span><span class="p">.</span><span class="mi">157</span> <span class="n">ms</span>
<span class="n">Buffers</span><span class="p">:</span> <span class="n">shared</span> <span class="n">hit</span><span class="o">=</span><span class="mi">118</span>
<span class="n">Execution</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">1</span><span class="p">.</span><span class="mi">257</span> <span class="n">ms</span>
<span class="p">(</span><span class="mi">5</span> <span class="k">rows</span><span class="p">)</span>
<span class="o">=#</span> <span class="k">EXPLAIN</span> <span class="p">(</span><span class="n">BUFFERS</span><span class="p">,</span> <span class="k">ANALYZE</span><span class="p">,</span> <span class="n">COSTS</span> <span class="k">OFF</span><span class="p">)</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">pg_class</span><span class="p">;</span>
<span class="n">QUERY</span> <span class="n">PLAN</span>
<span class="c1">------------------------------------------------------------------</span>
<span class="n">Seq</span> <span class="n">Scan</span> <span class="k">on</span> <span class="n">pg_class</span> <span class="p">(</span><span class="n">actual</span> <span class="nb">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">035</span><span class="p">..</span><span class="mi">0</span><span class="p">.</span><span class="mi">413</span> <span class="k">rows</span><span class="o">=</span><span class="mi">388</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">Buffers</span><span class="p">:</span> <span class="n">shared</span> <span class="n">hit</span><span class="o">=</span><span class="mi">13</span>
<span class="n">Planning</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">393</span> <span class="n">ms</span>
<span class="n">Execution</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">670</span> <span class="n">ms</span></code></pre></figure>
<p>We can see here that populating the cache (relation, columns, datatypes…)
access 118 blocks, and that’s probably a significant part of the 5 extra ms we
saw in the first EXPLAIN output.</p>
<p><a href="https://rjuju.github.io/postgresql/2020/04/04/new-in-pg13-monitoring-query-planner.html">New in pg13: Monitoring the query planner</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on April 04, 2020.</p>
https://rjuju.github.io/postgresqlfr/2020/03/08/nouveau-dans-pg13-leader_pid2020-03-08T05:33:26+00:002020-03-08T05:33:26+00:00Julien Rouhaudhttps://rjuju.github.io
<h3 id="nouvelle-colonne-leader_pid-dans-la-vue-pg_stat_activity">Nouvelle colonne leader_pid dans la vue pg_stat_activity</h3>
<p>Étonnamment, depuis que les requêtes parallèles ont été ajoutées dans
PostgreSQL 9.6, il était impossible de savoir à quel processus client était lié
un worker parallèle. Ainsi, comme <a href="https://twitter.com/g_lelarge/status/1209486212190343168">Guillaume l’a fait
remarquer</a>, it makes
il est assez difficile de construire des outils simples permettant
d’échantillonner les événements d’attente liés à tous les processus impliqués
dans une requête. Une solution simple à ce problème est d’exporter
l’information de <code class="language-plaintext highlighter-rouge">lock group leader</code> disponible dans le processus client au
niveau SQL :</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>commit b025f32e0b5d7668daec9bfa957edf3599f4baa8
Author: Michael Paquier <[email protected]>
Date: Thu Feb 6 09:18:06 2020 +0900
Add leader_pid to pg_stat_activity
This new field tracks the PID of the group leader used with parallel
query. For parallel workers and the leader, the value is set to the
PID of the group leader. So, for the group leader, the value is the
same as its own PID. Note that this reflects what PGPROC stores in
shared memory, so as leader_pid is NULL if a backend has never been
involved in parallel query. If the backend is using parallel query or
has used it at least once, the value is set until the backend exits.
Author: Julien Rouhaud
Reviewed-by: Sergei Kornilov, Guillaume Lelarge, Michael Paquier, Tomas
Vondra
Discussion: https://postgr.es/m/CAOBaU_Yy5bt0vTPZ2_LUM6cUcGeqmYNoJ8-Rgto+c2+w3defYA@mail.gmail.com
</code></pre></div></div>
<p>Avec cette modification, il est maintenant très simple de trouver tous les
processus impliqués dans une requête parallèle. Par exemple :</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="o">=#</span> <span class="k">SELECT</span> <span class="n">query</span><span class="p">,</span> <span class="n">leader_pid</span><span class="p">,</span>
<span class="n">array_agg</span><span class="p">(</span><span class="n">pid</span><span class="p">)</span> <span class="n">filter</span><span class="p">(</span><span class="k">WHERE</span> <span class="n">leader_pid</span> <span class="o">!=</span> <span class="n">pid</span><span class="p">)</span> <span class="k">AS</span> <span class="n">members</span>
<span class="k">FROM</span> <span class="n">pg_stat_activity</span>
<span class="k">WHERE</span> <span class="n">leader_pid</span> <span class="k">IS</span> <span class="k">NOT</span> <span class="k">NULL</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">query</span><span class="p">,</span> <span class="n">leader_pid</span><span class="p">;</span>
<span class="n">query</span> <span class="o">|</span> <span class="n">leader_pid</span> <span class="o">|</span> <span class="n">members</span>
<span class="c1">-------------------+------------+---------------</span>
<span class="k">select</span> <span class="o">*</span> <span class="k">from</span> <span class="n">t1</span><span class="p">;</span> <span class="o">|</span> <span class="mi">31630</span> <span class="o">|</span> <span class="p">{</span><span class="mi">32269</span><span class="p">,</span><span class="mi">32268</span><span class="p">}</span>
<span class="p">(</span><span class="mi">1</span> <span class="k">row</span><span class="p">)</span></code></pre></figure>
<p>Attention toutefois, comme indiqué dans le message de commit, si la colonne
<code class="language-plaintext highlighter-rouge">leader_pid</code> à la même valeur que la colonne <code class="language-plaintext highlighter-rouge">pid</code>, cela ne veut pas forcément
dire que le processus client est actuellement en train d’effectuer une requête
parallèle, car une fois que le champ est positionné il n’est jamais
réinitialisé. De plus, pour éviter tout surcoût, aucun verrou supplémentaire
n’est maintenu lors de l’affichage de ces données. Cela veut dire que chaque
ligne est traitée indépendamment. Ainsi, bien que cela soit fort peu probable,
vous pouvez obtenir des données incohérentes dans certaines circonstances,
comme par exemple un worker paralèlle pointant vers un pid qui est déjà
déconnecté.</p>
<p><a href="https://rjuju.github.io/postgresqlfr/2020/03/08/nouveau-dans-pg13-leader_pid.html">Nouveau dans pg13: Colonne leader_pid dans pg_stat_activity</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on March 08, 2020.</p>
https://rjuju.github.io/postgresql/2020/02/28/pg_qualstats-2-selectivity-error2020-02-28T12:37:04+00:002020-02-28T12:37:04+00:00Julien Rouhaudhttps://rjuju.github.io
<p>Selectivity estimation error is one of the main cause of bad query plans. It’s
quite straighforward to compute those estimation error using <code class="language-plaintext highlighter-rouge">EXPLAIN
(ANALYZE)</code>, either manually or with the help of
<a href="https://explain.depesz.com/">explain.depesz.com</a> (or other similar tools),
but until now there were now tool available to get this information
automatically and globally. Version 2 of pg_qualstats fixes that, thanks a
lot to <a href="https://twitter.com/obartunov">Oleg Bartunov</a> for the original idea!</p>
<p>Note: If you don’t know pg_qualstats extension, you may want to see <a href="/postgresql/2020/01/06/pg_qualstats-2-global-index-advisor.html">my last
article about it</a>.</p>
<h3 id="the-problem">The problem</h3>
<p>There can be many causes to that issue: outdated statistics, complex
predicates, non uniform data… But whatever the reason is, if the optimizer
doesn’t have an accurate idea on how much data each predicate will filter, the
result is the same: a bad query plan, which can lead to longer query execution.</p>
<p>To illustrate the problem, I’ll use here a simple test case, voluntarily built
to fool the optimizer.</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="n">rjuju</span><span class="o">=#</span> <span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">pgqs</span> <span class="k">AS</span>
<span class="k">SELECT</span> <span class="n">i</span><span class="o">%</span><span class="mi">2</span> <span class="n">val1</span> <span class="p">,</span> <span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">%</span><span class="mi">2</span> <span class="n">val2</span>
<span class="k">FROM</span> <span class="n">generate_series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">50000</span><span class="p">)</span> <span class="n">i</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="mi">50000</span>
<span class="n">rjuju</span><span class="o">=#</span> <span class="k">VACUUM</span> <span class="k">ANALYZE</span> <span class="n">pgqs</span><span class="p">;</span>
<span class="k">VACUUM</span>
<span class="n">rjuju</span><span class="o">=#</span> <span class="k">EXPLAIN</span> <span class="p">(</span><span class="k">ANALYZE</span><span class="p">)</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">pgqs</span> <span class="k">WHERE</span> <span class="n">val1</span> <span class="o">=</span> <span class="mi">1</span> <span class="k">AND</span> <span class="n">val2</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="n">QUERY</span> <span class="n">PLAN</span>
<span class="c1">--------------------------------------------------------------------</span>
<span class="n">Seq</span> <span class="n">Scan</span> <span class="k">on</span> <span class="n">pgqs</span> <span class="p">([...]</span> <span class="k">rows</span><span class="o">=</span><span class="mi">12500</span> <span class="n">width</span><span class="o">=</span><span class="mi">8</span><span class="p">)</span> <span class="p">([...]</span> <span class="k">rows</span><span class="o">=</span><span class="mi">0</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">Filter</span><span class="p">:</span> <span class="p">((</span><span class="n">val1</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span> <span class="k">AND</span> <span class="p">(</span><span class="n">val2</span> <span class="o">=</span> <span class="mi">1</span><span class="p">))</span>
<span class="k">Rows</span> <span class="n">Removed</span> <span class="k">by</span> <span class="n">Filter</span><span class="p">:</span> <span class="mi">50000</span>
<span class="n">Planning</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">553</span> <span class="n">ms</span>
<span class="n">Execution</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">38</span><span class="p">.</span><span class="mi">062</span> <span class="n">ms</span>
<span class="p">(</span><span class="mi">5</span> <span class="k">rows</span><span class="p">)</span></code></pre></figure>
<p>Here postgres think that the query will emit 12500 tuples, while in reality
none will be emitted. If you’re wondering how postgres came up with that
number, the explanation is simple. When multiple independant (overlapping
range predicate can be merged) clauses are AND-ed and no extended statistics
are available (see below for more about it), postgres will simply multiply each
clause selectivity. This is done in <code class="language-plaintext highlighter-rouge">clauselist_selectivity_simple</code>, in
<a href="https://github.com/postgres/postgres/blob/master/src/backend/optimizer/path/clausesel.c">src/backend/optimizer/path/clausesel.c</a>:</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="n">Selectivity</span>
<span class="nf">clauselist_selectivity_simple</span><span class="p">(</span><span class="n">PlannerInfo</span> <span class="o">*</span><span class="n">root</span><span class="p">,</span>
<span class="n">List</span> <span class="o">*</span><span class="n">clauses</span><span class="p">,</span>
<span class="kt">int</span> <span class="n">varRelid</span><span class="p">,</span>
<span class="n">JoinType</span> <span class="n">jointype</span><span class="p">,</span>
<span class="n">SpecialJoinInfo</span> <span class="o">*</span><span class="n">sjinfo</span><span class="p">,</span>
<span class="n">Bitmapset</span> <span class="o">*</span><span class="n">estimatedclauses</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">Selectivity</span> <span class="n">s1</span> <span class="o">=</span> <span class="mi">1</span><span class="p">.</span><span class="mi">0</span><span class="p">;</span>
<span class="p">[...]</span>
<span class="cm">/*
* Anything that doesn't look like a potential rangequery clause gets
* multiplied into s1 and forgotten. Anything that does gets inserted into
* an rqlist entry.
*/</span>
<span class="n">listidx</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="n">foreach</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="n">clauses</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">[...]</span>
<span class="cm">/* Always compute the selectivity using clause_selectivity */</span>
<span class="n">s2</span> <span class="o">=</span> <span class="n">clause_selectivity</span><span class="p">(</span><span class="n">root</span><span class="p">,</span> <span class="n">clause</span><span class="p">,</span> <span class="n">varRelid</span><span class="p">,</span> <span class="n">jointype</span><span class="p">,</span> <span class="n">sjinfo</span><span class="p">);</span>
<span class="p">[...]</span>
<span class="cm">/*
* If it's not a "<"/"<="/">"/">=" operator, just merge the
* selectivity in generically. But if it's the right oprrest,
* add the clause to rqlist for later processing.
*/</span>
<span class="k">switch</span> <span class="p">(</span><span class="n">get_oprrest</span><span class="p">(</span><span class="n">expr</span><span class="o">-></span><span class="n">opno</span><span class="p">))</span>
<span class="p">{</span>
<span class="p">[...]</span>
<span class="nl">default:</span>
<span class="cm">/* Just merge the selectivity in generically */</span>
<span class="n">s1</span> <span class="o">=</span> <span class="n">s1</span> <span class="o">*</span> <span class="n">s2</span><span class="p">;</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">[...]</span></code></pre></figure>
<p>In this case, each predicate will independantly filter approximately 50% of the
table, as we can see in <strong>pg_stats view</strong>:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="n">rjuju</span><span class="o">=#</span> <span class="k">SELECT</span> <span class="n">tablename</span><span class="p">,</span> <span class="n">attname</span><span class="p">,</span> <span class="n">most_common_vals</span><span class="p">,</span> <span class="n">most_common_freqs</span>
<span class="k">FROM</span> <span class="n">pg_stats</span> <span class="k">WHERE</span> <span class="n">tablename</span> <span class="o">=</span> <span class="s1">'pgqs'</span><span class="p">;</span>
<span class="n">tablename</span> <span class="o">|</span> <span class="n">attname</span> <span class="o">|</span> <span class="n">most_common_vals</span> <span class="o">|</span> <span class="n">most_common_freqs</span>
<span class="c1">-----------+---------+------------------+-------------------------</span>
<span class="n">pgqs</span> <span class="o">|</span> <span class="n">val1</span> <span class="o">|</span> <span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="o">|</span> <span class="p">{</span><span class="mi">0</span><span class="p">.</span><span class="mi">50116664</span><span class="p">,</span><span class="mi">0</span><span class="p">.</span><span class="mi">49883333</span><span class="p">}</span>
<span class="n">pgqs</span> <span class="o">|</span> <span class="n">val2</span> <span class="o">|</span> <span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="o">|</span> <span class="p">{</span><span class="mi">0</span><span class="p">.</span><span class="mi">50116664</span><span class="p">,</span><span class="mi">0</span><span class="p">.</span><span class="mi">49883333</span><span class="p">}</span>
<span class="p">(</span><span class="mi">2</span> <span class="k">rows</span><span class="p">)</span></code></pre></figure>
<p>So when using both clauses, the estimate is 25% of the table, since postgres
doesn’t know <strong>by default</strong> that both values are mutually exclusive.
Continuing with this artificial test case, let’s see what happens if we add a
<em>join</em> on top of if. For instance, joining the table to itself on the <code class="language-plaintext highlighter-rouge">val1</code>
column only. For clarity, I’ll use <strong>t1</strong> for the table on which I’m applying
the mutually exclusive predicates, and <strong>t2</strong> the table joined:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="n">rjuju</span><span class="o">=#</span> <span class="k">EXPLAIN</span> <span class="k">ANALYZE</span> <span class="k">SELECT</span> <span class="o">*</span>
<span class="k">FROM</span> <span class="n">pgqs</span> <span class="n">t1</span>
<span class="k">JOIN</span> <span class="n">pgqs</span> <span class="n">t2</span> <span class="k">ON</span> <span class="n">t1</span><span class="p">.</span><span class="n">val1</span> <span class="o">=</span> <span class="n">t2</span><span class="p">.</span><span class="n">val1</span>
<span class="k">WHERE</span> <span class="n">t1</span><span class="p">.</span><span class="n">val1</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">AND</span> <span class="n">t1</span><span class="p">.</span><span class="n">val2</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">QUERY</span> <span class="n">PLAN</span>
<span class="c1">-----------------------------------------------------------------------------------</span>
<span class="n">Nested</span> <span class="n">Loop</span> <span class="p">([...]</span> <span class="k">rows</span><span class="o">=</span><span class="mi">313475000</span> <span class="n">width</span><span class="o">=</span><span class="mi">16</span><span class="p">)</span> <span class="p">([...]</span> <span class="k">rows</span><span class="o">=</span><span class="mi">0</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="o">-></span> <span class="n">Seq</span> <span class="n">Scan</span> <span class="k">on</span> <span class="n">pgqs</span> <span class="n">t2</span> <span class="p">([...]</span> <span class="k">rows</span><span class="o">=</span><span class="mi">25078</span> <span class="n">width</span><span class="o">=</span><span class="mi">8</span><span class="p">)</span> <span class="p">([...]</span> <span class="k">rows</span><span class="o">=</span><span class="mi">25000</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">Filter</span><span class="p">:</span> <span class="p">(</span><span class="n">val1</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span>
<span class="k">Rows</span> <span class="n">Removed</span> <span class="k">by</span> <span class="n">Filter</span><span class="p">:</span> <span class="mi">25000</span>
<span class="o">-></span> <span class="n">Materialize</span> <span class="p">([...]</span> <span class="k">rows</span><span class="o">=</span><span class="mi">12500</span> <span class="n">width</span><span class="o">=</span><span class="mi">8</span><span class="p">)</span> <span class="p">([...]</span> <span class="k">rows</span><span class="o">=</span><span class="mi">0</span> <span class="n">loops</span><span class="o">=</span><span class="mi">25000</span><span class="p">)</span>
<span class="o">-></span> <span class="n">Seq</span> <span class="n">Scan</span> <span class="k">on</span> <span class="n">pgqs</span> <span class="n">t1</span> <span class="p">([...]</span> <span class="k">rows</span><span class="o">=</span><span class="mi">12500</span> <span class="n">width</span><span class="o">=</span><span class="mi">8</span><span class="p">)</span> <span class="p">([...]</span> <span class="k">rows</span><span class="o">=</span><span class="mi">0</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">Filter</span><span class="p">:</span> <span class="p">((</span><span class="n">val1</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span> <span class="k">AND</span> <span class="p">(</span><span class="n">val2</span> <span class="o">=</span> <span class="mi">0</span><span class="p">))</span>
<span class="k">Rows</span> <span class="n">Removed</span> <span class="k">by</span> <span class="n">Filter</span><span class="p">:</span> <span class="mi">50000</span>
<span class="n">Planning</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">943</span> <span class="n">ms</span>
<span class="n">Execution</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">86</span><span class="p">.</span><span class="mi">757</span> <span class="n">ms</span>
<span class="p">(</span><span class="mi">14</span> <span class="k">rows</span><span class="p">)</span></code></pre></figure>
<p>Postgres thinks that this join will emit <strong>313 millions rows</strong>, while obviously
no rows will be emitted. And this is a good example on how bad assumptions can
lead to an inefficient plan.</p>
<p>Here Postgres can deduce that the <code class="language-plaintext highlighter-rouge">val1 = 0</code> predicate can be applied to
<strong>t2</strong>. So how to join two relations, one that should emit 25000 tuples and
the other that should emit 12500 tuples, with no index available? A nested
loop is not a bad choice, as both relation aren’t really big. As no index is
available, postgres also chooses to <strong>materialize</strong> the inner relation, meaning
storing it in memory, to make it more efficient. As it tries to limit memory
consumption as much as possible, the smallest relation is materialized, and
that’s the mistake here.</p>
<p>Indeed, postgres will read the whole table twice: once to get every rows
corresponding to the <code class="language-plaintext highlighter-rouge">val1 = 0</code> predicate for the outer relation, and once to
find all rows to be materialized. If the opposite was done, as it would
probably have if the estimates had been more realistic, the table would only
have been read once.</p>
<p>In this case, as the dataset isn’t big and quite artificial, a better plan
wouldn’t drastically change the execution time. But keep in mind than with
real production environements, it could mean choosing a nested loop assuming
that there’ll be only a couple of rows to loop on while in reality the backend
will spend minutes or even hours looping over millions of rows, and another
plan would have been orders of magnitude quicker.</p>
<h3 id="detecting-the-problem">Detecting the problem</h3>
<p>pg_qualstats 2 will now compute the selectivity estimation error, both in a
ratio and a raw number, and will keep track for each predicate the minimum,
maximum and mean values, with the standard deviation. This is now quite simple
to detect problematic quals!</p>
<p>After executing the last query, here’s what the <code class="language-plaintext highlighter-rouge">pg_qualstats</code> view will
return:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="n">rjuju</span><span class="o">=#</span> <span class="k">SELECT</span> <span class="n">relname</span><span class="p">,</span> <span class="n">attname</span><span class="p">,</span> <span class="n">opno</span><span class="p">::</span><span class="n">regoper</span><span class="p">,</span> <span class="n">qualid</span><span class="p">,</span> <span class="n">qualnodeid</span><span class="p">,</span>
<span class="n">mean_err_estimate_ratio</span> <span class="n">mean_ratio</span><span class="p">,</span> <span class="n">mean_err_estimate_num</span> <span class="n">mean_num</span><span class="p">,</span> <span class="n">constvalue</span>
<span class="k">FROM</span> <span class="n">pg_qualstats</span> <span class="n">pgqs</span>
<span class="k">JOIN</span> <span class="n">pg_class</span> <span class="k">c</span> <span class="k">ON</span> <span class="n">pgqs</span><span class="p">.</span><span class="n">lrelid</span> <span class="o">=</span> <span class="k">c</span><span class="p">.</span><span class="n">oid</span>
<span class="k">JOIN</span> <span class="n">pg_attribute</span> <span class="n">a</span> <span class="k">ON</span> <span class="n">a</span><span class="p">.</span><span class="n">attrelid</span> <span class="o">=</span> <span class="k">c</span><span class="p">.</span><span class="n">oid</span> <span class="k">AND</span> <span class="n">a</span><span class="p">.</span><span class="n">attnum</span> <span class="o">=</span> <span class="n">pgqs</span><span class="p">.</span><span class="n">lattnum</span><span class="p">;</span>
<span class="n">relname</span> <span class="o">|</span> <span class="n">attname</span> <span class="o">|</span> <span class="n">opno</span> <span class="o">|</span> <span class="n">qualid</span> <span class="o">|</span> <span class="n">qualnodeid</span> <span class="o">|</span> <span class="n">mean_ratio</span> <span class="o">|</span> <span class="n">mean_num</span> <span class="o">|</span> <span class="n">constvalue</span>
<span class="c1">---------+---------+------+------------+------------+------------+----------+------------</span>
<span class="n">pgqs</span> <span class="o">|</span> <span class="n">val1</span> <span class="o">|</span> <span class="o">=</span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="mi">3161070364</span> <span class="o">|</span> <span class="mi">1</span><span class="p">.</span><span class="mi">00393542</span> <span class="o">|</span> <span class="mi">98</span> <span class="o">|</span> <span class="mi">0</span><span class="p">::</span><span class="nb">integer</span>
<span class="n">pgqs</span> <span class="o">|</span> <span class="n">val1</span> <span class="o">|</span> <span class="o">=</span> <span class="o">|</span> <span class="mi">3864967567</span> <span class="o">|</span> <span class="mi">3161070364</span> <span class="o">|</span> <span class="mi">12500</span> <span class="o">|</span> <span class="mi">12500</span> <span class="o">|</span> <span class="mi">0</span><span class="p">::</span><span class="nb">integer</span>
<span class="n">pgqs</span> <span class="o">|</span> <span class="n">val2</span> <span class="o">|</span> <span class="o">=</span> <span class="o">|</span> <span class="mi">3864967567</span> <span class="o">|</span> <span class="mi">3065200358</span> <span class="o">|</span> <span class="mi">12500</span> <span class="o">|</span> <span class="mi">12500</span> <span class="o">|</span> <span class="mi">0</span><span class="p">::</span><span class="nb">integer</span>
<span class="p">(</span><span class="mi">3</span> <span class="k">rows</span><span class="p">)</span></code></pre></figure>
<p class="notice"><strong>NOTE:</strong> <code class="language-plaintext highlighter-rouge">qualid</code> is an identifier if multiple qual are AND-ed, NULL
otherwise, and <code class="language-plaintext highlighter-rouge">qualnodeid</code> is a per-qual only identifier.</p>
<p>We see here that when used alone, the qual <code class="language-plaintext highlighter-rouge">pgqs.val = ?</code> doesn’t show any
selectivity estimate problem as the ratio (<em>mean_ratio</em>) is very close to
<strong>1</strong> and the raw number (<em>mean_num</em>) is quite low. On the other hand, when
combined with <code class="language-plaintext highlighter-rouge">AND pgqs.val2 = ?</code> pg_qualstats reports significant estimate
error. That’s a very strong sign that those columns are functionally
dependent.</p>
<p>If for example a qual alone shows issues, it could be a sign of outdated
statistics, or that the sample size isn’t big enough.</p>
<p>Also, if you have <code class="language-plaintext highlighter-rouge">pg_stat_statements</code> extension installed, <code class="language-plaintext highlighter-rouge">pg_qualstats</code> will
give you the <em>query identifier</em> for each predicate. With that and a bit of
SQL, you can for instance find the query with a long average execution time
which contains quals for which the selectivity estimation is off by 10 or more.</p>
<h3 id="interlude-extended-statistics">Interlude: Extended statistics</h3>
<p>If you’re wondering how to solve the issue I just explained, the solution is
very easy since <strong>extended statistics</strong> were introduced in PostgreSQL 10, and
assuming that you know that’s the root issue. <a href="https://www.postgresql.org/docs/current/sql-createstatistics.html">Create an extended
statistcs</a>
on the related columns, perform an ANALYZE and you’re done!</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="n">rjuju</span><span class="o">=#</span> <span class="k">CREATE</span> <span class="k">STATISTICS</span> <span class="n">pgqs_stats</span> <span class="k">ON</span> <span class="n">val1</span><span class="p">,</span> <span class="n">val2</span> <span class="k">FROM</span> <span class="n">pgqs</span><span class="p">;</span>
<span class="k">CREATE</span> <span class="k">STATISTICS</span>
<span class="n">rjuju</span><span class="o">=#</span> <span class="k">ANALYZE</span> <span class="n">pgqs</span><span class="p">;</span>
<span class="k">ANALYZE</span>
<span class="n">rjuju</span><span class="p">]</span><span class="o">=#</span> <span class="k">EXPLAIN</span> <span class="k">ANALYZE</span> <span class="k">SELECT</span> <span class="o">*</span>
<span class="k">FROM</span> <span class="n">pgqs</span> <span class="n">t1</span>
<span class="k">JOIN</span> <span class="n">pgqs</span> <span class="n">t2</span> <span class="k">ON</span> <span class="n">t1</span><span class="p">.</span><span class="n">val1</span> <span class="o">=</span> <span class="n">t2</span><span class="p">.</span><span class="n">val1</span>
<span class="k">WHERE</span> <span class="n">t1</span><span class="p">.</span><span class="n">val1</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">AND</span> <span class="n">t1</span><span class="p">.</span><span class="n">val2</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">order</span> <span class="k">by</span> <span class="n">t1</span><span class="p">.</span><span class="n">val2</span><span class="p">;</span>
<span class="n">QUERY</span> <span class="n">PLAN</span>
<span class="c1">-------------------------------------------------------------------------</span>
<span class="n">Nested</span> <span class="n">Loop</span> <span class="p">([...]</span> <span class="k">rows</span><span class="o">=</span><span class="mi">25002</span> <span class="n">width</span><span class="o">=</span><span class="mi">16</span><span class="p">)</span> <span class="p">([...]</span> <span class="k">rows</span><span class="o">=</span><span class="mi">0</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="o">-></span> <span class="n">Seq</span> <span class="n">Scan</span> <span class="k">on</span> <span class="n">pgqs</span> <span class="n">t1</span> <span class="p">([...]</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1</span> <span class="n">width</span><span class="o">=</span><span class="mi">8</span><span class="p">)</span> <span class="p">([...]</span> <span class="k">rows</span><span class="o">=</span><span class="mi">0</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">Filter</span><span class="p">:</span> <span class="p">((</span><span class="n">val1</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span> <span class="k">AND</span> <span class="p">(</span><span class="n">val2</span> <span class="o">=</span> <span class="mi">0</span><span class="p">))</span>
<span class="k">Rows</span> <span class="n">Removed</span> <span class="k">by</span> <span class="n">Filter</span><span class="p">:</span> <span class="mi">50000</span>
<span class="o">-></span> <span class="n">Seq</span> <span class="n">Scan</span> <span class="k">on</span> <span class="n">pgqs</span> <span class="n">t2</span> <span class="p">([...]</span> <span class="k">rows</span><span class="o">=</span><span class="mi">25002</span> <span class="n">width</span><span class="o">=</span><span class="mi">8</span><span class="p">)</span> <span class="p">(</span><span class="n">never</span> <span class="n">executed</span><span class="p">)</span>
<span class="n">Filter</span><span class="p">:</span> <span class="p">(</span><span class="n">val1</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">Planning</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">559</span> <span class="n">ms</span>
<span class="n">Execution</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">39</span><span class="p">.</span><span class="mi">471</span> <span class="n">ms</span>
<span class="p">(</span><span class="mi">8</span> <span class="k">rows</span><span class="p">)</span></code></pre></figure>
<p>If you want more details on extended statistics, I recommend looking at the
slides from <a href="https://blog.pgaddict.com/">Tomas Vondra</a>’s <a href="https://www.postgresql.eu/events/pgconfeu2018/sessions/session/2083/slides/130/create-statistics-what-is-it.pdf">excellent talk on
this
subject</a>.</p>
<h3 id="going-further">Going further</h3>
<p>Tracking the quals in every single qual executed is of course quite expensive,
and would significantly impact the performance for any non datawarehouse
workload. That’s why <code class="language-plaintext highlighter-rouge">pg_qualstats</code> has an option,
<strong>pg_qualstats.sample_rate</strong>, to sample the query that will be processed.
This setting is by default set to <strong>1 / max_connections</strong>, which will make the
overhead quite negligible, but don’t be surprised if you don’t see any qual
reported after running a few queries!</p>
<p>But if you’re instead only interested by the quals that has bad selectivity
estimation, for instance to detect this class of problem rather than missing
indexes, there are two new options available for that:</p>
<ul>
<li><strong>pg_qualstats.min_err_estimate_ratio</strong></li>
<li><strong>pg_qualstats.min_err_estimate_num</strong></li>
</ul>
<p>Those options are cumulative and can be changed at anytime, and will limit the
quals that pg_qualstats will store to the ones that have a selectivity
estimate ratio and/or raw number higher that what you ask. Although those
options will help to reduce the performance overhead, they of course can be
combined with <strong>pg_qualstats.sample_rate</strong> if needed.</p>
<h3 id="conclusion">Conclusion</h3>
<p>After <a href="/postgresql/2020/01/06/pg_qualstats-2-global-index-advisor.html">introducing the new global index advisor</a>, this article presented
a class of problems that are frequently seen as a DBA, and how to detect and
solve them.</p>
<p>I believe that those two new features in pg_qualstats will greatly help
PostgreSQL databases administration. Also, external tools that aims to solve
related issue, such as
<a href="https://github.com/ossc-db/pg_plan_advsr">pg_plan_advsr</a> or
<a href="https://github.com/postgrespro/aqo">AQO</a> could also benefit from
pg_qualstats, as they could directly get the exact data they need to be able
perform analysis and optimize the queries!</p>
<p><a href="https://rjuju.github.io/postgresql/2020/02/28/pg_qualstats-2-selectivity-error.html">Planner selectivity estimation error statistics with pg_qualstats 2</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on February 28, 2020.</p>
https://rjuju.github.io/postgresql/2020/02/06/new-in-pg13-leader_pid2020-02-06T12:59:53+00:002020-02-06T12:59:53+00:00Julien Rouhaudhttps://rjuju.github.io
<h3 id="new-leader_pid-column-in-pg_stat_activity-view">New leader_pid column in pg_stat_activity view</h3>
<p>Surprisingly, since parallel query was introduced in PostgreSQL 9.6, it was
impossible to know wich backend a parallel worker was related to. So, as
<a href="https://twitter.com/g_lelarge/status/1209486212190343168">Guillaume pointed
out</a>, it makes it
quite difficult to build simple tools that can sample the wait events related
to all process involved in a query. A simple solution to that problem is to
export the <code class="language-plaintext highlighter-rouge">lock group leader</code> information available in the backend at the SQL
level:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>commit b025f32e0b5d7668daec9bfa957edf3599f4baa8
Author: Michael Paquier <[email protected]>
Date: Thu Feb 6 09:18:06 2020 +0900
Add leader_pid to pg_stat_activity
This new field tracks the PID of the group leader used with parallel
query. For parallel workers and the leader, the value is set to the
PID of the group leader. So, for the group leader, the value is the
same as its own PID. Note that this reflects what PGPROC stores in
shared memory, so as leader_pid is NULL if a backend has never been
involved in parallel query. If the backend is using parallel query or
has used it at least once, the value is set until the backend exits.
Author: Julien Rouhaud
Reviewed-by: Sergei Kornilov, Guillaume Lelarge, Michael Paquier, Tomas
Vondra
Discussion: https://postgr.es/m/CAOBaU_Yy5bt0vTPZ2_LUM6cUcGeqmYNoJ8-Rgto+c2+w3defYA@mail.gmail.com
</code></pre></div></div>
<p>With this change, you can now easily find all processes involved in a parallel
query. For instance:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="o">=#</span> <span class="k">SELECT</span> <span class="n">query</span><span class="p">,</span> <span class="n">leader_pid</span><span class="p">,</span>
<span class="n">array_agg</span><span class="p">(</span><span class="n">pid</span><span class="p">)</span> <span class="n">filter</span><span class="p">(</span><span class="k">WHERE</span> <span class="n">leader_pid</span> <span class="o">!=</span> <span class="n">pid</span><span class="p">)</span> <span class="k">AS</span> <span class="n">members</span>
<span class="k">FROM</span> <span class="n">pg_stat_activity</span>
<span class="k">WHERE</span> <span class="n">leader_pid</span> <span class="k">IS</span> <span class="k">NOT</span> <span class="k">NULL</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">query</span><span class="p">,</span> <span class="n">leader_pid</span><span class="p">;</span>
<span class="n">query</span> <span class="o">|</span> <span class="n">leader_pid</span> <span class="o">|</span> <span class="n">members</span>
<span class="c1">-------------------+------------+---------------</span>
<span class="k">select</span> <span class="o">*</span> <span class="k">from</span> <span class="n">t1</span><span class="p">;</span> <span class="o">|</span> <span class="mi">31630</span> <span class="o">|</span> <span class="p">{</span><span class="mi">32269</span><span class="p">,</span><span class="mi">32268</span><span class="p">}</span>
<span class="p">(</span><span class="mi">1</span> <span class="k">row</span><span class="p">)</span></code></pre></figure>
<p>Be careful, as mentionned in the commit message, if the <code class="language-plaintext highlighter-rouge">leader_pid</code> is the
same as <code class="language-plaintext highlighter-rouge">pid</code>, it doesn’t necessarily mean that the backend is currently
performing a parallel query, as once set this field is never reset. Also, to
avoid extra ovherhead, no additional lock is held while outputting the data.
It means that each row is processed independently. So, while quite unlikely,
you can get in some circumstances inconsistent data, such as a parallel worker
pointing to a pid that already disconnected.</p>
<p><a href="https://rjuju.github.io/postgresql/2020/02/06/new-in-pg13-leader_pid.html">New in pg13: New leader_pid column in pg_stat_activity</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on February 06, 2020.</p>
https://rjuju.github.io/postgresqlfr/2020/01/06/pg_qualstats-2-suggestion-index-globale2020-01-06T12:23:29+00:002020-01-06T12:23:29+00:00Julien Rouhaudhttps://rjuju.github.io
<p>Parvenir à une suggestion d’index de qualité peut être une tâche complexe.
Cela nécessite à la fois une connaissance des requêtes applicatives et des
spécificités de la base de données. Avec le temps de nombreux projets ont
essayé de résoudre ce problème, l’un d’entre eux étant <a href="https://powa.readthedocs.io/">PoWA version
3</a>, avec l’aide de <a href="https://powa.readthedocs.io/en/latest/components/stats_extensions/pg_qualstats.html">pg_qualstats
extension</a>.
Cet outil donne de plutôt bonnes suggestions d’index, mais il est nécessaire
d’installer et configurer PoWA, alors que certains utilisateurs aimeraient
n’avoir que la suggestion d’index globale. Pour répondre à ce besoin de
simplicité, l’algorithme utilisé dans PoWA est maintenant disponible dans
pg_qualstats version 2, sans avoir besoin d’utiliser des composants
additionnels.</p>
<p>EDIT: La fonction <code class="language-plaintext highlighter-rouge">pg_qualstats_index\_advisor()</code> a été changée pour retourner
du <strong>json</strong> plutôt que du <strong>jsonb</strong>, afin de conserver la compatibilité avec PostgreSQL
9.3. Les requêtes d’exemples sont donc également modifiées pour utiliser
<code class="language-plaintext highlighter-rouge">json_array_elements()</code> plutôt que <code class="language-plaintext highlighter-rouge">jsonb_array_elements()</code>.</p>
<h3 id="quest-ce-que-pg_qualstats">Qu’est-ce que pg_qualstats</h3>
<p>Une manière simple d’expliquer ce qu’est pg_qualstats serait de dire qu’il
s’agit d’une extension similaire à
<a href="https://www.postgresql.org/docs/current/pgstatstatements.html">pg_stat_statements</a>
mais travaillant au niveaux des prédicats.</p>
<p>Cette extension sauvegarde des statistiques utiles pour les clauses <strong>WHERE</strong>
et <strong>JOIN</strong> : à quelle table et quelle colonne un prédicat fait référénce, le
nombre de fois qu’un prédicat a été utilisé, le nombre d’exécutions de
l’opérateur sous-jacent, si le prédicat provient d’un parcours d’index ou non,
la sélectivité, la valeur des constantes et bien plus encore.</p>
<p>Il est possible de déduire beaucoup de choses depuis ces informations. Par
exemple, si vous examinez les prédicats qui contiennent des références à des
tables différentes, vous pouvez trouver quelles tables sont jointes ensembles,
et à quel point les conditions de jointures sont sélectives.</p>
<h3 id="suggestion-globale-">Suggestion Globale ?</h3>
<p>Comment je l’ai mentionné, la suggestion d’index globale ajoutée dans
pg_qualstats 2 utilise la même approche que celle de PoWA, ainsi cet article
peut servir à décrire le fonctionnement des deux outils. La seule différence
est que vous obtiendrez probablement une suggestion de meilleure qualité avec
PoWA, puisque plus de prédicats seront disponibles, et que vous pourrez
également choisir sur quel intervalle de temps vous souhaitez effectuer une
suggestion d’index manquants.</p>
<p>La chose importante à retenir ici est qu’il s’agit d’une suggestion effectuée
de manière <strong>globale</strong>, c’est-à-dire en prenant en compte tous les prédicats
intéressant en même temps. Cette approche est différente de toutes les autres
dont j’ai connaissance, qui ne prennent en compte qu’une seule requête à la
fois. Selon moi, une approche globale est meilleure, car il est possible de
réduire le nombre total d’index, en maximisant l’efficacité des index
multi-colonnes.</p>
<h3 id="comment-marche-la-suggestion-globale">Comment marche la suggestion globale</h3>
<p>La première étape consiste à récupérer tous les prédicats qui pourraient
bénéficier de nouveaux index. C’est particulièrement facile à obtenir avec
pg_qualstats. En filtrant les prédicats venant d’un parcours séquentiel,
exécutés de nombreuses fois et qui filtrent de nombreuses lignes (à la fois en
nombre et en pourcentage), vous obtenez une liste parfaite de prédicats qui
auraient très probablement besoin d’un index (ou alors dans certains cas une
liste des requêtes mal écrites). Voyons regardons par exemple le cas d’une
applications qui utiliserait ces 4 prédicats:</p>
<p><a href="/images/global_advisor_1_quals.png"><img src="/images/global_advisor_1_quals.png" alt="Liste de tous les prédicats
trouvés" /></a></p>
<p>Ensuite, il faut construire l’ensemble entier des chemins de toutes les
prédicats joints par un AND logique, qui contiennent d’autres prédicats, qui
peuvent être eux-meme également joints par des AND logiques. En utilisants les
même 4 prédicats vus précédemments, nous obtenons ces chemins :</p>
<p><a href="/images/global_advisor_2_graphs.png"><img src="/images/global_advisor_2_graphs.png" alt="Construction de tous les chemins de prédicats
possibles" /></a></p>
<p>Une fois tous les chemins construits, il suffit d’obtenir le meilleur chemin
pour trouver le meilleur index à suggérer. Le classement de ces chemins est
pour le moment fait en donnant un poids à chaque nœud de chaque chemin qui
correspond au nombre de prédicats simple qu’il contient, et en additionnant le
poids pour chaque chemin. C’est une approche très simple, et qui permet de
favoriser un nombre minimal d’index qui optimisent le plus de requêtes
possible. Avec nos exemple, nous obtenons :</p>
<p><a href="/images/global_advisor_3_weighted.png"><img src="/images/global_advisor_3_weighted.png" alt="Ajout d'un poids à tous les chemins et choix du score le plus
haut" /></a></p>
<p>Bien évidemment, d’autres approches de classement pourraient être utilisée pour
prendre en compte d’autres paramètres, et potentiellement obtenir une meilleur
suggestion. Par exemple, en prenant en compte également le nombre d’exécution
ou la sélectivité des prédicats. Si le ratio de lecture/écriture pour chaque
table est connu (ce qui est disponible avec l’extension
<a href="https://github.com/powa-team/powa-archivist">powa-archivist</a>), il serait
également possible d’adapter le classement pour limiter la suggestion d’index
pour les tables qui ne sont accédées presque exclusivement en écriture. Avec
cet algorithme, ces ajustements seraient relativement simples à faire.</p>
<p>Une fois que le meilleur chemin est trouvé, on peut générer l’ordre de création
de l’index ! Comme l’ordre des colonnes peut être important, l’ordre est
généré en récupérant les colonnes de chaque nœud par poids croissant. Avec
notre exemple, l’index suivant est généré :</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">INDEX</span> <span class="k">ON</span> <span class="n">t1</span> <span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">ts</span><span class="p">,</span> <span class="n">val</span><span class="p">);</span></code></pre></figure>
<p>Une fois que l’index est trouvé, on supprime simplement les prédicats contenus
de la liste globale de prédicats et on reprendre de zéro jusqu’à ce qu’il n’y
ait plus de prédicats.</p>
<h3 id="un-peu-plus-de-détails-et-mise-en-garde">Un peu plus de détails et mise en garde</h3>
<p>Bien évidemment, il s’agit ici d’une version simplifiée de l’algorithme de
suggestion, car d’autres informations sont nécessaires. Par exemple, la liste
des prédicats est en réalité ajustée avec les <a href="https://www.postgresql.org/docs/current/indexes-opclass.html">classes d’opérateurs et méthode
d’acces</a> en
fonction du type de la colonne et de sont opérateur, afin de s’assurer
d’obtenir des index valides. Si plusieurs méthodes d’accès aux index sont
trouvées pour un même meilleur chemin, <code class="language-plaintext highlighter-rouge">btree</code> sera choisi en priorité.</p>
<p>Cela nous amène à un autre détail : cette approche est principalement pensée
pour les index <strong>btree</strong>, pour lesqules l’ordre des colonnes est critiques.
D’autres méthodes d’accès ne requièrent pas un ordre spécifique pour les
colonnes, et pour ces méthodes d’accès il est possible qu’une suggestion plus
optimale soit possible si l’ordre des colonnes n’était pas pris en compte.</p>
<p>Un autre point important est que les classes d’opérateurs et méthodes d’accès
ne sont pas gérés en dur mais récupérés à l’exécution en utilisant les
catalogues locaux. Par conséquent, vous pouvez obtenir des résultats
différents (et potentiellement meilleurs) si vous faites en sorte d’avoir
toutes les classes d’opérateur additionelles disponibles quand vous utilisez la
suggestion d’index globale. Cela pourrait être les extensions <strong>btree_gist</strong>
et <strong>btree_gist</strong>, mais également d’autres méthodes d’accès aux index. Il est
également possible que certain types / opérateurs n’aient pas de méthode
d’accès associée dans les catalogues. Dans ce cas, ces prédicats sont
retournées séparément dans une liste de prédicats non optimisables
automatiquement, et pour lequel une analyse manuelle est nécessaire.</p>
<p>Enfin, comme pg_qualstats ne traite pas les prédicats composés d’expressions,
l’outil ne peut pas suggérer d’index sur des expressions, par exemple en cas
d’utilisateur de recherche plein texte.</p>
<h3 id="exemple-dutilisation">Exemple d’utilisation</h3>
<p>Une simple fonction est fournie, avec des paramètres facultatifs, qui retourne
une valeur de type json :</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">OR</span> <span class="k">REPLACE</span> <span class="k">FUNCTION</span> <span class="n">pg_qualstats_index_advisor</span> <span class="p">(</span>
<span class="n">min_filter</span> <span class="nb">integer</span> <span class="k">DEFAULT</span> <span class="mi">1000</span><span class="p">,</span>
<span class="n">min_selectivity</span> <span class="nb">integer</span> <span class="k">DEFAULT</span> <span class="mi">30</span><span class="p">,</span>
<span class="n">forbidden_am</span> <span class="nb">text</span><span class="p">[]</span> <span class="k">DEFAULT</span> <span class="s1">'{}'</span><span class="p">)</span>
<span class="k">RETURNS</span> <span class="n">json</span></code></pre></figure>
<p>Les noms de paramètres sont parlants :</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">min_filter</code>: combien de lignes le prédicat doit-il filtrer en moyenne pour
être pris en compte par la suggestion globale, par défaut <strong>1000</strong> ;</li>
<li><code class="language-plaintext highlighter-rouge">min_selectivity</code>: quelle doit être la sélectivité moyenne d’un prédicat
pour qu’il soit pris en compte par la suggestion globale, par défaut
<strong>30%</strong> ;</li>
<li><code class="language-plaintext highlighter-rouge">forbidden_am</code>: liste des méthodes d’accès aux index à ignorer. Aucune par
défaut, bien que pour les version 9.6 et inférieures <strong>les index hash sont
ignoré en interne</strong>, puisque ceux-ci ne sont sur que depuis la version 10.</li>
</ul>
<p>Voici un exemple simple, tirés des tests de non régression de pg_qualstats :</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">pgqs</span> <span class="k">AS</span> <span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="s1">'a'</span> <span class="n">val</span> <span class="k">FROM</span> <span class="n">generate_series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span> <span class="n">id</span><span class="p">;</span>
<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">adv</span> <span class="p">(</span><span class="n">id1</span> <span class="nb">integer</span><span class="p">,</span> <span class="n">id2</span> <span class="nb">integer</span><span class="p">,</span> <span class="n">id3</span> <span class="nb">integer</span><span class="p">,</span> <span class="n">val</span> <span class="nb">text</span><span class="p">);</span>
<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">adv</span> <span class="k">SELECT</span> <span class="n">i</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="s1">'line '</span> <span class="o">||</span> <span class="n">i</span> <span class="k">from</span> <span class="n">generate_series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1000</span><span class="p">)</span> <span class="n">i</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="n">pg_qualstats_reset</span><span class="p">();</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">id1</span> <span class="o"><</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">id1</span> <span class="o"><</span> <span class="mi">500</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">val</span> <span class="o">=</span> <span class="s1">'meh'</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">id1</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">and</span> <span class="n">val</span> <span class="o">=</span> <span class="s1">'meh'</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">id1</span> <span class="o">=</span> <span class="mi">1</span> <span class="k">and</span> <span class="n">val</span> <span class="o">=</span> <span class="s1">'meh'</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">id1</span> <span class="o">=</span> <span class="mi">1</span> <span class="k">and</span> <span class="n">id2</span> <span class="o">=</span> <span class="mi">2</span> <span class="k">AND</span> <span class="n">val</span> <span class="o">=</span> <span class="s1">'meh'</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">id1</span> <span class="o">=</span> <span class="mi">6</span> <span class="k">and</span> <span class="n">id2</span> <span class="o">=</span> <span class="mi">6</span> <span class="k">AND</span> <span class="n">id3</span> <span class="o">=</span> <span class="mi">6</span> <span class="k">AND</span> <span class="n">val</span> <span class="o">=</span> <span class="s1">'meh'</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">val</span> <span class="k">ILIKE</span> <span class="s1">'moh'</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">pgqs</span> <span class="k">WHERE</span> <span class="n">id</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span></code></pre></figure>
<p>Et voici ce que la fonction retourne :</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span> <span class="n">v</span>
<span class="k">FROM</span> <span class="n">json_array_elements</span><span class="p">(</span>
<span class="n">pg_qualstats_index_advisor</span><span class="p">(</span><span class="n">min_filter</span> <span class="o">=></span> <span class="mi">50</span><span class="p">)</span><span class="o">-></span><span class="s1">'indexes'</span><span class="p">)</span> <span class="n">v</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">v</span><span class="p">::</span><span class="nb">text</span> <span class="k">COLLATE</span> <span class="nv">"C"</span><span class="p">;</span>
<span class="n">v</span>
<span class="c1">---------------------------------------------------------------</span>
<span class="nv">"CREATE INDEX ON public.adv USING btree (id1)"</span>
<span class="nv">"CREATE INDEX ON public.adv USING btree (val, id1, id2, id3)"</span>
<span class="nv">"CREATE INDEX ON public.pgqs USING btree (id)"</span>
<span class="p">(</span><span class="mi">3</span> <span class="k">rows</span><span class="p">)</span>
<span class="k">SELECT</span> <span class="n">v</span>
<span class="k">FROM</span> <span class="n">json_array_elements</span><span class="p">(</span>
<span class="n">pg_qualstats_index_advisor</span><span class="p">(</span><span class="n">min_filter</span> <span class="o">=></span> <span class="mi">50</span><span class="p">)</span><span class="o">-></span><span class="s1">'unoptimised'</span><span class="p">)</span> <span class="n">v</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">v</span><span class="p">::</span><span class="nb">text</span> <span class="k">COLLATE</span> <span class="nv">"C"</span><span class="p">;</span>
<span class="n">v</span>
<span class="c1">-----------------</span>
<span class="nv">"adv.val ~~* ?"</span>
<span class="p">(</span><span class="mi">1</span> <span class="k">row</span><span class="p">)</span></code></pre></figure>
<p>La <a href="https://github.com/powa-team/pg_qualstats/">version 2 de pg_qualstats</a>
n’est pas encore disponible en version stable, mais n’hésitez pas à la tester
et <a href="https://github.com/powa-team/pg_qualstats/issues">rapporter tout problème que vous pourriez
rencontrer</a> !</p>
<p><a href="https://rjuju.github.io/postgresqlfr/2020/01/06/pg_qualstats-2-suggestion-index-globale.html">pg qualstats 2: Suggestion d'index globale</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on January 06, 2020.</p>
https://rjuju.github.io/postgresql/2020/01/06/pg_qualstats-2-global-index-advisor2020-01-06T12:23:29+00:002020-01-06T12:23:29+00:00Julien Rouhaudhttps://rjuju.github.io
<p>Coming up with good index suggestion can be a complex task. It requires
knowledge of both application queries and database specificities. Over the
year multiple projects tried to solve this problem, one of which being <a href="https://powa.readthedocs.io/">PoWA
with the version 3</a>, with the help of
<a href="https://powa.readthedocs.io/en/latest/components/stats_extensions/pg_qualstats.html">pg_qualstats
extension</a>.
It can give pretty good index suggestion, but it requires to install and
configure PoWA, while some users wanted to only have the global index advisor.
In such case and for simplicity, the algorithm used in PoWA is now available in
pg_qualstats version 2 without requiring any additional component.</p>
<p>EDIT: The <code class="language-plaintext highlighter-rouge">pg_qualstats_index_advisor()</code> function has been changed to return
<strong>json</strong> rather than <strong>jsonb</strong>, so that the compatibility with PostgreSQL 9.3
is maintained. The query examples are therefore also modified to use
<code class="language-plaintext highlighter-rouge">json_array_elements()</code> rather than <code class="language-plaintext highlighter-rouge">jsonb_array_elements()</code>.</p>
<h3 id="what-is-pg_qualstats">What is pg_qualstats</h3>
<p>A simple way to explain what is pg_qualstats would be to say that it’s like
<a href="https://www.postgresql.org/docs/current/pgstatstatements.html">pg_stat_statements</a>
working at the predicate level.</p>
<p>The extension will save useful statistics for <strong>WHERE</strong> and <strong>JOIN</strong> clauses:
which table and column a predicate refers to, number of time the predicate has
been used, number of execution of the underlying operator, whether it’s a
predicate from an index scan or not, selectivity, constant values used and much
more.</p>
<p>You can deduce many things from such information. For instance, if you examine
the predicates that contains references to different tables, you can find which
tables are joined together, and how selective are those join conditions.</p>
<h3 id="global-suggestion">Global suggestion?</h3>
<p>As I mentioned, the global index advisor added in pg_qualstats 2 uses the same
approach as the one in PoWA, so the explanation here will describe both tools.
The only difference is that with PoWA you’ll likely get a better suggestion, as
more predicates will be available, and you can also choose for wich time
interval you want to detect missing indexes.</p>
<p>The important thing here is that the suggestion is performed <strong>globally</strong>,
considering all interesting predicates at the same time. This approach is
different to all other approaches I saw that only consider a single query at a
time. I believe that a global approach is better, as it’s possible to reduce
the total number of indexes, maximizing multi-column indexes usefulness.</p>
<h3 id="how-global-suggestion-is-done">How global suggestion is done</h3>
<p>The first step is to gather all predicates that could benefit from a new index.
This is easy to get with pg_qualstats, by filtering the predicates coming from
sequential scans, executed many time, that filter many rows (both in number of
rows and in percentage) you get a perfect list of predicates that likely miss
an index (or alternatively the list of poorly written queries in certain
cases). For instance, let’s consider an application which uses those 4
predicates:</p>
<p><a href="/images/global_advisor_1_quals.png"><img src="/images/global_advisor_1_quals.png" alt="List of all predicates
found" /></a></p>
<p>Next, we build the full set of paths with each AND-ed predicates that contains
other, also possibly AND-ed, predicates. Using the same 4 predicates, we would
get those paths:</p>
<p><a href="/images/global_advisor_2_graphs.png"><img src="/images/global_advisor_2_graphs.png" alt="Build all possible paths of
predicates" /></a></p>
<p>Once all the paths are built, we just need to get the best path to find out the
best index to suggest. The scoring is for now done by giving a weight to each
node of each path corresponding to the number of simple predicates it contains
and summing the weight for each path. This is very simple and allows to prefer
a smaller amount of indexes to optimize as many queries as possible. With our
simple example, we get:</p>
<p><a href="/images/global_advisor_3_weighted.png"><img src="/images/global_advisor_3_weighted.png" alt="Weight all paths and choose the highest
score" /></a></p>
<p>Of course, other scoring approaches could be used to take into account other
parameters and give possibly better suggestions. For instance, combining the
number of executions or the predicate selectivity. If the read/write ratio for
each table is known (this is available using
<a href="https://github.com/powa-team/powa-archivist">powa-archivist</a>), it would also
be possible to adapt the scoring method to limit index suggestions for
write-mostly tables. With this algorithm, all of that could be added quite
easily.</p>
<p>Once the best path is found, we can generate an index DDL! As the order of the
columns can be important, this is done using getting the columns for each node
in ascending weight order. In our example, we would generate this index:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">INDEX</span> <span class="k">ON</span> <span class="n">t1</span> <span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">ts</span><span class="p">,</span> <span class="n">val</span><span class="p">);</span></code></pre></figure>
<p>Once an index is found, we simply remove the contained predicates for the
global list of predicates and start again from scratch until there are no
predicate left.</p>
<h3 id="additional-details-and-caveat">Additional details and caveat</h3>
<p>Of course, this is a simplified version of the suggestion algorithm. Some
other informations are required. For instance, the list of predicates is
actually expanded with <a href="https://www.postgresql.org/docs/current/indexes-opclass.html">operator classes and access
method</a> depending
on the column types and operator, to make sure that the suggested indexes are
valid. If multiple index methods are found for a best path, <code class="language-plaintext highlighter-rouge">btree</code> will be
chosen in priority.</p>
<p>This brings another consideration: this approach is mostly thought for
<strong>btree</strong> indexes, for which the column order is critical. Some other access
methods don’t require a specific column order, and for those it could be
possible to get better index suggestions if the column order parameters wasn’t
considered.</p>
<p>Another important point is that the operator classes and access method is not
hardcoded but retrieved at execution time using the local catalogs. Therefore,
you can get different (and possibly better) results if you make sure that
optional operator classes are present when using the index advisor. This could
be <strong>btree_gist</strong> or <strong>btree_gin</strong> extensions, but also other access methods.
It’s also possible that some type / operator combination doesn’t have any
associated access method recorded in the catalogs. In this case, those
predicates are returned separately as a list of unoptimizable predicates, that
should be manually analyzed.</p>
<p>Finally, as pg_qualstats isn’t considering expression predicates, this advisor
can’t suggest indexes on expression, for instance if you’re using fulltext
search.</p>
<h3 id="usage-example">Usage example</h3>
<p>A simple set-returning function is provided, with optional parameters, that
returns a json value:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">OR</span> <span class="k">REPLACE</span> <span class="k">FUNCTION</span> <span class="n">pg_qualstats_index_advisor</span> <span class="p">(</span>
<span class="n">min_filter</span> <span class="nb">integer</span> <span class="k">DEFAULT</span> <span class="mi">1000</span><span class="p">,</span>
<span class="n">min_selectivity</span> <span class="nb">integer</span> <span class="k">DEFAULT</span> <span class="mi">30</span><span class="p">,</span>
<span class="n">forbidden_am</span> <span class="nb">text</span><span class="p">[]</span> <span class="k">DEFAULT</span> <span class="s1">'{}'</span><span class="p">)</span>
<span class="k">RETURNS</span> <span class="n">json</span></code></pre></figure>
<p>The parameter names are self explanatory:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">min_filter</code>: how many tuples should a predicate filter on average to be
considered for the global optimization, by default <strong>1000</strong>.</li>
<li><code class="language-plaintext highlighter-rouge">min_selectivity</code>: how selective should a predicate filter on average to be
considered for the global optimization, by default <strong>30%</strong>.</li>
<li><code class="language-plaintext highlighter-rouge">forbidden_am</code>: list of access methods to ignore. None by default,
although for PostgreSQL 9.6 and prior <strong>hash indexes will internally be
discarded</strong>, as those are only safe since version 10.</li>
</ul>
<p>Using pg_qualstats regression tests, let’s see a simple example:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">pgqs</span> <span class="k">AS</span> <span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="s1">'a'</span> <span class="n">val</span> <span class="k">FROM</span> <span class="n">generate_series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span> <span class="n">id</span><span class="p">;</span>
<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">adv</span> <span class="p">(</span><span class="n">id1</span> <span class="nb">integer</span><span class="p">,</span> <span class="n">id2</span> <span class="nb">integer</span><span class="p">,</span> <span class="n">id3</span> <span class="nb">integer</span><span class="p">,</span> <span class="n">val</span> <span class="nb">text</span><span class="p">);</span>
<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">adv</span> <span class="k">SELECT</span> <span class="n">i</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="s1">'line '</span> <span class="o">||</span> <span class="n">i</span> <span class="k">from</span> <span class="n">generate_series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1000</span><span class="p">)</span> <span class="n">i</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="n">pg_qualstats_reset</span><span class="p">();</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">id1</span> <span class="o"><</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">id1</span> <span class="o"><</span> <span class="mi">500</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">val</span> <span class="o">=</span> <span class="s1">'meh'</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">id1</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">and</span> <span class="n">val</span> <span class="o">=</span> <span class="s1">'meh'</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">id1</span> <span class="o">=</span> <span class="mi">1</span> <span class="k">and</span> <span class="n">val</span> <span class="o">=</span> <span class="s1">'meh'</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">id1</span> <span class="o">=</span> <span class="mi">1</span> <span class="k">and</span> <span class="n">id2</span> <span class="o">=</span> <span class="mi">2</span> <span class="k">AND</span> <span class="n">val</span> <span class="o">=</span> <span class="s1">'meh'</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">id1</span> <span class="o">=</span> <span class="mi">6</span> <span class="k">and</span> <span class="n">id2</span> <span class="o">=</span> <span class="mi">6</span> <span class="k">AND</span> <span class="n">id3</span> <span class="o">=</span> <span class="mi">6</span> <span class="k">AND</span> <span class="n">val</span> <span class="o">=</span> <span class="s1">'meh'</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">val</span> <span class="k">ILIKE</span> <span class="s1">'moh'</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">pgqs</span> <span class="k">WHERE</span> <span class="n">id</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span></code></pre></figure>
<p>And here’s what the function returns:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span> <span class="n">v</span>
<span class="k">FROM</span> <span class="n">json_array_elements</span><span class="p">(</span>
<span class="n">pg_qualstats_index_advisor</span><span class="p">(</span><span class="n">min_filter</span> <span class="o">=></span> <span class="mi">50</span><span class="p">)</span><span class="o">-></span><span class="s1">'indexes'</span><span class="p">)</span> <span class="n">v</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">v</span><span class="p">::</span><span class="nb">text</span> <span class="k">COLLATE</span> <span class="nv">"C"</span><span class="p">;</span>
<span class="n">v</span>
<span class="c1">---------------------------------------------------------------</span>
<span class="nv">"CREATE INDEX ON public.adv USING btree (id1)"</span>
<span class="nv">"CREATE INDEX ON public.adv USING btree (val, id1, id2, id3)"</span>
<span class="nv">"CREATE INDEX ON public.pgqs USING btree (id)"</span>
<span class="p">(</span><span class="mi">3</span> <span class="k">rows</span><span class="p">)</span>
<span class="k">SELECT</span> <span class="n">v</span>
<span class="k">FROM</span> <span class="n">json_array_elements</span><span class="p">(</span>
<span class="n">pg_qualstats_index_advisor</span><span class="p">(</span><span class="n">min_filter</span> <span class="o">=></span> <span class="mi">50</span><span class="p">)</span><span class="o">-></span><span class="s1">'unoptimised'</span><span class="p">)</span> <span class="n">v</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">v</span><span class="p">::</span><span class="nb">text</span> <span class="k">COLLATE</span> <span class="nv">"C"</span><span class="p">;</span>
<span class="n">v</span>
<span class="c1">-----------------</span>
<span class="nv">"adv.val ~~* ?"</span>
<span class="p">(</span><span class="mi">1</span> <span class="k">row</span><span class="p">)</span></code></pre></figure>
<p>The <a href="https://github.com/powa-team/pg_qualstats/">version 2 of pg_qualstats</a> is
not released yet, but feel free to test it and <a href="https://github.com/powa-team/pg_qualstats/issues">report any issue you may
find</a>!</p>
<p><a href="https://rjuju.github.io/postgresql/2020/01/06/pg_qualstats-2-global-index-advisor.html">pg qualstats 2: Global index advisor</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on January 06, 2020.</p>
https://rjuju.github.io/postgresqlfr/2019/12/10/powa-4-nouveau-powa-collector2019-12-10T18:54:17+00:002019-12-10T18:54:17+00:00Julien Rouhaudhttps://rjuju.github.io
<p>Cet article fait partie d’une série d’article sur <a href="http://powa.readthedocs.io/">la beta de PoWA
4</a>, et décrit le nouveau <a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">daemon
powa-collector</a>.</p>
<h3 id="nouveau-daemon-powa-collector">Nouveau <a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">daemon powa-collector</a></h3>
<p>Ce daemon remplace le précédent <em>background worker</em> lorsque le nouveau <a href="https://powa.readthedocs.io/en/latest/remote_setup.html">mode
remote</a> est utilisé.
Il s’agit d’un simple daemon écrit en python, qui s’occupera de toutes les
étapes nécessaires pour effectuer des <em>snapshots distants</em>. Il est <a href="https://pypi.org/project/powa-collector/">disponible
sur pypi</a>.</p>
<p>Comme je l’ai expliqué dans mon <a href="/postgresql/2019/05/17/powa-4-with-remote-mode-beta-is-available.html">précédent article introduistant PoWA 4</a>, ce
daemon est nécessaire pour la configuration d’un mode remote, en gardant cette
architecture à l’esprit :</p>
<p><a href="/images/powa_4_remote.svg"><img src="/images/powa_4_remote.svg" alt="Architecture de PoWA 4 en mode distant" /></a></p>
<p>Sa configuration est très simple. Il vous suffit tout simplement de renommer
le fichier <code class="language-plaintext highlighter-rouge">powa-collector.conf.sample</code> fourni, et d’adapter <a href="https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING">l’URI de
connexion</a>
pour décrire comment se connecter sur votre <em>serveur repository</em> dédié, et
c’est fini.</p>
<p>Une configuration typique devrait ressembler à :</p>
<figure class="highlight"><pre><code class="language-conf" data-lang="conf">{
<span class="s2">"repository"</span>: {
<span class="s2">"dsn"</span>: <span class="s2">"postgresql://powa_user@server_dns:5432/powa"</span>,
},
<span class="s2">"debug"</span>: <span class="n">true</span>
}</code></pre></figure>
<p>La liste des <em>serveur distants</em>, leur configuration ainsi que tout le reste qui
est nécessaire pour le bon fonctionnement sera automatiquement récupéré depuis
le <em>serveur repository</em> que vous ave déjà configuré. Une fois démarré, il
démarrera un thread dédié par <em>serveur distant</em> déclaré, et maintiendra une
<strong>connexion persistente</strong> sur ce <em>serveur distant</em>. Chaque thread effectuera
un <em>snapshot distant</em>, exportant les données sur le <em>serveur repository</em> en
utilisant les nouvelles <em>fonctions sources</em>. Chaque thread ouvrira et fermera
une connexion sur le <em>serveur repository</em> lors de l’exécution du <em>snapshot
distant</em>.</p>
<p>Bien évidemment, ce daemon a besoin de pouvoir se connecter sur tous les
<em>serveurs distants</em> déclarés ainsi que le <em>serveur repository</em>. La table
<code class="language-plaintext highlighter-rouge">powa_servers</code>, qui stocke la liste des <em>serveurs distants</em>, a un champ pour
stocker les nom d’utilisateur et mot de passe pour se connecter aux <em>serveur
distants</em>. Stocker un mot de passe en clair dans cette table est une hérésie,
si l’on considère l’aspect sécurité. Ainsi, comme indiqué dans la
<a href="https://powa.readthedocs.io/en/latest/security.html#connection-on-remote-servers">section sécurité de
PoWA</a>,
vous pouve stocker un mot de passe NULL et <a href="https://www.postgresql.org/docs/current/auth-methods.html">utiliser à la place n’importe
laquelle des autres méthodes d’authentification supportées par la
libpq</a> (fichier
.pgpass, certificat…). C’est très fortement recommandé pour toute
installation sérieuse.</p>
<p>La connexion persistente sur le <em>serveur repository</em> est utilisée pour
superviser la daemon :</p>
<ul>
<li>pour vérifier que le daemon est bien démarré</li>
<li>pour communiquer au travers de l’UI en utilisant un <a href="https://powa.readthedocs.io/en/latest/components/powa-collector/protocol.html">protocole simple</a>
afin d’effectuer des actions diverses (recharger la configuration, vérifier
le status d’un thread dédié à un <em>serveur distant</em>…)</li>
</ul>
<p>Il est à noter que vous pouvez également demander au daemon de recharger sa
configuration en envoyant un SIGHUP au processus du daemon. Un rechargement
est nécessaire pour toute modification effectuée sur la liste des serveurs
distants (ajout ou suppression d’un <em>serveur distant</em>, ou mise à jour d’un
existant).</p>
<p>Veuillez également noter que, par choix,
<a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">powa-collector</a>
n’effectuera pas de <em>snapshot local</em>. Si vous voulez utiliser PoWA pour le
<em>serveur repository</em>, il vous faudra activer le <em>background worker</em> original.</p>
<h5 id="nouvelle-page-de-configuration">Nouvelle page de configuration</h5>
<p>La page de configuration est maintenant modifiée pour donner toutes les
informations nécessaires sur le status du background worker, le <a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">powa-collector
daemon</a>
(incluant tous ses threads dédiés) ainsi que la liste des <em>serveurs distants</em>
déclarés. Voici un exemple de cette nouvelle page racine de configuration :</p>
<p><a href="/images/powa_4_configuration_page.png"><img src="/images/powa_4_configuration_page.png" alt="Nouvelle page de
configuration" /></a></p>
<p>Si le <a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">daemon
powa-collector</a>
est utilisé, le status de chaque serveur distant sera récupéré en utilisant le
protocole de communication. Si le collecteur rencontre des erreurs (lors de la
connexion à un <em>serveur distant</em>, durant un <em>snapshot</em> par exemple), celles-ci
seront également affichées ici. À noter également que ces erreurs seront
également affichées en haut de chaque page de toutes les pages de l’UI, afin
d’être sûr de ne pas les rater.</p>
<p>De plus, la section configuration a maintenant une hiérarchie, et vous pourrez
voir la liste des extensions ainsi que la configuration actuelle de PostgreSQL
pour le serveur <strong>local</strong> ou <strong>distant</strong> en cliquant sur le serveur de votre
choix!</p>
<p>Il y a également un nouveau bouton <strong>Reload collector</strong> sur le bandeau
d’en-tête qui, comme on pourrait s’y attendre, demandera au collecteur de
recharger sa configuration. Cela peut être utile si vous avez déclarés de
nouveaux serveurs mais n’ave pas d’accès au serveur sur lequel le collecteur
s’exécute.</p>
<h3 id="conclusion">Conclusion</h3>
<p>Cette article est le dernier de la séurie concernant la nouvelle version de
PoWA. Il est toujours en beta, n’hésitez donc pas à le tester, <a href="https://powa.readthedocs.io/en/latest/support.html#support">rapporter
tout bug rencontré</a>
ou donner tout autre retour!</p>
<p><a href="https://rjuju.github.io/postgresqlfr/2019/12/10/powa-4-nouveau-powa-collector.html">PoWA 4: Nouveau daemon powa-collector</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on December 10, 2019.</p>
https://rjuju.github.io/postgresql/2019/12/10/powa-4-new-powa-collector2019-12-10T18:54:17+00:002019-12-10T18:54:17+00:00Julien Rouhaudhttps://rjuju.github.io
<p>This article is part of the <a href="http://powa.readthedocs.io/">PoWA 4 beta</a> series,
and describes the new <a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">powa-collector
daemon</a>.</p>
<h3 id="new-powa-collector-daemon">New <a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">powa-collector daemon</a></h3>
<p>This daemon replaces the previous <em>background worker</em> when using the <a href="https://powa.readthedocs.io/en/latest/remote_setup.html">new
remote mode</a>. It’s a
simple daemon written in python, which will perform all the required steps to
perform <em>remote snapshots</em>. It’s <a href="https://pypi.org/project/powa-collector/">available on
pypi</a>.</p>
<p>As I explained in my <a href="/postgresql/2019/05/17/powa-4-with-remote-mode-beta-is-available.html">previous article introducing PoWA 4</a>, this daemon is
required for a remote mode setup, with this architecture in mind:</p>
<p><a href="/images/powa_4_remote.svg"><img src="/images/powa_4_remote.svg" alt="PoWA 4 remote architecture" /></a></p>
<p>Its configuration is very simple. All you need to do is copy and rename the
provided <code class="language-plaintext highlighter-rouge">powa-collector.conf.sample</code> file, and adapt the <a href="https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING">connection
URI</a>
to describe how to connect on your dedicated <em>repository server</em>, and you’re
done.</p>
<p>A typical configuration will look like:</p>
<figure class="highlight"><pre><code class="language-conf" data-lang="conf">{
<span class="s2">"repository"</span>: {
<span class="s2">"dsn"</span>: <span class="s2">"postgresql://powa_user@server_dns:5432/powa"</span>,
},
<span class="s2">"debug"</span>: <span class="n">true</span>
}</code></pre></figure>
<p>The list of <em>remote servers</em>, their configuration and everything else it needs
will be automatically retrieved from the <em>repository server</em> you just
configured. When started, it’ll spawn one dedicated thread per declared
<em>remote server</em>, and maintain a <strong>persistent connection</strong> on the configured
<strong>powa database</strong> on this <em>remote server</em>. Each thread will perform a <em>remote
snapshot</em>, exporting the data on the <em>repository server</em> using the new <em>source
functions</em>. Each thread will open and close a connection on the <em>repository
server</em> when performing the <em>remote snapshot</em>.</p>
<p>This daemon obviously needs to be able to connect to all the declared <em>remote
servers</em> and the <em>repository server</em>. The <code class="language-plaintext highlighter-rouge">powa_servers</code> table, which store
the list of <em>remote servers</em>, has a field to store username and password to
connect to the <em>remote server</em>. Storing a password in plain text in this table
is an heresy as far as security is concerned. So, as mentioned in the
<a href="https://powa.readthedocs.io/en/latest/security.html#connection-on-remote-servers">PoWA security
documentation</a>,
you can store a NULL password and <a href="https://www.postgresql.org/docs/current/auth-methods.html">instead use any of the authentication method
that libpq supports</a>
(.pgpass file, certificate…). That’s strongly recommended for any non toy
setup.</p>
<p>The persistent connection on the <em>repository server</em> is used to monitor the
daemon:</p>
<ul>
<li>to check that the daemon is up and running</li>
<li>to communicate through the UI using a <a href="https://powa.readthedocs.io/en/latest/components/powa-collector/protocol.html">simple protocol</a>
to perform various actions (reload the configuration, check for a <em>remote
server</em> thread status…)</li>
</ul>
<p>Note that you can also ask the daemon to reload its configuration by issuing a
SIGHUP to the daemon process. A reload is required if any modification to the
list of remote servers (if you added or removed a <em>remote server</em>, or
updated a setting for an existing) has been done.</p>
<p>Also note that by choice,
<a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">powa-collector</a>
will not perform <em>local snapshots</em>. If you want to use PoWA for the
<em>repository server</em>, you need to enable the original <em>background worker</em>.</p>
<h5 id="new-configuration-page">New configuration page</h5>
<p>The configuration page is now updated to give all needed information about the
background worker status and the <a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">powa-collector
daemon</a>
status (including all of its dedicated threads) and the list of registered
<em>remote servers</em>. Here’s an example of the new root configuration page:</p>
<p><a href="/images/powa_4_configuration_page.png"><img src="/images/powa_4_configuration_page.png" alt="New configuration
page" /></a></p>
<p>If the <a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">powa-collector
daemon</a>
is used, each remote server status will be retrieved using the communication
protocol. If the collector encountered any error (connecting to a <em>remote
server</em>, during a <em>snapshot</em> or anything else), they’ll also be displayed here.
Also note that such errors will also be displayed on top of any page of the UI,
so that you can’t miss them.</p>
<p>Also, the configuration section has now a hierarchy, and you’ll be able to see
the list of extensions and the current PostgreSQL configuration for the
<strong>local</strong> or <strong>remote servers</strong> by clicking on the server of your choice!</p>
<p>There’s also a new <strong>Reload collector</strong> button on the header panel, which as
expected will ask the collector to reload its configuration. That can be
useful if you registered new servers and you don’t have access on the server
where the collector is running.</p>
<h3 id="conclusion">Conclusion</h3>
<p>This is the last article introducing the new version of PoWA. It’s still in
beta, so feel free to test it, <a href="https://powa.readthedocs.io/en/latest/support.html#support">report any issue you may
find</a> or give any
other feedback!</p>
<p><a href="https://rjuju.github.io/postgresql/2019/12/10/powa-4-new-powa-collector.html">PoWA 4: New powa-collector daemon</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on December 10, 2019.</p>
https://rjuju.github.io/postgresqlfr/2019/06/05/powa-4-nouveaute-dans-powa-archivist2019-06-05T14:26:17+00:002019-06-05T14:26:17+00:00Julien Rouhaudhttps://rjuju.github.io
<p>Cet article fait partie d’une série d’article sur <a href="http://powa.readthedocs.io/">la beta de PoWA
4</a>, et décrit les changements présents dans
<a href="https://powa.readthedocs.io/en/latest/components/powa-archivist/index.html">powa-archivist</a>.</p>
<p>Pour plus d’information sur cette version 4, vous pouvez consulter <a href="/postgresqlfr/2019/05/17/powa-4-avec-mode-remote-disponible-en-beta.html">l’article
de présentation général</a>.</p>
<h3 id="aperçu-rapide">Aperçu rapide</h3>
<p>Tout d’abord, il faut savoir qu’il n’y a pas d’upgrade possible depuis la v3
vers la v4, il est donc nécessaire d’effectuer un <code class="language-plaintext highlighter-rouge">DROP EXTENSION powa</code> si vous
utilisiez déjà PoWA sur vos serveurs. Cela est du au fait que la v4 apporte
<strong>de très nombreux</strong> changements dans la partie SQL de l’extension, ce qui en
fait le changement le plus significatif dans la suite PoWA pour cette nouvelle
version. Au moment où j’écris cet article, la quantité de changements apportés
dans cette extension est :</p>
<figure class="highlight"><pre><code class="language-diff" data-lang="diff"> CHANGELOG.md | 14 +
powa--4.0.0dev.sql | 2075 +++++++++++++++++++++-------
powa.c | 44 +-
3 files changed, 1629 insertions(+), 504 deletions(-)</code></pre></figure>
<p>L’absence d’upgrade ne devrait pas être un problème en pratique. PoWA est un
outil pour analyser les performances, il est fait pour avoir des données avec
une grande précision mais un historique très limité. Si vous cherchez une
solution de supervision généraliste pour conserver des mois de données, PoWA
n’est définitivement pas l’outil qu’il vous faut.</p>
<h3 id="configurer-la-liste-des-serveurs-distants">Configurer la liste des <em>serveurs distants</em></h3>
<p>En ce qui concerne les changements à proprement parler, le premier petit
changement est que le <a href="https://www.postgresql.org/docs/current/bgworker.html">background
worker</a> n’est plus
nécessaire pour le fonctionnement de powa-archivist, car il n’est pas utilisé
pour le mode distant. Cela signifie qu’un redémarrage de PostgreSQL n’est plus
nécessaire pour installer PoWA. Bien évidemment, un redémarrage est toujours
nécessaire si vous souhaitez utiliser le mode local, en utilisant le background
worker, or si vous voulez installer des extensions additionelles qui
nécessitent elles-même un redémarrage.</p>
<p>Ensuite, comme PoWA requiert un peu de configuration (fréquence des snapshot,
rétention des données et ainsi de suite), certaines nouvelles tables sont
ajouter pour permettre de configurer tout ça. La nouvelle table <code class="language-plaintext highlighter-rouge">powa_servers</code>
stocke la configuration de toutes les instances distantes dont les données
doivent être stockées sur cette instance. Cette <em>instance PoWA locale</em> est
appelée un <strong>serveur repository</strong> (qui devrait typiquement être dédiée à
stocker des données PoWA), en opposition aux <strong>instances distantes</strong> qui sont
les instances que vous voulez monitorer. Le contenu de cette table est tout ce
qu’il y a de plus simple :</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="err">\</span><span class="n">d</span> <span class="n">powa_servers</span>
<span class="k">Table</span> <span class="nv">"public.powa_servers"</span>
<span class="k">Column</span> <span class="o">|</span> <span class="k">Type</span> <span class="o">|</span> <span class="k">Collation</span> <span class="o">|</span> <span class="k">Nullable</span> <span class="o">|</span> <span class="k">Default</span>
<span class="c1">-----------+----------+-----------+----------+------------------------------------------</span>
<span class="n">id</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="n">nextval</span><span class="p">(</span><span class="s1">'powa_servers_id_seq'</span><span class="p">::</span><span class="n">regclass</span><span class="p">)</span>
<span class="n">hostname</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span>
<span class="k">alias</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span>
<span class="n">port</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span>
<span class="n">username</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span>
<span class="n">password</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span>
<span class="n">dbname</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span>
<span class="n">frequency</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="mi">300</span>
<span class="n">powa_coalesce</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="mi">100</span>
<span class="n">retention</span> <span class="o">|</span> <span class="n">interval</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="s1">'1 day'</span><span class="p">::</span><span class="n">interval</span></code></pre></figure>
<p>Si vous avez déjà utilisé PoWA, vous devriez reconnaître la plupart des options
de configuration qui sont maintenant stockées ici. Les nouvelles options sont
utilisées pour décrire comment se connecter aux <em>instances distances</em>, et
peuvent fournir un alias à afficher sur l’UI.</p>
<p>Vous avez également probablement remarqué une colonne <strong>password</strong>. Stocker un
mot de passe en clair dans cette table est une hérésie pour n’importe qui
désirant un minimum de sécurité. Ainsi, comme mentionné dans la <a href="https://powa.readthedocs.io/en/latest/security.html#connection-on-remote-servers">section
sécurité de la documentation de PoWA
</a>,
vous pouvez stocker NULL pour le champ password et à la place utiliser
<a href="https://www.postgresql.org/docs/current/auth-methods.html">n’importe laquelle des autres méthodes d’authentification supportée par la
libpq</a>
(fichier .pgpass, certificat…). Une authentification plus sécurisée est
chaudement recommandée pour toute installation sérieuse.</p>
<p>Une autre table, la table <code class="language-plaintext highlighter-rouge">powa_snapshot_metas</code>, est également ajoutée pour
stocker quelques métadonnées concernant les informations de snapshot pour
chaque <em>serveur distant</em>.</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"> <span class="k">Table</span> <span class="nv">"public.powa_snapshot_metas"</span>
<span class="k">Column</span> <span class="o">|</span> <span class="k">Type</span> <span class="o">|</span> <span class="k">Collation</span> <span class="o">|</span> <span class="k">Nullable</span> <span class="o">|</span> <span class="k">Default</span>
<span class="c1">--------------+--------------------------+-----------+----------+---------------------------------------</span>
<span class="n">srvid</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span>
<span class="n">coalesce_seq</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="mi">1</span>
<span class="n">snapts</span> <span class="o">|</span> <span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="s1">'-infinity'</span><span class="p">::</span><span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span>
<span class="n">aggts</span> <span class="o">|</span> <span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="s1">'-infinity'</span><span class="p">::</span><span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span>
<span class="n">purgets</span> <span class="o">|</span> <span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="s1">'-infinity'</span><span class="p">::</span><span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span>
<span class="n">errors</span> <span class="o">|</span> <span class="nb">text</span><span class="p">[]</span></code></pre></figure>
<p>Il s’agit tout simplement d’un compteur pour compter le nombre de snapshots
effectués, un timestamp pour chaque type d’événement survenu (snapshot,
aggrégation et purge) et un tableau de chaîne de caractères pour stocker toute
erreur survenant durant le snapshot, afin que l’UI pour l’afficher.</p>
<h3 id="api-sql-pour-configurer-les-serveurs-distants">API SQL pour configurer les <em>serveurs distants</em></h3>
<p>Bien que ces tables soient très simples, une <a href="https://powa.readthedocs.io/en/latest/remote_setup.html#configure-powa-and-stats-extensions-on-each-remote-server">API SQL basique est disponible
pour déclarer de nouveaux serveurs et les
configurer</a>.
6 fonctions de bases sont disponibles :</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">powa_register_server()</code>, pour déclarer un nouveau <em>servuer distant</em>, ainsi
que la liste des extensions qui y sont disponibles</li>
<li><code class="language-plaintext highlighter-rouge">powa_configure_server()</code> pour mettre à jour un des paramètres pour le
<em>serveur distant</em> spécifié (en utilisant un paramètre JSON, où la clé est
le nom du paramètre à changer et la valeur la nouvelle valeur à utiliser)</li>
<li><code class="language-plaintext highlighter-rouge">powa_deactivate_server()</code> pour désactiver les snapshots pour le <em>serveur
distant</em> spécifiqué (ce qui concrètement positionnera le paramètre
<code class="language-plaintext highlighter-rouge">frequency</code> à <strong>-1</strong>)</li>
<li><code class="language-plaintext highlighter-rouge">powa_delete_and_purge_server()</code> pour supprimer le <em>serveur distant</em>
spécifié de la liste des serveurs et supprimer toutes les données associées
aux snapshots</li>
<li><code class="language-plaintext highlighter-rouge">powa_activate_extension()</code>, pour déclarer qu’une nouvelle extension est
disponible sur le <em>serveur distant</em> spécifié</li>
<li><code class="language-plaintext highlighter-rouge">powa_deactivate_extension()</code>, pour spécifier qu’une extension n’est plus
disponible sur le <em>serveur distant</em> spécifié</li>
</ul>
<p>Toute action plus compliquée que ça devra être effectuée en utilisant des
requêtes SQL. Heureusement, il ne devrait pas y avoir beaucoup d’autres
besoins, et les tables sont vraiment très simple donc cela ne devrait pas poser
de soucis. <a href="https://github.com/powa-team/powa-archivist/issues">N’hésitez cependant pas à demander de nouvelles
fonctions</a> si vous aviez
d’autres besoins. Veuillez également noter que l’UI ne vous permet pas
d’appeler ces fonctions, puisque celle-ci est pour le moment <strong>entièrement en
lecture seule</strong>.</p>
<h3 id="effectuer-des-snapshots-distants">Effectuer des <em>snapshots distants</em></h3>
<p>Puisque les métriques sont maintenant stockées sur une instance PostgreSQL
différente, nous avons énormément changé la façon dont les <em>snapshots</em>
(récupérer les données fournies par une <a href="https://powa.readthedocs.io/en/latest/components/stats_extensions/index.html">extensions
statistique</a>
et les stockées dans le catalogue PoWA <a href="/postgresqlfr/2019/04/06/minimiser-le-surcout-de-stockage-par-ligne.html">de manière à optimiser le stockage</a>) sont
effectués.</p>
<p>La liste de toutes les extensions statistiques, ou <em>sources de données</em>, qui
sont disponibles sur un <strong>serveur</strong> (soit <em>distant</em> soit <em>local</em>) et pour
lesquelles un <em>snapshot</em> devrait être effectué est stockée dans une table
appelée <code class="language-plaintext highlighter-rouge">powa_functions</code>:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"> <span class="k">Table</span> <span class="nv">"public.powa_functions"</span>
<span class="k">Column</span> <span class="o">|</span> <span class="k">Type</span> <span class="o">|</span> <span class="k">Collation</span> <span class="o">|</span> <span class="k">Nullable</span> <span class="o">|</span> <span class="k">Default</span>
<span class="c1">----------------+---------+-----------+----------+---------</span>
<span class="n">srvid</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span>
<span class="n">module</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span>
<span class="k">operation</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span>
<span class="n">function_name</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span>
<span class="n">query_source</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span>
<span class="n">added_manually</span> <span class="o">|</span> <span class="nb">boolean</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="k">true</span>
<span class="n">enabled</span> <span class="o">|</span> <span class="nb">boolean</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="k">true</span>
<span class="n">priority</span> <span class="o">|</span> <span class="nb">numeric</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="mi">10</span></code></pre></figure>
<p>Un nouveau champ <code class="language-plaintext highlighter-rouge">query_source</code> a été rajouté. Celui-ci fournit le nom de la
<em>fonction source</em>, nécessaire pour la compatibilité d’une <a href="https://powa.readthedocs.io/en/latest/components/stats_extensions/index.html">extension
statistique</a>
avec les snapshots distants. Cette fonction est utilisée pour exporter les
compteurs fournis par cette extension sur un serveur différent, dans une <em>table
transitoire</em> dédiée. La fonction de <em>snapshot</em> effectuera alors le <em>snapshot</em>
en utilisant automatiquement ces données exportées plutôt que celles fournies
par l’extension statististique locale quand le mode distant est utilisé. Il
est à noter que l’export de ces compteurs ainsi que le snapshot distant est
effectué automatiquement par le nouveau <a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">daemon
powa-collector</a>
que je présenterai dans un autre article.</p>
<p>Voici un exemple montant comment PoWA effectue un <em>snapshot distant</em> d’une
liste de base données. Comme vous allez le voir, c’est très simple ce qui
signifie qu’il est également très simple d’ajouter cette même compatibilité
pour une nouvelle extension statistique.</p>
<p>La <em>table transitoire</em>:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"> <span class="n">Unlogged</span> <span class="k">table</span> <span class="nv">"public.powa_databases_src_tmp"</span>
<span class="k">Column</span> <span class="o">|</span> <span class="k">Type</span> <span class="o">|</span> <span class="k">Collation</span> <span class="o">|</span> <span class="k">Nullable</span> <span class="o">|</span> <span class="k">Default</span>
<span class="c1">---------+---------+-----------+----------+---------</span>
<span class="n">srvid</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span>
<span class="n">oid</span> <span class="o">|</span> <span class="n">oid</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span>
<span class="n">datname</span> <span class="o">|</span> <span class="n">name</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span></code></pre></figure>
<p>Pour de meilleurs performances, toutes les <em>tables transitoires</em> sont <strong>non
journalisées (unlogged)</strong>, puisque leur contenu n’est nécessaire que durant un
<em>snapshot</em> et sont supprimées juste après. Dans cet examlple, la <em>table
transitoire</em> ne stocke que l’identifiant du serveur distant correspondant à ces
données, l’oid ainsi que le nom de chacune des bases de données présentes sur
le <em>serveur distant</em>.</p>
<p>Et la <em>fonction source</em> :</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">OR</span> <span class="k">REPLACE</span> <span class="k">FUNCTION</span> <span class="k">public</span><span class="p">.</span><span class="n">powa_databases_src</span><span class="p">(</span><span class="n">_srvid</span> <span class="nb">integer</span><span class="p">,</span>
<span class="k">OUT</span> <span class="n">oid</span> <span class="n">oid</span><span class="p">,</span> <span class="k">OUT</span> <span class="n">datname</span> <span class="n">name</span><span class="p">)</span>
<span class="k">RETURNS</span> <span class="k">SETOF</span> <span class="n">record</span>
<span class="k">LANGUAGE</span> <span class="n">plpgsql</span>
<span class="k">AS</span> <span class="err">$</span><span class="k">function</span><span class="err">$</span>
<span class="k">BEGIN</span>
<span class="n">IF</span> <span class="p">(</span><span class="n">_srvid</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span> <span class="k">THEN</span>
<span class="k">RETURN</span> <span class="n">QUERY</span> <span class="k">SELECT</span> <span class="n">d</span><span class="p">.</span><span class="n">oid</span><span class="p">,</span> <span class="n">d</span><span class="p">.</span><span class="n">datname</span>
<span class="k">FROM</span> <span class="n">pg_database</span> <span class="n">d</span><span class="p">;</span>
<span class="k">ELSE</span>
<span class="k">RETURN</span> <span class="n">QUERY</span> <span class="k">SELECT</span> <span class="n">d</span><span class="p">.</span><span class="n">oid</span><span class="p">,</span> <span class="n">d</span><span class="p">.</span><span class="n">datname</span>
<span class="k">FROM</span> <span class="n">powa_databases_src_tmp</span> <span class="n">d</span>
<span class="k">WHERE</span> <span class="n">srvid</span> <span class="o">=</span> <span class="n">_srvid</span><span class="p">;</span>
<span class="k">END</span> <span class="n">IF</span><span class="p">;</span>
<span class="k">END</span><span class="p">;</span>
<span class="err">$</span><span class="k">function</span><span class="err">$</span></code></pre></figure>
<p>Cette fonction retourne simplement le contenu de <code class="language-plaintext highlighter-rouge">pg_database</code> si les données
locales sont demandées (l’identifiant de serveur <strong>0</strong> est toujours le serveur
local), ou alors le contenu de la <em>table transitoire</em> pour le serveur distant
spécifié.</p>
<p>La <em>fonction de snapshot</em> peut alors facilement effectuer n’importe quel
traitement avec ces données pour le <em>serveur distant</em> voulu. Dans le cas de la
fonction <code class="language-plaintext highlighter-rouge">powa_databases_snapshot()</code>, il s’agit simplement de synchroniser la
liste des bases de données, et de stocker le timestamp de suppression si une
base de données qui existait précédemment n’est plus listée.</p>
<p>Pour plus de détails, vous pouvez consulter la documentation concernant
<a href="https://powa.readthedocs.io/en/latest/components/powa-archivist/development.html">l’ajout d’une source de données dans
PoWA</a>,
qui a été mise à jour pour les spécificités de la version 4.</p>
<p><a href="https://rjuju.github.io/postgresqlfr/2019/06/05/powa-4-nouveaute-dans-powa-archivist.html">PoWA 4: nouveautés dans powa-archivist !</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on June 05, 2019.</p>
https://rjuju.github.io/postgresql/2019/06/05/powa-4-new-in-powa-archivist2019-06-05T14:26:17+00:002019-06-05T14:26:17+00:00Julien Rouhaudhttps://rjuju.github.io
<p>This article is part of the <a href="http://powa.readthedocs.io/">PoWA 4 beta</a> series,
and describes the changes done in
<a href="https://powa.readthedocs.io/en/latest/components/powa-archivist/index.html">powa-archivist</a>.</p>
<p>For more information about this v4, you can consult the <a href="/postgresql/2019/05/17/powa-4-with-remote-mode-beta-is-available.html">general introduction
article</a>.</p>
<h3 id="quick-overview">Quick overview</h3>
<p>First of all, you have to know that there is not upgrade possible from v3 to
v4, so a <code class="language-plaintext highlighter-rouge">DROP EXTENSION powa</code> is required if you were already using PoWA on
any of your servers. This is because this v4 involved <strong>a lot</strong> of changes in
the SQL part of the extension, making it the most significant change in the
PoWA suite for this new version. Looking at the amount changes at the time I’m
writing this article, I get:</p>
<figure class="highlight"><pre><code class="language-diff" data-lang="diff"> CHANGELOG.md | 14 +
powa--4.0.0dev.sql | 2075 +++++++++++++++++++++-------
powa.c | 44 +-
3 files changed, 1629 insertions(+), 504 deletions(-)</code></pre></figure>
<p>The lack of upgrade shouldn’t be a problem in practice though. PoWA is a
performance tool, so it’s intended to have data with high precision but with a
very limited history. If you’re looking for a general monitoring solution
keeping months of counters, PoWA is definitely not the tool you need.</p>
<h3 id="configuring-the-list-of-remote-servers">Configuring the list of <em>remote servers</em></h3>
<p>Concerning the features themselves, the first small change is that
powa-archivist does not require the <a href="https://www.postgresql.org/docs/current/bgworker.html">background
worker</a> to be active
anymore, as it won’t be used for remote setup. That means that a PostgreSQL
restart is not needed needed anymore to install PoWA. Obviously, a restart is still
required if you want to use the local setup, using the background worker, or if
you want to install additional extensions that themselves require a restart.</p>
<p>Then, as PoWA needs some configuration (frequency of snapshot, data retention
and so on), some new tables are added to be able to configure all of that. The
new <code class="language-plaintext highlighter-rouge">powa_servers</code> table stores the configuration for all the remote instances
whose data should be stored on this instance. This <em>local PoWA instance</em> is
call a <strong>repository server</strong> (that typically should be dedicated to storing
PoWA data), in opposition to <strong>remote instances</strong> which are the instances you
want to monitor. The content of this table is pretty straightforward:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="err">\</span><span class="n">d</span> <span class="n">powa_servers</span>
<span class="k">Table</span> <span class="nv">"public.powa_servers"</span>
<span class="k">Column</span> <span class="o">|</span> <span class="k">Type</span> <span class="o">|</span> <span class="k">Collation</span> <span class="o">|</span> <span class="k">Nullable</span> <span class="o">|</span> <span class="k">Default</span>
<span class="c1">-----------+----------+-----------+----------+------------------------------------------</span>
<span class="n">id</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="n">nextval</span><span class="p">(</span><span class="s1">'powa_servers_id_seq'</span><span class="p">::</span><span class="n">regclass</span><span class="p">)</span>
<span class="n">hostname</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span>
<span class="k">alias</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span>
<span class="n">port</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span>
<span class="n">username</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span>
<span class="n">password</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span>
<span class="n">dbname</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span>
<span class="n">frequency</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="mi">300</span>
<span class="n">powa_coalesce</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="mi">100</span>
<span class="n">retention</span> <span class="o">|</span> <span class="n">interval</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="s1">'1 day'</span><span class="p">::</span><span class="n">interval</span></code></pre></figure>
<p>If you already used PoWA, you should recognize most of the configuration
options, that are now stored here. The new options are used to describe how to
connect to the <em>remote servers</em>, and can provide an alias to be displayed in
the UI.</p>
<p>You also probably noticed a <strong>password</strong> column here. Storing a password in
plain text in this table is an heresy as far as security is concerned. So, as
mentioned in the <a href="https://powa.readthedocs.io/en/latest/security.html#connection-on-remote-servers">PoWA security section of the
documentation</a>,
you can store a NULL password and use instead <a href="https://www.postgresql.org/docs/current/auth-methods.html">any of the authentication method
that libpq supports</a>
(.pgpass file, certificate…). That’s strongly recommended for any non toy
setup.</p>
<p>Another table, the <code class="language-plaintext highlighter-rouge">powa_snapshot_metas</code> table, is also added to store some
metadata regarding each <em>remote server</em> snapshot information:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"> <span class="k">Table</span> <span class="nv">"public.powa_snapshot_metas"</span>
<span class="k">Column</span> <span class="o">|</span> <span class="k">Type</span> <span class="o">|</span> <span class="k">Collation</span> <span class="o">|</span> <span class="k">Nullable</span> <span class="o">|</span> <span class="k">Default</span>
<span class="c1">--------------+--------------------------+-----------+----------+---------------------------------------</span>
<span class="n">srvid</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span>
<span class="n">coalesce_seq</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="mi">1</span>
<span class="n">snapts</span> <span class="o">|</span> <span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="s1">'-infinity'</span><span class="p">::</span><span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span>
<span class="n">aggts</span> <span class="o">|</span> <span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="s1">'-infinity'</span><span class="p">::</span><span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span>
<span class="n">purgets</span> <span class="o">|</span> <span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="s1">'-infinity'</span><span class="p">::</span><span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span>
<span class="n">errors</span> <span class="o">|</span> <span class="nb">text</span><span class="p">[]</span></code></pre></figure>
<p>That’s basically a counter to track the number of snapshots done, the timestamp
for each kind of event that happened (snapshot, aggregate and purge), and a
text array to store any error happening during the snapshot, that the UI can
display.</p>
<h3 id="sql-api-to-configure-the-remote-servers">SQL API to configure the <em>remote servers</em></h3>
<p>While thoses table are simple, a <a href="https://powa.readthedocs.io/en/latest/remote_setup.html#configure-powa-and-stats-extensions-on-each-remote-server">basic SQL API is available to register new
servers and configure
them</a>.
Basically, 6 functions are available:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">powa_register_server()</code>, to declare a new <em>remote server</em>, and the list of
extensions available on it</li>
<li><code class="language-plaintext highlighter-rouge">powa_configure_server()</code> to update any setting for the specified <em>remote
server</em> (using a JSON where the key is the name of the parameter to change,
and the value is the new value to use)</li>
<li><code class="language-plaintext highlighter-rouge">powa_deactivate_server()</code> to disable snapshots on the specified <em>remote
server</em> (which actually is setting up the <code class="language-plaintext highlighter-rouge">frequency</code> to <strong>-1</strong>)</li>
<li><code class="language-plaintext highlighter-rouge">powa_delete_and_purge_server()</code> to remove the specified <em>remote server</em>
from the list of servers and remove all associated snapshot data</li>
<li><code class="language-plaintext highlighter-rouge">powa_activate_extension()</code>, to declare that a new extension is available
on the specified <em>remote server</em></li>
<li><code class="language-plaintext highlighter-rouge">powa_deactivate_extension()</code>, to specify that an extension is not available
anymore on the specified <em>remote server</em></li>
</ul>
<p>Any action more complicated than this should be performed using plain SQL
queries. Hopefully, there shouldn’t be many other needs, and the tables are
straightforward so this shouldn’t be a problem. <a href="https://github.com/powa-team/powa-archivist/issues">Feel free to ask for more
functions</a> if you feel the
need though. Please also note that the UI doesn’t allow you to call those
functions, as the UI is for now entirely <strong>read only</strong>.</p>
<h3 id="performing-remote-snapshots">Performing <em>remote snapshots</em></h3>
<p>As metrics are now stored on a different PostgreSQL instance, we had to
extensively change the way <em>snapshots</em> (retrieving the data from a <a href="https://powa.readthedocs.io/en/latest/components/stats_extensions/index.html">stat
extension</a>
and storing them in PoWA catalog <a href="/postgresql/2016/09/16/minimizing-tuple-overhead.html">in a space efficient way</a>) are performed.</p>
<p>The list of all stat extensions, or <em>data sources</em>, that are available on a
<strong>server</strong> (either <em>remote</em> or <em>local</em>) and for which we should perform a
<em>snapshot</em> are configured in a table called <code class="language-plaintext highlighter-rouge">powa_functions</code>:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"> <span class="k">Table</span> <span class="nv">"public.powa_functions"</span>
<span class="k">Column</span> <span class="o">|</span> <span class="k">Type</span> <span class="o">|</span> <span class="k">Collation</span> <span class="o">|</span> <span class="k">Nullable</span> <span class="o">|</span> <span class="k">Default</span>
<span class="c1">----------------+---------+-----------+----------+---------</span>
<span class="n">srvid</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span>
<span class="n">module</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span>
<span class="k">operation</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span>
<span class="n">function_name</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span>
<span class="n">query_source</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span>
<span class="n">added_manually</span> <span class="o">|</span> <span class="nb">boolean</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="k">true</span>
<span class="n">enabled</span> <span class="o">|</span> <span class="nb">boolean</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="k">true</span>
<span class="n">priority</span> <span class="o">|</span> <span class="nb">numeric</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="mi">10</span></code></pre></figure>
<p>A new <code class="language-plaintext highlighter-rouge">query_source</code> field is added, that provides the name of a <em>source</em>
function, required to support remote snapshot of any <a href="https://powa.readthedocs.io/en/latest/components/stats_extensions/index.html">stat
extensions</a>.
This function is used to export the counters provided by this extension on a
different server, in a dedicated <em>transient table</em>. The <em>snapshot</em> function
will then perform the <em>snapshot</em> using those exported data instead of the one
provided by stat extensions locally when the remote mode is used. Note that
the counters export and the remote snapshot is done automatically with the the
new <a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">powa-collector
daemon</a>,
that I’ll cover in another article.</p>
<p>Here’s an example of how PoWA perform a <em>remote snapshot</em> of the list of
databases. As you’ll see, this is very simplistic, meaning that it’s very easy
to add support for a new stat extension.</p>
<p>The <em>transient table</em>:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"> <span class="n">Unlogged</span> <span class="k">table</span> <span class="nv">"public.powa_databases_src_tmp"</span>
<span class="k">Column</span> <span class="o">|</span> <span class="k">Type</span> <span class="o">|</span> <span class="k">Collation</span> <span class="o">|</span> <span class="k">Nullable</span> <span class="o">|</span> <span class="k">Default</span>
<span class="c1">---------+---------+-----------+----------+---------</span>
<span class="n">srvid</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span>
<span class="n">oid</span> <span class="o">|</span> <span class="n">oid</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span>
<span class="n">datname</span> <span class="o">|</span> <span class="n">name</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span></code></pre></figure>
<p>For better performance, all the <em>transient tables</em> are <strong>unlogged</strong>, as their
content is only needed during a <em>snapshot</em> and are trashed afterwards. In this
example the <em>transient table</em> only stores the server identifier for which the
data are, the oid and name of each databases present on the <em>remote server</em>.</p>
<p>And the <em>source function</em>:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">OR</span> <span class="k">REPLACE</span> <span class="k">FUNCTION</span> <span class="k">public</span><span class="p">.</span><span class="n">powa_databases_src</span><span class="p">(</span><span class="n">_srvid</span> <span class="nb">integer</span><span class="p">,</span>
<span class="k">OUT</span> <span class="n">oid</span> <span class="n">oid</span><span class="p">,</span> <span class="k">OUT</span> <span class="n">datname</span> <span class="n">name</span><span class="p">)</span>
<span class="k">RETURNS</span> <span class="k">SETOF</span> <span class="n">record</span>
<span class="k">LANGUAGE</span> <span class="n">plpgsql</span>
<span class="k">AS</span> <span class="err">$</span><span class="k">function</span><span class="err">$</span>
<span class="k">BEGIN</span>
<span class="n">IF</span> <span class="p">(</span><span class="n">_srvid</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span> <span class="k">THEN</span>
<span class="k">RETURN</span> <span class="n">QUERY</span> <span class="k">SELECT</span> <span class="n">d</span><span class="p">.</span><span class="n">oid</span><span class="p">,</span> <span class="n">d</span><span class="p">.</span><span class="n">datname</span>
<span class="k">FROM</span> <span class="n">pg_database</span> <span class="n">d</span><span class="p">;</span>
<span class="k">ELSE</span>
<span class="k">RETURN</span> <span class="n">QUERY</span> <span class="k">SELECT</span> <span class="n">d</span><span class="p">.</span><span class="n">oid</span><span class="p">,</span> <span class="n">d</span><span class="p">.</span><span class="n">datname</span>
<span class="k">FROM</span> <span class="n">powa_databases_src_tmp</span> <span class="n">d</span>
<span class="k">WHERE</span> <span class="n">srvid</span> <span class="o">=</span> <span class="n">_srvid</span><span class="p">;</span>
<span class="k">END</span> <span class="n">IF</span><span class="p">;</span>
<span class="k">END</span><span class="p">;</span>
<span class="err">$</span><span class="k">function</span><span class="err">$</span></code></pre></figure>
<p>This function simply returns the content of <code class="language-plaintext highlighter-rouge">pg_database</code> if local data are
asked (server id <strong>0</strong> is always the local server), or the content of the
<em>transient table</em> for the given remote server otherwise.</p>
<p>The <em>snapshot function</em> can then easily do any required work with the data
for the wanted <em>remote server</em>. In the case of the <code class="language-plaintext highlighter-rouge">powa_databases_snapshot()</code>
function, the just synchronizing the list of databases, and storing the
timestamp of removal if a previously existing database is not found anymore.</p>
<p>For more details, you can consult the <a href="https://powa.readthedocs.io/en/latest/components/powa-archivist/development.html">PoWA datasource
integration</a>
documentation, which was updated for the version 4 specificities.</p>
<p><a href="https://rjuju.github.io/postgresql/2019/06/05/powa-4-new-in-powa-archivist.html">PoWA 4: changes in powa-archivist!</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on June 05, 2019.</p>
https://rjuju.github.io/postgresql/2019/05/17/powa-4-with-remote-mode-beta-is-available2019-05-17T11:04:17+00:002019-05-17T11:04:17+00:00Julien Rouhaudhttps://rjuju.github.io
<p><a href="http://powa.readthedocs.io/">PoWA 4</a> is available in beta.</p>
<h3 id="new-remote-mode">New remote mode!</h3>
<p>The <a href="https://powa.readthedocs.io/en/latest/remote_setup.html">new remote mode</a>
is the biggest feature introduced in PoWA 4, though there have been other
improvements.</p>
<p>I’ll describe here what this new mode implies and what changed in the
<a href="https://powa.readthedocs.io/en/latest/components/powa-web/index.html">UI</a>.</p>
<p>If you’re interested in more details about the rest of the changes in PoWA 4,
I’ll soon publish other articles for that.</p>
<p>For the most hurried people, feel free to directly go on the <a href="https://dev-powa.anayrat.info/">v4 demo of
PoWA</a>, kindly hosted by <a href="http://blog.anayrat.info/">Adrien
Nayrat</a>. No credential needed, just click on
“Login”.</p>
<h3 id="why-is-a-remote-mode-important">Why is a remote mode important</h3>
<p>This feature has probably been the most frequently asked since PoWA was first
released, back in 2014. And that was asked for good reasons, as a local mode
have some drawbacks.</p>
<p>First, let’s see how was the architecture up to PoWA 3. Assuming an instance
with 2 databases (db1 and db2), plus <strong>one database dedicated for PoWA</strong>. This
dedicated database contains both the <em>stat extension</em> required to get the
live performance data and to <strong>store them</strong>.</p>
<p><a href="/images/powa_4_local.svg"><img src="/images/powa_4_local.svg" alt="Local mode architecture" /></a></p>
<p>A custom <em><a href="https://powa.readthedocs.io/en/latest/components/powa-archivist/configuration.html#background-worker-configuration">background
worker</a></em>
is started by PoWA, which is responsible for taking snapshots and storing them
in the dediacted powa database regularly. Then, using powa-web, you can see the
activity of any of the <strong>local</strong> databases querying the stored data on the
dedicated database, and possibly connect to one of the other local database
when complete data are needed, for instance when using the index suggestion
tool.</p>
<p>With version 4, the architecture with a remote setup change quite a lot:</p>
<p><a href="/images/powa_4_remote.svg"><img src="/images/powa_4_remote.svg" alt="Remote mode architecture" /></a></p>
<p>You can see the a dedicated powa database is still required, but <strong>only for the
stat extensions</strong>. Data are now stored on a different instance. Then, the
<em><a href="https://powa.readthedocs.io/en/latest/components/powa-archivist/configuration.html#background-worker-configuration">background
worker</a></em>
is replaced by a <strong><a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">new collector
daemon</a></strong>,
which reads the performance data from the <em>remote servers</em>, and store them on the
dedicated <em>repository server</em>. Powa-web will then be able to display the
activity connecting on the <em>repository server</em>, and also on the <strong>remote
server</strong> when complete data are needed.</p>
<p>In short, with the new remote mode introduced in this version 4:</p>
<ul>
<li>a PostgreSQL restart is not required anymore to install powa-archivist
extension, as the background worker is not mandatory anymore</li>
<li>there is no overhead due to storing and querying data on the same
PostgreSQL server as your production server (there are still some part of
the UI that requires querying the original server, for instance when
showing EXPLAIN plans, but that’s a negligible overhead)</li>
<li>it’s now possible to use PoWA on a <strong>hot-standby server</strong></li>
</ul>
<p>The UI will therefore now welcome you with a initial page to let you chose
which server stored on the configured database you want to wotk on:
<a href="/images/powa_4_all_servers.png"><img src="/images/powa_4_all_servers.png" alt="Servers choice" /></a></p>
<p>The main reason it took so much time to bring a remote mode is because this
adds quite some complexity, requiring a major rewrite of the whole PoWA stack.
We also wanted to add more feature first, such as the <strong>global index
suggestion</strong>, with <strong>validation using <a href="http://hypopg.readthedocs.io/">hypopg</a></strong>
introduced with <a href="https://powa.readthedocs.io/en/latest/releases/v3.0.0.html">PoWA
3</a>.</p>
<h3 id="changes-in-powa-web">Changes in <a href="https://powa.readthedocs.io/en/latest/components/powa-web/index.html">powa-web</a></h3>
<p>The <em>user interface</em> is the component which probably has the most visible
changes in this version 4. Here are the most important ones.</p>
<h5 id="remote-mode-compatibility">Remote mode compatibility</h5>
<p>The biggest change is obviously the support for the <a href="https://powa.readthedocs.io/en/latest/remote_setup.html">new remote
mode</a>. As a
consequence, the first page shown is now a <strong>server selector</strong> page, displaying
all registered <em>remote servers</em>. After choosing the wanted <em>remote server</em> (or
<em>local server</em> if you don’t use the remote mode), all other pages will be
similar to the one that were available until PoWA 3, but displaying data for a
specific <em>remote server</em> only, and of course retrieving the data from the
<strong>repository powa database</strong>, and with some new information I’ll describe just
after.</p>
<p>Note that as the data is now stored on a dedicated <em>repository server</em> when
using the remote mode, most of the UI is usable without connecting on the
currently selected <em>remote server</em>. However, powa-web still requires to
connect on the <em>remote server</em> when the original data are needed (for instance,
for index suggestion or when showing <strong>EXPLAIN</strong> plans). The <a href="https://powa.readthedocs.io/en/latest/security.html#connection-on-remote-servers">same
authentication considerations and
possibilities</a>
as for the new <a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">powa-collector
daemon</a>
(which will be described in a following article) applies here.</p>
<h5 id="pg_track_settings-support"><a href="https://github.com/rjuju/pg_track_settings/">pg_track_settings</a> support</h5>
<p>When this extension is properly configured, a new timeline widget will appear,
placed between each graph and its overview, displaying any kind of recorded
change if any was detected in the currently selected time interval. On the
per-database and per-query pages, this list will be filtered by the selected
database.</p>
<p>The same timeline will be displayed on every graph of each page, to easily
check if this change had any visible impact using the various graphs.</p>
<p>Note that details of the changes will be displayed on mouseover. You can also
click on any event on the timeline to make the event stay displayed, and draw a
vertical line on the underlying graph.</p>
<p>Here’s an example of such detected configuration change in action:</p>
<p><a href="/images/pg_track_settings_powa4.png"><img src="/images/pg_track_settings_powa4.png" alt="Configuration changes detected" /></a></p>
<p>Please also note that you need at least version 2.0.0 of
<a href="https://github.com/rjuju/pg_track_settings/">pg_track_settings</a>, and that the
extension has to be installed <strong>both on the <em>remote servers</em> and the
<em>repository server</em>.</strong></p>
<h5 id="new-graphs-available">New graphs available</h5>
<p>When
<a href="https://powa.readthedocs.io/en/latest/components/stats_extensions/pg_stat_kcache.html">pg_stat_kcache</a>
is setup, its information were previously only displayed on the per-query page.
They’re now displayed on per-server and per-database too, in two graphs:</p>
<ul>
<li>in the <strong>Block Access</strong> graph, where the <strong>OS cache</strong> and <strong>disk read</strong>
metrics will replace the <strong>read</strong> metric</li>
<li>in a new <strong>System Resources</strong> graph (which is also added in the <em>per-query</em>
page), showing the <a href="/postgresql/2018/07/17/pg_stat_kcache-2-1-is-out.html">metrics added in pg_stat_kcache 2.1</a></li>
</ul>
<p>Here is an example of this new <strong>System Resources</strong> graph:</p>
<p><a href="/images/pg_stat_kcache_system_resources_powa4.png"><img src="/images/pg_stat_kcache_system_resources_powa4.png" alt="System ressources" /></a></p>
<p>There was also a <strong>Wait Events</strong> graph (available when <a href="https://powa.readthedocs.io/en/v4/components/stats_extensions/pg_wait_sampling.html">pg_wait_sampling
extension</a>
is setup) only available on the per-query page. This graph is now available on
the per-server and per-database pages too.</p>
<h5 id="metrics-documentation-and-documentation-link">Metrics documentation and documentation link</h5>
<p>Some metrics displayed in the user interface was quite self explanatory, while
some could be a little bit obscure. Unfortunately, until now there wasn’t any
documentation for any of the metrics. That’s now fixed, and all graphs have an
<em>information icon</em>, that will display a description of the metrics used in the
graph on mouseover. Some graphs will also include a link to the underlying
<a href="https://powa.readthedocs.io/en/latest/components/stats_extensions/index.html">stat extension in PoWA
documentation</a>
for users who want to learn more about them.</p>
<p>Here’s an example:</p>
<p><a href="/images/powa_4_metrics_doc.png"><img src="/images/powa_4_metrics_doc.png" alt="Metrics documentation" /></a></p>
<h5 id="and-general-bugfixes">And general bugfixes</h5>
<p>Some longstanding issues were also reported:</p>
<ul>
<li>the graph hover box showing metric values had a wrong vertical position</li>
<li>the time selection using the graph preview didn’t show a correct preview
after applying the selection</li>
<li>errors on hypothetical index creation or in certain cases their display
wasn’t correctly handled in multiple pages</li>
<li>grid filters weren’t reapplied when time selection was changed</li>
</ul>
<p>If you have ever been annoyed by any of this, you’ll be glad to know that
they’re now all fixed!</p>
<h3 id="conclusion">Conclusion</h3>
<p>This 4th version of PoWA represents a lot of time on development, documentation
improvements and testing. We’re now quite satisfied with it, but we may have
missed some bugs. If you’re interested in this project, I hope that you’ll
consider testing the beta, and if needed don’t hesitate <a href="https://powa.readthedocs.io/en/latest/support.html#support">to report a
bug</a>!</p>
<p><a href="https://rjuju.github.io/postgresql/2019/05/17/powa-4-with-remote-mode-beta-is-available.html">PoWA 4 brings a remote mode, available in beta!</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on May 17, 2019.</p>
https://rjuju.github.io/postgresqlfr/2019/05/17/powa-4-avec-mode-remote-disponible-en-beta2019-05-17T11:04:17+00:002019-05-17T11:04:17+00:00Julien Rouhaudhttps://rjuju.github.io
<p><a href="http://powa.readthedocs.io/">PoWA 4</a> est disponible en beta.</p>
<h3 id="nouveau-mode-remote-">Nouveau mode remote !</h3>
<p>Le <a href="https://powa.readthedocs.io/en/latest/remote_setup.html">nouveau mode remote</a>
est la plus grosse fonctionnalité ajoutée dans PoWA 4, bien qu’il y ait eu
d’autres améliorations.</p>
<p>Je vais décrire ici ce que ce nouveau mode implique ainsi que ce qui a changé
sur l’<a href="https://powa.readthedocs.io/en/latest/components/powa-web/index.html">UI</a>.</p>
<p>Si de plus amples détails sur le reste des changements apportés dans PoWA 4
vous intéresse, je publierai bientôt d’autres articles sur le sujet.</p>
<p>Pour les plus pressés, n’hésitez pas à aller directement sur la <a href="https://dev-powa.anayrat.info/">démo v4 de
PoWA</a>, très gentiment hébergée par <a href="http://blog.anayrat.info/">Adrien
Nayrat</a>. Aucun authentification n’est requise,
cliquez simplement sur “Login”.</p>
<h3 id="pourquoi-un-mode-remote-est-il-important">Pourquoi un mode remote est-il important</h3>
<p>Cette fonctionnalité a probablement été la plus fréquemment demandée depuis que
PoWA a été publié, en 2014. Et c’est pour de bonnes raisons, car un mode local
a quelques inconvénients.</p>
<p>Tout d’abord, voyons comment se présentait l’architecture avec les versions 3
et antérieures. Imaginons une instance contenant 2 bases de données (db1 et
db2), ainsi qu’<strong>une base de données dédiée à PoWA</strong>. Cette base de données
dédiée contient à la fois les <em>extensions statistiques</em> nécessaires pour
récupérer compteurs de performances actuels ainsi que pour <strong>les stocker</strong>.</p>
<p><a href="/images/powa_4_local.svg"><img src="/images/powa_4_local.svg" alt="Architecture en mode local" /></a></p>
<p>Un <em><a href="https://powa.readthedocs.io/en/latest/components/powa-archivist/configuration.html#background-worker-configuration">background
worker</a></em>
est démarré par PoWA, qui est responsable d’effectuer des <em>snapshots</em> et de les
stocker dans la base powa dédiée à intervalle réguliers. Ensuite, en utilisant
powa-web, vous pouvez consulter l’activité de n’importe laquelle des bases de
données <strong>locales</strong> en effectuant des requêtes sur les données stockées dans la
base dédié, et potentiellement en se connectant sur l’une des autres bases de
données locales lorsque les données complètes sont nécessaires, par exemple
lorsque l’outil de suggestion d’index est utilisé.</p>
<p>Avec la version 4, l’architecture avec une configuration distante change de
manière significative:</p>
<p><a href="/images/powa_4_remote.svg"><img src="/images/powa_4_remote.svg" alt="Architecture en mode distant" /></a></p>
<p>Vous pouvez voir qu’une base de donnée powa dédiée est toujours nécessaire,
mais <strong>uniquement pour les extensions statistiques</strong>. Les données sont
maintenant stockées sur une instance différente. Ensuite, le <em><a href="https://powa.readthedocs.io/en/latest/components/powa-archivist/configuration.html#background-worker-configuration">background
worker</a></em>
est remplacé par un <strong><a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">nouveau daemon
collecteur</a></strong>,
qui lit les métriques de performance depuis les <em>serveurs distants</em>, et les
stocke sur le <em>serveur repository</em> dédié. Powa-web pourra présenter les
données en se connectant sur le <em>serveur repository</em>, ainsi que sur les
<strong>serveurs distants</strong> lorsque des données complètes sont nécessaires.</p>
<p>En résumé, avec le nouveau mode distant ajouté dans cette version 4</p>
<ul>
<li>un redémarrage de PostgreSQL n’est plus nécessaire pour installer
powa-archivist</li>
<li>il n’y a plus de surcoût du au fait de stocker et requêter les données sur
le même serveur PostgreSQL que vos serveurs de productions (il y a toujours
certaines partie de l’UI qui nécessitent d’effectuer des requêtes sur le
serveur d’origine, par exemple pour montrer des plans avec EXPLAIN, mais le
surcoût est négligeable)</li>
<li>il est maintenant possible d’utiliser PoWA sur un <strong>serveur en
hot-standby</strong></li>
</ul>
<p>L’UI vous accueillera donc maintenant avec une page initiale afin de choisir
lequel des serveurs stockés sur la base de données cible vous voulez
travailler :
<a href="/images/powa_4_all_servers.png"><img src="/images/powa_4_all_servers.png" alt="Choix des serveurs" /></a></p>
<p>La principale raison pour laquelle il a fallu tellement de temps pour apporter
ce mode distant est parce que cela apporte beaucoup de complexité, nécessitant
une réécriture majeure de PoWA. Nous voulions également ajouter d’abord
d’autres fonctionnalités, comme la <strong>suggestion globale d’index</strong>, avec une
<strong>validation grâce à <a href="http://hypopg.readthedocs.io/">hypopg</a></strong> introduit avec
<a href="https://powa.readthedocs.io/en/latest/releases/v3.0.0.html">PoWA 3</a>.</p>
<h3 id="changements-dans-powa-web">Changements dans <a href="https://powa.readthedocs.io/en/latest/components/powa-web/index.html">powa-web</a></h3>
<p>L’<em>interface graphique</em> est le composant qui a le plus de changements visibles
dans cette version 4. Voici les plus changements les plus importants.</p>
<h5 id="compatibilité-avec-le-mode-distant">Compatibilité avec le mode distant</h5>
<p>Le changement le plus important est bien évidemment le support pour le <a href="https://powa.readthedocs.io/en/latest/remote_setup.html">nouveau
mode remote</a>. En
conséquence, la première page affichée est maintenant une page de <strong>sélection
de serveur</strong>, affichant tous les <em>serveurs distants</em> enregistrés. Après avoir
choisi le <em>serveur distant</em> voulu (ou le <em>serveur local</em> si vous n’utilisez pas
le mode distant), toutes les autres pages seront similaires à celles
disponibles jusqu’à la version 3, mais afficheront les données pour un <em>serveur
distant</em> spécifique uniquement, et bien entendu en récupérant les données
depuis la <strong>base de données repository</strong>, avec en plus de nouvelles
informations décrites ci-dessous.</p>
<p>Veuillez notez que puisque les données sont maintenant stockées sur un <em>serveur
repository</em> dédié quand le mode remote est utilisé, la majorité de l’UI est
utilisable sans se connecter au <em>serveur distant</em> sélectionné. Toutefois,
powa-web nécessite toujours de pouvoir se connecter sur le <em>serveur distant</em>
quand les données originales sont nécessaires (par exemple, pour la suggestion
d’index ou pour montrer des plans avec <strong>EXPLAIN</strong>). Les <a href="https://powa.readthedocs.io/en/latest/security.html#connection-on-remote-servers">mêmes considérations
et possibilités concernant
l’authentification</a>
que pour le nouveau <a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">daemon powa-collector
</a>
(qui sera décrit dans un prochain article) s’appliquent ici.</p>
<h5 id="pg_track_settings-support"><a href="https://github.com/rjuju/pg_track_settings/">pg_track_settings</a> support</h5>
<p>Quand cette extension est correctement configurée, un nouveau widget timeline
apparaîtra, placé entre chaque graph et son aperçu, affichant différents types
de changements enregistrés si ceux-ci ont été détectés sur l’intervalle de
temps sélectionné. Sur les pages par base de données et par requête, la liste
sera également filtrée en fonction de la base de données sélectionnée.</p>
<p>La même timeline sera affichée sur chacun des graphs de chacune des pages, afin
de facilement vérifier si ces changements ont eu un impact visible en utilisant
les différents graphs.</p>
<p>Veuillez noter que les détails des changements sont affichés au survol de la
souris. Vous pouvez également cliquer sur n’importe lequel des événements de
la timeline pour figer l’affichage, et tracer une ligne verticale sur le graph
associé.</p>
<p>Voici un exemple d’un tel changement de configuration en action :</p>
<p><a href="/images/pg_track_settings_powa4.png"><img src="/images/pg_track_settings_powa4.png" alt="Changements de configuration détectés" /></a></p>
<p>Veuillez également noter qu’il est nécessaire d’avoir au minimum la version
2.0.0 de <a href="https://github.com/rjuju/pg_track_settings/">pg_track_settings</a>, et
que l’extension doit être installée <strong>à la fois sur les <em>serveurs distants</em>
ainsi que sur le <em>serveur repository</em>.</strong></p>
<h5 id="nouveaux-graphs-disponibles">Nouveaux graphs disponibles</h5>
<p>Quand
<a href="https://powa.readthedocs.io/en/latest/components/stats_extensions/pg_stat_kcache.html">pg_stat_kcache</a>
est configuré, ses informations n’étaient auparavant affichées que sur la page
par requête. Les informations sont maintenant également affichées sur les
pages par serveur et par base, dans deux nouveaux graphs :</p>
<ul>
<li>dans le graph <strong>Block Access</strong>, où les métriques <strong>OS cache</strong> et <strong>disk
read</strong> remplaceront la métrique <strong>read</strong></li>
<li>dans un nouveau graph <strong>System Resources</strong> (qui est également ajouté dans
la page <em>par requête</em>), montrant les <a href="/postgresql/2018/07/17/pg_stat_kcache-2-1-is-out.html">metrics ajoutées dans pg_stat_kcache
2.1</a></li>
</ul>
<p>Voici un example de ce nouveau graph <strong>System Resources</strong> :</p>
<p><a href="/images/pg_stat_kcache_system_resources_powa4.png"><img src="/images/pg_stat_kcache_system_resources_powa4.png" alt="Ressources système" /></a></p>
<p>Il y avait également un graph <strong>Wait Events</strong> (disponible quand <a href="https://powa.readthedocs.io/en/v4/components/stats_extensions/pg_wait_sampling.html">l’extension
pg_wait_sampling</a>
est configuée) disponible uniquement sur la page par requête. Ce graph est
maintenant disponible sur les pages par serveur et par base également.</p>
<h5 id="documentation-des-métriques-et-liens-vers-la-documentation">Documentation des métriques et liens vers la documentation</h5>
<p>Certaines métriques affichées sur l’interface sont assez parlante, mais
certaines autres peuvent être un peu obscures. Jusqu’à maintenant, il n’y
avait malheureusement aucune documentation pour les métriques. Le problème est
maintenant réglé, et tous les graphs ont une <em>icône d’information</em>, qui
affichent une description des métriques utilisée dans le graph au survol de la
souris. Certains graphs incluent également un lien vers la <a href="https://powa.readthedocs.io/en/latest/components/stats_extensions/index.html">documentation PoWA
de extension
statistiques</a>
pour les utilisateurs qui désirent en apprendre plus à leur sujet.</p>
<p>Voici un exemple :</p>
<p><a href="/images/powa_4_metrics_doc.png"><img src="/images/powa_4_metrics_doc.png" alt="Documentation des métriques" /></a></p>
<h5 id="et-des-correctifs-de-bugs-divers">Et des correctifs de bugs divers</h5>
<p>Certains problèmes de longues dates ont également été rapportés :</p>
<ul>
<li>la boîte affichée au survol d’un graph montant les valeurs des métriques
avait une position verticale incorrecte</li>
<li>la sélection temporelle en utilisant l’aperçu des graphs ne montrait pas un
aperçu correct après avoir appliqué la sélection</li>
<li>les erreurs lors de la création d’index hypothétiques ou dans certains cas
leur affichage n’était pas correctement gérés sur plusieurs pages</li>
<li>les filtres des tableaux n’était pas réappliqués quand l’intervalle de
temps sélectionné était changé</li>
</ul>
<p>Si un de ces problèmes vous a un jour posé problème, vous serez ravi
d’apprendre qu’ils sont maintenant tous corrigés !</p>
<h3 id="conclusion">Conclusion</h3>
<p>Cette 4ème version de PoWA représente un temps de développement très important,
de nombreuses améliorations sur la documentation et beaucoup de tests. Nous
somme maintenant assez satisfaits, mais il est possible que nous ayons ratés
certains bugs. Si vous vous intéressez à ce projet, j’espère que vous
essaierez de tester cette beta, et si besoin n’hésitez pas à <a href="https://powa.readthedocs.io/en/latest/support.html#support">nous remonter un
bug</a>!</p>
<p><a href="https://rjuju.github.io/postgresqlfr/2019/05/17/powa-4-avec-mode-remote-disponible-en-beta.html">PoWA 4 apporte un mode remote, disponible en beta !</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on May 17, 2019.</p>
https://rjuju.github.io/postgresqlfr/2019/04/18/nouveau-dans-pg12-statistiques-erreurs-checksums2019-04-18T11:02:26+00:002019-04-18T11:02:26+00:00Julien Rouhaudhttps://rjuju.github.io
<h3 id="data-checksums">Data checksums</h3>
<p>Ajoutés dans <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=96ef3b8ff1c">PostgreSQL
9.3</a>,
les <a href="https://www.postgresql.org/docs/current/app-initdb.html#APP-INITDB-DATA-CHECKSUMS">data
checksums</a>
peuvent aider à détecter les corruptions de données survenant sur votre
stockage.</p>
<p>Les checksums sont activés si l’instance a été initialisée en utilisant <code class="language-plaintext highlighter-rouge">initdb
--data-checksums</code> (ce qui n’est pas le comportement par défaut), ou s’ils ont
été activés après en utilisant la nouvelle utilitaire
activated afterwards with the new
<a href="https://www.postgresql.org/docs/devel/app-pgchecksums.html">pg_checksums</a>
également <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=ed308d783790">ajouté dans PostgreSQL
12</a>.</p>
<p>Quand les checksums sont ativés, ceux-ci sont écrits à chaque fois qu’un bloc
de données est écrit sur disque, et vérifiés à chaque fois qu’un bloc est lu
depuis le disque (ou depuis le cache du système d’exploitation). Si la
vérification échoue, une erreur est remontée dans les logs. Si le bloc était
lu par un processus client, la requête associée échouera bien évidemment, mais
si le bloc était lu par une opération
<a href="https://www.postgresql.org/docs/current/protocol-replication.html#id-1.10.5.9.7.1.8.1.12">BASE_BACKUP</a>
(tel que pg_basebackup), la commande continuera à s’exécuter. Bien que les
data checksums ne détecteront qu’un sous ensemble des problèmes possibles, ils
ont tout de même une certaine utilisé, surtout si vous ne faites pas confiance
à votre stockage.</p>
<p>Jusqu’à PostgreSQL 11, les erreurs de validation de checksum ne pouvaient être
trouvées qu’en cherchant dans les logs, ce qui n’est clairement pas pratique si
vous voulez monitorer de telles erreurs.</p>
<h3 id="nouveaux-compteurs-disponibles-dans-pg_stat_database">Nouveaux compteurs disponibles dans pg_stat_database</h3>
<p>Pour rendre la supervision des erreurs de checksum plus simple, et pour aider
les utilisateurs à réagir dès qu’un tel problème survient, PostgreSQL 12 ajoute
de nouveaux compteurs dans la vue <code class="language-plaintext highlighter-rouge">pg_stat_database</code> :</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>commit 6b9e875f7286d8535bff7955e5aa3602e188e436
Author: Magnus Hagander <[email protected]>
Date: Sat Mar 9 10:45:17 2019 -0800
Track block level checksum failures in pg_stat_database
This adds a column that counts how many checksum failures have occurred
on files belonging to a specific database. Both checksum failures
during normal backend processing and those created when a base backup
detects a checksum failure are counted.
Author: Magnus Hagander
Reviewed by: Julien Rouhaud
</code></pre></div></div>
<p> </p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>commit 77bd49adba4711b4497e7e39a5ec3a9812cbd52a
Author: Magnus Hagander <[email protected]>
Date: Fri Apr 12 14:04:50 2019 +0200
Show shared object statistics in pg_stat_database
This adds a row to the pg_stat_database view with datoid 0 and datname
NULL for those objects that are not in a database. This was added
particularly for checksums, but we were already tracking more satistics
for these objects, just not returning it.
Also add a checksum_last_failure column that holds the timestamptz of
the last checksum failure that occurred in a database (or in a
non-dataabase file), if any.
Author: Julien Rouhaud <[email protected]>
</code></pre></div></div>
<p> </p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>commit 252b707bc41cc9bf6c55c18d8cb302a6176b7e48
Author: Magnus Hagander <[email protected]>
Date: Wed Apr 17 13:51:48 2019 +0200
Return NULL for checksum failures if checksums are not enabled
Returning 0 could falsely indicate that there is no problem. NULL
correctly indicates that there is no information about potential
problems.
Also return 0 as numbackends instead of NULL for shared objects (as no
connection can be made to a shared object only).
Author: Julien Rouhaud <[email protected]>
Reviewed-by: Robert Treat <[email protected]>
</code></pre></div></div>
<p>Ces compteurs reflèteront les erreurs de validation de checksum à la fois pour
les processus clients et pour l’activité
<a href="https://www.postgresql.org/docs/current/protocol-replication.html#id-1.10.5.9.7.1.8.1.12">BASE_BACKUP</a>,
par base de données.</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="n">rjuju</span><span class="o">=#</span> <span class="err">\</span><span class="n">d</span> <span class="n">pg_stat_database</span>
<span class="k">View</span> <span class="nv">"pg_catalog.pg_stat_database"</span>
<span class="k">Column</span> <span class="o">|</span> <span class="k">Type</span> <span class="o">|</span> <span class="k">Collation</span> <span class="o">|</span> <span class="k">Nullable</span> <span class="o">|</span> <span class="k">Default</span>
<span class="c1">-----------------------+--------------------------+-----------+----------+---------</span>
<span class="n">datid</span> <span class="o">|</span> <span class="n">oid</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span>
<span class="n">datname</span> <span class="o">|</span> <span class="n">name</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span>
<span class="p">[...]</span>
<span class="n">checksum_failures</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span>
<span class="n">checksum_last_failure</span> <span class="o">|</span> <span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span>
<span class="p">[...]</span>
<span class="n">stats_reset</span> <span class="o">|</span> <span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span></code></pre></figure>
<p>La colonne <code class="language-plaintext highlighter-rouge">checksum_failures</code> montrera un nombre cumulé d’erreurs, et la
colonne <code class="language-plaintext highlighter-rouge">checksum_last_failure</code> montrera l’horodatage de la dernière erreur de
validation sur la base de données (NULL si aucune erreur n’est jamais
survenue).</p>
<p>Pour éviter toute confusion (merci à Robert Treat pour l’avoir signalé), ces
deux colonnes retourneront toujours NULL si les data checkums ne sont pas
activés, afin qu’on ne puisse pas croire que les checksums sont toujours
vérifiés avec succès.</p>
<p>Comme effet de bord, <code class="language-plaintext highlighter-rouge">pg_stat_database</code> montrera maintenant également les
statistiques disponibles pour les objets partagés (tels que la table
<code class="language-plaintext highlighter-rouge">pg_database</code> par exemple), dans une nouvelle ligne pour laquelle <code class="language-plaintext highlighter-rouge">datid</code> vaut
<strong>0</strong>, et <code class="language-plaintext highlighter-rouge">datname</code> vaut <strong>NULL</strong>.</p>
<p><del>Une sonde dédiée est également <a href="https://github.com/OPMDG/check_pgactivity/issues/226">déjà
planifiée</a> dans
<a href="https://opm.readthedocs.io/probes/check_pgactivity.html">check_pgactivity</a> !</del>
Une sonde dédiée est également <a href="https://github.com/OPMDG/check_pgactivity/commit/0e8b516e95e4364470d4e205aebc9fe68bbcfd23">déjà
disponible</a>
dans <a href="https://opm.readthedocs.io/probes/check_pgactivity.html">check_pgactivity</a> !</p>
<p><a href="https://rjuju.github.io/postgresqlfr/2019/04/18/nouveau-dans-pg12-statistiques-erreurs-checksums.html">Nouveauté pg12: Statistiques sur les erreurs de checkums</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on April 18, 2019.</p>
https://rjuju.github.io/postgresql/2019/04/18/new-in-pg12-statistics-checksums-errors2019-04-18T11:02:26+00:002019-04-18T11:02:26+00:00Julien Rouhaudhttps://rjuju.github.io
<h3 id="data-checksums">Data checksums</h3>
<p>Added in <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=96ef3b8ff1c">PostgreSQL
9.3</a>,
<a href="https://www.postgresql.org/docs/current/app-initdb.html#APP-INITDB-DATA-CHECKSUMS">data
checksums</a>
can help to detect data corruption happening on the storage side.</p>
<p>Checksums are only enabled if the instance was setup using <code class="language-plaintext highlighter-rouge">initdb
--data-checksums</code> (which isn’t the default behavior), or if activated
afterwards with the new
<a href="https://www.postgresql.org/docs/devel/app-pgchecksums.html">pg_checksums</a>
tool also <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=ed308d783790">added in PostgreSQL
12</a>.</p>
<p>When enabled, checksums are written each time a block is written to disk, and
verified each time a block is read from disk (or from the operating system
cache). If the checksum verification fails, an error is reported in the logs.
If the block was read by a backend, the query will obviously fails, but if the
block was read by a
<a href="https://www.postgresql.org/docs/current/protocol-replication.html#id-1.10.5.9.7.1.8.1.12">BASE_BACKUP</a>
operation (such as pg_basebackup), the command will continue its processing .
While data checkums will only catch a subset of possible problems, they still
have some values, especially if you don’t trust your storage reliability.</p>
<p>Up to PostgreSQL 11, any checksum validation error could only be found by
looking into the logs, which clearly isn’t convenient if you want to monitor
such error.</p>
<h3 id="new-counters-available-in-pg_stat_database">New counters available in pg_stat_database</h3>
<p>To make checksum errors easier to monitor, and help users to react as soon as
such a problem occurs, PostgreSQL 12 adds new counters in the
<code class="language-plaintext highlighter-rouge">pg_stat_database</code> view:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>commit 6b9e875f7286d8535bff7955e5aa3602e188e436
Author: Magnus Hagander <[email protected]>
Date: Sat Mar 9 10:45:17 2019 -0800
Track block level checksum failures in pg_stat_database
This adds a column that counts how many checksum failures have occurred
on files belonging to a specific database. Both checksum failures
during normal backend processing and those created when a base backup
detects a checksum failure are counted.
Author: Magnus Hagander
Reviewed by: Julien Rouhaud
</code></pre></div></div>
<p> </p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>commit 77bd49adba4711b4497e7e39a5ec3a9812cbd52a
Author: Magnus Hagander <[email protected]>
Date: Fri Apr 12 14:04:50 2019 +0200
Show shared object statistics in pg_stat_database
This adds a row to the pg_stat_database view with datoid 0 and datname
NULL for those objects that are not in a database. This was added
particularly for checksums, but we were already tracking more satistics
for these objects, just not returning it.
Also add a checksum_last_failure column that holds the timestamptz of
the last checksum failure that occurred in a database (or in a
non-dataabase file), if any.
Author: Julien Rouhaud <[email protected]>
</code></pre></div></div>
<p> </p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>commit 252b707bc41cc9bf6c55c18d8cb302a6176b7e48
Author: Magnus Hagander <[email protected]>
Date: Wed Apr 17 13:51:48 2019 +0200
Return NULL for checksum failures if checksums are not enabled
Returning 0 could falsely indicate that there is no problem. NULL
correctly indicates that there is no information about potential
problems.
Also return 0 as numbackends instead of NULL for shared objects (as no
connection can be made to a shared object only).
Author: Julien Rouhaud <[email protected]>
Reviewed-by: Robert Treat <[email protected]>
</code></pre></div></div>
<p>Those counters will reflect checksum validation errors for both backend
activity and
<a href="https://www.postgresql.org/docs/current/protocol-replication.html#id-1.10.5.9.7.1.8.1.12">BASE_BACKUP</a>
activity, per database.</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="n">rjuju</span><span class="o">=#</span> <span class="err">\</span><span class="n">d</span> <span class="n">pg_stat_database</span>
<span class="k">View</span> <span class="nv">"pg_catalog.pg_stat_database"</span>
<span class="k">Column</span> <span class="o">|</span> <span class="k">Type</span> <span class="o">|</span> <span class="k">Collation</span> <span class="o">|</span> <span class="k">Nullable</span> <span class="o">|</span> <span class="k">Default</span>
<span class="c1">-----------------------+--------------------------+-----------+----------+---------</span>
<span class="n">datid</span> <span class="o">|</span> <span class="n">oid</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span>
<span class="n">datname</span> <span class="o">|</span> <span class="n">name</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span>
<span class="p">[...]</span>
<span class="n">checksum_failures</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span>
<span class="n">checksum_last_failure</span> <span class="o">|</span> <span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span>
<span class="p">[...]</span>
<span class="n">stats_reset</span> <span class="o">|</span> <span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span></code></pre></figure>
<p>The <code class="language-plaintext highlighter-rouge">checksum_failures</code> column will show a cumulated number of errors, and the
<code class="language-plaintext highlighter-rouge">checksum_last_failure</code> column will show the timestamp of the last checksum
failure on the database (NULL if no error ever happened).</p>
<p>To avoid any confusion (thanks to Robert Treat for pointing it), those two
columns will always return NULL if data checksums aren’t enabled, so people
won’t mistakenly think that data checksums are always successfully verified.</p>
<p>As a side effect, <code class="language-plaintext highlighter-rouge">pg_stat_database</code> will also now show available statistics
for shared objects (such as the <code class="language-plaintext highlighter-rouge">pg_database</code> table for instance), in a new row
with <code class="language-plaintext highlighter-rouge">datid</code> valued to <strong>0</strong>, and a <strong>NULL</strong> <code class="language-plaintext highlighter-rouge">datname</code>. Those were always
accumulated, but weren’t displayed in any system view until now.</p>
<p><del>A dedicated check is also <a href="https://github.com/OPMDG/check_pgactivity/issues/226">already
planned</a> in
<a href="https://opm.readthedocs.io/probes/check_pgactivity.html">check_pgactivity</a>!</del>
A dedicated check is also <a href="https://github.com/OPMDG/check_pgactivity/commit/0e8b516e95e4364470d4e205aebc9fe68bbcfd23">already
available</a>
in <a href="https://opm.readthedocs.io/probes/check_pgactivity.html">check_pgactivity</a>!</p>
<p><a href="https://rjuju.github.io/postgresql/2019/04/18/new-in-pg12-statistics-checksums-errors.html">New in pg12: Statistics on checkums errors</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on April 18, 2019.</p>
https://rjuju.github.io/postgresqlfr/2019/04/06/minimiser-le-surcout-de-stockage-par-ligne2019-04-06T07:51:28+00:002019-04-06T07:51:28+00:00Julien Rouhaudhttps://rjuju.github.io
<p>J’entends régulièrement des complaintes sur la quantité d’espace disque gâchée
par PostgreSQL pour chacune des lignes qu’il stocke. Je vais essayer de
montrer ici quelques astuces pour minimiser cet effet, afin d’avoir un stockage
plus efficace.</p>
<h3 id="quel-surcoût-">Quel surcoût ?</h3>
<p>Si vous n’avez pas de table avec plus que quelques centaines de millions de
lignes, il est probable que ce n’est pas un problème pour vous.</p>
<p>Pour chaque ligne stockée, postgres conservera quelques données additionnelles
pour ses propres besoins. C’est <a href="https://www.postgresql.fr/docs/current/storage-page-layout.html#heaptupleheaderdata-table">documenté
ici</a>.
La documentation indique :</p>
<table>
<thead>
<tr>
<th>Field</th>
<th>Type</th>
<th>Length</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>t_xmin</td>
<td>TransactionId</td>
<td>4 bytes</td>
<td>XID d’insertion</td>
</tr>
<tr>
<td>t_xmax</td>
<td>TransactionId</td>
<td>4 bytes</td>
<td>XID de suppresion</td>
</tr>
<tr>
<td>t_cid</td>
<td>CommandId</td>
<td>4 bytes</td>
<td>CID d’insertion et de suppression (surcharge avec t_xvac)</td>
</tr>
<tr>
<td>t_xvac</td>
<td>TransactionId</td>
<td>4 bytes</td>
<td>XID pour l’opération VACUUM déplaçant une version de ligne</td>
</tr>
<tr>
<td>t_ctid</td>
<td>ItemPointerData</td>
<td>6 bytes</td>
<td>TID en cours pour cette version de ligne ou pour une version plus récente</td>
</tr>
<tr>
<td>t_infomask2</td>
<td>uint16</td>
<td>2 bytes</td>
<td>nombre d’attributs et quelques bits d’état</td>
</tr>
<tr>
<td>t_infomask</td>
<td>uint16</td>
<td>2 bytes</td>
<td>différents bits d’options (flag bits)</td>
</tr>
<tr>
<td>t_hoff</td>
<td>uint8</td>
<td>1 byte</td>
<td>décalage vers les données utilisateur</td>
</tr>
</tbody>
</table>
<p>Ce qui représente <strong>23 octets</strong> sur la plupart des architectures (il y a soit
<strong>t_cid</strong> soit <strong>t_xvac</strong>).</p>
<p>Vous pouvez d’ailleurs consulter une partie de ces champs grâce aux colonnes
cachées présentes dans n’importe quelle table en les ajoutant dans la partie
SELECT d’une requête, ou en cherchant pour les numéros d’attribut négatifs dans
le catalogue <strong>pg_attribute</strong> :</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="o">#</span> <span class="err">\</span><span class="n">d</span> <span class="n">test</span>
<span class="k">Table</span> <span class="nv">"public.test"</span>
<span class="k">Column</span> <span class="o">|</span> <span class="k">Type</span> <span class="o">|</span> <span class="n">Modifiers</span>
<span class="c1">--------+---------+-----------</span>
<span class="n">id</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span>
<span class="o">#</span> <span class="k">SELECT</span> <span class="n">xmin</span><span class="p">,</span> <span class="n">xmax</span><span class="p">,</span> <span class="n">id</span> <span class="k">FROM</span> <span class="n">test</span> <span class="k">LIMIT</span> <span class="mi">1</span><span class="p">;</span>
<span class="n">xmin</span> <span class="o">|</span> <span class="n">xmax</span> <span class="o">|</span> <span class="n">id</span>
<span class="c1">------+------+----</span>
<span class="mi">1361</span> <span class="o">|</span> <span class="mi">0</span> <span class="o">|</span> <span class="mi">1</span>
<span class="o">#</span> <span class="k">SELECT</span> <span class="n">attname</span><span class="p">,</span> <span class="n">attnum</span><span class="p">,</span> <span class="n">atttypid</span><span class="p">::</span><span class="n">regtype</span><span class="p">,</span> <span class="n">attlen</span>
<span class="k">FROM</span> <span class="n">pg_class</span> <span class="k">c</span>
<span class="k">JOIN</span> <span class="n">pg_attribute</span> <span class="n">a</span> <span class="k">ON</span> <span class="n">a</span><span class="p">.</span><span class="n">attrelid</span> <span class="o">=</span> <span class="k">c</span><span class="p">.</span><span class="n">oid</span>
<span class="k">WHERE</span> <span class="n">relname</span> <span class="o">=</span> <span class="s1">'test'</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">attnum</span><span class="p">;</span>
<span class="n">attname</span> <span class="o">|</span> <span class="n">attnum</span> <span class="o">|</span> <span class="n">atttypid</span> <span class="o">|</span> <span class="n">attlen</span>
<span class="c1">----------+--------+----------+--------</span>
<span class="n">tableoid</span> <span class="o">|</span> <span class="o">-</span><span class="mi">7</span> <span class="o">|</span> <span class="n">oid</span> <span class="o">|</span> <span class="mi">4</span>
<span class="n">cmax</span> <span class="o">|</span> <span class="o">-</span><span class="mi">6</span> <span class="o">|</span> <span class="n">cid</span> <span class="o">|</span> <span class="mi">4</span>
<span class="n">xmax</span> <span class="o">|</span> <span class="o">-</span><span class="mi">5</span> <span class="o">|</span> <span class="n">xid</span> <span class="o">|</span> <span class="mi">4</span>
<span class="n">cmin</span> <span class="o">|</span> <span class="o">-</span><span class="mi">4</span> <span class="o">|</span> <span class="n">cid</span> <span class="o">|</span> <span class="mi">4</span>
<span class="n">xmin</span> <span class="o">|</span> <span class="o">-</span><span class="mi">3</span> <span class="o">|</span> <span class="n">xid</span> <span class="o">|</span> <span class="mi">4</span>
<span class="n">ctid</span> <span class="o">|</span> <span class="o">-</span><span class="mi">1</span> <span class="o">|</span> <span class="n">tid</span> <span class="o">|</span> <span class="mi">6</span>
<span class="n">id</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="mi">4</span></code></pre></figure>
<p>Si vous comparez ces champs avec le tableau précédent, vous pouvez constater
que toutes ces colonnes ne sont pas stockées sur disque. Bien évidemment,
PostgreSQL ne stocke pas l’oid de la table pour chaque ligne. Celui-ci est
ajouté après, lors de la construction d’une ligne.</p>
<p>Si vous voulez plus de détails techniques, vous pouvez regarder
<a href="http://doxygen.postgresql.org/htup__details_8h.html">htup_detail.c</a>, en
commençant par
<a href="http://doxygen.postgresql.org/structHeapTupleHeaderData.html">TupleHeaderData struct</a>.</p>
<h3 id="combien-est-ce-que-ça-coûte-">Combien est-ce que ça coûte ?</h3>
<p>Puisque ce surcoût est fixe, plus la taille des lignes croît plus il devient
négligeable. Si vous ne stocker qu’une simple colonne de type intt (<strong>4
octets</strong>), chaque ligne nécessitera :</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="mi">23</span><span class="n">B</span> <span class="o">+</span> <span class="mi">4</span><span class="n">B</span> <span class="o">=</span> <span class="mi">27</span><span class="n">B</span></code></pre></figure>
<p>soit <strong>85% de surcoût</strong>, ce qui est plutôt horrible.</p>
<p>D’une autre côté, si vous stockez 5 integer, 3 bigint et 2 colonnes de type
texte (disons environ 80 octets en moyenne), cela donnera :</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="mi">23</span><span class="n">B</span> <span class="o">+</span> <span class="mi">5</span><span class="o">*</span><span class="mi">4</span><span class="n">B</span> <span class="o">+</span> <span class="mi">3</span><span class="o">*</span><span class="mi">8</span><span class="n">B</span> <span class="o">+</span> <span class="mi">2</span><span class="o">*</span><span class="mi">80</span><span class="n">B</span> <span class="o">=</span> <span class="mi">227</span><span class="n">B</span></code></pre></figure>
<p>C’est “seulement” <strong>10% de surcoût</strong>.</p>
<h3 id="et-donc-comment-minimiser-ce-surcoût">Et donc, comment minimiser ce surcoût</h3>
<p>L’idée est de stocker les même données, mais avec moins d’enregistrements.
Comment faire ? En aggrégeant les données dans des tableaux. Plus vous mettez
d’enregistrements dans un seul tableau, plus vous minimiserez le surcoût. Et
si vous aggrégez suffisamment de données, vous pouvez bénéficier d’une
compression entièrement transparente grâce au <a href="https://www.postgresql.fr/docs/current/storage-toast.html">mécanisme de
TOAST</a>.</p>
<p>Voyons ce que cela donne avec une table ne disposant que d’une seule colonne,
avec 10 millions de lignes :</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="o">#</span> <span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">raw_1</span> <span class="p">(</span><span class="n">id</span> <span class="nb">integer</span><span class="p">);</span>
<span class="o">#</span> <span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">raw_1</span> <span class="k">SELECT</span> <span class="n">generate_series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">10000000</span><span class="p">);</span>
<span class="o">#</span> <span class="k">CREATE</span> <span class="k">INDEX</span> <span class="k">ON</span> <span class="n">raw_1</span> <span class="p">(</span><span class="n">id</span><span class="p">);</span></code></pre></figure>
<p>Les données utilisateur ne devrait nécessiter que 10M * 4 octets, soit environ
<strong>30 Mo</strong>, alors que cette table pèse <strong>348 Mo</strong>. L’insertion des données
prend environ <strong>23 secondes</strong>.</p>
<p class="notice"><strong>NOTE :</strong> Si vous faites le calcul, vous trouverez que le surcoût est d’un peu
plus que <strong>32 octets</strong> par ligne, pas <strong>23 octets</strong>. C’est parce que chaque
bloc de données a également un surcoût, une gestion des colonnes NULL ainsi que
des contraintes d’alignement. Si vous voulez plus d’informations à ce sujet,
je vous recommande de regarder <a href="https://github.com/dhyannataraj/tuple-internals-presentation">cette
présentation</a></p>
<p>Comparons maintenant cela avec la version aggrégées des même données :</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="o">#</span> <span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">agg_1</span> <span class="p">(</span><span class="n">id</span> <span class="nb">integer</span><span class="p">[]);</span>
<span class="o">#</span> <span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">agg_1</span> <span class="k">SELECT</span> <span class="n">array_agg</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
<span class="k">FROM</span> <span class="n">generate_series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">10000000</span><span class="p">)</span> <span class="n">i</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">i</span> <span class="o">%</span> <span class="mi">2000000</span><span class="p">;</span>
<span class="o">#</span> <span class="k">CREATE</span> <span class="k">INDEX</span> <span class="k">ON</span> <span class="n">agg_1</span> <span class="p">(</span><span class="n">id</span><span class="p">);</span></code></pre></figure>
<p>Cette requête insèrera 5 éléments par ligne. J’ai fait le même test avec 20,
100, 200 et 1000 éléments par ligne. Les résultats sont les suivants :</p>
<p><a href="/images/tuple_overhead_1.svg"><img src="/images/tuple_overhead_1.svg" alt="Benchmark 1" /></a></p>
<p class="notice"><strong>NOTE :</strong> La taille pour 1000 éléments par ligne est un peu plus importante
que pour la valeur précédents. C’est parce que c’est le seul qui implique une
taille suffisamment importante pour être TOAST-ée, mais pas assez pour être
compressée. On peut donc voir ici un peu de surcoût lié au TOAST.</p>
<p>Jusqu’ici tout va bien, on peut voir de plutôt bonnes améliorations à la fois
sur la taille et sur le temps d’insertion, même pour les tableaux les plus
petits. Voyons maintenant l’impact pour récupérer des lignes. Je testerai la
récupération de toutes les lignes, ainsi qu’une seule ligne au moyen d’un
parcours d’index (j’ai utilisé pour les tests EXPLAIN ANALYZE afin de minimiser
le temps passé par psql à afficher les données) :
psql):</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="o">#</span> <span class="k">SELECT</span> <span class="n">id</span> <span class="k">FROM</span> <span class="n">raw_1</span><span class="p">;</span>
<span class="o">#</span> <span class="k">CREATE</span> <span class="k">INDEX</span> <span class="k">ON</span> <span class="n">raw_1</span> <span class="p">(</span><span class="n">id</span><span class="p">);</span>
<span class="o">#</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">raw_1</span> <span class="k">WHERE</span> <span class="n">id</span> <span class="o">=</span> <span class="mi">500</span><span class="p">;</span></code></pre></figure>
<p>Pour correctement indexer le tableau, nous avons besoin d’un index GIN. Pour
récupérer les valeurs de toutes les données aggrégées, il est nécessaire
d’appeler unnest() sur le tableau, et pour récupérer un seul enregistrement il
faut être un peu plus créatif :</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="o">#</span> <span class="k">SELECT</span> <span class="k">unnest</span><span class="p">(</span><span class="n">id</span><span class="p">)</span> <span class="k">AS</span> <span class="n">id</span> <span class="k">FROM</span> <span class="n">agg_1</span><span class="p">;</span>
<span class="o">#</span> <span class="k">CREATE</span> <span class="k">INDEX</span> <span class="k">ON</span> <span class="n">agg_1</span> <span class="k">USING</span> <span class="n">gin</span> <span class="p">(</span><span class="n">id</span><span class="p">);</span>
<span class="o">#</span> <span class="k">WITH</span> <span class="n">s</span><span class="p">(</span><span class="n">id</span><span class="p">)</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span> <span class="k">unnest</span><span class="p">(</span><span class="n">id</span><span class="p">)</span>
<span class="k">FROM</span> <span class="n">agg_1</span>
<span class="k">WHERE</span> <span class="n">id</span> <span class="o">&&</span> <span class="n">array</span><span class="p">[</span><span class="mi">500</span><span class="p">]</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="n">id</span> <span class="k">FROM</span> <span class="n">s</span> <span class="k">WHERE</span> <span class="n">id</span> <span class="o">=</span> <span class="mi">500</span><span class="p">;</span></code></pre></figure>
<p>Voici le tableau comparant les temps de création de l’index ainsi que la taille
de celui-ci, pour chaque dimension de tableau :</p>
<p><a href="/images/tuple_overhead_2.svg"><img src="/images/tuple_overhead_2.svg" alt="Benchmark 2" /></a></p>
<p>L’index GIN est un peu plus que deux fois plus volumineux que l’index btree, et
si on accumule la taille de la table à la taille de l’index, la taille totale
est presque identique avec ou sans aggrégation. Ce n’est pas un gros problème
puisque cet exemple est très naïf, et nous verrons juste après comme éviter
d’avoir recours à un index GIN pour conserver une taille totale faible. De
plus, l’index est bien plus lent à créer, ce qui signifie qu’INSERT sera
également plus lent.</p>
<p>Voici le tableau comparant le temps pour récupérer toutes les lignes ainsi
qu’une seule ligne :</p>
<p><a href="/images/tuple_overhead_3.svg"><img src="/images/tuple_overhead_3.svg" alt="Benchmark 3" /></a></p>
<p>Récupérer toutes les lignes n’est probablement pas un exemple intéressant, mais
il est intéressant de noter que dès que le tableau contient suffisamement
d’éléments cela devient plus efficace que faire la même chose avec la table
originale. Nous voyons également que récuérer un seul élément est bien plus
rapide qu’avec l’index btree, grâce à l’efficacité de GIN. Ce n’est pas testé
ici, mais puisque seul les index btree sont nativement triés, si vous devez
récupérer un grand nombre d’enregistrements triés, l’utilisation d’un index GIN
nécessitera un tri supplémentaire, ce qui sera bien plus lent qu’un simple
parcours d’index btree.</p>
<h3 id="un-exemple-plus-réaliste">Un exemple plus réaliste</h3>
<p>Maintenant que nous avons vu les bases, voyons comment aller un peu plus loin :
aggréger plus d’une colonne et éviter d’utiliser trop d’espce disque (et de
ralentissements à l’écriture) du fait d’un index GIN. Pour cela, je vais
présenter comme <a href="https://powa.readthedocs.io/">PoWA</a> stocke ses données.</p>
<p>Pour chaque source de données collectée, deux tables sont utilisées : une pour
les données <strong>historiques et aggrégées</strong>, ainsi qu’une pour <strong>les données
courantes</strong>. Ces tables stockent les données dans un type de données
personnalisé plutôt que des colonnes. Voyons les tables liées à l’extension
<strong>pg_stat_statements</strong> :</p>
<p>Le type de données, grosso modo tous les compteurs présents dans
pg_stat_statements ainsi que l’horodatage associé à l’enregistrement :</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="n">powa</span><span class="o">=#</span> <span class="err">\</span><span class="n">d</span> <span class="n">powa_statements_history_record</span>
<span class="n">Composite</span> <span class="k">type</span> <span class="nv">"public.powa_statements_history_record"</span>
<span class="k">Column</span> <span class="o">|</span> <span class="k">Type</span> <span class="o">|</span> <span class="n">Modifiers</span>
<span class="c1">---------------------+--------------------------+-----------</span>
<span class="n">ts</span> <span class="o">|</span> <span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span> <span class="o">|</span>
<span class="n">calls</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span>
<span class="n">total_time</span> <span class="o">|</span> <span class="nb">double</span> <span class="nb">precision</span> <span class="o">|</span>
<span class="k">rows</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span>
<span class="n">shared_blks_hit</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span>
<span class="n">shared_blks_read</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span>
<span class="n">shared_blks_dirtied</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span>
<span class="n">shared_blks_written</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span>
<span class="n">local_blks_hit</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span>
<span class="n">local_blks_read</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span>
<span class="n">local_blks_dirtied</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span>
<span class="n">local_blks_written</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span>
<span class="n">temp_blks_read</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span>
<span class="n">temp_blks_written</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span>
<span class="n">blk_read_time</span> <span class="o">|</span> <span class="nb">double</span> <span class="nb">precision</span> <span class="o">|</span>
<span class="n">blk_write_time</span> <span class="o">|</span> <span class="nb">double</span> <span class="nb">precision</span> <span class="o">|</span></code></pre></figure>
<p>La table pour les données courrante stocke l’identifieur unique de
pg_stat_statements (queryid, dbid, userid), ainsi qu’un enregistrement de
compteurs :</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="n">powa</span><span class="o">=#</span> <span class="err">\</span><span class="n">d</span> <span class="n">powa_statements_history_current</span>
<span class="k">Table</span> <span class="nv">"public.powa_statements_history_current"</span>
<span class="k">Column</span> <span class="o">|</span> <span class="k">Type</span> <span class="o">|</span> <span class="n">Modifiers</span>
<span class="c1">---------+--------------------------------+-----------</span>
<span class="n">queryid</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span>
<span class="n">dbid</span> <span class="o">|</span> <span class="n">oid</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span>
<span class="n">userid</span> <span class="o">|</span> <span class="n">oid</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span>
<span class="n">record</span> <span class="o">|</span> <span class="n">powa_statements_history_record</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span></code></pre></figure>
<p>La table pour les données aggrégées contient le même identifieur unique, un
tableau d’enregistrements ainsi que quelques champs spéciaux :</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="n">powa</span><span class="o">=#</span> <span class="err">\</span><span class="n">d</span> <span class="n">powa_statements_history</span>
<span class="k">Table</span> <span class="nv">"public.powa_statements_history"</span>
<span class="k">Column</span> <span class="o">|</span> <span class="k">Type</span> <span class="o">|</span> <span class="n">Modifiers</span>
<span class="c1">----------------+----------------------------------+-----------</span>
<span class="n">queryid</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span>
<span class="n">dbid</span> <span class="o">|</span> <span class="n">oid</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span>
<span class="n">userid</span> <span class="o">|</span> <span class="n">oid</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span>
<span class="n">coalesce_range</span> <span class="o">|</span> <span class="n">tstzrange</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span>
<span class="n">records</span> <span class="o">|</span> <span class="n">powa_statements_history_record</span><span class="p">[]</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span>
<span class="n">mins_in_range</span> <span class="o">|</span> <span class="n">powa_statements_history_record</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span>
<span class="n">maxs_in_range</span> <span class="o">|</span> <span class="n">powa_statements_history_record</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span>
<span class="n">Indexes</span><span class="p">:</span>
<span class="nv">"powa_statements_history_query_ts"</span> <span class="n">gist</span> <span class="p">(</span><span class="n">queryid</span><span class="p">,</span> <span class="n">coalesce_range</span><span class="p">)</span></code></pre></figure>
<p>Nous stockons également l’intervalle d’horodatage (<em>coalesce_range</em>) contenant
tous les compteurs aggrégés dans la ligne, ainsi que les valeurs minimales et
maximales de chaque compteurs dans deux compteurs dédiés. Ces champs
supplémentaires ne consomment pas trop d’espace, et permettent une indexation
ainsi qu’un traitement très efficace, basé sur les modèles d’accès aux données
de l’application associée.</p>
<p>Cette table est utilisée pour savoir combien de ressources ont été utilisée par
une requête sur un intervalle de temps donné. L’index GiST ne sera pas très
gros puisqu’il n’indexe que deux petites valeus pour X compteurs aggrégés, et
trouvera les lignes correspondant à une requête et un intervalle de temps
données de manière très efficace.</p>
<p>Ensuite, calculer les ressources consommées peut être fait de manière très
efficace, puisque les compteurs de pg_stat_statements sont strictement
monotones. L’algorithme pourrait être :</p>
<ul>
<li>si l’intervalle de temps de la ligne est entièrement contenu dans
l’intervalle de temps demandé, nous n’avons besoin de calculer que le delta
du résumé de l’enregistrement :
<strong>maxs_in_range.counter - mins_in_range.counter</strong></li>
<li>sinon (c’est-à-dire pour uniquement deux lignes par queryid) nous dépilons le
tableau, filtrons les enregistrements qui ne sont pas compris dans
l’intervalle de temps demandé, conservons la première et dernière valeur et
calculons pour chaque compteur le maximum moins le minimum.</li>
</ul>
<p class="notice"><strong>NOTE :</strong> Dans les faits, l’interface de PoWA dépilera toujours tous les
enregistrements contenus dans l’intervalle de temps demandé, puisque
l’interface est faite pour montrer l’évolution de ces compteurs sur un
intervalle de temps relativement réduit, mais avec une grande précision.
Heureusement, dépiler les tableaux n’est pas si coûteux que ça, surtout en
regard de l’espace disque économisé.</p>
<p>Et voici la taille nécessaire pour les valeurs aggrégées et non aggrégées.
Pour cela j’ai laissé PoWA générer <strong>12 331 366 enregistrements</strong> (en
configurant une capture toutes les 5 secondes pendant quelques heures, et avec
l’aggrégation par défaut de 100 enregistrements par lignes), et créé un index
btree sur (queryid, ((record).ts) pour simuler l’index présent sur les tables
aggrégées :</p>
<p><a href="/images/tuple_overhead_4.svg"><img src="/images/tuple_overhead_4.svg" alt="Benchmark 4" /></a></p>
<p>Vous trouvez aussi que c’est plutôt efficace ?</p>
<h3 id="limitations">Limitations</h3>
<p>Il y a quelques limitations avec l’aggrégation d’enregistrements. Si vous
faites ça, vous ne pouvez plus garantir de contraintes telles que des clés
étrangères ou contrainte d’unicité. C’est donc à utiliser pour des données non
relationnelles, telles que des compteurs ou des métadonnées.</p>
<h3 id="bonus">Bonus</h3>
<p>L’utilisation de type de données personnalisés vous permet de faire des choses
sympathiques, comme définir des <strong>opérateurs personnalisés</strong>. Par exemple, la
version 3.1.0 de PoWA fournit deux opérateurs pour chacun des types de données
personnalisé définis :</p>
<ul>
<li>l’opérateur <strong>-</strong>, pour obtenir la différent entre deux enregistrements</li>
<li>l’opérateur <strong>/</strong>, pour obtenir la différence <em>par seconde</em></li>
</ul>
<p>Vous pouvez donc faire très facilement des requêtes du genre :</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="o">#</span> <span class="k">SELECT</span> <span class="p">(</span><span class="n">record</span> <span class="o">-</span> <span class="n">lag</span><span class="p">(</span><span class="n">record</span><span class="p">)</span> <span class="n">over</span><span class="p">()).</span><span class="o">*</span>
<span class="k">FROM</span> <span class="k">from</span> <span class="n">powa_statements_history_current</span>
<span class="k">WHERE</span> <span class="n">queryid</span> <span class="o">=</span> <span class="mi">3589441560</span> <span class="k">AND</span> <span class="n">dbid</span> <span class="o">=</span> <span class="mi">16384</span><span class="p">;</span>
<span class="n">intvl</span> <span class="o">|</span> <span class="n">calls</span> <span class="o">|</span> <span class="n">total_time</span> <span class="o">|</span> <span class="k">rows</span> <span class="o">|</span> <span class="p">...</span>
<span class="c1">-----------------+--------+------------------+--------+ ...</span>
<span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="p">...</span>
<span class="mi">00</span><span class="p">:</span><span class="mi">00</span><span class="p">:</span><span class="mi">05</span><span class="p">.</span><span class="mi">004611</span> <span class="o">|</span> <span class="mi">5753</span> <span class="o">|</span> <span class="mi">20</span><span class="p">.</span><span class="mi">5570000000005</span> <span class="o">|</span> <span class="mi">5753</span> <span class="o">|</span> <span class="p">...</span>
<span class="mi">00</span><span class="p">:</span><span class="mi">00</span><span class="p">:</span><span class="mi">05</span><span class="p">.</span><span class="mi">004569</span> <span class="o">|</span> <span class="mi">1879</span> <span class="o">|</span> <span class="mi">6</span><span class="p">.</span><span class="mi">40500000000047</span> <span class="o">|</span> <span class="mi">1879</span> <span class="o">|</span> <span class="p">...</span>
<span class="mi">00</span><span class="p">:</span><span class="mi">00</span><span class="p">:</span><span class="mi">05</span><span class="p">.</span><span class="mi">00477</span> <span class="o">|</span> <span class="mi">14369</span> <span class="o">|</span> <span class="mi">48</span><span class="p">.</span><span class="mi">9060000000006</span> <span class="o">|</span> <span class="mi">14369</span> <span class="o">|</span> <span class="p">...</span>
<span class="mi">00</span><span class="p">:</span><span class="mi">00</span><span class="p">:</span><span class="mi">05</span><span class="p">.</span><span class="mi">00418</span> <span class="o">|</span> <span class="mi">0</span> <span class="o">|</span> <span class="mi">0</span> <span class="o">|</span> <span class="mi">0</span> <span class="o">|</span> <span class="p">...</span>
<span class="o">#</span> <span class="k">SELECT</span> <span class="p">(</span><span class="n">record</span> <span class="o">/</span> <span class="n">lag</span><span class="p">(</span><span class="n">record</span><span class="p">)</span> <span class="n">over</span><span class="p">()).</span><span class="o">*</span>
<span class="k">FROM</span> <span class="n">powa_statements_history_current</span>
<span class="k">WHERE</span> <span class="n">queryid</span> <span class="o">=</span> <span class="mi">3589441560</span> <span class="k">AND</span> <span class="n">dbid</span> <span class="o">=</span> <span class="mi">16384</span><span class="p">;</span>
<span class="n">sec</span> <span class="o">|</span> <span class="n">calls_per_sec</span> <span class="o">|</span> <span class="n">runtime_per_sec</span> <span class="o">|</span> <span class="n">rows_per_sec</span> <span class="o">|</span> <span class="p">...</span>
<span class="c1">--------+---------------+------------------+--------------+ ...</span>
<span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="p">...</span>
<span class="mi">5</span> <span class="o">|</span> <span class="mi">1150</span><span class="p">.</span><span class="mi">6</span> <span class="o">|</span> <span class="mi">4</span><span class="p">.</span><span class="mi">1114000000001</span> <span class="o">|</span> <span class="mi">1150</span><span class="p">.</span><span class="mi">6</span> <span class="o">|</span> <span class="p">...</span>
<span class="mi">5</span> <span class="o">|</span> <span class="mi">375</span><span class="p">.</span><span class="mi">8</span> <span class="o">|</span> <span class="mi">1</span><span class="p">.</span><span class="mi">28100000000009</span> <span class="o">|</span> <span class="mi">375</span><span class="p">.</span><span class="mi">8</span> <span class="o">|</span> <span class="p">...</span>
<span class="mi">5</span> <span class="o">|</span> <span class="mi">2873</span><span class="p">.</span><span class="mi">8</span> <span class="o">|</span> <span class="mi">9</span><span class="p">.</span><span class="mi">78120000000011</span> <span class="o">|</span> <span class="mi">2873</span><span class="p">.</span><span class="mi">8</span> <span class="o">|</span> <span class="p">...</span></code></pre></figure>
<p>Si vous êtes intéressés sur la façon d’implémenter de tels opérateurs, vous
pouvez regarder <a href="https://github.com/powa-team/powa-archivist/commit/203ed02a5205ad41ce0854bf0580779d7fb6193b#diff-efeed95efc180d43a149361145c2f082R1079">l’implémentation de
PoWA</a>.</p>
<h3 id="conclusion">Conclusion</h3>
<p>Vous connaissez maintenant les bases pour éviter le surcoût de stockage par
ligne. En fonction de vos besoins et de la spécificité de vos données, vous
devriez pouvoir trouver un moyen d’aggréger vos données, en ajoutant
potentiellement quelques colonnes supplémentaires, afin de conserver de bonnes
performances et économiser de l’espace disque.</p>
<!--
Test 1, simple integer, 10M row
with s(id) AS (select unnest(id) from agg_1 where id && array[500])
select * from s where id = 500;
raw_1 (id integer)
insert: 23s
size: 346 MB
read data: 2.2s
create index: 5.2s
index size: 214 MB
find 1 row: 1.4ms
agg_1 (id integer[])
5 val per row
INSERT INTO agg_1 SELECT array_agg(i) FROM generate_series(1,10000000) i GROUP BY i % 2000000 ;
insert: 18s
size: 146 MB (no toast)
read raw data: 377 ms
unnnest: 4s
create (GIN) index: 73s
index size: 478 MB
find 1 val: 0.25ms
agg_1 (id integer[])
20 val per row
INSERT INTO agg_1 SELECT array_agg(i) FROM generate_series(1,10000000) i GROUP BY i % 500000 ;
insert: 13s
size: 64 MB (no toast)
read raw data: 100ms
read unnnest: 2.6 s
create (GIN) index: 70s
index size: 478MB
find 1 val: 0.3ms
agg_1 (id integer[])
100 val per row
INSERT INTO agg_1 SELECT array_agg(i) FROM generate_series(1,10000000) i GROUP BY i % 100000;
insert: 10s
size: 43MB (notoast)
read raw data: 31ms
read unnnest: 2s
create (GIN) index: 68s
index size: 478 MB
find 1 val: 0.45 ms
agg_1 (id integer[])
200 val per row
INSERT INTO agg_1 SELECT array_agg(i) FROM generate_series(1,10000000) i GROUP BY i % 50000;
insert: 9.7s
size: 43MB (notoast)
read raw data: 21ms
read unnnest: 2s
create (GIN) index: 69s
index size: 478MB
find 1 val: 0.7ms
agg_1 (id integer[])
1000 val per row
INSERT INTO agg_1 SELECT array_agg(i) FROM generate_series(1,10000000) i GROUP BY i % 10000;
insert: 10s
size: 53MB (toast)
read raw data: 7ms
read unnnest: 2s
create (GIN) index: 67s
index size: 478MB
find 1 val: 2,7ms
-->
<p><a href="https://rjuju.github.io/postgresqlfr/2019/04/06/minimiser-le-surcout-de-stockage-par-ligne.html">Minimiser le surcoût de stockage par ligne</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on April 06, 2019.</p>
https://rjuju.github.io/postgresqlfr/2019/04/02/support-des-wait-events-pour-powa2019-04-02T17:08:24+00:002019-04-02T17:08:24+00:00Julien Rouhaudhttps://rjuju.github.io
<p>Vous avez la possibilité de visualiser les <strong>Wait Events</strong> dans <a href="https://powa.readthedocs.io/">PoWA
3.2.0</a> grâce à l’extension
<a href="https://github.com/postgrespro/pg_wait_sampling/">pg_wait_sampling</a>
extension.</p>
<h3 id="wait-events--pg_wait_sampling">Wait Events & pg_wait_sampling</h3>
<p>Les wait events sont une fonctionnalité connues, et bien utiles, dans de
nombreux moteurs de base de données relationnelles. Ceux-ci ont été ajouté à
<a href="https://github.com/postgres/postgres/commit/53be0b1add7">PostgreSQL 9.6</a>, il
y a maintenant quelques versions. Contrairement à la plupart des autres
statistiques exposées par PostgreSQL, ceux-ci ne sont qu’une vision à un
instant donné des événements sur lesquels les processus sont en attente, et non
pas des compteurs cumulés. Vous pouvez consulter cette information en
utilisant la vue <code class="language-plaintext highlighter-rouge">pg_stat_activity</code>, par exemple :</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="o">=#</span> <span class="k">SELECT</span> <span class="n">datid</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="n">wait_event_type</span><span class="p">,</span> <span class="n">wait_event</span><span class="p">,</span> <span class="n">query</span> <span class="k">FROM</span> <span class="n">pg_stat_activity</span><span class="p">;</span>
<span class="n">datid</span> <span class="o">|</span> <span class="n">pid</span> <span class="o">|</span> <span class="n">wait_event_type</span> <span class="o">|</span> <span class="n">wait_event</span> <span class="o">|</span> <span class="n">query</span>
<span class="c1">--------+-------+-----------------+---------------------+-------------------------------------------------------------------------</span>
<span class="o"><</span><span class="k">NULL</span><span class="o">></span> <span class="o">|</span> <span class="mi">13782</span> <span class="o">|</span> <span class="n">Activity</span> <span class="o">|</span> <span class="n">AutoVacuumMain</span> <span class="o">|</span>
<span class="mi">16384</span> <span class="o">|</span> <span class="mi">16615</span> <span class="o">|</span> <span class="k">Lock</span> <span class="o">|</span> <span class="n">relation</span> <span class="o">|</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">t1</span><span class="p">;</span>
<span class="mi">16384</span> <span class="o">|</span> <span class="mi">16621</span> <span class="o">|</span> <span class="n">Client</span> <span class="o">|</span> <span class="n">ClientRead</span> <span class="o">|</span> <span class="k">LOCK</span> <span class="k">TABLE</span> <span class="n">t1</span><span class="p">;</span>
<span class="mi">847842</span> <span class="o">|</span> <span class="mi">16763</span> <span class="o">|</span> <span class="n">LWLock</span> <span class="o">|</span> <span class="n">WALWriteLock</span> <span class="o">|</span> <span class="k">END</span><span class="p">;</span>
<span class="mi">847842</span> <span class="o">|</span> <span class="mi">16764</span> <span class="o">|</span> <span class="k">Lock</span> <span class="o">|</span> <span class="n">transactionid</span> <span class="o">|</span> <span class="k">UPDATE</span> <span class="n">pgbench_branches</span> <span class="k">SET</span> <span class="n">bbalance</span> <span class="o">=</span> <span class="n">bbalance</span> <span class="o">+</span> <span class="mi">1229</span> <span class="k">WHERE</span> <span class="n">bid</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="mi">847842</span> <span class="o">|</span> <span class="mi">16766</span> <span class="o">|</span> <span class="n">LWLock</span> <span class="o">|</span> <span class="n">WALWriteLock</span> <span class="o">|</span> <span class="k">END</span><span class="p">;</span>
<span class="mi">847842</span> <span class="o">|</span> <span class="mi">16767</span> <span class="o">|</span> <span class="k">Lock</span> <span class="o">|</span> <span class="n">transactionid</span> <span class="o">|</span> <span class="k">UPDATE</span> <span class="n">pgbench_tellers</span> <span class="k">SET</span> <span class="n">tbalance</span> <span class="o">=</span> <span class="n">tbalance</span> <span class="o">+</span> <span class="mi">3383</span> <span class="k">WHERE</span> <span class="n">tid</span> <span class="o">=</span> <span class="mi">86</span><span class="p">;</span>
<span class="mi">847842</span> <span class="o">|</span> <span class="mi">16769</span> <span class="o">|</span> <span class="k">Lock</span> <span class="o">|</span> <span class="n">transactionid</span> <span class="o">|</span> <span class="k">UPDATE</span> <span class="n">pgbench_branches</span> <span class="k">SET</span> <span class="n">bbalance</span> <span class="o">=</span> <span class="n">bbalance</span> <span class="o">+</span> <span class="o">-</span><span class="mi">3786</span> <span class="k">WHERE</span> <span class="n">bid</span> <span class="o">=</span> <span class="mi">10</span><span class="p">;</span>
<span class="p">[...]</span></code></pre></figure>
<p>Dans cet exemple, nous voyons que le //wait event// pour le pid 16615 est un
<code class="language-plaintext highlighter-rouge">Lock</code> sur une <code class="language-plaintext highlighter-rouge">Relation</code>. En d’autre terme, la requête est bloquée en
attente d’un verrou lourd, alors que le pid 16621, qui clairement détient le
verrou, est inactif en attente de commandes du client. Il s’agit
d’informations qu’il était déjà possible d’obtenir avec les anciennes versions,
bien que cela se faisait d’une autre manière. Mais plus intéressant, nous
pouvons également voir que le //wait event// pour le pid 16766 est un
<code class="language-plaintext highlighter-rouge">LWLock</code>, c’est-à-dire un <strong>Lightweight Lock</strong>, ou verrou léger. Les verrous
légers sont des verrous internes et transitoires qu’il était auparavant
impossible de voir au niveau SQL. dans cet exemple, la requête est en attente
d’un <strong>WALWriteLock</strong>, un verrou léger principalement utilisé pour contrôler
l’écriture dans les tampons des journaux de transaction. Une liste complète
des //wait events// disponible est <a href="https://docs.postgresql.fr/current/monitoring-stats.html#wait-event-table">disponible sur la documentation
officielle</a>.</p>
<p>Ces informations manquaient curellement et sont bien utiles pour diagnostiquer
les causes de ralentissement. Cependant, n’avoir que la vue de ces //wait
events// à l’instant présent n’est clairement pas suffisant pour avoir une
bonne idée de ce qu’il se passe sur le serveur. Puisque la plupart des //wait
events// sont pas nature très éphémères, ce dont vous avez besoin est de les
échantilloner à une fréquence élevée. Tenter de faire cet échantillonage avec
un outil externe, même à une seconde d’intervalle, n’est généralement pas
suffisant. C’est là que <a href="https://github.com/postgrespro/pg_wait_sampling/">l’extension
pg_wait_sampling</a> apporte
une solution vraiment brillante. Il s’agit d’une extension écrite par
<a href="http://akorotkov.github.io/">Alexander Korotkov</a> et Ildus Kurbangaliev. Une
fois activée (il est nécessaire de la configurer dans le
<code class="language-plaintext highlighter-rouge">shared_preload_libraries</code>, un redémarrage de l’instance est donc nécessaire),
elle échantillonera en mémoire partagée les //wait events// toutes les <strong>10
ms</strong> (par défaut), et aggèrega également les compteurs par type de //wait
event// (wait_event_type), //wait event// et queryid (si
<code class="language-plaintext highlighter-rouge">pg_stat_statements</code> est également acctivé). Pour plus de détails sur la
configuration et l’utilisation de cette extension, vous pouvez consulter le
<a href="https://github.com/postgrespro/pg_wait_sampling/blob/master/README.md">README de
l’extension</a>.
Comme tout le travail est fait en mémoire au moyen d’une extension écrite en C,
c’est très efficace. De plus, l’implémentation est faite avec très peu de
verouillage, le surcoût de cette extension devrait être presque négligable.
J’ai fait quelques tests de performance sur mon pc portable (je n’ai
malheureusement pas de meilleure machine sur laquelle tester) avec un
<a href="https://www.postgresql.org/docs/current/static/pgbench.html">pgbench</a> en
lecture seule où toutes les données tenaient dans le cache de PostgreSQL
(<code class="language-plaintext highlighter-rouge">shared_buffers</code>), avec 8 puis 90 clients, afin d’essayer d’avoir le maximum
de surcoût possible. La moyenne sur 3 tests était d’environ 1% de surcoût,
avec des fluctuations entre chaque test d’environ 0.8%.</p>
<h3 id="et-powa-">Et PoWA ?</h3>
<p>Ainsi, grâce à cette extension, nous avons à notre disposition une vue cumulée
et extrêmement précise des //wait events//. C’est très bien, mais comme toutes
les autres statistiques cumulées dans PostgreSQL, vous devez échantillonner ces
compteurs régulièrement si vous voulez pouvoir être capable de savoir ce qu’il
s’est passé à un certain moment dans le passé, comme c’est d’ailleurs précisé
dans le README de l’extension :</p>
<blockquote>
<p>[…]
Waits profile. It’s implemented as in-memory hash table where count
of samples are accumulated per each process and each wait event
(and each query with <code class="language-plaintext highlighter-rouge">pg_stat_statements</code>). This hash
table can be reset by user request. Assuming there is a client who
periodically dumps profile and resets it, user can have statistics of
intensivity of wait events among time.</p>
</blockquote>
<p>C’est exactement le but de <a href="http://powa.readthedocs.io/">PoWA</a>: sauvegarder les
compteurs statistiques de manière efficace, et les afficher sur une interface
graphique.</p>
<p>PoWA 3.2 détecte automatiquement si l’extension
<a href="https://github.com/postgrespro/pg_wait_sampling/">pg_wait_sampling</a>
est déjà présente ou si vous l’installez ultérieurement, et commencera à
collecter ses données, vous donnant une vue vraiment précise des //wait
events// dans le temps sur vos bases de données !</p>
<p>Les données sont centralisée dans des <a href="/postgresql/2016/09/16/minimizing-tuple-overhead.html (article en cours de traduction)">tables PoWA classiques</a>,
<code class="language-plaintext highlighter-rouge">powa_wait_sampling_history_current</code> pour les 100 dernières collectes (valeur
par défaut de <code class="language-plaintext highlighter-rouge">powa.coalesce</code>), et les valeurs plus anciennes sont aggrégées
dans la table <code class="language-plaintext highlighter-rouge">powa_wait_sampling_history</code>, avec un historique allant jusqu’à
une période définie par <code class="language-plaintext highlighter-rouge">powa.retention</code>. Par exemple, voici une requête
simple affichant les 20 premiers changements survenus au sein des 100 premiers
instantanés :</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">WITH</span> <span class="n">s</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span> <span class="p">(</span><span class="n">record</span><span class="p">).</span><span class="n">ts</span><span class="p">,</span> <span class="n">queryid</span><span class="p">,</span> <span class="n">event_type</span><span class="p">,</span> <span class="n">event</span><span class="p">,</span>
<span class="p">(</span><span class="n">record</span><span class="p">).</span><span class="k">count</span> <span class="o">-</span> <span class="n">lag</span><span class="p">((</span><span class="n">record</span><span class="p">).</span><span class="k">count</span><span class="p">)</span>
<span class="n">OVER</span> <span class="p">(</span><span class="k">PARTITION</span> <span class="k">BY</span> <span class="n">queryid</span><span class="p">,</span> <span class="n">event_type</span><span class="p">,</span> <span class="n">event</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="p">(</span><span class="n">record</span><span class="p">).</span><span class="n">ts</span><span class="p">)</span>
<span class="k">AS</span> <span class="n">events</span>
<span class="k">FROM</span> <span class="n">powa_wait_sampling_history_current</span> <span class="n">w</span>
<span class="k">JOIN</span> <span class="n">pg_database</span> <span class="n">d</span> <span class="k">ON</span> <span class="n">d</span><span class="p">.</span><span class="n">oid</span> <span class="o">=</span> <span class="n">w</span><span class="p">.</span><span class="n">dbid</span>
<span class="k">WHERE</span> <span class="n">d</span><span class="p">.</span><span class="n">datname</span> <span class="o">=</span> <span class="s1">'bench'</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="o">*</span>
<span class="k">FROM</span> <span class="n">s</span>
<span class="k">WHERE</span> <span class="n">events</span> <span class="o">!=</span> <span class="mi">0</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">ts</span> <span class="k">ASC</span><span class="p">,</span> <span class="n">event</span> <span class="k">DESC</span>
<span class="k">LIMIT</span> <span class="mi">20</span><span class="p">;</span>
<span class="n">ts</span> <span class="o">|</span> <span class="n">queryid</span> <span class="o">|</span> <span class="n">event_type</span> <span class="o">|</span> <span class="n">event</span> <span class="o">|</span> <span class="n">events</span>
<span class="c1">-------------------------------+----------------------+------------+----------------+--------</span>
<span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">08</span><span class="p">.</span><span class="mi">037191</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="o">-</span><span class="mi">6531859117817823569</span> <span class="o">|</span> <span class="n">LWLock</span> <span class="o">|</span> <span class="n">pg_qualstats</span> <span class="o">|</span> <span class="mi">1233</span>
<span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">28</span><span class="p">.</span><span class="mi">035212</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="mi">8851222058009799098</span> <span class="o">|</span> <span class="k">Lock</span> <span class="o">|</span> <span class="n">tuple</span> <span class="o">|</span> <span class="mi">4</span>
<span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">28</span><span class="p">.</span><span class="mi">035212</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="o">-</span><span class="mi">6860707137622661878</span> <span class="o">|</span> <span class="k">Lock</span> <span class="o">|</span> <span class="n">tuple</span> <span class="o">|</span> <span class="mi">149</span>
<span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">28</span><span class="p">.</span><span class="mi">035212</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="mi">8851222058009799098</span> <span class="o">|</span> <span class="k">Lock</span> <span class="o">|</span> <span class="n">transactionid</span> <span class="o">|</span> <span class="mi">193</span>
<span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">28</span><span class="p">.</span><span class="mi">035212</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="o">-</span><span class="mi">6860707137622661878</span> <span class="o">|</span> <span class="k">Lock</span> <span class="o">|</span> <span class="n">transactionid</span> <span class="o">|</span> <span class="mi">1143</span>
<span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">28</span><span class="p">.</span><span class="mi">035212</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="o">-</span><span class="mi">6531859117817823569</span> <span class="o">|</span> <span class="n">LWLock</span> <span class="o">|</span> <span class="n">pg_qualstats</span> <span class="o">|</span> <span class="mi">1</span>
<span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">28</span><span class="p">.</span><span class="mi">035212</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="mi">8851222058009799098</span> <span class="o">|</span> <span class="n">LWLock</span> <span class="o">|</span> <span class="n">lock_manager</span> <span class="o">|</span> <span class="mi">2</span>
<span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">28</span><span class="p">.</span><span class="mi">035212</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="o">-</span><span class="mi">6860707137622661878</span> <span class="o">|</span> <span class="n">LWLock</span> <span class="o">|</span> <span class="n">lock_manager</span> <span class="o">|</span> <span class="mi">3</span>
<span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">28</span><span class="p">.</span><span class="mi">035212</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="o">-</span><span class="mi">6860707137622661878</span> <span class="o">|</span> <span class="n">LWLock</span> <span class="o">|</span> <span class="n">buffer_content</span> <span class="o">|</span> <span class="mi">2</span>
<span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">48</span><span class="p">.</span><span class="mi">037205</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="mi">8851222058009799098</span> <span class="o">|</span> <span class="k">Lock</span> <span class="o">|</span> <span class="n">tuple</span> <span class="o">|</span> <span class="mi">14</span>
<span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">48</span><span class="p">.</span><span class="mi">037205</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="o">-</span><span class="mi">6860707137622661878</span> <span class="o">|</span> <span class="k">Lock</span> <span class="o">|</span> <span class="n">tuple</span> <span class="o">|</span> <span class="mi">335</span>
<span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">48</span><span class="p">.</span><span class="mi">037205</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="o">-</span><span class="mi">6860707137622661878</span> <span class="o">|</span> <span class="k">Lock</span> <span class="o">|</span> <span class="n">transactionid</span> <span class="o">|</span> <span class="mi">2604</span>
<span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">48</span><span class="p">.</span><span class="mi">037205</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="mi">8851222058009799098</span> <span class="o">|</span> <span class="k">Lock</span> <span class="o">|</span> <span class="n">transactionid</span> <span class="o">|</span> <span class="mi">384</span>
<span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">48</span><span class="p">.</span><span class="mi">037205</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="o">-</span><span class="mi">6860707137622661878</span> <span class="o">|</span> <span class="n">LWLock</span> <span class="o">|</span> <span class="n">lock_manager</span> <span class="o">|</span> <span class="mi">13</span>
<span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">48</span><span class="p">.</span><span class="mi">037205</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="mi">8851222058009799098</span> <span class="o">|</span> <span class="n">LWLock</span> <span class="o">|</span> <span class="n">lock_manager</span> <span class="o">|</span> <span class="mi">4</span>
<span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">48</span><span class="p">.</span><span class="mi">037205</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="mi">8221555873158496753</span> <span class="o">|</span> <span class="n">IO</span> <span class="o">|</span> <span class="n">DataFileExtend</span> <span class="o">|</span> <span class="mi">1</span>
<span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">48</span><span class="p">.</span><span class="mi">037205</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="o">-</span><span class="mi">6860707137622661878</span> <span class="o">|</span> <span class="n">LWLock</span> <span class="o">|</span> <span class="n">buffer_content</span> <span class="o">|</span> <span class="mi">4</span>
<span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">45</span><span class="p">:</span><span class="mi">08</span><span class="p">.</span><span class="mi">032938</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="mi">8851222058009799098</span> <span class="o">|</span> <span class="k">Lock</span> <span class="o">|</span> <span class="n">tuple</span> <span class="o">|</span> <span class="mi">5</span>
<span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">45</span><span class="p">:</span><span class="mi">08</span><span class="p">.</span><span class="mi">032938</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="o">-</span><span class="mi">6860707137622661878</span> <span class="o">|</span> <span class="k">Lock</span> <span class="o">|</span> <span class="n">tuple</span> <span class="o">|</span> <span class="mi">312</span>
<span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">45</span><span class="p">:</span><span class="mi">08</span><span class="p">.</span><span class="mi">032938</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="o">-</span><span class="mi">6860707137622661878</span> <span class="o">|</span> <span class="k">Lock</span> <span class="o">|</span> <span class="n">transactionid</span> <span class="o">|</span> <span class="mi">2586</span>
<span class="p">(</span><span class="mi">20</span> <span class="k">rows</span><span class="p">)</span></code></pre></figure>
<p class="notice"><strong>NOTE:</strong> Il y a également une version par base de données de ces valeurs pour
un traitement plus efficace au niveau des basesn dans les tables
<code class="language-plaintext highlighter-rouge">powa_wait_sampling_history_current_db</code> et <code class="language-plaintext highlighter-rouge">powa_wait_sampling_history_db</code></p>
<p>Et ces données sont visibles avec l’interface
<a href="https://pypi.org/project/powa-web/">powa-web</a>. Voici quelques exemples
d’affichage des //wait events// tels qu’affichés par PoWA avec un simple
pgbench :</p>
<h5 id="wait-events-pour-linstance-entière">Wait events pour l’instance entière</h5>
<p><a href="/images/powa_waits_overview.png"><img src="/images/powa_waits_overview.png" alt="Wait events pour l'instance entière" /></a></p>
<h5 id="wait-events-pour-une-base-de-données">Wait events pour une base de données</h5>
<p><a href="/images/powa_waits_db.png"><img src="/images/powa_waits_db.png" alt="Wait events pour une base de données" /></a></p>
<h5 id="wait-events-pour-une-seule-requête">Wait events pour une seule requête</h5>
<p><a href="/images/powa_waits_query.png"><img src="/images/powa_waits_query.png" alt="Wait events pour une seule requête" /></a></p>
<div class="gallery">
</div>
<p>Cette fonctionnalité est disponible depuis la version 3.2 de PoWA. J’espère
pouvoir afficher plus de vues de ces données dans le futur, en incluant
d’autres graphes, puisque toutes les données sont déjà disponibles en bases.
Également, si vous êtes un développeur python ou javascript, <a href="https://github.com/powa-team/powa-web">les contributions
sont toujours bienvenues</a>!</p>
<p><a href="https://rjuju.github.io/postgresqlfr/2019/04/02/support-des-wait-events-pour-powa.html">Support des Wait Events pour PoWA</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on April 02, 2019.</p>