rjuju's home Jekyll 2024-08-02T23:53:42+00:00 https://rjuju.github.io/ Julien Rouhaud https://rjuju.github.io/ <![CDATA[Extracting SQL from WAL? (part 2)]]> https://rjuju.github.io/postgresql/2023/12/20/extract-sql-from-wal-part2 2023-12-20T03:04:10+00:00 2023-12-20T03:04:10+00:00 Julien Rouhaud https://rjuju.github.io <p>In the <a href="/postgresql/2023/12/06/extract-sql-from-wal.html">previous article</a> of this series, we saw how to extract WAL records related to the exact SQL commands we want, INSERTs on heap tables, and what the structure of those records was. In this article we will focus on the heap specific information contained in those records and how to extract SQL queries from them.</p> <h3 id="insert-data">INSERT data</h3> <p>At the end of the <a href="/postgresql/2023/12/06/extract-sql-from-wal.html">previous article</a>, we could locate the various <code class="language-plaintext highlighter-rouge">xl_heap_insert</code> records from the WAL stream. From there, we extracted some metadata about the file’s physical location (tablespace oid, database oid and relation filenode among other things) and the data that was inserted itself.</p> <p>As a reminder, here’s an extract of the code responsible for generating the WAL records for an INSERT, in the <a href="https://github.com/postgres/postgres/blob/master/src/backend/access/heap/heapam.c">heap_insert() function</a>, focusing on the interesting data:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">heap_insert</span><span class="p">(</span><span class="n">Relation</span> <span class="n">relation</span><span class="p">,</span> <span class="n">HeapTuple</span> <span class="n">tup</span><span class="p">,</span> <span class="n">CommandId</span> <span class="n">cid</span><span class="p">,</span> <span class="kt">int</span> <span class="n">options</span><span class="p">,</span> <span class="n">BulkInsertState</span> <span class="n">bistate</span><span class="p">)</span> <span class="p">{</span> <span class="p">[...]</span> <span class="n">xl_heap_header</span> <span class="n">xlhdr</span><span class="p">;</span> <span class="p">[...]</span> <span class="n">xlhdr</span><span class="p">.</span><span class="n">t_infomask2</span> <span class="o">=</span> <span class="n">heaptup</span><span class="o">-&gt;</span><span class="n">t_data</span><span class="o">-&gt;</span><span class="n">t_infomask2</span><span class="p">;</span> <span class="n">xlhdr</span><span class="p">.</span><span class="n">t_infomask</span> <span class="o">=</span> <span class="n">heaptup</span><span class="o">-&gt;</span><span class="n">t_data</span><span class="o">-&gt;</span><span class="n">t_infomask</span><span class="p">;</span> <span class="n">xlhdr</span><span class="p">.</span><span class="n">t_hoff</span> <span class="o">=</span> <span class="n">heaptup</span><span class="o">-&gt;</span><span class="n">t_data</span><span class="o">-&gt;</span><span class="n">t_hoff</span><span class="p">;</span> <span class="p">[...]</span> <span class="n">XLogRegisterBuffer</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">buffer</span><span class="p">,</span> <span class="n">REGBUF_STANDARD</span> <span class="o">|</span> <span class="n">bufflags</span><span class="p">);</span> <span class="n">XLogRegisterBufData</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span> <span class="o">&amp;</span><span class="n">xlhdr</span><span class="p">,</span> <span class="n">SizeOfHeapHeader</span><span class="p">);</span> <span class="cm">/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */</span> <span class="n">XLogRegisterBufData</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span> <span class="n">heaptup</span><span class="o">-&gt;</span><span class="n">t_data</span> <span class="o">+</span> <span class="n">SizeofHeapTupleHeader</span><span class="p">,</span> <span class="n">heaptup</span><span class="o">-&gt;</span><span class="n">t_len</span> <span class="o">-</span> <span class="n">SizeofHeapTupleHeader</span><span class="p">);</span> <span class="p">[...]</span> </code></pre></div></div> <p>2 entries are inserted: an <code class="language-plaintext highlighter-rouge">xl_heap_header</code> which contains some metadata about the tuple, extracted from the <em>tuple header</em>, and the data part of a <code class="language-plaintext highlighter-rouge">HeapTuple</code>. Let’s look at those in details.</p> <h3 id="page-layout">Page layout</h3> <p>First of all, let’s quickly see how postgres stores tables and indexes on disk. I will only cover those basics that will be helpful for the rest of the article. If you want to dig more into this topic, there are a tons of resource available. You can refer to <a href="https://github.com/postgres/postgres/blob/master/src/include/storage/bufpage.h.">this entry point in the code</a>, and I otherwise recommend looking at <a href="https://www.interdb.jp/pg/pgsql01.html#_1.3.">the section about it in “The internals of postgres” website</a>.</p> <p>A good general introduction is <a href="https://www.postgresql.org/docs/current/storage-page-layout.html">the documentation</a>, which comes with a diagram of the layout that I include here:</p> <p><a href="/images/page_layout.png"><img src="/images/page_layout.png" alt="Physical page layout, from the offical postgres documentation" /></a></p> <p>Each tuple and index piece of data that postgres stores on disk is stored into a <code class="language-plaintext highlighter-rouge">Page</code>, which is by default 8kB. Each page starts with a header that contains some metadata about the page and ends with an optional “special area”, which can contain additional information specific to the component of postgres that will use this page.</p> <p>In between is the actual data. The beginning of the data part is an array of <code class="language-plaintext highlighter-rouge">ItemId</code>, in ascending order, and the end of the data part are the items themselves (which will be the tuples in case of heap table pages), stored in the reverse order from the <code class="language-plaintext highlighter-rouge">ItemId</code>. Unless the page is totally full, there will be an empty space between the last <code class="language-plaintext highlighter-rouge">ItemId</code> and the first item (the pd_lower and pd_upper offset in the Page metadata).</p> <p>Here’s the <code class="language-plaintext highlighter-rouge">ItemId</code> definition:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">ItemIdData</span> <span class="p">{</span> <span class="kt">unsigned</span> <span class="n">lp_off</span><span class="o">:</span><span class="mi">15</span><span class="p">,</span> <span class="cm">/* offset to tuple (from start of page) */</span> <span class="nl">lp_flags:</span><span class="mi">2</span><span class="p">,</span> <span class="cm">/* state of line pointer, see below */</span> <span class="nl">lp_len:</span><span class="mi">15</span><span class="p">;</span> <span class="cm">/* byte length of tuple */</span> <span class="p">}</span> <span class="n">ItemIdData</span><span class="p">;</span> </code></pre></div></div> <p>As you can see it holds the location of the item in the page, minimal metadata and the length of the item.</p> <h3 id="heaptuple">HeapTuple</h3> <p>The largest part stored in the record is the tuple itself. As the historic and default access method to store tuple is called <code class="language-plaintext highlighter-rouge">heap</code>, the struct that holds the tuple is called <code class="language-plaintext highlighter-rouge">HeapTuple</code>. Any custom <strong>Table Access Method</strong> can use a different struct to store what it needs for its specific implementation, but it will then also use a custom resource manager to generate specific WAL records.</p> <p>Here’s the <a href="https://github.com/postgres/postgres/blob/master/src/include/access/htup.h">definition of a HeapTuple</a>:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">HeapTupleData</span> <span class="p">{</span> <span class="n">uint32</span> <span class="n">t_len</span><span class="p">;</span> <span class="cm">/* length of *t_data */</span> <span class="n">ItemPointerData</span> <span class="n">t_self</span><span class="p">;</span> <span class="cm">/* SelfItemPointer */</span> <span class="n">Oid</span> <span class="n">t_tableOid</span><span class="p">;</span> <span class="cm">/* table the tuple came from */</span> <span class="cp">#define FIELDNO_HEAPTUPLEDATA_DATA 3 </span> <span class="n">HeapTupleHeader</span> <span class="n">t_data</span><span class="p">;</span> <span class="cm">/* -&gt; tuple header and data */</span> <span class="p">}</span> <span class="n">HeapTupleData</span><span class="p">;</span> </code></pre></div></div> <p>It starts with some metadata, which isn’t stored on disk but generated or retrieved from somewhere else when the struct is read from disk. Indeed, there wouldn’t be much value storing the relation’s oid for each tuple on disk. The length of the tuple is stored on disk, as it’s a necessary piece of information, and is retrieved from the associated <code class="language-plaintext highlighter-rouge">ItemId</code> the we saw just before.</p> <p>After that follows the “real” data, which is what is stored in the <strong>item</strong> part of the <code class="language-plaintext highlighter-rouge">Page</code>. It’s again split in 2 parts: the tuple header, which I will cover a bit later, and the tuple data.</p> <p>The tuple data is the physical on-disk representation of the tuple. It was designed to be as space efficient as possible, so accessing individual fields is a bit complex, and CPU intensive. Let’s the most important part of this design. First, the tuple data is <a href="https://github.com/postgres/postgres/blob/master/src/include/access/htup_details.h">defined like that</a>:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">HeapTupleHeaderData</span> <span class="p">{</span> <span class="p">[...]</span> <span class="cm">/* ^ - 23 bytes - ^ */</span> <span class="cp">#define FIELDNO_HEAPTUPLEHEADERDATA_BITS 5 </span> <span class="n">bits8</span> <span class="n">t_bits</span><span class="p">[</span><span class="n">FLEXIBLE_ARRAY_MEMBER</span><span class="p">];</span> <span class="cm">/* bitmap of NULLs */</span> <span class="cm">/* MORE DATA FOLLOWS AT END OF STRUCT */</span> <span class="p">};</span> </code></pre></div></div> <p>You probably know or heard that in postgres, NULL attributes don’t use any storage. Indeed, if an attribute is NULL there won’t be anything in the “data section”, and the bit for its attribute number in the <code class="language-plaintext highlighter-rouge">t_bit</code> bitmap will be set.</p> <p>Then, a lot of data types have a variable size (which is internally referred as <code class="language-plaintext highlighter-rouge">varlena</code>). So, to save space postgres doesn’t store the offset of each attributes in the <code class="language-plaintext highlighter-rouge">HeapTuple</code> and just stores them next to each other (according to the datatype alignment rules) in a big chunk of memory.</p> <p>This is indeed efficient, but unless your tuple only contains non-null fixed-sized attribute, the only way to access a specific attribute is to read all the previous ones, skip the NULL attribute and compute the position of the next one reading the length of variable datatype. This process is called <strong>tuple deforming</strong>, it takes a tuple in input and outputs two arrays: one with the datums and one with the null references, all indexed by the attribute number (0 based). The opposite operation (transform a tuple of datum and a tuple of nulls in a tuple) is unsurprisingly called <strong>tuple forming</strong>. If you want to read a bit more about those operations, the underlying functions are called <a href="https://github.com/postgres/postgres/blob/master/src/backend/access/common/heaptuple.c">heap_deform_tuple() and heap_form_tuple()</a>.</p> <p>Note that tuple deforming is one of the operations that can be <a href="https://www.postgresql.org/docs/current/jit.html">JITted</a>, and there are some optimisations on the tuple deforming operation. Postgres supports “partial” deforming and will avoid deforming the full tuple when possible, stopping at the last attribute that the query is referencing, and will cache the offset of the latest attribute that has been deformed. But that can only help to some extent, so it’s always a good idea to mark columns as NOT NULL when possible, put all the columns with fixed-length attributes at the beginning of the tuples (with the NOT NULL first), ideally grouped by alignment size to avoid wasting a few bits, and put the most frequently accessed columns of variable length datatype next. All of that will help speeding up tuple deforming as much as possible.</p> <h4 id="tuple-header">Tuple header</h4> <p>The first part of the stored data is an <code class="language-plaintext highlighter-rouge">xl_heap_header</code> struct. It’s just a shorter version of the real tuple header that only contains some part of it, the rest of the header being available elsewhere in the WAL record or just not needed otherwise. Doing it this way can save a few bytes for each insert in the WAL, which is always a good thing. Its definition is:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">xl_heap_header</span> <span class="p">{</span> <span class="n">uint16</span> <span class="n">t_infomask2</span><span class="p">;</span> <span class="n">uint16</span> <span class="n">t_infomask</span><span class="p">;</span> <span class="n">uint8</span> <span class="n">t_hoff</span><span class="p">;</span> <span class="p">}</span> <span class="n">xl_heap_header</span><span class="p">;</span> </code></pre></div></div> <p><em>t_infomask2</em> and <em>t_infomask2</em> are two bitmaps that contain information about the tuple. You may have heard about <a href="https://wiki.postgresql.org/wiki/Hint_Bits">hint bits</a>, those two fields contains the tuple-level hint bits.</p> <p>Let’s look at their details <a href="https://github.com/postgres/postgres/blob/master/src/include/access/htup_details.h">htup_details.c</a></p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">HeapTupleHeaderData</span> <span class="p">{</span> <span class="p">[...]</span> <span class="cm">/* Fields below here must match MinimalTupleData! */</span> <span class="cp">#define FIELDNO_HEAPTUPLEHEADERDATA_INFOMASK2 2 </span> <span class="n">uint16</span> <span class="n">t_infomask2</span><span class="p">;</span> <span class="cm">/* number of attributes + various flags */</span> <span class="cp">#define FIELDNO_HEAPTUPLEHEADERDATA_INFOMASK 3 </span> <span class="n">uint16</span> <span class="n">t_infomask</span><span class="p">;</span> <span class="cm">/* various flag bits, see below */</span> <span class="cp">#define FIELDNO_HEAPTUPLEHEADERDATA_HOFF 4 </span> <span class="n">uint8</span> <span class="n">t_hoff</span><span class="p">;</span> <span class="cm">/* sizeof header incl. bitmap, padding */</span> <span class="cm">/* ^ - 23 bytes - ^ */</span> <span class="p">[...]</span> <span class="p">}</span> <span class="o">*</span> <span class="n">information</span> <span class="n">stored</span> <span class="n">in</span> <span class="n">t_infomask2</span><span class="o">:</span> <span class="err">*/</span> <span class="cp">#define HEAP_NATTS_MASK 0x07FF </span><span class="cm">/* 11 bits for number of attributes */</span><span class="cp"> </span><span class="cm">/* bits 0x1800 are available */</span> <span class="cp">#define HEAP_KEYS_UPDATED 0x2000 </span><span class="cm">/* tuple was updated and key cols * modified, or tuple deleted */</span><span class="cp"> #define HEAP_HOT_UPDATED 0x4000 </span><span class="cm">/* tuple was HOT-updated */</span><span class="cp"> #define HEAP_ONLY_TUPLE 0x8000 </span><span class="cm">/* this is heap-only tuple */</span><span class="cp"> </span> <span class="cp">#define HEAP2_XACT_MASK 0xE000 </span><span class="cm">/* visibility-related bits */</span><span class="cp"> </span><span class="p">[...]</span> <span class="o">*</span> <span class="n">information</span> <span class="n">stored</span> <span class="n">in</span> <span class="n">t_infomask</span><span class="o">:</span> <span class="err">*/</span> <span class="cp">#define HEAP_HASNULL 0x0001 </span><span class="cm">/* has null attribute(s) */</span><span class="cp"> #define HEAP_HASVARWIDTH 0x0002 </span><span class="cm">/* has variable-width attribute(s) */</span><span class="cp"> </span><span class="p">[...]</span> <span class="cp">#define HEAP_XMIN_COMMITTED 0x0100 </span><span class="cm">/* t_xmin committed */</span><span class="cp"> #define HEAP_XMIN_INVALID 0x0200 </span><span class="cm">/* t_xmin invalid/aborted */</span><span class="cp"> #define HEAP_XMIN_FROZEN (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID) #define HEAP_XMAX_COMMITTED 0x0400 </span><span class="cm">/* t_xmax committed */</span><span class="cp"> #define HEAP_XMAX_INVALID 0x0800 </span><span class="cm">/* t_xmax invalid/aborted */</span><span class="cp"> </span><span class="p">[...]</span> </code></pre></div></div> <p>We can see a few bits useful for the <strong>tuple deforming</strong>. For instance, we see that 11 bits of <em>t_infomask2</em> are used to store the actual number of attributes stored in this tuple. Adding a new column in a table doesn’t always require a full table rewrite, and in that case those bits are critical to know when to stop looking for additional attributes when accessing tuples stored before the column was added. There’s also information on whether the tuple contains any NULL or variable-length datatype attribute. The rest of the hint bits are a clever use of the available space to handle various SQL operations, MVCC rules, HOT updates and other low level optimisations.</p> <h3 id="tuple-descriptors">Tuple descriptors</h3> <p>Now that we covered some internals of the <code class="language-plaintext highlighter-rouge">HeapTuple</code>, it seems much easier to reach our goal: transform the INSERT WAL records into plain SQL statements. We know that we just have to <em>deform</em> the tuples to retrieve the values and the NULL attributes, generating the SQL statements around isn’t hard. But here comes the second reason why we need a proper data directory to do so, and why the lack of DDL is important.</p> <p>As you probably guessed by now, one critical piece of information needed for the <em>tuple deforming</em> operation is the table structure declaration. Indeed, the <code class="language-plaintext highlighter-rouge">HeapTuple</code> is just a big chunk of memory, and without the list of columns, data types, and the types details, it’s impossible to interpret those. If your model doesn’t change too much it’s probably possible to do without and instead generate some kind of mapping manually based on what you know about the history of the instance. Be careful if you go this way, any discrepancy between the original and generated data types can lead to bogus output in the best case, or crashing your whole instance. But in my case I had the guarantee that no DDL happened since the incident, and the other data directory available so I could just rely on it.</p> <p>Postgres handles the table structure declaration using another struct, called <code class="language-plaintext highlighter-rouge">TupleDesc</code>, for <em>tuple descriptor</em>. Its definition is:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">TupleDescData</span> <span class="p">{</span> <span class="kt">int</span> <span class="n">natts</span><span class="p">;</span> <span class="cm">/* number of attributes in the tuple */</span> <span class="n">Oid</span> <span class="n">tdtypeid</span><span class="p">;</span> <span class="cm">/* composite type ID for tuple type */</span> <span class="n">int32</span> <span class="n">tdtypmod</span><span class="p">;</span> <span class="cm">/* typmod for tuple type */</span> <span class="kt">int</span> <span class="n">tdrefcount</span><span class="p">;</span><span class="cm">/* reference count, or -1 if not counting */</span> <span class="n">TupleConstr</span> <span class="o">*</span><span class="n">constr</span><span class="p">;</span> <span class="cm">/* constraints, or NULL if none */</span> <span class="cm">/* attrs[N] is the description of Attribute Number N+1 */</span> <span class="n">FormData_pg_attribute</span> <span class="n">attrs</span><span class="p">[</span><span class="n">FLEXIBLE_ARRAY_MEMBER</span><span class="p">];</span> <span class="p">}</span> <span class="n">TupleDescData</span><span class="p">;</span> </code></pre></div></div> <p>In our case the most interesting members are the number of attributes (<code class="language-plaintext highlighter-rouge">natts</code>) and the array of <code class="language-plaintext highlighter-rouge">pg_attribute</code> records (<code class="language-plaintext highlighter-rouge">attrs</code>). Those are also useful for the SQL generation part, as we can retrieve the columns from it. Note also that postgres will generate a <code class="language-plaintext highlighter-rouge">TupleDesc</code> automatically when you internally open a relation.</p> <p>Let’s recapitulate. We have the record data, the filename contains the physical file location information that we can use to retrieve the actual relation, we know how to get the tuple descriptor for this relation and we can use it to deform the tuple and get the values from it. We have <em>almost</em> everything we need to generate the SQL queries.</p> <p>The only remaining detail is that the values we get from the tuple deforming operation are in their physical representation, and we need to emit their textual representation. Again, that’s not a problem as each data type has a dedicated function for that, called <strong>type output function</strong>, available in <code class="language-plaintext highlighter-rouge">pg_type.typoutput</code>.</p> <h3 id="extracting-sql-from-the-insert-records">Extracting SQL from the INSERT records</h3> <p>Now is time for the fun part where we just need to put everything together to finish the project!</p> <p>I chose to write it as an extension to be able to add and remove it easily from a production server. I also chose to minimize the amount of C code and rely on plpgsql functions when possible. It’s faster to write and plpgsql is also way safer.</p> <p>I only wrote a single <code class="language-plaintext highlighter-rouge">pg_decode_record()</code> C function, that takes as input a record as a bytea, the tablespace oid and the relation filenode and emits the underlying SQL query. I wrote an extra <code class="language-plaintext highlighter-rouge">pg_decode_all_records()</code> function in plpgsql that uses existing <code class="language-plaintext highlighter-rouge">pg_ls_dir()</code> and <code class="language-plaintext highlighter-rouge">pg_read_binary_file()</code> to retrieve the files and record, and <code class="language-plaintext highlighter-rouge">split_part()</code> to extract the metadata from the filename.</p> <p>I’m <a href="/assets/patch/pg_decode_record.tgz">attaching the resulting extension to this article</a> so you can see the whole implementation and adapt it if needed, and will just quickly describe the main parts here as we already covered the underlying elements. I’m also only showing here a simplified version to avoid too many implementation details.</p> <p>First, I look for a matching relation oid in the pg_class catalog for the given tablespace and relfilenode, open the found relation with the weakest lock possible, make a copy of the tuple descriptor and start generating the SQL query with the qualified relation name. As for normal application, you need to make sure that the identifiers are properly quoted to generate working queries:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">PGDLLEXPORT</span> <span class="n">Datum</span> <span class="nf">pg_decode_record</span><span class="p">(</span><span class="n">PG_FUNCTION_ARGS</span><span class="p">)</span> <span class="p">{</span> <span class="n">bytea</span> <span class="o">*</span><span class="n">record</span> <span class="o">=</span> <span class="n">PG_GETARG_BYTEA_PP</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span> <span class="n">Oid</span> <span class="n">spc</span> <span class="o">=</span> <span class="n">PG_GETARG_OID</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span> <span class="n">Oid</span> <span class="n">relfilenode</span> <span class="o">=</span> <span class="n">PG_GETARG_OID</span><span class="p">(</span><span class="mi">2</span><span class="p">);</span> <span class="cm">/* Get the relation oid from the tablespace oid and relfilenode */</span> <span class="n">relid</span> <span class="o">=</span> <span class="n">get_spc_relnumber_relid</span><span class="p">(</span><span class="n">spcOid</span><span class="p">,</span> <span class="n">relNumber</span><span class="p">);</span> <span class="n">relation</span> <span class="o">=</span> <span class="n">table_open</span><span class="p">(</span><span class="n">relid</span><span class="p">,</span> <span class="n">AccessShareLock</span><span class="p">);</span> <span class="n">tupdesc</span> <span class="o">=</span> <span class="n">CreateTupleDescCopy</span><span class="p">(</span><span class="n">RelationGetDescr</span><span class="p">(</span><span class="n">relation</span><span class="p">));</span> <span class="cm">/* Start generating the SQL query */</span> <span class="n">initStringInfo</span><span class="p">(</span><span class="n">buf</span><span class="p">);</span> <span class="n">appendStringInfo</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="s">"INSERT INTO %s.%s"</span><span class="p">,</span> <span class="n">quote_identifier</span><span class="p">(</span><span class="n">get_namespace_name</span><span class="p">(</span><span class="n">RelationGetNamespace</span><span class="p">(</span><span class="n">relation</span><span class="p">))),</span> <span class="n">quote_identifier</span><span class="p">(</span><span class="n">RelationGetRelationName</span><span class="p">(</span><span class="n">relation</span><span class="p">)));</span> </code></pre></div></div> <p>The next part extracts the data from the record and generate a <code class="language-plaintext highlighter-rouge">HeapTuple</code> with just enough information to be correctly deformed:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="cm">/* mimic heap_xlog_insert */</span> <span class="n">data</span> <span class="o">=</span> <span class="n">VARDATA</span><span class="p">(</span><span class="n">record</span><span class="p">);</span> <span class="n">datalen</span> <span class="o">=</span> <span class="n">VARSIZE_ANY</span><span class="p">(</span><span class="n">record</span><span class="p">);</span> <span class="p">[...]</span> <span class="n">htup</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">tbuf</span><span class="p">.</span><span class="n">hdr</span><span class="p">;</span> <span class="p">[...]</span> <span class="n">htup</span><span class="o">-&gt;</span><span class="n">t_hoff</span> <span class="o">=</span> <span class="n">xlhdr</span><span class="p">.</span><span class="n">t_hoff</span><span class="p">;</span> <span class="cm">/* build a fake tuple with the bare minimum to deform it */</span> <span class="n">tuple</span> <span class="o">=</span> <span class="p">(</span><span class="n">HeapTuple</span><span class="p">)</span> <span class="n">palloc0</span><span class="p">(</span><span class="n">HEAPTUPLESIZE</span> <span class="o">+</span> <span class="n">VARSIZE_ANY</span><span class="p">(</span><span class="n">record</span><span class="p">));</span> <span class="n">tuple</span><span class="o">-&gt;</span><span class="n">t_data</span> <span class="o">=</span> <span class="n">htup</span><span class="p">;</span> <span class="n">tuple</span><span class="o">-&gt;</span><span class="n">t_len</span> <span class="o">=</span> <span class="n">VARSIZE_ANY</span><span class="p">(</span><span class="n">record</span><span class="p">);</span> <span class="n">ItemPointerSetInvalid</span><span class="p">(</span><span class="o">&amp;</span><span class="p">(</span><span class="n">tuple</span><span class="o">-&gt;</span><span class="n">t_self</span><span class="p">));</span> <span class="n">tuple</span><span class="o">-&gt;</span><span class="n">t_tableOid</span> <span class="o">=</span> <span class="n">relid</span><span class="p">;</span> </code></pre></div></div> <p>For the next step, we just need to allocate the 2 arrays needed for the deforming and call <code class="language-plaintext highlighter-rouge">heap_deform_tuple()</code>:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">values</span> <span class="o">=</span> <span class="n">palloc0</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="n">Datum</span><span class="p">)</span> <span class="o">*</span> <span class="n">tupdesc</span><span class="o">-&gt;</span><span class="n">natts</span><span class="p">);</span> <span class="n">isnull</span> <span class="o">=</span> <span class="n">palloc0</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="n">bool</span><span class="p">)</span> <span class="o">*</span> <span class="n">tupdesc</span><span class="o">-&gt;</span><span class="n">natts</span><span class="p">);</span> <span class="n">heap_deform_tuple</span><span class="p">(</span><span class="n">tuple</span><span class="p">,</span> <span class="n">tupdesc</span><span class="p">,</span> <span class="n">values</span><span class="p">,</span> <span class="n">isnull</span><span class="p">);</span> </code></pre></div></div> <p>Now that we have all the elements, we just need to iterate over the list of columns in the tuple descriptor, output a NULL if needed, otherwise find the type output function, call it for our value, and output it in the query after escaping it:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="cm">/* append the values */</span> <span class="n">appendStringInfoString</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="s">" VALUES ("</span><span class="p">);</span> <span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">tupdesc</span><span class="o">-&gt;</span><span class="n">natts</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="kt">char</span> <span class="o">*</span><span class="n">value</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span> <span class="n">Oid</span> <span class="n">typoutput</span><span class="p">;</span> <span class="n">bool</span> <span class="n">typisvarlena</span><span class="p">;</span> <span class="k">if</span> <span class="p">(</span><span class="n">i</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">)</span> <span class="n">appendStringInfoString</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="s">", "</span><span class="p">);</span> <span class="k">if</span> <span class="p">(</span><span class="n">isnull</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span> <span class="n">appendStringInfoString</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="s">"NULL"</span><span class="p">);</span> <span class="k">continue</span><span class="p">;</span> <span class="p">}</span> <span class="n">getTypeOutputInfo</span><span class="p">(</span><span class="n">TupleDescAttr</span><span class="p">(</span><span class="n">tupdesc</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">atttypid</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">typoutput</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">typisvarlena</span><span class="p">);</span> <span class="n">value</span> <span class="o">=</span> <span class="n">OidOutputFunctionCall</span><span class="p">(</span><span class="n">typoutput</span><span class="p">,</span> <span class="n">values</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span> <span class="n">value</span> <span class="o">=</span> <span class="n">quote_literal_cstr</span><span class="p">(</span><span class="n">value</span><span class="p">);</span> <span class="n">appendStringInfo</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="s">"%s"</span><span class="p">,</span> <span class="n">value</span><span class="p">);</span> <span class="n">pfree</span><span class="p">(</span><span class="n">value</span><span class="p">);</span> <span class="p">}</span> <span class="n">appendStringInfoString</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="s">");"</span><span class="p">);</span> </code></pre></div></div> <p>Once done, we just need to properly close the relation and return the generated query to the caller:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">table_close</span><span class="p">(</span><span class="n">relation</span><span class="p">,</span> <span class="n">NoLock</span><span class="p">);</span> <span class="n">PG_RETURN_TEXT_P</span><span class="p">(</span><span class="n">cstring_to_text</span><span class="p">(</span><span class="n">buf</span><span class="p">.</span><span class="n">data</span><span class="p">));</span> <span class="err">}</span> </code></pre></div></div> <p>And that’s all you need for the basic scenario! The real implementation has a bit more code for various other cases, like <strong>very basic</strong> TOAST table support, but is still unlikely to correctly handle any weird corner cases that can happen in the wild.</p> <h3 id="basic-usage">Basic usage</h3> <p>We can finally see the result of all the hard work in this article and the previous one! I will be using a simple scenario, first saving the current WAL position to only keep the records generated afterwards, then removing all the data from the table (without changing its relfilenode) to make sure that we don’t read anything from the table itself.</p> <div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- Get the current WAL location</span> <span class="n">rjuju</span> <span class="o">=#</span> <span class="k">SELECT</span> <span class="n">pg_current_wal_lsn</span><span class="p">();</span> <span class="n">pg_current_wal_lsn</span> <span class="c1">--------------------</span> <span class="n">F</span><span class="o">/</span><span class="mi">46349</span><span class="n">E80</span> <span class="p">(</span><span class="mi">1</span> <span class="k">row</span><span class="p">)</span> <span class="n">rjuju</span><span class="o">=#</span> <span class="k">CREATE</span> <span class="n">EXTENSION</span> <span class="n">pg_decode_record</span><span class="p">;</span> <span class="k">CREATE</span> <span class="n">EXTENSION</span> <span class="n">rjuju</span><span class="o">=#</span> <span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">decode_record</span><span class="p">(</span><span class="n">id</span> <span class="nb">integer</span><span class="p">,</span> <span class="n">val</span> <span class="nb">text</span> <span class="k">storage</span> <span class="k">external</span><span class="p">);</span> <span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">rjuju</span><span class="o">=#</span> <span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">decode_record</span> <span class="k">SELECT</span> <span class="mi">1</span><span class="p">,</span> <span class="s1">'simple test'</span><span class="p">;</span> <span class="k">INSERT</span> <span class="mi">0</span> <span class="mi">1</span> <span class="c1">-- Force a full-page write</span> <span class="n">rjuju</span><span class="o">=#</span> <span class="k">CHECKPOINT</span><span class="p">;</span> <span class="k">CHECKPOINT</span> <span class="n">rjuju</span><span class="o">=#</span> <span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">decode_record</span> <span class="k">SELECT</span> <span class="mi">2</span><span class="p">,</span> <span class="s1">'full-page write'</span><span class="p">;</span> <span class="k">INSERT</span> <span class="mi">0</span> <span class="mi">1</span> <span class="n">rjuju</span><span class="o">=#</span> <span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">decode_record</span> <span class="k">SELECT</span> <span class="mi">3</span><span class="p">,</span> <span class="s1">'a bit big '</span><span class="o">||</span><span class="n">string_agg</span><span class="p">(</span><span class="n">random</span><span class="p">()::</span><span class="nb">text</span><span class="p">,</span> <span class="s1">' '</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">generate_series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">10</span><span class="p">);</span> <span class="k">INSERT</span> <span class="mi">0</span> <span class="mi">1</span> <span class="n">rjuju</span><span class="o">=#</span> <span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">decode_record</span> <span class="k">SELECT</span> <span class="mi">4</span><span class="p">,</span> <span class="s1">'way bigger '</span><span class="o">||</span><span class="n">string_agg</span><span class="p">(</span><span class="n">random</span><span class="p">()::</span><span class="nb">text</span><span class="p">,</span> <span class="s1">' '</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">generate_series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">120</span><span class="p">);</span> <span class="k">INSERT</span> <span class="mi">0</span> <span class="mi">1</span> <span class="c1">-- Check the heap table size and underlying TOAST table size</span> <span class="n">rjuju</span><span class="o">=#</span> <span class="k">SELECT</span> <span class="n">oid</span><span class="p">::</span><span class="n">regclass</span><span class="p">::</span><span class="nb">text</span><span class="p">,</span> <span class="n">pg_size_pretty</span><span class="p">(</span><span class="n">pg_relation_size</span><span class="p">(</span><span class="n">oid</span><span class="p">)),</span> <span class="n">reltoastrelid</span><span class="p">::</span><span class="n">regclass</span><span class="p">::</span><span class="nb">text</span><span class="p">,</span> <span class="n">pg_size_pretty</span><span class="p">(</span><span class="n">pg_relation_size</span><span class="p">(</span><span class="n">reltoastrelid</span><span class="p">))</span> <span class="k">FROM</span> <span class="n">pg_class</span> <span class="k">WHERE</span> <span class="n">relname</span> <span class="o">=</span> <span class="s1">'decode_record'</span><span class="p">;</span> <span class="n">oid</span> <span class="o">|</span> <span class="n">pg_size_pretty</span> <span class="o">|</span> <span class="n">reltoastrelid</span> <span class="o">|</span> <span class="n">pg_size_pretty</span> <span class="c1">---------------+----------------+-------------------------+----------------</span> <span class="n">decode_record</span> <span class="o">|</span> <span class="mi">8192</span> <span class="n">bytes</span> <span class="o">|</span> <span class="n">pg_toast</span><span class="p">.</span><span class="n">pg_toast_66731</span> <span class="o">|</span> <span class="mi">8192</span> <span class="n">bytes</span> <span class="p">(</span><span class="mi">1</span> <span class="k">row</span><span class="p">)</span> <span class="n">rjuju</span><span class="o">=#</span> <span class="k">DELETE</span> <span class="k">FROM</span> <span class="n">decode_record</span><span class="p">;</span> <span class="k">DELETE</span> <span class="mi">4</span> <span class="c1">-- Make sure we remove all records and physically empty the tables</span> <span class="n">rjuju</span><span class="o">=#</span> <span class="k">VACUUM</span> <span class="n">decode_record</span><span class="p">;</span> <span class="k">VACUUM</span> <span class="n">rjuju</span><span class="o">=#</span> <span class="k">SELECT</span> <span class="n">oid</span><span class="p">::</span><span class="n">regclass</span><span class="p">::</span><span class="nb">text</span><span class="p">,</span> <span class="n">pg_size_pretty</span><span class="p">(</span><span class="n">pg_relation_size</span><span class="p">(</span><span class="n">oid</span><span class="p">)),</span> <span class="n">reltoastrelid</span><span class="p">::</span><span class="n">regclass</span><span class="p">::</span><span class="nb">text</span><span class="p">,</span> <span class="n">pg_size_pretty</span><span class="p">(</span><span class="n">pg_relation_size</span><span class="p">(</span><span class="n">reltoastrelid</span><span class="p">))</span> <span class="k">FROM</span> <span class="n">pg_class</span> <span class="k">WHERE</span> <span class="n">relname</span> <span class="o">=</span> <span class="s1">'decode_record'</span><span class="p">;</span> <span class="n">oid</span> <span class="o">|</span> <span class="n">pg_size_pretty</span> <span class="o">|</span> <span class="n">reltoastrelid</span> <span class="o">|</span> <span class="n">pg_size_pretty</span> <span class="c1">---------------+----------------+-------------------------+----------------</span> <span class="n">decode_record</span> <span class="o">|</span> <span class="mi">0</span> <span class="n">bytes</span> <span class="o">|</span> <span class="n">pg_toast</span><span class="p">.</span><span class="n">pg_toast_66737</span> <span class="o">|</span> <span class="mi">0</span> <span class="n">bytes</span> <span class="p">(</span><span class="mi">1</span> <span class="k">row</span><span class="p">)</span> </code></pre></div></div> <p>Ok, we should have a few records generated in the WAL corresponding to data we definitely lost in the table. Let’s extract the INSERT records using the custom <em>pg_waldump</em> we created in the previous article:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ mkdir -p /tmp/pg_decode_record $ pg_waldump --start "F/46349E80" --save-records /tmp/pg_decode_record [...] $ ls -l /tmp/pg_decode_record 0000000F-46367520.1663.16384.66743.0_main 0000000F-46367660.1663.16384.66743.0_main 0000000F-46367738.1663.16384.66743.0_main 0000000F-46367868.1663.16384.66746.0_main 0000000F-46368130.1663.16384.66746.0_main 0000000F-46368300.1663.16384.66743.0_main </code></pre></div></div> <p>You might wonder why there are 6 records extracted while we only inserted 4 rows. That’s because the last record was big enough to be TOASTed using 2 chunks, and as far as the WAL are concerned that’s 3 separate INSERTs in 2 different tables. Let’s see that in detail using the extension to decode the records (truncating the output as some rows are quite big):</p> <div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rjuju</span><span class="o">=#</span> <span class="k">SELECT</span> <span class="n">substr</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">95</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">pg_decode_all_records</span><span class="p">(</span><span class="s1">'/tmp/pg_decode_records'</span><span class="p">)</span> <span class="n">f</span><span class="p">(</span><span class="n">v</span><span class="p">);</span> <span class="n">substr</span> <span class="c1">-------------------------------------------------------------------------------------------</span> <span class="k">INSERT</span> <span class="k">INTO</span> <span class="k">public</span><span class="p">.</span><span class="n">decode_record</span> <span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">val</span><span class="p">)</span> <span class="k">VALUES</span> <span class="p">(</span><span class="s1">'1'</span><span class="p">,</span> <span class="s1">'simple test'</span><span class="p">);</span> <span class="k">INSERT</span> <span class="k">INTO</span> <span class="k">public</span><span class="p">.</span><span class="n">decode_record</span> <span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">val</span><span class="p">)</span> <span class="k">VALUES</span> <span class="p">(</span><span class="s1">'2'</span><span class="p">,</span> <span class="s1">'full-page write'</span><span class="p">);</span> <span class="k">INSERT</span> <span class="k">INTO</span> <span class="k">public</span><span class="p">.</span><span class="n">decode_record</span> <span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">val</span><span class="p">)</span> <span class="k">VALUES</span> <span class="p">(</span><span class="s1">'3'</span><span class="p">,</span> <span class="s1">'a bit big 0.5356172842583808 0.3...'</span> <span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">pg_toast</span><span class="p">.</span><span class="n">pg_toast_66810</span> <span class="k">VALUES</span> <span class="p">(</span><span class="s1">'66815'</span><span class="p">,</span> <span class="s1">'0'</span><span class="p">,</span> <span class="n">E</span><span class="s1">'</span><span class="se">\\</span><span class="s1">x7761792062696767657220302e...'</span> <span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">pg_toast</span><span class="p">.</span><span class="n">pg_toast_66810</span> <span class="k">VALUES</span> <span class="p">(</span><span class="s1">'66815'</span><span class="p">,</span> <span class="s1">'1'</span><span class="p">,</span> <span class="n">E</span><span class="s1">'</span><span class="se">\\</span><span class="s1">x3337383137353120302e303439...'</span> <span class="k">INSERT</span> <span class="k">INTO</span> <span class="k">public</span><span class="p">.</span><span class="n">decode_record</span> <span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">val</span><span class="p">)</span> <span class="k">VALUES</span> <span class="p">(</span><span class="s1">'4'</span><span class="p">,</span> <span class="cm">/* toast pointer 66815 */</span><span class="p">);</span> <span class="p">(</span><span class="mi">6</span> <span class="k">rows</span><span class="p">)</span> </code></pre></div></div> <p>(note: I slightly edited the output to make it smaller and have correct syntax highlighting, the real extension will emit the real table name in a comment in case of INSERT in a TOAST table)</p> <p>We see the first normal records properly decoded, whether they’re in a full-page image or not. The last record is indeed split into 3 different INSERTs, 2 in the TOAST table and 1 in the heap table.</p> <p>As I mentioned earlier I only added <strong>very minimal</strong> support for TOAST tables, as I didn’t have any information about the customer tables and whether they would hit that case or not, or how often. The last insert isn’t a valid statement as the 2nd value is missing, but we can manually extract the value from the INSERT statements in the TOAST table and therefore fix the normal INSERT. For instance, using the first few bytes that we can see in the first chunk:</p> <div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rjuju</span><span class="o">=#</span> <span class="k">SELECT</span> <span class="n">encode</span><span class="p">(</span><span class="n">E</span><span class="s1">'</span><span class="se">\\</span><span class="s1">x7761792062696767657220302e'</span><span class="p">,</span> <span class="s1">'escape'</span><span class="p">);</span> <span class="o">-</span><span class="p">[</span> <span class="n">RECORD</span> <span class="mi">1</span> <span class="p">]</span><span class="c1">---------</span> <span class="n">encode</span> <span class="o">|</span> <span class="n">way</span> <span class="n">bigger</span> <span class="mi">0</span><span class="p">.</span> </code></pre></div></div> <p>The data is there, it just needs a bit of manual processing to get it.</p> <p>To be totally fair, I also cheated a bit in that example by making sure that the data will be TOASTed but not compressed, so it’s very easy to manually retrieve the raw value from the extra INSERTs in the TOAST tables. It wouldn’t be very hard to have all of that working transparently, but I simply didn’t have the need. If you’re interested in that, I’d recommend looking at the <code class="language-plaintext highlighter-rouge">detoast_attr()</code> function in <a href="https://github.com/postgres/postgres/blob/master/src/backend/access/common/detoast.c">src/backend/access/common/detoast.c</a> and all underlying code to see how you can manually decompress data. You would then only need to store the detoasted (and potentially decompressed) value referenced by the toast’s chunk_id locally, and emit it in the query instead of the currently emitted comment.</p> <h3 id="conclusion">Conclusion</h3> <p>I hope you enjoyed those two articles and learned a bit about the WAL infrastructure and the way pages and tuples work internally.</p> <p>If you missed it in the article, <a href="/assets/patch/pg_decode_record.tgz">here is the link for the full extension</a>.</p> <p>I want to emphasize again that all the code I showed here is only a quick proof of concept that’s thought for one narrow use case, and it should be used with care. My goal here wasn’t to show state of the art code but rather show one possible way to quickly come up with a plan to salvage data in case of production incident. If you’re unfortunately confronted to a similar problem, or some major other accident I hope you will find some valuable resources and a starting point to come up with your own dedicated solution!</p> <p><a href="https://rjuju.github.io/postgresql/2023/12/20/extract-sql-from-wal-part2.html">Extracting SQL from WAL? (part 2)</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on December 20, 2023.</p> <![CDATA[Extracting SQL from WAL? (part 1)]]> https://rjuju.github.io/postgresql/2023/12/06/extract-sql-from-wal 2023-12-06T03:04:10+00:00 2023-12-06T03:04:10+00:00 Julien Rouhaud https://rjuju.github.io <p>Is it actually possible to extract SQL commands from WAL generated in “replica” <code class="language-plaintext highlighter-rouge">wal_level</code>?</p> <p>The answer is usually no, the “logical” <code class="language-plaintext highlighter-rouge">wal_level</code> exists for a reason after all, and you shouldn’t expect some kind of miracle here.</p> <p>But in this series of articles you will see that if some conditions are met you can still manage to extract some information, and how to do it. This first article focuses on the WAL records and how to extract the ones you want, while the next one will show how to try to extract the information contained in those records.</p> <h3 id="some-context">Some context</h3> <p>This article is based of some work I did a few months ago to help a customer recover some data after an incident. It’s not a perfect solution and mostly a set of quick hacks I did to come up with something able to retrieve data in a few hours of work only, but I hope sharing details about it and some methodology can be helpful if you ever get in a similar situation. You will probably need to adapt it to your needs, with yet other hacks, but it should give you a good start. It can otherwise be of some interest if you want to know a bit more about the WAL records internals and some associated infrastructure.</p> <h3 id="the-incident">The incident</h3> <p>Due to a series of unfortunate events, one of their HA clusters ended in a split-brain situation for a some time before being reinitialised, which entirely removed one of the data directory. After that, only the WALs that were were generated on that instance were available, those being in “replica” <code class="language-plaintext highlighter-rouge">wal_level</code>, and nothing else.</p> <p>One possibility to try recover the data would be to restore a physical backup, if any, replay archived WALs until the last transaction before the removed node is promoted (assuming those are still available) and then replay the WALs generated on that newly promoted node. Once there you still need to look at each row of each table of each database and compare it to yet another instance restore from the same backup to approximately the same time as this one. That’s clearly not ideal as it will likely require many days or even weeks of tedious hard work to do so, and will consume a lot of resources along the way. Is there a way to do better?</p> <p>After a quick discussion, it turned out that there were a few elements that made some recovery from the WALs themselves possible (more on why later):</p> <ol> <li>One of the data directories was still available</li> <li>The customer guaranteed that no DDL happened since the incident</li> <li>Only INSERTs happened during the split-brain</li> </ol> <h3 id="wals--physical-replication">WALs &amp; Physical replication</h3> <p>As you probably know, postgres physical replication works by sending an exact copy of the modified binary raw data to the various standby servers, in a continuous stream of WAL records. As a consequence, those records don’t really know much about the database objects they reference, and nothing about the SQL queries that generated them. So what do they really contain? Let’s see what’s inside the WAL records generated for an INSERT into a normal heap relation.</p> <h4 id="wal-records">WAL records</h4> <p>First of all, you have to know that the WAL records are split into <strong>Resource Managers</strong> (declared in <a href="https://github.com/postgres/postgres/blob/master/src/include/access/rmgrlist.h">src/include/access/rmgrlist.h</a>), each being responsible for a specific part of postgres (heap tables, indexes, vauum…). They’re identified by a numeric identifier and often referred to as a <code class="language-plaintext highlighter-rouge">rmid</code>, for //resource manager identifier//.</p> <p>Each of those resource managers can handle various operations, which are internally called <strong>opcodes</strong>. Here we’re interested in the WAL records generated while operating on standard heap tables, and especially during INSERTs. This resource manager is a bit particular as it’s split into 2 different <code class="language-plaintext highlighter-rouge">rmid</code>: <code class="language-plaintext highlighter-rouge">RM_HEAP_ID</code> and R<code class="language-plaintext highlighter-rouge">M_HEAP2_ID</code>. This is only an implementation details, as each resource manager can only handle a limited number of opcodes, everything is the same otherwise.</p> <p>If you’re curious, here’s the definition of the main WAL record in the <a href="https://github.com/postgres/postgres/blob/master/src/include/access/xlogrecord.h">source code</a> and a bit of details on the exact layout in the files:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* * The overall layout of an XLOG record is: * Fixed-size header (XLogRecord struct) * XLogRecordBlockHeader struct * XLogRecordBlockHeader struct * ... * XLogRecordDataHeader[Short|Long] struct * block data * block data * ... * main data * [...] */</span> <span class="k">typedef</span> <span class="k">struct</span> <span class="n">XLogRecord</span> <span class="p">{</span> <span class="n">uint32</span> <span class="n">xl_tot_len</span><span class="p">;</span> <span class="cm">/* total len of entire record */</span> <span class="n">TransactionId</span> <span class="n">xl_xid</span><span class="p">;</span> <span class="cm">/* xact id */</span> <span class="n">XLogRecPtr</span> <span class="n">xl_prev</span><span class="p">;</span> <span class="cm">/* ptr to previous record in log */</span> <span class="n">uint8</span> <span class="n">xl_info</span><span class="p">;</span> <span class="cm">/* flag bits, see below */</span> <span class="n">RmgrId</span> <span class="n">xl_rmid</span><span class="p">;</span> <span class="cm">/* resource manager for this record */</span> <span class="cm">/* 2 bytes of padding here, initialize to zero */</span> <span class="n">pg_crc32c</span> <span class="n">xl_crc</span><span class="p">;</span> <span class="cm">/* CRC for this record */</span> <span class="cm">/* XLogRecordBlockHeaders and XLogRecordDataHeader follow, no padding */</span> <span class="p">}</span> <span class="n">XLogRecord</span><span class="p">;</span> </code></pre></div></div> <p>and a block data header:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="cm">/* * Header info for block data appended to an XLOG record. * * 'data_length' is the length of the rmgr-specific payload data associated * with this block. It does not include the possible full page image, nor * XLogRecordBlockHeader struct itself. * * Note that we don't attempt to align the XLogRecordBlockHeader struct! * So, the struct must be copied to aligned local storage before use. */</span> <span class="k">typedef</span> <span class="k">struct</span> <span class="n">XLogRecordBlockHeader</span> <span class="p">{</span> <span class="n">uint8</span> <span class="n">id</span><span class="p">;</span> <span class="cm">/* block reference ID */</span> <span class="n">uint8</span> <span class="n">fork_flags</span><span class="p">;</span> <span class="cm">/* fork within the relation, and flags */</span> <span class="n">uint16</span> <span class="n">data_length</span><span class="p">;</span> <span class="cm">/* number of payload bytes (not including page * image) */</span> <span class="cm">/* If BKPBLOCK_HAS_IMAGE, an XLogRecordBlockImageHeader struct follows */</span> <span class="cm">/* If BKPBLOCK_SAME_REL is not set, a RelFileLocator follows */</span> <span class="cm">/* BlockNumber follows */</span> <span class="p">}</span> <span class="n">XLogRecordBlockHeader</span><span class="p">;</span> </code></pre></div></div> <p>Everything here is very generic as it’s used by all the resource managers. One important bit though is the mention of a <strong>RelFileLocator</strong> after the block header if the record contains information about a different relation from the previous block, whatever is was (which is the meaning of BKPBLOCK_SAME_REL). This is of course important information for us.</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">RelFileLocator</span> <span class="p">{</span> <span class="n">Oid</span> <span class="n">spcOid</span><span class="p">;</span> <span class="cm">/* tablespace */</span> <span class="n">Oid</span> <span class="n">dbOid</span><span class="p">;</span> <span class="cm">/* database */</span> <span class="n">RelFileNumber</span> <span class="n">relNumber</span><span class="p">;</span> <span class="cm">/* relation */</span> <span class="p">}</span> <span class="n">RelFileLocator</span><span class="p">;</span> </code></pre></div></div> <p>But here’s a first reason why you need a proper data directory to do anything with the WALs: this doesn’t contain the schema name and table name, or even the table oid, but the <strong>tablespace oid, database oid and relfilenode</strong>, which is what the WAL actually need to identify a physical relation file (which is itself split into multiple files, the exact <a href="https://github.com/postgres/postgres/blob/master/src/backend/storage/smgr/README">fork</a> and segment are deduced using other information). So any table rewrite happening since the WAL records were generated (e.g. a VACUUM FULL) and you won’t be able to identify which relation a record is about, unless of course you find a way to map the current relfilenode to the one before the table rewrite.</p> <h4 id="heap-insert-wal-records">Heap INSERT WAL records</h4> <p>Now that we saw a bit of the general WAL structures, let’s focus on the data specific to an INSERT. If you’re not familiar really with the internals, one easy way to locate the code related to a specific command is to look at the functions associated to a resource manager. Let’s look at the <strong>RM_HEAP_ID</strong> information in <a href="https://github.com/postgres/postgres/blob/master/src/include/access/rmgrlist.h">src/include/access/rmgrlist.h</a>:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* symbol name, textual name, redo, desc, identify, startup, cleanup, mask, decode */</span> <span class="n">PG_RMGR</span><span class="p">(</span><span class="n">RM_HEAP_ID</span><span class="p">,</span> <span class="s">"Heap"</span><span class="p">,</span> <span class="n">heap_redo</span><span class="p">,</span> <span class="n">heap_desc</span><span class="p">,</span> <span class="n">heap_identify</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="n">heap_mask</span><span class="p">,</span> <span class="n">heap_decode</span><span class="p">)</span> </code></pre></div></div> <p>We here have the name of the actual functions responsible for many operations (the exact list will vary depending on the postgres major version, I’m here using the list in postgres 17).</p> <p>The <strong>redo</strong> function is the name of the function that applies an RM_HEAP_ID record, the <strong>desc</strong> functions is the one that emits the info you see in pg_waldump, the <strong>identify</strong> function returns a string describing the opcode and so on. Let’s look at <code class="language-plaintext highlighter-rouge">heap_identify()</code>:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span> <span class="nf">heap_identify</span><span class="p">(</span><span class="n">uint8</span> <span class="n">info</span><span class="p">)</span> <span class="p">{</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">id</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span> <span class="k">switch</span> <span class="p">(</span><span class="n">info</span> <span class="o">&amp;</span> <span class="o">~</span><span class="n">XLR_INFO_MASK</span><span class="p">)</span> <span class="p">{</span> <span class="k">case</span> <span class="n">XLOG_HEAP_INSERT</span><span class="p">:</span> <span class="n">id</span> <span class="o">=</span> <span class="s">"INSERT"</span><span class="p">;</span> <span class="k">break</span><span class="p">;</span> <span class="p">[...]</span> <span class="p">}</span> <span class="k">return</span> <span class="n">id</span><span class="p">;</span> <span class="p">}</span> </code></pre></div></div> <p>We now know that the opcode we’re interested in is <strong>XLOG_HEAP_INSERT</strong>. A quick <code class="language-plaintext highlighter-rouge">git grep</code> in the tree will lead you to <a href="https://github.com/postgres/postgres/blob/master/src/backend/access/heap/heapam.c">src/backend/access/heap/heapam.c</a>, more precisely the <strong>heap_insert</strong> function. The interesting bit is located in the “XLOG stuff” block. I will show here an extract focusing on the bit we will need:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">heap_insert</span><span class="p">(</span><span class="n">Relation</span> <span class="n">relation</span><span class="p">,</span> <span class="n">HeapTuple</span> <span class="n">tup</span><span class="p">,</span> <span class="n">CommandId</span> <span class="n">cid</span><span class="p">,</span> <span class="kt">int</span> <span class="n">options</span><span class="p">,</span> <span class="n">BulkInsertState</span> <span class="n">bistate</span><span class="p">)</span> <span class="p">{</span> <span class="p">[...]</span> <span class="cm">/* XLOG stuff */</span> <span class="k">if</span> <span class="p">(</span><span class="n">RelationNeedsWAL</span><span class="p">(</span><span class="n">relation</span><span class="p">))</span> <span class="p">{</span> <span class="n">xl_heap_insert</span> <span class="n">xlrec</span><span class="p">;</span> <span class="n">xl_heap_header</span> <span class="n">xlhdr</span><span class="p">;</span> <span class="n">XLogRecPtr</span> <span class="n">recptr</span><span class="p">;</span> <span class="n">Page</span> <span class="n">page</span> <span class="o">=</span> <span class="n">BufferGetPage</span><span class="p">(</span><span class="n">buffer</span><span class="p">);</span> <span class="n">uint8</span> <span class="n">info</span> <span class="o">=</span> <span class="n">XLOG_HEAP_INSERT</span><span class="p">;</span> <span class="kt">int</span> <span class="n">bufflags</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="p">[...]</span> <span class="n">xlrec</span><span class="p">.</span><span class="n">offnum</span> <span class="o">=</span> <span class="n">ItemPointerGetOffsetNumber</span><span class="p">(</span><span class="o">&amp;</span><span class="n">heaptup</span><span class="o">-&gt;</span><span class="n">t_self</span><span class="p">);</span> <span class="n">xlrec</span><span class="p">.</span><span class="n">flags</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="p">[...]</span> <span class="n">XLogBeginInsert</span><span class="p">();</span> <span class="n">XLogRegisterData</span><span class="p">((</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span> <span class="o">&amp;</span><span class="n">xlrec</span><span class="p">,</span> <span class="n">SizeOfHeapInsert</span><span class="p">);</span> <span class="n">xlhdr</span><span class="p">.</span><span class="n">t_infomask2</span> <span class="o">=</span> <span class="n">heaptup</span><span class="o">-&gt;</span><span class="n">t_data</span><span class="o">-&gt;</span><span class="n">t_infomask2</span><span class="p">;</span> <span class="n">xlhdr</span><span class="p">.</span><span class="n">t_infomask</span> <span class="o">=</span> <span class="n">heaptup</span><span class="o">-&gt;</span><span class="n">t_data</span><span class="o">-&gt;</span><span class="n">t_infomask</span><span class="p">;</span> <span class="n">xlhdr</span><span class="p">.</span><span class="n">t_hoff</span> <span class="o">=</span> <span class="n">heaptup</span><span class="o">-&gt;</span><span class="n">t_data</span><span class="o">-&gt;</span><span class="n">t_hoff</span><span class="p">;</span> <span class="cm">/* * note we mark xlhdr as belonging to buffer; if XLogInsert decides to * write the whole page to the xlog, we don't need to store * xl_heap_header in the xlog. */</span> <span class="n">XLogRegisterBuffer</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">buffer</span><span class="p">,</span> <span class="n">REGBUF_STANDARD</span> <span class="o">|</span> <span class="n">bufflags</span><span class="p">);</span> <span class="n">XLogRegisterBufData</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span> <span class="o">&amp;</span><span class="n">xlhdr</span><span class="p">,</span> <span class="n">SizeOfHeapHeader</span><span class="p">);</span> <span class="cm">/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */</span> <span class="n">XLogRegisterBufData</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span> <span class="n">heaptup</span><span class="o">-&gt;</span><span class="n">t_data</span> <span class="o">+</span> <span class="n">SizeofHeapTupleHeader</span><span class="p">,</span> <span class="n">heaptup</span><span class="o">-&gt;</span><span class="n">t_len</span> <span class="o">-</span> <span class="n">SizeofHeapTupleHeader</span><span class="p">);</span> <span class="p">[...]</span> <span class="n">recptr</span> <span class="o">=</span> <span class="n">XLogInsert</span><span class="p">(</span><span class="n">RM_HEAP_ID</span><span class="p">,</span> <span class="n">info</span><span class="p">);</span> <span class="n">PageSetLSN</span><span class="p">(</span><span class="n">page</span><span class="p">,</span> <span class="n">recptr</span><span class="p">);</span> <span class="p">}</span> </code></pre></div></div> <p>We see here that this function is as expected inserting an <code class="language-plaintext highlighter-rouge">RM_HEAP_ID</code> record, with an <code class="language-plaintext highlighter-rouge">XLOG_HEAP_INSERT</code> opcode. There are 2 data parts associated with this record: the header of the tuple that’s being inserted and the tuple itself.</p> <p>That’s great! At this point we know how to identify what relation an INSERT is about and the content of that INSERT. Let’s see how to filter those records from the WALs.</p> <h3 id="extracting-and-filtering-wal-records">Extracting and filtering WAL records</h3> <p>Parsing the postgres WALs isn’t that complicated but still requires to know quite a bit more than what I showed here. Writing such code is possible but wait, don’t we already have a tool shipped with postgres which is designed to do exactly that? Yes there sure is, it’s <a href="https://github.com/postgres/postgres/tree/master/src/bin/pg_waldump">pg_waldump</a>.</p> <p>Rather that writing something similar, couldn’t we simply teach pg_waldump to filter the records we’re interested in and save them somewhere so that we can later process them and generate SQL queries? This way we can then also benefit from all options in pg_waldump like specifying the starting and/or ending LSN or filtering a specific resource manager, without the need to worry about most of the WAL implementation details and only focusing on the few functions provided by postgres necessary for our need. Let’s see how to implement that.</p> <p>The main source file is <a href="https://github.com/postgres/postgres/blob/master/src/bin/pg_waldump/pg_waldump.c">src/bin/pg_waldump/pg_waldump.c</a>. Skipping most of the unrelated code, we can see that there’s a main loop that takes care of reading each record one by one, optionally filter them and then do something with them depending on how the tool was executed. I will again show an extract to focus on the most relevant part only:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span> <span class="p">[...]</span> <span class="cm">/* try to read the next record */</span> <span class="n">record</span> <span class="o">=</span> <span class="n">XLogReadRecord</span><span class="p">(</span><span class="n">xlogreader_state</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">errormsg</span><span class="p">);</span> <span class="p">[...]</span> <span class="cm">/* apply all specified filters */</span> <span class="k">if</span> <span class="p">(</span><span class="n">config</span><span class="p">.</span><span class="n">filter_by_rmgr_enabled</span> <span class="o">&amp;&amp;</span> <span class="o">!</span><span class="n">config</span><span class="p">.</span><span class="n">filter_by_rmgr</span><span class="p">[</span><span class="n">record</span><span class="o">-&gt;</span><span class="n">xl_rmid</span><span class="p">])</span> <span class="k">continue</span><span class="p">;</span> <span class="p">[...]</span> <span class="cm">/* perform any per-record work */</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">config</span><span class="p">.</span><span class="n">quiet</span><span class="p">)</span> <span class="p">{</span> <span class="k">if</span> <span class="p">(</span><span class="n">config</span><span class="p">.</span><span class="n">stats</span> <span class="o">==</span> <span class="nb">true</span><span class="p">)</span> <span class="p">{</span> <span class="n">XLogRecStoreStats</span><span class="p">(</span><span class="o">&amp;</span><span class="n">stats</span><span class="p">,</span> <span class="n">xlogreader_state</span><span class="p">);</span> <span class="n">stats</span><span class="p">.</span><span class="n">endptr</span> <span class="o">=</span> <span class="n">xlogreader_state</span><span class="o">-&gt;</span><span class="n">EndRecPtr</span><span class="p">;</span> <span class="p">}</span> <span class="k">else</span> <span class="n">XLogDumpDisplayRecord</span><span class="p">(</span><span class="o">&amp;</span><span class="n">config</span><span class="p">,</span> <span class="n">xlogreader_state</span><span class="p">);</span> <span class="p">}</span> <span class="cm">/* save full pages if requested */</span> <span class="k">if</span> <span class="p">(</span><span class="n">config</span><span class="p">.</span><span class="n">save_fullpage_path</span> <span class="o">!=</span> <span class="nb">NULL</span><span class="p">)</span> <span class="n">XLogRecordSaveFPWs</span><span class="p">(</span><span class="n">xlogreader_state</span><span class="p">,</span> <span class="n">config</span><span class="p">.</span><span class="n">save_fullpage_path</span><span class="p">);</span> <span class="cm">/* check whether we printed enough */</span> <span class="n">config</span><span class="p">.</span><span class="n">already_displayed_records</span><span class="o">++</span><span class="p">;</span> <span class="k">if</span> <span class="p">(</span><span class="n">config</span><span class="p">.</span><span class="n">stop_after_records</span> <span class="o">&gt;</span> <span class="mi">0</span> <span class="o">&amp;&amp;</span> <span class="n">config</span><span class="p">.</span><span class="n">already_displayed_records</span> <span class="o">&gt;=</span> <span class="n">config</span><span class="p">.</span><span class="n">stop_after_records</span><span class="p">)</span> <span class="k">break</span><span class="p">;</span> <span class="p">}</span> </code></pre></div></div> <p>That’s quite simple, pg_waldump read the records one by one until it needs to stop, ignore the records that the users asked to discard and then takes action on the remaining ones. We can see that there’s already an option to save full page images, it definitely looks like we could just add something similar there, but for all records.</p> <p>First, we will need to provide a way to identify the relation the INSERT is about. That’s the <code class="language-plaintext highlighter-rouge">RelFileLocator</code>, and we already know that it can be found just after the XLogRecordBlockHeader. Postgres provides a function to retrieve this information, and a bit more, named <a href="https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlogreader.c"><code class="language-plaintext highlighter-rouge">XLogRecGetBlockTagExtended()</code></a>. Here is it’s description:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* * Returns information about the block that a block reference refers to, * optionally including the buffer that the block may already be in. * * If the WAL record contains a block reference with the given ID, *rlocator, * *forknum, *blknum and *prefetch_buffer are filled in (if not NULL), and * returns true. Otherwise returns false. */</span> <span class="n">bool</span> <span class="n">XLogRecGetBlockTagExtended</span><span class="p">(</span><span class="n">XLogReaderState</span> <span class="o">*</span><span class="n">record</span><span class="p">,</span> <span class="n">uint8</span> <span class="n">block_id</span><span class="p">,</span> <span class="n">RelFileLocator</span> <span class="o">*</span><span class="n">rlocator</span><span class="p">,</span> <span class="n">ForkNumber</span> <span class="o">*</span><span class="n">forknum</span><span class="p">,</span> <span class="n">BlockNumber</span> <span class="o">*</span><span class="n">blknum</span><span class="p">,</span> <span class="n">Buffer</span> <span class="o">*</span><span class="n">prefetch_buffer</span><span class="p">)</span> </code></pre></div></div> <p>We need to provide the record - pg_waldump already retrieves it for us - and the <code class="language-plaintext highlighter-rouge">block_id</code>. The <code class="language-plaintext highlighter-rouge">block_id</code>, or block reference, is simply an offset in the array of data that the WAL records contains. If you look a bit above in this article, you will see that we already know that <code class="language-plaintext highlighter-rouge">heap_insert()</code> only uses a hardcoded <strong>0</strong> block_id: this is the first argument in the various <code class="language-plaintext highlighter-rouge">XLogRegisterXXX()</code> function calls.</p> <p>Next we need to retrieve the actual WAL record data, the tuple header and the tuple itself. This one is a bit trickier, as the record can either be found in a simple WAL record or in a full-page record. We need to check for a simple WAL record first. The associated function is <a href="https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlogreader.c"><code class="language-plaintext highlighter-rouge">XLogRecGetBlockData()</code></a>:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* * Returns the data associated with a block reference, or NULL if there is * no data (e.g. because a full-page image was taken instead). The returned * pointer points to a MAXALIGNed buffer. */</span> <span class="kt">char</span> <span class="o">*</span> <span class="n">XLogRecGetBlockData</span><span class="p">(</span><span class="n">XLogReaderState</span> <span class="o">*</span><span class="n">record</span><span class="p">,</span> <span class="n">uint8</span> <span class="n">block_id</span><span class="p">,</span> <span class="n">Size</span> <span class="o">*</span><span class="n">len</span><span class="p">)</span> </code></pre></div></div> <p>As noted in the comment, if the function returns NULL (and sets len to <strong>0</strong>) then the data may be in a full-page image instead (or the data could be missing entirely). If that’s the case we need to retrieve the full-page image, and then locate the tuple the INSERT was about and extract it in the same format as a simple WAL record.</p> <p>Postgres provides a function to extract the full-page image: <a href="https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlogreader.c"><code class="language-plaintext highlighter-rouge">RestoreBlockImage()</code></a>:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* * Restore a full-page image from a backup block attached to an XLOG record. * * Returns true if a full-page image is restored, and false on failure with * an error to be consumed by the caller. */</span> <span class="n">bool</span> <span class="n">RestoreBlockImage</span><span class="p">(</span><span class="n">XLogReaderState</span> <span class="o">*</span><span class="n">record</span><span class="p">,</span> <span class="n">uint8</span> <span class="n">block_id</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">page</span><span class="p">)</span> </code></pre></div></div> <p>which is straightforward to use: just provide the record and the block identifier and you get the full-page image if found. However, there’s no function available to extract a tuple for a full-page image. Indeed postgres can simply overwrite the whole block with the full-page image as it contains the latest version of the block at the time it was generated, but in our case we definitely don’t want to emit an INSERT statement for every already existing tuple in the block!</p> <p>Fortunately, even when we get a full-page image, our record still contains a //main data area//. If you look up at the <code class="language-plaintext highlighter-rouge">heap_insert()</code> function, that’s the call to <code class="language-plaintext highlighter-rouge">XLogRegisterData()</code>, and as you see here it contains an <code class="language-plaintext highlighter-rouge">xl_heap_insert</code> struct. And the first member of this struct, <strong>offnum</strong>, is actually the position of the tuple in the page which is exactly what we need!</p> <p>With all of that, it’s just a matter of accessing the tuple header and tuple at the correct place among all the tuples present in the page, and save as we would way it would be if it were a simple WAL record. If you’re wondering how exactly it should be done, you can always look at how postgres itself does it when it needs to return a specific tuple and adapt that code to your need. The functions responsible for that are <code class="language-plaintext highlighter-rouge">heapgetpage()</code> and <code class="language-plaintext highlighter-rouge">heapgettup()</code>, located in the <a href="https://github.com/postgres/postgres/blob/master/src/backend/access/heap/heapam.c">src/backend/access/heap/heapam.c</a> file we already mentioned.</p> <p>We now have the information about the physical file location and the record itself that we will need to transmit to another program to decode it. The best way to do that is to simply save the record as-is in a binary file, and use the file name to transmit the metadata. I chose the following pattern to name the produced files:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>LSN.TABLESPACE_OID.DATABASE_OID.RELFILENODE.FORKNAME </code></pre></div></div> <p>It will be trivial for the consumer to parse it and extract the required metadata. One thing to note is that I don’t put the <code class="language-plaintext highlighter-rouge">rmid</code> or the <code class="language-plaintext highlighter-rouge">opcode</code> here as I’m only emitting the only one I’m interested in and discard everything else. If that’s not your case you should definitely remember to add those in the filename pattern.</p> <p>Since this requires a bit of code to implement, I won’t detail it here but you can find the full result in the patch for pg_waldump that I’m attaching to this article, which implements this as a new <strong>–save-records</strong> option.</p> <p>To conclude, let me also remind you that a compiled version of pg_waldump will only work for a single major postgres version. In my case, I had to work with postgres 11, so you can <a href="/assets/patch/0001-Add-a-save-records-PATH-option-to-pg_waldump_pg11.patch">find the patch for this version here</a>, but if needed I also rebased it again the current commit on the master branch, which <a href="/assets/patch/0001-Add-a-save-records-PATH-option-to-pg_waldump_pg17.patch">can be found here</a>.</p> <h3 id="whats-next">What’s next?</h3> <p>This is the end of this first article. We saw some details on the postgres WAL infrastructure, with a full example for the case of a plain INSERT on a heap table. We also learned where to look to find where other WAL records are generated and to see more details about the implementation.</p> <p>We also checked how pg_waldump is working and how to adapt it for our need, with a provided complete patch for both <a href="/assets/patch/0001-Add-a-save-records-PATH-option-to-pg_waldump_pg11.patch">postgres 11</a> and <a href="/assets/patch/0001-Add-a-save-records-PATH-option-to-pg_waldump_pg17.patch">the current dev version (postgres 17)</a>. Again, I’d like to remind you that all this work is only at a proof-of-concept stage, it’s definitely not polished and I’m sure that are many problems that would need to be fixed. One obvious example of such problem is that we’re saving all INSERT we find in the logs but we don’t check if the transaction they’re in eventually committed. It would be possible to fix that but it would require extraneous code, so as is it’s up to the users to double check that as needed. Overall it was enough to recover the needed data so I didn’t pursue any more work on it.</p> <p>In the next article we will see some usage of this new <strong>–save-records</strong> option, and also how to read those records and decode them to generate plain INSERT queries. Stay tuned!</p> <p><a href="https://rjuju.github.io/postgresql/2023/12/06/extract-sql-from-wal.html">Extracting SQL from WAL? (part 1)</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on December 06, 2023.</p> <![CDATA[Queryid reporting in plpgsql_check]]> https://rjuju.github.io/postgresql/2020/11/17/queryid-reporting-in-plpgsql_check 2020-11-17T02:42:33+00:00 2020-11-17T02:42:33+00:00 Julien Rouhaud https://rjuju.github.io <p>plpgsql_check version 1.14.0 was just released and brings some improvement for performance diagnostic.</p> <p>Thanks <strong>a lot</strong> to <a href="http://okbob.blogspot.com/">Pavel Stěhule</a> for the awesome plpgsql_check extension and the help for implementing the queryid reporting in v1.14!</p> <h3 id="plpgsql_check-static-code-analysis-and-more">plpgsql_check: static code analysis and more</h3> <p>PostgreSQL supports procedural code for many languages, the most popular one probably being plpgsql.</p> <p>Even if that language allows you to write raw SQL statements, any function written in that language is still a block box as far as PostgreSQL is concerned, which means that PostgreSQL won’t perform a lot of checks to verify code quality, typo or any other problem related to code development. That’s where <a href="https://github.com/okbob/plpgsql_check">plpgsql_check extension</a> comes into play.</p> <p>If you write any plpgsql code, this extension will be your best friend, as it brings so many cool features. The major feature is static code analysis, which can detect many bugs, security / SQL inject issue and even possible performance issue by detecting implicit casts that could prevent PostgreSQL from using indexes and much more.</p> <p>It also brings a simple, but yet very useful, <strong>code profiler</strong>.</p> <h3 id="how-to-track-down-performance-issue-in-plpgsql-code">How to track down performance issue in plpgsql code?</h3> <p>As I mentioned above, plpgsql code is a black box as far as PostgreSQL is concerned. The direct consequence is that the performance diagnostic possibilities are quite limited.</p> <p>Using core PostgreSQL, the only option is using <code class="language-plaintext highlighter-rouge">pg_stat_user_functions</code> (which requires <code class="language-plaintext highlighter-rouge">track_functions</code> to be set to <strong>pl</strong> or <strong>all</strong>). It’ll show the number of time each function has been called, and how long the execution took including and excluding nested functions. Unfortunately, this view can only help you track down <strong>which</strong> function is slow, but not <strong>why</strong>, as you don’t get any per-instruction metric.</p> <p>You can somehow work around that limitation using the contrib extensions <a href="https://www.postgresql.org/docs/current/pgstatstatements.html">pg_stat_statements</a>. This extensions is one of the most popular one as far as performance diagnostic is concerned, and gives you a lot of data on query performance (including <a href="/postgresql/2020/04/04/new-in-pg13-monitoring-query-planner.html">planning counters</a> and <a href="/postgresql/2020/04/07/new-in-pg13-WAL-monitoring.html">WAL counters</a> since PostgreSQL 13).</p> <p>The only problem is that it can be quite tricky to match pg_stat_statements entries with your plpgsql code, as there’s way to directly identify which queries are run inside your plpgsql code.</p> <h3 id="plpgsql_check-code-profiler">plpgsql_check code profiler</h3> <p>Another alternative is to use a plpgsql code profiler. There are multiple extensions that bring this feature, and I personally chose <a href="https://github.com/okbob/plpgsql_check">plpgsql_check</a>, as it perfectly suited my need: simple to setup and use, all performance information I needed and possibility to use it either in a per-connection base or globally when configuration the extension in <strong>shared_preload_libraries</strong>. Thanks to this profiler, you can finally get performance metrics at the statement level <strong>inside plpgsql code</strong>:</p> <ul> <li>total execution time, that is the cumulated execution time for all the statements in the source code line</li> <li>average execution time, that is the total execution time divided by the number of statements in the source code line</li> <li>maximum execution time, per statement</li> <li>number of rows processed, per statement</li> </ul> <p>With those information, it becomes quite easy to track down the slow part of your functions. Here’s a simplistic example:</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="o">=#</span> <span class="k">SELECT</span> <span class="n">lineno</span><span class="p">,</span> <span class="n">cmds_on_row</span><span class="p">,</span> <span class="n">total_time</span><span class="p">,</span> <span class="n">avg_time</span><span class="p">,</span> <span class="n">max_time</span><span class="p">,</span> <span class="k">source</span> <span class="k">FROM</span> <span class="n">plpgsql_profiler_function_tb</span><span class="p">(</span><span class="s1">'pltest()'</span><span class="p">);</span> <span class="n">lineno</span> <span class="o">|</span> <span class="n">cmds_on_row</span> <span class="o">|</span> <span class="n">total_time</span> <span class="o">|</span> <span class="n">avg_time</span> <span class="o">|</span> <span class="n">max_time</span> <span class="o">|</span> <span class="k">source</span> <span class="c1">--------+-------------+------------+----------+------------------+-------------------------------------------------------</span> <span class="mi">1</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="mi">2</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="k">DECLARE</span> <span class="mi">3</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="n">num</span> <span class="nb">bigint</span><span class="p">;</span> <span class="mi">4</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="n">_tbl</span> <span class="nb">text</span> <span class="o">=</span> <span class="s1">'pg_class'</span><span class="p">;</span> <span class="mi">5</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="mi">0</span><span class="p">.</span><span class="mi">085</span> <span class="o">|</span> <span class="mi">0</span><span class="p">.</span><span class="mi">085</span> <span class="o">|</span> <span class="p">{</span><span class="mi">0</span><span class="p">.</span><span class="mi">085</span><span class="p">}</span> <span class="o">|</span> <span class="k">BEGIN</span> <span class="mi">6</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="mi">0</span><span class="p">.</span><span class="mi">504</span> <span class="o">|</span> <span class="mi">0</span><span class="p">.</span><span class="mi">504</span> <span class="o">|</span> <span class="p">{</span><span class="mi">0</span><span class="p">.</span><span class="mi">504</span><span class="p">}</span> <span class="o">|</span> <span class="k">drop</span> <span class="k">table</span> <span class="n">if</span> <span class="k">exists</span> <span class="n">meh</span><span class="p">;</span> <span class="mi">7</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="mi">0</span><span class="p">.</span><span class="mi">81</span> <span class="o">|</span> <span class="mi">0</span><span class="p">.</span><span class="mi">81</span> <span class="o">|</span> <span class="p">{</span><span class="mi">0</span><span class="p">.</span><span class="mi">81</span><span class="p">}</span> <span class="o">|</span> <span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">meh</span><span class="p">(</span><span class="n">id</span> <span class="nb">integer</span><span class="p">);</span> <span class="mi">8</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="mi">0</span><span class="p">.</span><span class="mi">362</span> <span class="o">|</span> <span class="mi">0</span><span class="p">.</span><span class="mi">362</span> <span class="o">|</span> <span class="p">{</span><span class="mi">0</span><span class="p">.</span><span class="mi">362</span><span class="p">}</span> <span class="o">|</span> <span class="k">EXECUTE</span> <span class="s1">'SELECT COUNT(*) FROM '</span> <span class="o">||</span> <span class="n">_tbl</span> <span class="k">INTO</span> <span class="n">num</span><span class="p">;</span> <span class="mi">9</span> <span class="o">|</span> <span class="mi">2</span> <span class="o">|</span> <span class="mi">1000</span><span class="p">.</span><span class="mi">84</span> <span class="o">|</span> <span class="mi">500</span><span class="p">.</span><span class="mi">42</span> <span class="o">|</span> <span class="p">{</span><span class="mi">0</span><span class="p">.</span><span class="mi">349</span><span class="p">,</span><span class="mi">1000</span><span class="p">.</span><span class="mi">491</span><span class="p">}</span> <span class="o">|</span> <span class="k">delete</span> <span class="k">from</span> <span class="n">meh</span><span class="p">;</span> <span class="n">PERFORM</span> <span class="n">pg_sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span> <span class="mi">10</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="mi">0</span> <span class="o">|</span> <span class="mi">0</span> <span class="o">|</span> <span class="p">{</span><span class="mi">0</span><span class="p">}</span> <span class="o">|</span> <span class="k">RETURN</span> <span class="n">num</span><span class="p">;</span> <span class="mi">11</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="k">END</span><span class="p">;</span> <span class="p">(</span><span class="mi">11</span> <span class="k">rows</span><span class="p">)</span></code></pre></figure> <p>In this example, we can see immediately that the slowdown comes from source code line n°9, which has a total execution time of 1s. Using the <strong>max_time</strong> field, we see that it’s because of the 2nd statements. As we also have the source code available in the view, we can immediately see the problematic query, which here is a simple call to <code class="language-plaintext highlighter-rouge">pg_sleep(1)</code>.</p> <p>So far so good. But with less naive example the cause of slow execution might be less obvious, and it could be handy to rely on all the available extensions to get more information: <a href="https://www.postgresql.org/docs/current/pgstatstatements.html">pg_stat_statements</a> for general counters, <a href="https://github.com/powa-team/pg_stat_kcache">pg_stat_kcache</a> for CPU and disk usage counters, <a href="https://github.com/postgrespro/pg_wait_sampling">pg_wait_sampling</a> for wait events and so on.</p> <p>But how to match the plpgsql statement with entries in those extensions?</p> <h3 id="exposing-queryid-in-plpgql_check-profiler">Exposing queryid in plpgql_check profiler</h3> <p>Indeed, those extensions identify queries using a <strong>query identifier</strong>, computed by <strong>pg_stat_statements</strong>. You could try to manually find the related entry using the query text stored by <strong>pg_stat_statements</strong>, but it may not always be possible. What if the query is dynamic SQL or using unqualified names?</p> <p>The solution here is quite simple: since plpgsql_check profiler already show per-statement information, also report the statement’s underlying queryid.</p> <p>This is now available with version 1.14.0. Using the previous naive example, here’s what we now see:</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="o">=#</span> <span class="k">SELECT</span> <span class="n">lineno</span><span class="p">,</span> <span class="n">max_time</span><span class="p">,</span> <span class="n">queryids</span><span class="p">,</span> <span class="k">source</span> <span class="k">FROM</span> <span class="n">plpgsql_profiler_function_tb</span><span class="p">(</span><span class="s1">'pltest()'</span><span class="p">);</span> <span class="n">lineno</span> <span class="o">|</span> <span class="n">max_time</span> <span class="o">|</span> <span class="n">queryids</span> <span class="o">|</span> <span class="k">source</span> <span class="c1">--------+------------------+-------------------------------------------+-------------------------------------------------------</span> <span class="mi">1</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="mi">2</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="k">DECLARE</span> <span class="mi">3</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="n">num</span> <span class="nb">bigint</span><span class="p">;</span> <span class="mi">4</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="n">_tbl</span> <span class="nb">text</span> <span class="o">=</span> <span class="s1">'pg_class'</span><span class="p">;</span> <span class="mi">5</span> <span class="o">|</span> <span class="p">{</span><span class="mi">0</span><span class="p">.</span><span class="mi">085</span><span class="p">}</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="k">BEGIN</span> <span class="mi">6</span> <span class="o">|</span> <span class="p">{</span><span class="mi">0</span><span class="p">.</span><span class="mi">504</span><span class="p">}</span> <span class="o">|</span> <span class="p">{</span><span class="k">NULL</span><span class="p">}</span> <span class="o">|</span> <span class="k">drop</span> <span class="k">table</span> <span class="n">if</span> <span class="k">exists</span> <span class="n">meh</span><span class="p">;</span> <span class="mi">7</span> <span class="o">|</span> <span class="p">{</span><span class="mi">0</span><span class="p">.</span><span class="mi">81</span><span class="p">}</span> <span class="o">|</span> <span class="p">{</span><span class="k">NULL</span><span class="p">}</span> <span class="o">|</span> <span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">meh</span><span class="p">(</span><span class="n">id</span> <span class="nb">integer</span><span class="p">);</span> <span class="mi">8</span> <span class="o">|</span> <span class="p">{</span><span class="mi">0</span><span class="p">.</span><span class="mi">362</span><span class="p">}</span> <span class="o">|</span> <span class="p">{</span><span class="o">-</span><span class="mi">7484655548452190292</span><span class="p">}</span> <span class="o">|</span> <span class="k">EXECUTE</span> <span class="s1">'SELECT COUNT(*) FROM '</span> <span class="o">||</span> <span class="n">_tbl</span> <span class="k">INTO</span> <span class="n">num</span><span class="p">;</span> <span class="mi">9</span> <span class="o">|</span> <span class="p">{</span><span class="mi">0</span><span class="p">.</span><span class="mi">349</span><span class="p">,</span><span class="mi">1000</span><span class="p">.</span><span class="mi">491</span><span class="p">}</span> <span class="o">|</span> <span class="p">{</span><span class="mi">8162364748417812595</span><span class="p">,</span><span class="mi">6729783856403017864</span><span class="p">}</span> <span class="o">|</span> <span class="k">delete</span> <span class="k">from</span> <span class="n">meh</span><span class="p">;</span> <span class="n">PERFORM</span> <span class="n">pg_sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span> <span class="mi">10</span> <span class="o">|</span> <span class="p">{</span><span class="mi">0</span><span class="p">}</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="k">RETURN</span> <span class="n">num</span><span class="p">;</span> <span class="mi">11</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="k">END</span><span class="p">;</span> <span class="p">(</span><span class="mi">11</span> <span class="k">rows</span><span class="p">)</span></code></pre></figure> <p>You’re now only a JOIN away from matching your plpgsql profile data from your favorite extensions!</p> <h3 id="limitations">Limitations</h3> <p>There are unfortunately some limitations.</p> <p>Due to pg_stat_statements implementation, queryid for DDL queries is not exposed outside the extension, so plpgsql_check can’t retrieve it.</p> <p>When using dynamic SQL, there might be <strong>many</strong> queries involved:</p> <ul> <li>the query text itself will be generated using SQL statement(s)</li> <li>the parameters, if any, will also be resolved running SQL statement(s)</li> <li>if the query text depends on some parameters, you can end up with multiple different top level query</li> </ul> <p>plpgsql_check will only report the top level query identifier, and if multiple different queries are generated only the query identifier of the first one will be reported.</p> <p>Even with those limitations I still hope that this new feature will be helpful.</p> <h3 id="whats-next">What’s next?</h3> <p>Due to current plpgsql implementation, when a dynamic SQL statement is executed the query identifier is not visible outside plpgsql itself. It means that retrieving the query identifier in that case is a bit costly, as plpgsql_check has to do some of the work that plpgsql is doing:</p> <ul> <li>generate the final query string</li> <li>parse the query string</li> <li>call the parse analysis step (this is where the query identifier is generated)</li> </ul> <p>Of course the query itself won’t be executed or even planned, but those extra steps might add non negligible overhead, especially when the dynamic SQL is executing very short OLTP-style queries.</p> <p>So plpgsql should be modified to be able to report the query identifier of all statements, whether static or dynamic, so external modules can access the information easily and without any additional overhead. Ideally, this could also be available in plpgsql code using a <strong>GET [ CURRENT ] DIAGNOSTICS</strong> command, so users can also use it as they need.</p> <p><a href="https://rjuju.github.io/postgresql/2020/11/17/queryid-reporting-in-plpgsql_check.html">Queryid reporting in plpgsql_check</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on November 17, 2020.</p> <![CDATA[New in pg13: WAL monitoring]]> https://rjuju.github.io/postgresql/2020/04/07/new-in-pg13-WAL-monitoring 2020-04-07T15:46:15+00:00 2020-04-07T15:46:15+00:00 Julien Rouhaud https://rjuju.github.io <p>Write-Ahead Logs is a critical part of PostgreSQL, that ensures data durability. While there are multiple <a href="https://www.postgresql.org/docs/current/runtime-config-wal.html">configuration parameters </a>, there was no easy to monitor WAL activity, or what is generating it.</p> <h3 id="new-infrastructure-to-track-wal-activity">New infrastructure to track WAL activity</h3> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>commit df3b181499b40523bd6244a4e5eb554acb9020ce Author: Amit Kapila &lt;[email protected]&gt; Date: Sat Apr 4 10:02:08 2020 +0530 Add infrastructure to track WAL usage. This allows gathering the WAL generation statistics for each statement execution. The three statistics that we collect are the number of WAL records, the number of full page writes and the amount of WAL bytes generated. This helps the users who have write-intensive workload to see the impact of I/O due to WAL. This further enables us to see approximately what percentage of overall WAL is due to full page writes. In the future, we can extend this functionality to allow us to compute the the exact amount of WAL data due to full page writes. This patch in itself is just an infrastructure to compute WAL usage data. The upcoming patches will expose this data via explain, auto_explain, pg_stat_statements and verbose (auto)vacuum output. Author: Kirill Bychik, Julien Rouhaud Reviewed-by: Dilip Kumar, Fujii Masao and Amit Kapila Discussion: https://postgr.es/m/CAB-hujrP8ZfUkvL5OYETipQwA=e3n7oqHFU=4ZLxWS_Cza3kQQ@mail.gmail.com </code></pre></div></div> <p>With this new infrastructure, each backend will track various information about WAL generation: the number of WAL records, the size of WAL generated and the number of full page images generated. It also makes sure that parallel queries, both DML and utility statements (for now only CREATE INDEX and VACUUM) are correctly handled.</p> <h3 id="per-query-wal-activity-with-pg_stat_statements">Per-query WAL activity with pg_stat_statements</h3> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>commit 6b466bf5f2bea0c89fab54eef696bcfc7ecdafd7 Author: Amit Kapila &lt;[email protected]&gt; Date: Sun Apr 5 07:34:04 2020 +0530 Allow pg_stat_statements to track WAL usage statistics. This commit adds three new columns in pg_stat_statements output to display WAL usage statistics added by commit df3b181499. This commit doesn't bump the version of pg_stat_statements as the same is done for this release in commit 17e0328224. Author: Kirill Bychik and Julien Rouhaud Reviewed-by: Julien Rouhaud, Fujii Masao, Dilip Kumar and Amit Kapila Discussion: https://postgr.es/m/CAB-hujrP8ZfUkvL5OYETipQwA=e3n7oqHFU=4ZLxWS_Cza3kQQ@mail.gmail.com </code></pre></div></div> <p>This basically exposes the mentionned new information about WAL activity in pg_stat_activity, so per (user, database, normalized query). Here is an example:</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="o">=#</span> <span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">t1</span> <span class="p">(</span><span class="n">id</span> <span class="nb">integer</span><span class="p">);</span> <span class="k">CREATE</span> <span class="o">=#</span> <span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">t1</span> <span class="k">SELECT</span> <span class="mi">1</span><span class="p">;</span> <span class="k">INSERT</span> <span class="mi">0</span> <span class="mi">1</span> <span class="o">=#</span> <span class="k">UPDATE</span> <span class="n">t1</span> <span class="k">SET</span> <span class="n">id</span> <span class="o">=</span> <span class="mi">2</span> <span class="k">WHERE</span> <span class="n">id</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="k">UPDATE</span> <span class="mi">1</span> <span class="o">=#</span> <span class="k">CHECKPOINT</span><span class="p">;</span> <span class="k">CHECKPOINT</span> <span class="o">=#</span> <span class="k">DELETE</span> <span class="k">FROM</span> <span class="n">t1</span> <span class="k">WHERE</span> <span class="n">id</span> <span class="o">=</span> <span class="mi">2</span><span class="p">;</span> <span class="k">DELETE</span> <span class="mi">1</span> <span class="o">=#</span> <span class="k">SELECT</span> <span class="n">query</span><span class="p">,</span> <span class="n">wal_records</span><span class="p">,</span> <span class="n">wal_bytes</span><span class="p">,</span> <span class="n">wal_num_fpw</span> <span class="k">FROM</span> <span class="n">pg_stat_statements</span> <span class="k">WHERE</span> <span class="n">query</span> <span class="k">LIKE</span> <span class="s1">'UPDATE%'</span> <span class="k">OR</span> <span class="n">query</span> <span class="k">LIKE</span> <span class="s1">'DELETE%'</span><span class="p">;</span> <span class="n">query</span> <span class="o">|</span> <span class="n">wal_records</span> <span class="o">|</span> <span class="n">wal_bytes</span> <span class="o">|</span> <span class="n">wal_num_fpw</span> <span class="c1">-------------------------------------+-------------+-----------+-------------</span> <span class="k">DELETE</span> <span class="k">FROM</span> <span class="n">t1</span> <span class="k">WHERE</span> <span class="n">id</span> <span class="o">=</span> <span class="err">$</span><span class="mi">1</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="mi">155</span> <span class="o">|</span> <span class="mi">1</span> <span class="k">UPDATE</span> <span class="n">t1</span> <span class="k">SET</span> <span class="n">id</span> <span class="o">=</span> <span class="err">$</span><span class="mi">1</span> <span class="k">WHERE</span> <span class="n">id</span> <span class="o">=</span> <span class="err">$</span><span class="mi">2</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="mi">69</span> <span class="o">|</span> <span class="mi">0</span> <span class="p">(</span><span class="mi">2</span> <span class="k">rows</span><span class="p">)</span></code></pre></figure> <p>I simply inserted a row, updated it and deleted it. Now, looking specifically at the UPDATE and the DELETE, the numbers can be surprising.</p> <p>When inserting a row, we indeed expect a single WAL record and some WAL bytes for the new row, with some overhead due to internal implementation.</p> <p>Now, if you’re familiar with PostgreSQL MVCC implementation, you should know that doing a DELETE should only write a transaction id in the <code class="language-plaintext highlighter-rouge">xmax</code> field (<a href="https://www.postgresql.org/docs/current/storage-page-layout.html">this documentation page</a> is a good introduction on that subject). So why writing a 4B field (the size of the recotded <code class="language-plaintext highlighter-rouge">xmax</code> field), even with some overhead, is writing more than twice the amount of WAL that was required to update a full row? That’s because the DELETE caused a <a href="https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-FULL-PAGE-WRITES">full page write</a>. This is a side effect of performing a <strong>CHECKPOINT</strong> before the DELETE. To guarantee data consistency (and if <code class="language-plaintext highlighter-rouge">full_page_writes</code> parameter isn’t deactivated), any block modified for the first time after a <strong>CHECKPOINT</strong> completion will be fully logged, rather than logging only the delta.</p> <p>You’ll also note that the full page didn’t generate 8kB of data as you could expect. This isn’t because of <code class="language-plaintext highlighter-rouge">wal_compression</code>, as I didn’t activate it, but because the page is almost empty. Indeed, as an optimization, any “hole” in a page, as long as it’s a standard page, can be safely skipped in the WAL. If you’re curious, this is done in the <a href="https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xloginsert.c">XLogRecordAssemble() function </a>. Here’s the relevant extract:</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">static</span> <span class="n">XLogRecData</span> <span class="o">*</span> <span class="n">XLogRecordAssemble</span><span class="p">(</span><span class="n">RmgrId</span> <span class="n">rmid</span><span class="p">,</span> <span class="n">uint8</span> <span class="n">info</span><span class="p">,</span> <span class="n">XLogRecPtr</span> <span class="n">RedoRecPtr</span><span class="p">,</span> <span class="nb">bool</span> <span class="n">doPageWrites</span><span class="p">,</span> <span class="n">XLogRecPtr</span> <span class="o">*</span><span class="n">fpw_lsn</span><span class="p">,</span> <span class="nb">int</span> <span class="o">*</span><span class="n">num_fpw</span><span class="p">)</span> <span class="p">{</span> <span class="p">[...]</span> <span class="cm">/* * If needs_backup is true or WAL checking is enabled for current * resource manager, log a full-page write for the current block. */</span> <span class="n">include_image</span> <span class="o">=</span> <span class="n">needs_backup</span> <span class="o">||</span> <span class="p">(</span><span class="n">info</span> <span class="o">&amp;</span> <span class="n">XLR_CHECK_CONSISTENCY</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">if</span> <span class="p">(</span><span class="n">include_image</span><span class="p">)</span> <span class="p">{</span> <span class="n">Page</span> <span class="n">page</span> <span class="o">=</span> <span class="n">regbuf</span><span class="o">-&gt;</span><span class="n">page</span><span class="p">;</span> <span class="n">uint16</span> <span class="n">compressed_len</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="cm">/* * The page needs to be backed up, so calculate its hole length * and offset. */</span> <span class="n">if</span> <span class="p">(</span><span class="n">regbuf</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">REGBUF_STANDARD</span><span class="p">)</span> <span class="p">{</span> <span class="cm">/* Assume we can omit data between pd_lower and pd_upper */</span> <span class="n">uint16</span> <span class="k">lower</span> <span class="o">=</span> <span class="p">((</span><span class="n">PageHeader</span><span class="p">)</span> <span class="n">page</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">pd_lower</span><span class="p">;</span> <span class="n">uint16</span> <span class="k">upper</span> <span class="o">=</span> <span class="p">((</span><span class="n">PageHeader</span><span class="p">)</span> <span class="n">page</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">pd_upper</span><span class="p">;</span> <span class="n">if</span> <span class="p">(</span><span class="k">lower</span> <span class="o">&gt;=</span> <span class="n">SizeOfPageHeaderData</span> <span class="o">&amp;&amp;</span> <span class="k">upper</span> <span class="o">&gt;</span> <span class="k">lower</span> <span class="o">&amp;&amp;</span> <span class="k">upper</span> <span class="o">&lt;=</span> <span class="n">BLCKSZ</span><span class="p">)</span> <span class="p">{</span> <span class="n">bimg</span><span class="p">.</span><span class="n">hole_offset</span> <span class="o">=</span> <span class="k">lower</span><span class="p">;</span> <span class="n">cbimg</span><span class="p">.</span><span class="n">hole_length</span> <span class="o">=</span> <span class="k">upper</span> <span class="o">-</span> <span class="k">lower</span><span class="p">;</span> <span class="p">}</span> <span class="k">else</span> <span class="p">{</span> <span class="cm">/* No "hole" to remove */</span> <span class="n">bimg</span><span class="p">.</span><span class="n">hole_offset</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">cbimg</span><span class="p">.</span><span class="n">hole_length</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="p">}</span> <span class="p">}</span> <span class="p">[...]</span></code></pre></figure> <h3 id="wal-activity-in-explain-and-auto_explain">WAL activity in EXPLAIN (and auto_explain)</h3> <p>A new <code class="language-plaintext highlighter-rouge">WAL</code> option is available in the <strong>EXPLAIN</strong> command, and similarly a <code class="language-plaintext highlighter-rouge">auto_explain.log_wal</code> for <strong>auto_explain</strong>, to display the same counters. In TEXT mode, only the non-zero counters are shown, similarly to other counters. For instance:</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="o">=#</span> <span class="k">EXPLAIN</span> <span class="p">(</span><span class="k">ANALYZE</span><span class="p">,</span> <span class="n">WAL</span><span class="p">,</span> <span class="n">COSTS</span> <span class="k">OFF</span><span class="p">)</span> <span class="k">UPDATE</span> <span class="n">t1</span> <span class="k">SET</span> <span class="n">id</span> <span class="o">=</span> <span class="mi">1</span> <span class="k">WHERE</span> <span class="n">id</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="n">QUERY</span> <span class="n">PLAN</span> <span class="c1">----------------------------------------------------------------</span> <span class="k">Update</span> <span class="k">on</span> <span class="n">t1</span> <span class="p">(</span><span class="n">actual</span> <span class="nb">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">181</span><span class="p">..</span><span class="mi">0</span><span class="p">.</span><span class="mi">181</span> <span class="k">rows</span><span class="o">=</span><span class="mi">0</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="n">WAL</span><span class="p">:</span> <span class="n">records</span><span class="o">=</span><span class="mi">1</span> <span class="n">bytes</span><span class="o">=</span><span class="mi">68</span> <span class="o">-&gt;</span> <span class="n">Seq</span> <span class="n">Scan</span> <span class="k">on</span> <span class="n">t1</span> <span class="p">(</span><span class="n">actual</span> <span class="nb">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">074</span><span class="p">..</span><span class="mi">0</span><span class="p">.</span><span class="mi">080</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="n">Filter</span><span class="p">:</span> <span class="p">(</span><span class="n">id</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span> <span class="n">Planning</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">274</span> <span class="n">ms</span> <span class="n">Execution</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">381</span> <span class="n">ms</span> <span class="p">(</span><span class="mi">6</span> <span class="k">rows</span><span class="p">)</span></code></pre></figure> <h3 id="wal-activity-in-autovacuum-logs">WAL activity in autovacuum logs</h3> <p>And finally, if an autovacuum is logging its activity (when reaching the <code class="language-plaintext highlighter-rouge">log_autovacuum_min_duration</code> threshold), the same information will be logged. For instance, after inserting 100k records in the same table, deleting half of them and running a <strong>CHECKPOINT</strong>, here’s the output I get:</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="n">LOG</span><span class="p">:</span> <span class="n">automatic</span> <span class="k">vacuum</span> <span class="k">of</span> <span class="k">table</span> <span class="nv">"rjuju.public.t1"</span><span class="p">:</span> <span class="k">index</span> <span class="n">scans</span><span class="p">:</span> <span class="mi">0</span> <span class="n">pages</span><span class="p">:</span> <span class="mi">0</span> <span class="n">removed</span><span class="p">,</span> <span class="mi">443</span> <span class="n">remain</span><span class="p">,</span> <span class="mi">0</span> <span class="n">skipped</span> <span class="n">due</span> <span class="k">to</span> <span class="n">pins</span><span class="p">,</span> <span class="mi">0</span> <span class="n">skipped</span> <span class="n">frozen</span> <span class="n">tuples</span><span class="p">:</span> <span class="mi">50000</span> <span class="n">removed</span><span class="p">,</span> <span class="mi">50001</span> <span class="n">remain</span><span class="p">,</span> <span class="mi">0</span> <span class="k">are</span> <span class="n">dead</span> <span class="n">but</span> <span class="k">not</span> <span class="n">yet</span> <span class="n">removable</span><span class="p">,</span> <span class="n">oldest</span> <span class="n">xmin</span><span class="p">:</span> <span class="mi">496</span> <span class="n">buffer</span> <span class="k">usage</span><span class="p">:</span> <span class="mi">912</span> <span class="n">hits</span><span class="p">,</span> <span class="mi">3</span> <span class="n">misses</span><span class="p">,</span> <span class="mi">448</span> <span class="n">dirtied</span> <span class="k">avg</span> <span class="k">read</span> <span class="n">rate</span><span class="p">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">084</span> <span class="n">MB</span><span class="o">/</span><span class="n">s</span><span class="p">,</span> <span class="k">avg</span> <span class="k">write</span> <span class="n">rate</span><span class="p">:</span> <span class="mi">12</span><span class="p">.</span><span class="mi">485</span> <span class="n">MB</span><span class="o">/</span><span class="n">s</span> <span class="k">system</span> <span class="k">usage</span><span class="p">:</span> <span class="n">CPU</span><span class="p">:</span> <span class="k">user</span><span class="p">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">17</span> <span class="n">s</span><span class="p">,</span> <span class="k">system</span><span class="p">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">00</span> <span class="n">s</span><span class="p">,</span> <span class="n">elapsed</span><span class="p">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">28</span> <span class="n">s</span> <span class="n">WAL</span> <span class="k">usage</span><span class="p">:</span> <span class="mi">1330</span> <span class="n">records</span><span class="p">,</span> <span class="mi">445</span> <span class="k">full</span> <span class="n">page</span> <span class="n">writes</span><span class="p">,</span> <span class="mi">2197104</span> <span class="n">bytes</span></code></pre></figure> <p>This new log output is in my opinion especially important, especially when it comes to <a href="https://www.postgresql.org/docs/current/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND">anti-wraparound / FREEZE vacuum</a>. Indeed, by nature an anti-wraparound VACUUM is more likely to touch blocks that weren’t modified for a long time as it’s targeting tuple being visible for more than 200M transactions (by default). Even though it’s only setting a flag bit to mark the tuple as frozen, if that block wasn’t modified since the last <strong>CHECKPOINT</strong>, this bit will be amplified to a <strong>full page image</strong> which is way more data.</p> <p>With this new feature, it’s now possible to really monitor the WAL generation, which will help to better tune your instances!</p> <p><a href="https://rjuju.github.io/postgresql/2020/04/07/new-in-pg13-WAL-monitoring.html">New in pg13: WAL monitoring</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on April 07, 2020.</p> <![CDATA[New in pg13: Monitoring the query planner]]> https://rjuju.github.io/postgresql/2020/04/04/new-in-pg13-monitoring-query-planner 2020-04-04T12:06:15+00:00 2020-04-04T12:06:15+00:00 Julien Rouhaud https://rjuju.github.io <p>Depending on your workload, the planning time can represent a significant part of the overal query procesing time. This is especially import in OLTP workload, but OLAP queries with numerous tables being joined and an aggressive configuration on the JOIN order search can also lead to hight planning time.</p> <h3 id="planning-counters-in-pg_stat_statements">Planning counters in pg_stat_statements</h3> <p>Previously, pg_stat_statements was only keeping track of the execution part of a query processing: the number of execution, cumulated time, but also minimum, maximum, mean and also the standard deviation. With PostgreSQL 13, you’ll also have those metrics for the planification part!</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>commit 17e03282241c6ac58a714eb0c3b6a8018cf6167a Author: Fujii Masao &lt;[email protected]&gt; Date: Thu Apr 2 11:20:19 2020 +0900 Allow pg_stat_statements to track planning statistics. This commit makes pg_stat_statements support new GUC pg_stat_statements.track_planning. If this option is enabled, pg_stat_statements tracks the planning statistics of the statements, e.g., the number of times the statement was planned, the total time spent planning the statement, etc. This feature is useful to check the statements that it takes a long time to plan. Previously since pg_stat_statements tracked only the execution statistics, we could not use that for the purpose. The planning and execution statistics are stored at the end of each phase separately. So there are not always one-to-one relationship between them. For example, if the statement is successfully planned but fails in the execution phase, only its planning statistics are stored. This may cause the users to be able to see different pg_stat_statements results from the previous version. To avoid this, pg_stat_statements.track_planning needs to be disabled. This commit bumps the version of pg_stat_statements to 1.8 since it changes the definition of pg_stat_statements function. Author: Julien Rouhaud, Pascal Legrand, Thomas Munro, Fujii Masao Reviewed-by: Sergei Kornilov, Tomas Vondra, Yoshikazu Imai, Haribabu Kommi, Tom Lane Discussion: https://postgr.es/m/CAHGQGwFx_=DO-Gu-MfPW3VQ4qC7TfVdH2zHmvZfrGv6fQ3D-Tw@mail.gmail.com Discussion: https://postgr.es/m/CAEepm=0e59Y_6Q_YXYCTHZkqOc6H2pJ54C_Xe=VFu50Aqqp_sA@mail.gmail.com Discussion: https://postgr.es/m/DB6PR0301MB21352F6210E3B11934B0DCC790B00@DB6PR0301MB2135.eurprd03.prod.outlook.com </code></pre></div></div> <p>Keep in mind that even simple query can have a surprisingly high planification time. One of the frequent cause was the <code class="language-plaintext highlighter-rouge">get_actual_variable_range()</code> function, which is called when the planner wants to know what are the minimum and maximum values of a specific field. This function detects if a suitable index exists, and if there’s one it gets the wanted values. However, when there were a lot of uncommitted values at the end of the index range, it could take a significant amount of time to get a visible value. While this problem has been fixed long ago (see <a href="https://github.com/postgres/postgres/commit/fccebe421d0c410e6378fb281419442c84759213">this commit</a> and <a href="https://github.com/postgres/postgres/commit/3ca930fc39ccf987c1c22fd04a1e7463b5dd0dfd">this other commit</a> for more details), there are still some cases where the planning time is higher than what you’d expect, so having an easy way to monitor the planification metrics is worthwhile.</p> <p>This feature can also be interesting to know how much you’re using the <a href="https://www.postgresql.org/docs/current/sql-prepare.html">generic plan feature</a> for instance, and how much of a difference this should make for instance.</p> <p>Let’s see a simple example, to see the effect of generic plans with prepared statements:</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="o">=#</span> <span class="k">PREPARE</span> <span class="n">s1</span> <span class="k">AS</span> <span class="k">SELECT</span> <span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">pg_class</span><span class="p">;</span> <span class="k">PREPARE</span> <span class="o">=#</span> <span class="k">EXECUTE</span> <span class="n">s1</span><span class="p">;</span> <span class="k">count</span> <span class="c1">-------</span> <span class="mi">387</span> <span class="p">(</span><span class="mi">1</span> <span class="k">row</span><span class="p">)</span> <span class="p">[...</span> <span class="mi">5</span> <span class="k">more</span> <span class="n">times</span> <span class="p">...]</span> <span class="o">=#</span> <span class="k">SELECT</span> <span class="n">query</span><span class="p">,</span> <span class="n">plans</span><span class="p">,</span> <span class="n">total_plan_time</span><span class="p">,</span> <span class="n">total_plan_time</span> <span class="o">/</span> <span class="n">plans</span> <span class="k">AS</span> <span class="n">avg_plan</span><span class="p">,</span> <span class="n">calls</span><span class="p">,</span> <span class="n">total_exec_time</span><span class="p">,</span> <span class="n">total_exec_time</span> <span class="o">/</span> <span class="n">calls</span> <span class="k">AS</span> <span class="n">avg_exec</span> <span class="k">FROM</span> <span class="n">pg_stat_statements</span> <span class="k">WHERE</span> <span class="n">query</span> <span class="k">ILIKE</span> <span class="s1">'%SELECT count(*) FROM pg_class%'</span><span class="p">;</span> <span class="o">-</span><span class="p">[</span> <span class="n">RECORD</span> <span class="mi">1</span> <span class="p">]</span><span class="c1">---+--------------------------------------------</span> <span class="n">query</span> <span class="o">|</span> <span class="k">PREPARE</span> <span class="n">s1</span> <span class="k">AS</span> <span class="k">SELECT</span> <span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">pg_class</span> <span class="n">plans</span> <span class="o">|</span> <span class="mi">1</span> <span class="n">total_plan_time</span> <span class="o">|</span> <span class="mi">2</span><span class="p">.</span><span class="mi">119496</span> <span class="n">avg_plan</span> <span class="o">|</span> <span class="mi">2</span><span class="p">.</span><span class="mi">119496</span> <span class="n">calls</span> <span class="o">|</span> <span class="mi">6</span> <span class="n">total_exec_time</span> <span class="o">|</span> <span class="mi">3</span><span class="p">.</span><span class="mi">4918280000000004</span> <span class="n">avg_exec</span> <span class="o">|</span> <span class="mi">0</span><span class="p">.</span><span class="mi">5819713333333334</span></code></pre></figure> <p>While the query was executed 6 times, it was actually planned only once (since there’s no parameter, a generic plan is always used). While the execution time is on average slightly more than half a milliscond, a single planning was almost <strong>4 times</strong> more expensive. By saving 5 planification, postgres saved up to <strong>10ms</strong>.</p> <h3 id="planning-buffers-in-explain">Planning buffers in EXPLAIN</h3> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>commit ce77abe63cfc85fb0bc236deb2cc34ae35cb5324 Author: Fujii Masao &lt;[email protected]&gt; Date: Sat Apr 4 03:13:17 2020 +0900 Include information on buffer usage during planning phase, in EXPLAIN output, take two. When BUFFERS option is enabled, EXPLAIN command includes the information on buffer usage during each plan node, in its output. In addition to that, this commit makes EXPLAIN command include also the information on buffer usage during planning phase, in its output. This feature makes it easier to discern the cases where lots of buffer access happen during planning. This commit revives the original commit ed7a509571 that was reverted by commit 19db23bcbd. The original commit had to be reverted because it caused the regression test failure on the buildfarm members prion and dory. But since commit c0885c4c30 got rid of the caues of the test failure, the original commit can be safely introduced again. Author: Julien Rouhaud, slightly revised by Fujii Masao Reviewed-by: Justin Pryzby Discussion: https://postgr.es/m/[email protected] </code></pre></div></div> <p>Following the same idea, EXPLAIN will now display the buffer usage if the <code class="language-plaintext highlighter-rouge">BUFFERS</code> option is used. If you try that on a fresh new connection, before any catalog cache is populated, you could be surprised on how many buffers would be accessed even for a simple query:</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="o">=#</span> <span class="k">EXPLAIN</span> <span class="p">(</span><span class="n">BUFFERS</span><span class="p">,</span> <span class="k">ANALYZE</span><span class="p">,</span> <span class="n">COSTS</span> <span class="k">OFF</span><span class="p">)</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">pg_class</span><span class="p">;</span> <span class="n">QUERY</span> <span class="n">PLAN</span> <span class="c1">---------------------------------------------------------------------------------------------------------</span> <span class="n">Seq</span> <span class="n">Scan</span> <span class="k">on</span> <span class="n">pg_class</span> <span class="p">(</span><span class="n">actual</span> <span class="nb">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">028</span><span class="p">..</span><span class="mi">0</span><span class="p">.</span><span class="mi">410</span> <span class="k">rows</span><span class="o">=</span><span class="mi">388</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="n">Buffers</span><span class="p">:</span> <span class="n">shared</span> <span class="n">hit</span><span class="o">=</span><span class="mi">13</span> <span class="n">Planning</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">5</span><span class="p">.</span><span class="mi">157</span> <span class="n">ms</span> <span class="n">Buffers</span><span class="p">:</span> <span class="n">shared</span> <span class="n">hit</span><span class="o">=</span><span class="mi">118</span> <span class="n">Execution</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">1</span><span class="p">.</span><span class="mi">257</span> <span class="n">ms</span> <span class="p">(</span><span class="mi">5</span> <span class="k">rows</span><span class="p">)</span> <span class="o">=#</span> <span class="k">EXPLAIN</span> <span class="p">(</span><span class="n">BUFFERS</span><span class="p">,</span> <span class="k">ANALYZE</span><span class="p">,</span> <span class="n">COSTS</span> <span class="k">OFF</span><span class="p">)</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">pg_class</span><span class="p">;</span> <span class="n">QUERY</span> <span class="n">PLAN</span> <span class="c1">------------------------------------------------------------------</span> <span class="n">Seq</span> <span class="n">Scan</span> <span class="k">on</span> <span class="n">pg_class</span> <span class="p">(</span><span class="n">actual</span> <span class="nb">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">035</span><span class="p">..</span><span class="mi">0</span><span class="p">.</span><span class="mi">413</span> <span class="k">rows</span><span class="o">=</span><span class="mi">388</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="n">Buffers</span><span class="p">:</span> <span class="n">shared</span> <span class="n">hit</span><span class="o">=</span><span class="mi">13</span> <span class="n">Planning</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">393</span> <span class="n">ms</span> <span class="n">Execution</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">670</span> <span class="n">ms</span></code></pre></figure> <p>We can see here that populating the cache (relation, columns, datatypes…) access 118 blocks, and that’s probably a significant part of the 5 extra ms we saw in the first EXPLAIN output.</p> <p><a href="https://rjuju.github.io/postgresql/2020/04/04/new-in-pg13-monitoring-query-planner.html">New in pg13: Monitoring the query planner</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on April 04, 2020.</p> <![CDATA[Nouveau dans pg13: Colonne leader_pid dans pg_stat_activity]]> https://rjuju.github.io/postgresqlfr/2020/03/08/nouveau-dans-pg13-leader_pid 2020-03-08T05:33:26+00:00 2020-03-08T05:33:26+00:00 Julien Rouhaud https://rjuju.github.io <h3 id="nouvelle-colonne-leader_pid-dans-la-vue-pg_stat_activity">Nouvelle colonne leader_pid dans la vue pg_stat_activity</h3> <p>Étonnamment, depuis que les requêtes parallèles ont été ajoutées dans PostgreSQL 9.6, il était impossible de savoir à quel processus client était lié un worker parallèle. Ainsi, comme <a href="https://twitter.com/g_lelarge/status/1209486212190343168">Guillaume l’a fait remarquer</a>, it makes il est assez difficile de construire des outils simples permettant d’échantillonner les événements d’attente liés à tous les processus impliqués dans une requête. Une solution simple à ce problème est d’exporter l’information de <code class="language-plaintext highlighter-rouge">lock group leader</code> disponible dans le processus client au niveau SQL :</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>commit b025f32e0b5d7668daec9bfa957edf3599f4baa8 Author: Michael Paquier &lt;[email protected]&gt; Date: Thu Feb 6 09:18:06 2020 +0900 Add leader_pid to pg_stat_activity This new field tracks the PID of the group leader used with parallel query. For parallel workers and the leader, the value is set to the PID of the group leader. So, for the group leader, the value is the same as its own PID. Note that this reflects what PGPROC stores in shared memory, so as leader_pid is NULL if a backend has never been involved in parallel query. If the backend is using parallel query or has used it at least once, the value is set until the backend exits. Author: Julien Rouhaud Reviewed-by: Sergei Kornilov, Guillaume Lelarge, Michael Paquier, Tomas Vondra Discussion: https://postgr.es/m/CAOBaU_Yy5bt0vTPZ2_LUM6cUcGeqmYNoJ8-Rgto+c2+w3defYA@mail.gmail.com </code></pre></div></div> <p>Avec cette modification, il est maintenant très simple de trouver tous les processus impliqués dans une requête parallèle. Par exemple :</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="o">=#</span> <span class="k">SELECT</span> <span class="n">query</span><span class="p">,</span> <span class="n">leader_pid</span><span class="p">,</span> <span class="n">array_agg</span><span class="p">(</span><span class="n">pid</span><span class="p">)</span> <span class="n">filter</span><span class="p">(</span><span class="k">WHERE</span> <span class="n">leader_pid</span> <span class="o">!=</span> <span class="n">pid</span><span class="p">)</span> <span class="k">AS</span> <span class="n">members</span> <span class="k">FROM</span> <span class="n">pg_stat_activity</span> <span class="k">WHERE</span> <span class="n">leader_pid</span> <span class="k">IS</span> <span class="k">NOT</span> <span class="k">NULL</span> <span class="k">GROUP</span> <span class="k">BY</span> <span class="n">query</span><span class="p">,</span> <span class="n">leader_pid</span><span class="p">;</span> <span class="n">query</span> <span class="o">|</span> <span class="n">leader_pid</span> <span class="o">|</span> <span class="n">members</span> <span class="c1">-------------------+------------+---------------</span> <span class="k">select</span> <span class="o">*</span> <span class="k">from</span> <span class="n">t1</span><span class="p">;</span> <span class="o">|</span> <span class="mi">31630</span> <span class="o">|</span> <span class="p">{</span><span class="mi">32269</span><span class="p">,</span><span class="mi">32268</span><span class="p">}</span> <span class="p">(</span><span class="mi">1</span> <span class="k">row</span><span class="p">)</span></code></pre></figure> <p>Attention toutefois, comme indiqué dans le message de commit, si la colonne <code class="language-plaintext highlighter-rouge">leader_pid</code> à la même valeur que la colonne <code class="language-plaintext highlighter-rouge">pid</code>, cela ne veut pas forcément dire que le processus client est actuellement en train d’effectuer une requête parallèle, car une fois que le champ est positionné il n’est jamais réinitialisé. De plus, pour éviter tout surcoût, aucun verrou supplémentaire n’est maintenu lors de l’affichage de ces données. Cela veut dire que chaque ligne est traitée indépendamment. Ainsi, bien que cela soit fort peu probable, vous pouvez obtenir des données incohérentes dans certaines circonstances, comme par exemple un worker paralèlle pointant vers un pid qui est déjà déconnecté.</p> <p><a href="https://rjuju.github.io/postgresqlfr/2020/03/08/nouveau-dans-pg13-leader_pid.html">Nouveau dans pg13: Colonne leader_pid dans pg_stat_activity</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on March 08, 2020.</p> <![CDATA[Planner selectivity estimation error statistics with pg_qualstats 2]]> https://rjuju.github.io/postgresql/2020/02/28/pg_qualstats-2-selectivity-error 2020-02-28T12:37:04+00:00 2020-02-28T12:37:04+00:00 Julien Rouhaud https://rjuju.github.io <p>Selectivity estimation error is one of the main cause of bad query plans. It’s quite straighforward to compute those estimation error using <code class="language-plaintext highlighter-rouge">EXPLAIN (ANALYZE)</code>, either manually or with the help of <a href="https://explain.depesz.com/">explain.depesz.com</a> (or other similar tools), but until now there were now tool available to get this information automatically and globally. Version 2 of pg_qualstats fixes that, thanks a lot to <a href="https://twitter.com/obartunov">Oleg Bartunov</a> for the original idea!</p> <p>Note: If you don’t know pg_qualstats extension, you may want to see <a href="/postgresql/2020/01/06/pg_qualstats-2-global-index-advisor.html">my last article about it</a>.</p> <h3 id="the-problem">The problem</h3> <p>There can be many causes to that issue: outdated statistics, complex predicates, non uniform data… But whatever the reason is, if the optimizer doesn’t have an accurate idea on how much data each predicate will filter, the result is the same: a bad query plan, which can lead to longer query execution.</p> <p>To illustrate the problem, I’ll use here a simple test case, voluntarily built to fool the optimizer.</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="n">rjuju</span><span class="o">=#</span> <span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">pgqs</span> <span class="k">AS</span> <span class="k">SELECT</span> <span class="n">i</span><span class="o">%</span><span class="mi">2</span> <span class="n">val1</span> <span class="p">,</span> <span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">%</span><span class="mi">2</span> <span class="n">val2</span> <span class="k">FROM</span> <span class="n">generate_series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">50000</span><span class="p">)</span> <span class="n">i</span><span class="p">;</span> <span class="k">SELECT</span> <span class="mi">50000</span> <span class="n">rjuju</span><span class="o">=#</span> <span class="k">VACUUM</span> <span class="k">ANALYZE</span> <span class="n">pgqs</span><span class="p">;</span> <span class="k">VACUUM</span> <span class="n">rjuju</span><span class="o">=#</span> <span class="k">EXPLAIN</span> <span class="p">(</span><span class="k">ANALYZE</span><span class="p">)</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">pgqs</span> <span class="k">WHERE</span> <span class="n">val1</span> <span class="o">=</span> <span class="mi">1</span> <span class="k">AND</span> <span class="n">val2</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="n">QUERY</span> <span class="n">PLAN</span> <span class="c1">--------------------------------------------------------------------</span> <span class="n">Seq</span> <span class="n">Scan</span> <span class="k">on</span> <span class="n">pgqs</span> <span class="p">([...]</span> <span class="k">rows</span><span class="o">=</span><span class="mi">12500</span> <span class="n">width</span><span class="o">=</span><span class="mi">8</span><span class="p">)</span> <span class="p">([...]</span> <span class="k">rows</span><span class="o">=</span><span class="mi">0</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="n">Filter</span><span class="p">:</span> <span class="p">((</span><span class="n">val1</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span> <span class="k">AND</span> <span class="p">(</span><span class="n">val2</span> <span class="o">=</span> <span class="mi">1</span><span class="p">))</span> <span class="k">Rows</span> <span class="n">Removed</span> <span class="k">by</span> <span class="n">Filter</span><span class="p">:</span> <span class="mi">50000</span> <span class="n">Planning</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">553</span> <span class="n">ms</span> <span class="n">Execution</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">38</span><span class="p">.</span><span class="mi">062</span> <span class="n">ms</span> <span class="p">(</span><span class="mi">5</span> <span class="k">rows</span><span class="p">)</span></code></pre></figure> <p>Here postgres think that the query will emit 12500 tuples, while in reality none will be emitted. If you’re wondering how postgres came up with that number, the explanation is simple. When multiple independant (overlapping range predicate can be merged) clauses are AND-ed and no extended statistics are available (see below for more about it), postgres will simply multiply each clause selectivity. This is done in <code class="language-plaintext highlighter-rouge">clauselist_selectivity_simple</code>, in <a href="https://github.com/postgres/postgres/blob/master/src/backend/optimizer/path/clausesel.c">src/backend/optimizer/path/clausesel.c</a>:</p> <figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="n">Selectivity</span> <span class="nf">clauselist_selectivity_simple</span><span class="p">(</span><span class="n">PlannerInfo</span> <span class="o">*</span><span class="n">root</span><span class="p">,</span> <span class="n">List</span> <span class="o">*</span><span class="n">clauses</span><span class="p">,</span> <span class="kt">int</span> <span class="n">varRelid</span><span class="p">,</span> <span class="n">JoinType</span> <span class="n">jointype</span><span class="p">,</span> <span class="n">SpecialJoinInfo</span> <span class="o">*</span><span class="n">sjinfo</span><span class="p">,</span> <span class="n">Bitmapset</span> <span class="o">*</span><span class="n">estimatedclauses</span><span class="p">)</span> <span class="p">{</span> <span class="n">Selectivity</span> <span class="n">s1</span> <span class="o">=</span> <span class="mi">1</span><span class="p">.</span><span class="mi">0</span><span class="p">;</span> <span class="p">[...]</span> <span class="cm">/* * Anything that doesn't look like a potential rangequery clause gets * multiplied into s1 and forgotten. Anything that does gets inserted into * an rqlist entry. */</span> <span class="n">listidx</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span> <span class="n">foreach</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="n">clauses</span><span class="p">)</span> <span class="p">{</span> <span class="p">[...]</span> <span class="cm">/* Always compute the selectivity using clause_selectivity */</span> <span class="n">s2</span> <span class="o">=</span> <span class="n">clause_selectivity</span><span class="p">(</span><span class="n">root</span><span class="p">,</span> <span class="n">clause</span><span class="p">,</span> <span class="n">varRelid</span><span class="p">,</span> <span class="n">jointype</span><span class="p">,</span> <span class="n">sjinfo</span><span class="p">);</span> <span class="p">[...]</span> <span class="cm">/* * If it's not a "&lt;"/"&lt;="/"&gt;"/"&gt;=" operator, just merge the * selectivity in generically. But if it's the right oprrest, * add the clause to rqlist for later processing. */</span> <span class="k">switch</span> <span class="p">(</span><span class="n">get_oprrest</span><span class="p">(</span><span class="n">expr</span><span class="o">-&gt;</span><span class="n">opno</span><span class="p">))</span> <span class="p">{</span> <span class="p">[...]</span> <span class="nl">default:</span> <span class="cm">/* Just merge the selectivity in generically */</span> <span class="n">s1</span> <span class="o">=</span> <span class="n">s1</span> <span class="o">*</span> <span class="n">s2</span><span class="p">;</span> <span class="k">break</span><span class="p">;</span> <span class="p">[...]</span></code></pre></figure> <p>In this case, each predicate will independantly filter approximately 50% of the table, as we can see in <strong>pg_stats view</strong>:</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="n">rjuju</span><span class="o">=#</span> <span class="k">SELECT</span> <span class="n">tablename</span><span class="p">,</span> <span class="n">attname</span><span class="p">,</span> <span class="n">most_common_vals</span><span class="p">,</span> <span class="n">most_common_freqs</span> <span class="k">FROM</span> <span class="n">pg_stats</span> <span class="k">WHERE</span> <span class="n">tablename</span> <span class="o">=</span> <span class="s1">'pgqs'</span><span class="p">;</span> <span class="n">tablename</span> <span class="o">|</span> <span class="n">attname</span> <span class="o">|</span> <span class="n">most_common_vals</span> <span class="o">|</span> <span class="n">most_common_freqs</span> <span class="c1">-----------+---------+------------------+-------------------------</span> <span class="n">pgqs</span> <span class="o">|</span> <span class="n">val1</span> <span class="o">|</span> <span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="o">|</span> <span class="p">{</span><span class="mi">0</span><span class="p">.</span><span class="mi">50116664</span><span class="p">,</span><span class="mi">0</span><span class="p">.</span><span class="mi">49883333</span><span class="p">}</span> <span class="n">pgqs</span> <span class="o">|</span> <span class="n">val2</span> <span class="o">|</span> <span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="o">|</span> <span class="p">{</span><span class="mi">0</span><span class="p">.</span><span class="mi">50116664</span><span class="p">,</span><span class="mi">0</span><span class="p">.</span><span class="mi">49883333</span><span class="p">}</span> <span class="p">(</span><span class="mi">2</span> <span class="k">rows</span><span class="p">)</span></code></pre></figure> <p>So when using both clauses, the estimate is 25% of the table, since postgres doesn’t know <strong>by default</strong> that both values are mutually exclusive. Continuing with this artificial test case, let’s see what happens if we add a <em>join</em> on top of if. For instance, joining the table to itself on the <code class="language-plaintext highlighter-rouge">val1</code> column only. For clarity, I’ll use <strong>t1</strong> for the table on which I’m applying the mutually exclusive predicates, and <strong>t2</strong> the table joined:</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="n">rjuju</span><span class="o">=#</span> <span class="k">EXPLAIN</span> <span class="k">ANALYZE</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">pgqs</span> <span class="n">t1</span> <span class="k">JOIN</span> <span class="n">pgqs</span> <span class="n">t2</span> <span class="k">ON</span> <span class="n">t1</span><span class="p">.</span><span class="n">val1</span> <span class="o">=</span> <span class="n">t2</span><span class="p">.</span><span class="n">val1</span> <span class="k">WHERE</span> <span class="n">t1</span><span class="p">.</span><span class="n">val1</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">AND</span> <span class="n">t1</span><span class="p">.</span><span class="n">val2</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">QUERY</span> <span class="n">PLAN</span> <span class="c1">-----------------------------------------------------------------------------------</span> <span class="n">Nested</span> <span class="n">Loop</span> <span class="p">([...]</span> <span class="k">rows</span><span class="o">=</span><span class="mi">313475000</span> <span class="n">width</span><span class="o">=</span><span class="mi">16</span><span class="p">)</span> <span class="p">([...]</span> <span class="k">rows</span><span class="o">=</span><span class="mi">0</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Seq</span> <span class="n">Scan</span> <span class="k">on</span> <span class="n">pgqs</span> <span class="n">t2</span> <span class="p">([...]</span> <span class="k">rows</span><span class="o">=</span><span class="mi">25078</span> <span class="n">width</span><span class="o">=</span><span class="mi">8</span><span class="p">)</span> <span class="p">([...]</span> <span class="k">rows</span><span class="o">=</span><span class="mi">25000</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="n">Filter</span><span class="p">:</span> <span class="p">(</span><span class="n">val1</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span> <span class="k">Rows</span> <span class="n">Removed</span> <span class="k">by</span> <span class="n">Filter</span><span class="p">:</span> <span class="mi">25000</span> <span class="o">-&gt;</span> <span class="n">Materialize</span> <span class="p">([...]</span> <span class="k">rows</span><span class="o">=</span><span class="mi">12500</span> <span class="n">width</span><span class="o">=</span><span class="mi">8</span><span class="p">)</span> <span class="p">([...]</span> <span class="k">rows</span><span class="o">=</span><span class="mi">0</span> <span class="n">loops</span><span class="o">=</span><span class="mi">25000</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Seq</span> <span class="n">Scan</span> <span class="k">on</span> <span class="n">pgqs</span> <span class="n">t1</span> <span class="p">([...]</span> <span class="k">rows</span><span class="o">=</span><span class="mi">12500</span> <span class="n">width</span><span class="o">=</span><span class="mi">8</span><span class="p">)</span> <span class="p">([...]</span> <span class="k">rows</span><span class="o">=</span><span class="mi">0</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="n">Filter</span><span class="p">:</span> <span class="p">((</span><span class="n">val1</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span> <span class="k">AND</span> <span class="p">(</span><span class="n">val2</span> <span class="o">=</span> <span class="mi">0</span><span class="p">))</span> <span class="k">Rows</span> <span class="n">Removed</span> <span class="k">by</span> <span class="n">Filter</span><span class="p">:</span> <span class="mi">50000</span> <span class="n">Planning</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">943</span> <span class="n">ms</span> <span class="n">Execution</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">86</span><span class="p">.</span><span class="mi">757</span> <span class="n">ms</span> <span class="p">(</span><span class="mi">14</span> <span class="k">rows</span><span class="p">)</span></code></pre></figure> <p>Postgres thinks that this join will emit <strong>313 millions rows</strong>, while obviously no rows will be emitted. And this is a good example on how bad assumptions can lead to an inefficient plan.</p> <p>Here Postgres can deduce that the <code class="language-plaintext highlighter-rouge">val1 = 0</code> predicate can be applied to <strong>t2</strong>. So how to join two relations, one that should emit 25000 tuples and the other that should emit 12500 tuples, with no index available? A nested loop is not a bad choice, as both relation aren’t really big. As no index is available, postgres also chooses to <strong>materialize</strong> the inner relation, meaning storing it in memory, to make it more efficient. As it tries to limit memory consumption as much as possible, the smallest relation is materialized, and that’s the mistake here.</p> <p>Indeed, postgres will read the whole table twice: once to get every rows corresponding to the <code class="language-plaintext highlighter-rouge">val1 = 0</code> predicate for the outer relation, and once to find all rows to be materialized. If the opposite was done, as it would probably have if the estimates had been more realistic, the table would only have been read once.</p> <p>In this case, as the dataset isn’t big and quite artificial, a better plan wouldn’t drastically change the execution time. But keep in mind than with real production environements, it could mean choosing a nested loop assuming that there’ll be only a couple of rows to loop on while in reality the backend will spend minutes or even hours looping over millions of rows, and another plan would have been orders of magnitude quicker.</p> <h3 id="detecting-the-problem">Detecting the problem</h3> <p>pg_qualstats 2 will now compute the selectivity estimation error, both in a ratio and a raw number, and will keep track for each predicate the minimum, maximum and mean values, with the standard deviation. This is now quite simple to detect problematic quals!</p> <p>After executing the last query, here’s what the <code class="language-plaintext highlighter-rouge">pg_qualstats</code> view will return:</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="n">rjuju</span><span class="o">=#</span> <span class="k">SELECT</span> <span class="n">relname</span><span class="p">,</span> <span class="n">attname</span><span class="p">,</span> <span class="n">opno</span><span class="p">::</span><span class="n">regoper</span><span class="p">,</span> <span class="n">qualid</span><span class="p">,</span> <span class="n">qualnodeid</span><span class="p">,</span> <span class="n">mean_err_estimate_ratio</span> <span class="n">mean_ratio</span><span class="p">,</span> <span class="n">mean_err_estimate_num</span> <span class="n">mean_num</span><span class="p">,</span> <span class="n">constvalue</span> <span class="k">FROM</span> <span class="n">pg_qualstats</span> <span class="n">pgqs</span> <span class="k">JOIN</span> <span class="n">pg_class</span> <span class="k">c</span> <span class="k">ON</span> <span class="n">pgqs</span><span class="p">.</span><span class="n">lrelid</span> <span class="o">=</span> <span class="k">c</span><span class="p">.</span><span class="n">oid</span> <span class="k">JOIN</span> <span class="n">pg_attribute</span> <span class="n">a</span> <span class="k">ON</span> <span class="n">a</span><span class="p">.</span><span class="n">attrelid</span> <span class="o">=</span> <span class="k">c</span><span class="p">.</span><span class="n">oid</span> <span class="k">AND</span> <span class="n">a</span><span class="p">.</span><span class="n">attnum</span> <span class="o">=</span> <span class="n">pgqs</span><span class="p">.</span><span class="n">lattnum</span><span class="p">;</span> <span class="n">relname</span> <span class="o">|</span> <span class="n">attname</span> <span class="o">|</span> <span class="n">opno</span> <span class="o">|</span> <span class="n">qualid</span> <span class="o">|</span> <span class="n">qualnodeid</span> <span class="o">|</span> <span class="n">mean_ratio</span> <span class="o">|</span> <span class="n">mean_num</span> <span class="o">|</span> <span class="n">constvalue</span> <span class="c1">---------+---------+------+------------+------------+------------+----------+------------</span> <span class="n">pgqs</span> <span class="o">|</span> <span class="n">val1</span> <span class="o">|</span> <span class="o">=</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="mi">3161070364</span> <span class="o">|</span> <span class="mi">1</span><span class="p">.</span><span class="mi">00393542</span> <span class="o">|</span> <span class="mi">98</span> <span class="o">|</span> <span class="mi">0</span><span class="p">::</span><span class="nb">integer</span> <span class="n">pgqs</span> <span class="o">|</span> <span class="n">val1</span> <span class="o">|</span> <span class="o">=</span> <span class="o">|</span> <span class="mi">3864967567</span> <span class="o">|</span> <span class="mi">3161070364</span> <span class="o">|</span> <span class="mi">12500</span> <span class="o">|</span> <span class="mi">12500</span> <span class="o">|</span> <span class="mi">0</span><span class="p">::</span><span class="nb">integer</span> <span class="n">pgqs</span> <span class="o">|</span> <span class="n">val2</span> <span class="o">|</span> <span class="o">=</span> <span class="o">|</span> <span class="mi">3864967567</span> <span class="o">|</span> <span class="mi">3065200358</span> <span class="o">|</span> <span class="mi">12500</span> <span class="o">|</span> <span class="mi">12500</span> <span class="o">|</span> <span class="mi">0</span><span class="p">::</span><span class="nb">integer</span> <span class="p">(</span><span class="mi">3</span> <span class="k">rows</span><span class="p">)</span></code></pre></figure> <p class="notice"><strong>NOTE:</strong> <code class="language-plaintext highlighter-rouge">qualid</code> is an identifier if multiple qual are AND-ed, NULL otherwise, and <code class="language-plaintext highlighter-rouge">qualnodeid</code> is a per-qual only identifier.</p> <p>We see here that when used alone, the qual <code class="language-plaintext highlighter-rouge">pgqs.val = ?</code> doesn’t show any selectivity estimate problem as the ratio (<em>mean_ratio</em>) is very close to <strong>1</strong> and the raw number (<em>mean_num</em>) is quite low. On the other hand, when combined with <code class="language-plaintext highlighter-rouge">AND pgqs.val2 = ?</code> pg_qualstats reports significant estimate error. That’s a very strong sign that those columns are functionally dependent.</p> <p>If for example a qual alone shows issues, it could be a sign of outdated statistics, or that the sample size isn’t big enough.</p> <p>Also, if you have <code class="language-plaintext highlighter-rouge">pg_stat_statements</code> extension installed, <code class="language-plaintext highlighter-rouge">pg_qualstats</code> will give you the <em>query identifier</em> for each predicate. With that and a bit of SQL, you can for instance find the query with a long average execution time which contains quals for which the selectivity estimation is off by 10 or more.</p> <h3 id="interlude-extended-statistics">Interlude: Extended statistics</h3> <p>If you’re wondering how to solve the issue I just explained, the solution is very easy since <strong>extended statistics</strong> were introduced in PostgreSQL 10, and assuming that you know that’s the root issue. <a href="https://www.postgresql.org/docs/current/sql-createstatistics.html">Create an extended statistcs</a> on the related columns, perform an ANALYZE and you’re done!</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="n">rjuju</span><span class="o">=#</span> <span class="k">CREATE</span> <span class="k">STATISTICS</span> <span class="n">pgqs_stats</span> <span class="k">ON</span> <span class="n">val1</span><span class="p">,</span> <span class="n">val2</span> <span class="k">FROM</span> <span class="n">pgqs</span><span class="p">;</span> <span class="k">CREATE</span> <span class="k">STATISTICS</span> <span class="n">rjuju</span><span class="o">=#</span> <span class="k">ANALYZE</span> <span class="n">pgqs</span><span class="p">;</span> <span class="k">ANALYZE</span> <span class="n">rjuju</span><span class="p">]</span><span class="o">=#</span> <span class="k">EXPLAIN</span> <span class="k">ANALYZE</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">pgqs</span> <span class="n">t1</span> <span class="k">JOIN</span> <span class="n">pgqs</span> <span class="n">t2</span> <span class="k">ON</span> <span class="n">t1</span><span class="p">.</span><span class="n">val1</span> <span class="o">=</span> <span class="n">t2</span><span class="p">.</span><span class="n">val1</span> <span class="k">WHERE</span> <span class="n">t1</span><span class="p">.</span><span class="n">val1</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">AND</span> <span class="n">t1</span><span class="p">.</span><span class="n">val2</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">order</span> <span class="k">by</span> <span class="n">t1</span><span class="p">.</span><span class="n">val2</span><span class="p">;</span> <span class="n">QUERY</span> <span class="n">PLAN</span> <span class="c1">-------------------------------------------------------------------------</span> <span class="n">Nested</span> <span class="n">Loop</span> <span class="p">([...]</span> <span class="k">rows</span><span class="o">=</span><span class="mi">25002</span> <span class="n">width</span><span class="o">=</span><span class="mi">16</span><span class="p">)</span> <span class="p">([...]</span> <span class="k">rows</span><span class="o">=</span><span class="mi">0</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Seq</span> <span class="n">Scan</span> <span class="k">on</span> <span class="n">pgqs</span> <span class="n">t1</span> <span class="p">([...]</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1</span> <span class="n">width</span><span class="o">=</span><span class="mi">8</span><span class="p">)</span> <span class="p">([...]</span> <span class="k">rows</span><span class="o">=</span><span class="mi">0</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="n">Filter</span><span class="p">:</span> <span class="p">((</span><span class="n">val1</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span> <span class="k">AND</span> <span class="p">(</span><span class="n">val2</span> <span class="o">=</span> <span class="mi">0</span><span class="p">))</span> <span class="k">Rows</span> <span class="n">Removed</span> <span class="k">by</span> <span class="n">Filter</span><span class="p">:</span> <span class="mi">50000</span> <span class="o">-&gt;</span> <span class="n">Seq</span> <span class="n">Scan</span> <span class="k">on</span> <span class="n">pgqs</span> <span class="n">t2</span> <span class="p">([...]</span> <span class="k">rows</span><span class="o">=</span><span class="mi">25002</span> <span class="n">width</span><span class="o">=</span><span class="mi">8</span><span class="p">)</span> <span class="p">(</span><span class="n">never</span> <span class="n">executed</span><span class="p">)</span> <span class="n">Filter</span><span class="p">:</span> <span class="p">(</span><span class="n">val1</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span> <span class="n">Planning</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">559</span> <span class="n">ms</span> <span class="n">Execution</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">39</span><span class="p">.</span><span class="mi">471</span> <span class="n">ms</span> <span class="p">(</span><span class="mi">8</span> <span class="k">rows</span><span class="p">)</span></code></pre></figure> <p>If you want more details on extended statistics, I recommend looking at the slides from <a href="https://blog.pgaddict.com/">Tomas Vondra</a>’s <a href="https://www.postgresql.eu/events/pgconfeu2018/sessions/session/2083/slides/130/create-statistics-what-is-it.pdf">excellent talk on this subject</a>.</p> <h3 id="going-further">Going further</h3> <p>Tracking the quals in every single qual executed is of course quite expensive, and would significantly impact the performance for any non datawarehouse workload. That’s why <code class="language-plaintext highlighter-rouge">pg_qualstats</code> has an option, <strong>pg_qualstats.sample_rate</strong>, to sample the query that will be processed. This setting is by default set to <strong>1 / max_connections</strong>, which will make the overhead quite negligible, but don’t be surprised if you don’t see any qual reported after running a few queries!</p> <p>But if you’re instead only interested by the quals that has bad selectivity estimation, for instance to detect this class of problem rather than missing indexes, there are two new options available for that:</p> <ul> <li><strong>pg_qualstats.min_err_estimate_ratio</strong></li> <li><strong>pg_qualstats.min_err_estimate_num</strong></li> </ul> <p>Those options are cumulative and can be changed at anytime, and will limit the quals that pg_qualstats will store to the ones that have a selectivity estimate ratio and/or raw number higher that what you ask. Although those options will help to reduce the performance overhead, they of course can be combined with <strong>pg_qualstats.sample_rate</strong> if needed.</p> <h3 id="conclusion">Conclusion</h3> <p>After <a href="/postgresql/2020/01/06/pg_qualstats-2-global-index-advisor.html">introducing the new global index advisor</a>, this article presented a class of problems that are frequently seen as a DBA, and how to detect and solve them.</p> <p>I believe that those two new features in pg_qualstats will greatly help PostgreSQL databases administration. Also, external tools that aims to solve related issue, such as <a href="https://github.com/ossc-db/pg_plan_advsr">pg_plan_advsr</a> or <a href="https://github.com/postgrespro/aqo">AQO</a> could also benefit from pg_qualstats, as they could directly get the exact data they need to be able perform analysis and optimize the queries!</p> <p><a href="https://rjuju.github.io/postgresql/2020/02/28/pg_qualstats-2-selectivity-error.html">Planner selectivity estimation error statistics with pg_qualstats 2</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on February 28, 2020.</p> <![CDATA[New in pg13: New leader_pid column in pg_stat_activity]]> https://rjuju.github.io/postgresql/2020/02/06/new-in-pg13-leader_pid 2020-02-06T12:59:53+00:00 2020-02-06T12:59:53+00:00 Julien Rouhaud https://rjuju.github.io <h3 id="new-leader_pid-column-in-pg_stat_activity-view">New leader_pid column in pg_stat_activity view</h3> <p>Surprisingly, since parallel query was introduced in PostgreSQL 9.6, it was impossible to know wich backend a parallel worker was related to. So, as <a href="https://twitter.com/g_lelarge/status/1209486212190343168">Guillaume pointed out</a>, it makes it quite difficult to build simple tools that can sample the wait events related to all process involved in a query. A simple solution to that problem is to export the <code class="language-plaintext highlighter-rouge">lock group leader</code> information available in the backend at the SQL level:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>commit b025f32e0b5d7668daec9bfa957edf3599f4baa8 Author: Michael Paquier &lt;[email protected]&gt; Date: Thu Feb 6 09:18:06 2020 +0900 Add leader_pid to pg_stat_activity This new field tracks the PID of the group leader used with parallel query. For parallel workers and the leader, the value is set to the PID of the group leader. So, for the group leader, the value is the same as its own PID. Note that this reflects what PGPROC stores in shared memory, so as leader_pid is NULL if a backend has never been involved in parallel query. If the backend is using parallel query or has used it at least once, the value is set until the backend exits. Author: Julien Rouhaud Reviewed-by: Sergei Kornilov, Guillaume Lelarge, Michael Paquier, Tomas Vondra Discussion: https://postgr.es/m/CAOBaU_Yy5bt0vTPZ2_LUM6cUcGeqmYNoJ8-Rgto+c2+w3defYA@mail.gmail.com </code></pre></div></div> <p>With this change, you can now easily find all processes involved in a parallel query. For instance:</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="o">=#</span> <span class="k">SELECT</span> <span class="n">query</span><span class="p">,</span> <span class="n">leader_pid</span><span class="p">,</span> <span class="n">array_agg</span><span class="p">(</span><span class="n">pid</span><span class="p">)</span> <span class="n">filter</span><span class="p">(</span><span class="k">WHERE</span> <span class="n">leader_pid</span> <span class="o">!=</span> <span class="n">pid</span><span class="p">)</span> <span class="k">AS</span> <span class="n">members</span> <span class="k">FROM</span> <span class="n">pg_stat_activity</span> <span class="k">WHERE</span> <span class="n">leader_pid</span> <span class="k">IS</span> <span class="k">NOT</span> <span class="k">NULL</span> <span class="k">GROUP</span> <span class="k">BY</span> <span class="n">query</span><span class="p">,</span> <span class="n">leader_pid</span><span class="p">;</span> <span class="n">query</span> <span class="o">|</span> <span class="n">leader_pid</span> <span class="o">|</span> <span class="n">members</span> <span class="c1">-------------------+------------+---------------</span> <span class="k">select</span> <span class="o">*</span> <span class="k">from</span> <span class="n">t1</span><span class="p">;</span> <span class="o">|</span> <span class="mi">31630</span> <span class="o">|</span> <span class="p">{</span><span class="mi">32269</span><span class="p">,</span><span class="mi">32268</span><span class="p">}</span> <span class="p">(</span><span class="mi">1</span> <span class="k">row</span><span class="p">)</span></code></pre></figure> <p>Be careful, as mentionned in the commit message, if the <code class="language-plaintext highlighter-rouge">leader_pid</code> is the same as <code class="language-plaintext highlighter-rouge">pid</code>, it doesn’t necessarily mean that the backend is currently performing a parallel query, as once set this field is never reset. Also, to avoid extra ovherhead, no additional lock is held while outputting the data. It means that each row is processed independently. So, while quite unlikely, you can get in some circumstances inconsistent data, such as a parallel worker pointing to a pid that already disconnected.</p> <p><a href="https://rjuju.github.io/postgresql/2020/02/06/new-in-pg13-leader_pid.html">New in pg13: New leader_pid column in pg_stat_activity</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on February 06, 2020.</p> <![CDATA[pg qualstats 2: Suggestion d'index globale]]> https://rjuju.github.io/postgresqlfr/2020/01/06/pg_qualstats-2-suggestion-index-globale 2020-01-06T12:23:29+00:00 2020-01-06T12:23:29+00:00 Julien Rouhaud https://rjuju.github.io <p>Parvenir à une suggestion d’index de qualité peut être une tâche complexe. Cela nécessite à la fois une connaissance des requêtes applicatives et des spécificités de la base de données. Avec le temps de nombreux projets ont essayé de résoudre ce problème, l’un d’entre eux étant <a href="https://powa.readthedocs.io/">PoWA version 3</a>, avec l’aide de <a href="https://powa.readthedocs.io/en/latest/components/stats_extensions/pg_qualstats.html">pg_qualstats extension</a>. Cet outil donne de plutôt bonnes suggestions d’index, mais il est nécessaire d’installer et configurer PoWA, alors que certains utilisateurs aimeraient n’avoir que la suggestion d’index globale. Pour répondre à ce besoin de simplicité, l’algorithme utilisé dans PoWA est maintenant disponible dans pg_qualstats version 2, sans avoir besoin d’utiliser des composants additionnels.</p> <p>EDIT: La fonction <code class="language-plaintext highlighter-rouge">pg_qualstats_index\_advisor()</code> a été changée pour retourner du <strong>json</strong> plutôt que du <strong>jsonb</strong>, afin de conserver la compatibilité avec PostgreSQL 9.3. Les requêtes d’exemples sont donc également modifiées pour utiliser <code class="language-plaintext highlighter-rouge">json_array_elements()</code> plutôt que <code class="language-plaintext highlighter-rouge">jsonb_array_elements()</code>.</p> <h3 id="quest-ce-que-pg_qualstats">Qu’est-ce que pg_qualstats</h3> <p>Une manière simple d’expliquer ce qu’est pg_qualstats serait de dire qu’il s’agit d’une extension similaire à <a href="https://www.postgresql.org/docs/current/pgstatstatements.html">pg_stat_statements</a> mais travaillant au niveaux des prédicats.</p> <p>Cette extension sauvegarde des statistiques utiles pour les clauses <strong>WHERE</strong> et <strong>JOIN</strong> : à quelle table et quelle colonne un prédicat fait référénce, le nombre de fois qu’un prédicat a été utilisé, le nombre d’exécutions de l’opérateur sous-jacent, si le prédicat provient d’un parcours d’index ou non, la sélectivité, la valeur des constantes et bien plus encore.</p> <p>Il est possible de déduire beaucoup de choses depuis ces informations. Par exemple, si vous examinez les prédicats qui contiennent des références à des tables différentes, vous pouvez trouver quelles tables sont jointes ensembles, et à quel point les conditions de jointures sont sélectives.</p> <h3 id="suggestion-globale-">Suggestion Globale ?</h3> <p>Comment je l’ai mentionné, la suggestion d’index globale ajoutée dans pg_qualstats 2 utilise la même approche que celle de PoWA, ainsi cet article peut servir à décrire le fonctionnement des deux outils. La seule différence est que vous obtiendrez probablement une suggestion de meilleure qualité avec PoWA, puisque plus de prédicats seront disponibles, et que vous pourrez également choisir sur quel intervalle de temps vous souhaitez effectuer une suggestion d’index manquants.</p> <p>La chose importante à retenir ici est qu’il s’agit d’une suggestion effectuée de manière <strong>globale</strong>, c’est-à-dire en prenant en compte tous les prédicats intéressant en même temps. Cette approche est différente de toutes les autres dont j’ai connaissance, qui ne prennent en compte qu’une seule requête à la fois. Selon moi, une approche globale est meilleure, car il est possible de réduire le nombre total d’index, en maximisant l’efficacité des index multi-colonnes.</p> <h3 id="comment-marche-la-suggestion-globale">Comment marche la suggestion globale</h3> <p>La première étape consiste à récupérer tous les prédicats qui pourraient bénéficier de nouveaux index. C’est particulièrement facile à obtenir avec pg_qualstats. En filtrant les prédicats venant d’un parcours séquentiel, exécutés de nombreuses fois et qui filtrent de nombreuses lignes (à la fois en nombre et en pourcentage), vous obtenez une liste parfaite de prédicats qui auraient très probablement besoin d’un index (ou alors dans certains cas une liste des requêtes mal écrites). Voyons regardons par exemple le cas d’une applications qui utiliserait ces 4 prédicats:</p> <p><a href="/images/global_advisor_1_quals.png"><img src="/images/global_advisor_1_quals.png" alt="Liste de tous les prédicats trouvés" /></a></p> <p>Ensuite, il faut construire l’ensemble entier des chemins de toutes les prédicats joints par un AND logique, qui contiennent d’autres prédicats, qui peuvent être eux-meme également joints par des AND logiques. En utilisants les même 4 prédicats vus précédemments, nous obtenons ces chemins :</p> <p><a href="/images/global_advisor_2_graphs.png"><img src="/images/global_advisor_2_graphs.png" alt="Construction de tous les chemins de prédicats possibles" /></a></p> <p>Une fois tous les chemins construits, il suffit d’obtenir le meilleur chemin pour trouver le meilleur index à suggérer. Le classement de ces chemins est pour le moment fait en donnant un poids à chaque nœud de chaque chemin qui correspond au nombre de prédicats simple qu’il contient, et en additionnant le poids pour chaque chemin. C’est une approche très simple, et qui permet de favoriser un nombre minimal d’index qui optimisent le plus de requêtes possible. Avec nos exemple, nous obtenons :</p> <p><a href="/images/global_advisor_3_weighted.png"><img src="/images/global_advisor_3_weighted.png" alt="Ajout d'un poids à tous les chemins et choix du score le plus haut" /></a></p> <p>Bien évidemment, d’autres approches de classement pourraient être utilisée pour prendre en compte d’autres paramètres, et potentiellement obtenir une meilleur suggestion. Par exemple, en prenant en compte également le nombre d’exécution ou la sélectivité des prédicats. Si le ratio de lecture/écriture pour chaque table est connu (ce qui est disponible avec l’extension <a href="https://github.com/powa-team/powa-archivist">powa-archivist</a>), il serait également possible d’adapter le classement pour limiter la suggestion d’index pour les tables qui ne sont accédées presque exclusivement en écriture. Avec cet algorithme, ces ajustements seraient relativement simples à faire.</p> <p>Une fois que le meilleur chemin est trouvé, on peut générer l’ordre de création de l’index ! Comme l’ordre des colonnes peut être important, l’ordre est généré en récupérant les colonnes de chaque nœud par poids croissant. Avec notre exemple, l’index suivant est généré :</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">INDEX</span> <span class="k">ON</span> <span class="n">t1</span> <span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">ts</span><span class="p">,</span> <span class="n">val</span><span class="p">);</span></code></pre></figure> <p>Une fois que l’index est trouvé, on supprime simplement les prédicats contenus de la liste globale de prédicats et on reprendre de zéro jusqu’à ce qu’il n’y ait plus de prédicats.</p> <h3 id="un-peu-plus-de-détails-et-mise-en-garde">Un peu plus de détails et mise en garde</h3> <p>Bien évidemment, il s’agit ici d’une version simplifiée de l’algorithme de suggestion, car d’autres informations sont nécessaires. Par exemple, la liste des prédicats est en réalité ajustée avec les <a href="https://www.postgresql.org/docs/current/indexes-opclass.html">classes d’opérateurs et méthode d’acces</a> en fonction du type de la colonne et de sont opérateur, afin de s’assurer d’obtenir des index valides. Si plusieurs méthodes d’accès aux index sont trouvées pour un même meilleur chemin, <code class="language-plaintext highlighter-rouge">btree</code> sera choisi en priorité.</p> <p>Cela nous amène à un autre détail : cette approche est principalement pensée pour les index <strong>btree</strong>, pour lesqules l’ordre des colonnes est critiques. D’autres méthodes d’accès ne requièrent pas un ordre spécifique pour les colonnes, et pour ces méthodes d’accès il est possible qu’une suggestion plus optimale soit possible si l’ordre des colonnes n’était pas pris en compte.</p> <p>Un autre point important est que les classes d’opérateurs et méthodes d’accès ne sont pas gérés en dur mais récupérés à l’exécution en utilisant les catalogues locaux. Par conséquent, vous pouvez obtenir des résultats différents (et potentiellement meilleurs) si vous faites en sorte d’avoir toutes les classes d’opérateur additionelles disponibles quand vous utilisez la suggestion d’index globale. Cela pourrait être les extensions <strong>btree_gist</strong> et <strong>btree_gist</strong>, mais également d’autres méthodes d’accès aux index. Il est également possible que certain types / opérateurs n’aient pas de méthode d’accès associée dans les catalogues. Dans ce cas, ces prédicats sont retournées séparément dans une liste de prédicats non optimisables automatiquement, et pour lequel une analyse manuelle est nécessaire.</p> <p>Enfin, comme pg_qualstats ne traite pas les prédicats composés d’expressions, l’outil ne peut pas suggérer d’index sur des expressions, par exemple en cas d’utilisateur de recherche plein texte.</p> <h3 id="exemple-dutilisation">Exemple d’utilisation</h3> <p>Une simple fonction est fournie, avec des paramètres facultatifs, qui retourne une valeur de type json :</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">OR</span> <span class="k">REPLACE</span> <span class="k">FUNCTION</span> <span class="n">pg_qualstats_index_advisor</span> <span class="p">(</span> <span class="n">min_filter</span> <span class="nb">integer</span> <span class="k">DEFAULT</span> <span class="mi">1000</span><span class="p">,</span> <span class="n">min_selectivity</span> <span class="nb">integer</span> <span class="k">DEFAULT</span> <span class="mi">30</span><span class="p">,</span> <span class="n">forbidden_am</span> <span class="nb">text</span><span class="p">[]</span> <span class="k">DEFAULT</span> <span class="s1">'{}'</span><span class="p">)</span> <span class="k">RETURNS</span> <span class="n">json</span></code></pre></figure> <p>Les noms de paramètres sont parlants :</p> <ul> <li><code class="language-plaintext highlighter-rouge">min_filter</code>: combien de lignes le prédicat doit-il filtrer en moyenne pour être pris en compte par la suggestion globale, par défaut <strong>1000</strong> ;</li> <li><code class="language-plaintext highlighter-rouge">min_selectivity</code>: quelle doit être la sélectivité moyenne d’un prédicat pour qu’il soit pris en compte par la suggestion globale, par défaut <strong>30%</strong> ;</li> <li><code class="language-plaintext highlighter-rouge">forbidden_am</code>: liste des méthodes d’accès aux index à ignorer. Aucune par défaut, bien que pour les version 9.6 et inférieures <strong>les index hash sont ignoré en interne</strong>, puisque ceux-ci ne sont sur que depuis la version 10.</li> </ul> <p>Voici un exemple simple, tirés des tests de non régression de pg_qualstats :</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">pgqs</span> <span class="k">AS</span> <span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="s1">'a'</span> <span class="n">val</span> <span class="k">FROM</span> <span class="n">generate_series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span> <span class="n">id</span><span class="p">;</span> <span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">adv</span> <span class="p">(</span><span class="n">id1</span> <span class="nb">integer</span><span class="p">,</span> <span class="n">id2</span> <span class="nb">integer</span><span class="p">,</span> <span class="n">id3</span> <span class="nb">integer</span><span class="p">,</span> <span class="n">val</span> <span class="nb">text</span><span class="p">);</span> <span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">adv</span> <span class="k">SELECT</span> <span class="n">i</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="s1">'line '</span> <span class="o">||</span> <span class="n">i</span> <span class="k">from</span> <span class="n">generate_series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1000</span><span class="p">)</span> <span class="n">i</span><span class="p">;</span> <span class="k">SELECT</span> <span class="n">pg_qualstats_reset</span><span class="p">();</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">id1</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">;</span> <span class="k">SELECT</span> <span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">id1</span> <span class="o">&lt;</span> <span class="mi">500</span><span class="p">;</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">val</span> <span class="o">=</span> <span class="s1">'meh'</span><span class="p">;</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">id1</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">and</span> <span class="n">val</span> <span class="o">=</span> <span class="s1">'meh'</span><span class="p">;</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">id1</span> <span class="o">=</span> <span class="mi">1</span> <span class="k">and</span> <span class="n">val</span> <span class="o">=</span> <span class="s1">'meh'</span><span class="p">;</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">id1</span> <span class="o">=</span> <span class="mi">1</span> <span class="k">and</span> <span class="n">id2</span> <span class="o">=</span> <span class="mi">2</span> <span class="k">AND</span> <span class="n">val</span> <span class="o">=</span> <span class="s1">'meh'</span><span class="p">;</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">id1</span> <span class="o">=</span> <span class="mi">6</span> <span class="k">and</span> <span class="n">id2</span> <span class="o">=</span> <span class="mi">6</span> <span class="k">AND</span> <span class="n">id3</span> <span class="o">=</span> <span class="mi">6</span> <span class="k">AND</span> <span class="n">val</span> <span class="o">=</span> <span class="s1">'meh'</span><span class="p">;</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">val</span> <span class="k">ILIKE</span> <span class="s1">'moh'</span><span class="p">;</span> <span class="k">SELECT</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">pgqs</span> <span class="k">WHERE</span> <span class="n">id</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span></code></pre></figure> <p>Et voici ce que la fonction retourne :</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span> <span class="n">v</span> <span class="k">FROM</span> <span class="n">json_array_elements</span><span class="p">(</span> <span class="n">pg_qualstats_index_advisor</span><span class="p">(</span><span class="n">min_filter</span> <span class="o">=&gt;</span> <span class="mi">50</span><span class="p">)</span><span class="o">-&gt;</span><span class="s1">'indexes'</span><span class="p">)</span> <span class="n">v</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">v</span><span class="p">::</span><span class="nb">text</span> <span class="k">COLLATE</span> <span class="nv">"C"</span><span class="p">;</span> <span class="n">v</span> <span class="c1">---------------------------------------------------------------</span> <span class="nv">"CREATE INDEX ON public.adv USING btree (id1)"</span> <span class="nv">"CREATE INDEX ON public.adv USING btree (val, id1, id2, id3)"</span> <span class="nv">"CREATE INDEX ON public.pgqs USING btree (id)"</span> <span class="p">(</span><span class="mi">3</span> <span class="k">rows</span><span class="p">)</span> <span class="k">SELECT</span> <span class="n">v</span> <span class="k">FROM</span> <span class="n">json_array_elements</span><span class="p">(</span> <span class="n">pg_qualstats_index_advisor</span><span class="p">(</span><span class="n">min_filter</span> <span class="o">=&gt;</span> <span class="mi">50</span><span class="p">)</span><span class="o">-&gt;</span><span class="s1">'unoptimised'</span><span class="p">)</span> <span class="n">v</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">v</span><span class="p">::</span><span class="nb">text</span> <span class="k">COLLATE</span> <span class="nv">"C"</span><span class="p">;</span> <span class="n">v</span> <span class="c1">-----------------</span> <span class="nv">"adv.val ~~* ?"</span> <span class="p">(</span><span class="mi">1</span> <span class="k">row</span><span class="p">)</span></code></pre></figure> <p>La <a href="https://github.com/powa-team/pg_qualstats/">version 2 de pg_qualstats</a> n’est pas encore disponible en version stable, mais n’hésitez pas à la tester et <a href="https://github.com/powa-team/pg_qualstats/issues">rapporter tout problème que vous pourriez rencontrer</a> !</p> <p><a href="https://rjuju.github.io/postgresqlfr/2020/01/06/pg_qualstats-2-suggestion-index-globale.html">pg qualstats 2: Suggestion d'index globale</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on January 06, 2020.</p> <![CDATA[pg qualstats 2: Global index advisor]]> https://rjuju.github.io/postgresql/2020/01/06/pg_qualstats-2-global-index-advisor 2020-01-06T12:23:29+00:00 2020-01-06T12:23:29+00:00 Julien Rouhaud https://rjuju.github.io <p>Coming up with good index suggestion can be a complex task. It requires knowledge of both application queries and database specificities. Over the year multiple projects tried to solve this problem, one of which being <a href="https://powa.readthedocs.io/">PoWA with the version 3</a>, with the help of <a href="https://powa.readthedocs.io/en/latest/components/stats_extensions/pg_qualstats.html">pg_qualstats extension</a>. It can give pretty good index suggestion, but it requires to install and configure PoWA, while some users wanted to only have the global index advisor. In such case and for simplicity, the algorithm used in PoWA is now available in pg_qualstats version 2 without requiring any additional component.</p> <p>EDIT: The <code class="language-plaintext highlighter-rouge">pg_qualstats_index_advisor()</code> function has been changed to return <strong>json</strong> rather than <strong>jsonb</strong>, so that the compatibility with PostgreSQL 9.3 is maintained. The query examples are therefore also modified to use <code class="language-plaintext highlighter-rouge">json_array_elements()</code> rather than <code class="language-plaintext highlighter-rouge">jsonb_array_elements()</code>.</p> <h3 id="what-is-pg_qualstats">What is pg_qualstats</h3> <p>A simple way to explain what is pg_qualstats would be to say that it’s like <a href="https://www.postgresql.org/docs/current/pgstatstatements.html">pg_stat_statements</a> working at the predicate level.</p> <p>The extension will save useful statistics for <strong>WHERE</strong> and <strong>JOIN</strong> clauses: which table and column a predicate refers to, number of time the predicate has been used, number of execution of the underlying operator, whether it’s a predicate from an index scan or not, selectivity, constant values used and much more.</p> <p>You can deduce many things from such information. For instance, if you examine the predicates that contains references to different tables, you can find which tables are joined together, and how selective are those join conditions.</p> <h3 id="global-suggestion">Global suggestion?</h3> <p>As I mentioned, the global index advisor added in pg_qualstats 2 uses the same approach as the one in PoWA, so the explanation here will describe both tools. The only difference is that with PoWA you’ll likely get a better suggestion, as more predicates will be available, and you can also choose for wich time interval you want to detect missing indexes.</p> <p>The important thing here is that the suggestion is performed <strong>globally</strong>, considering all interesting predicates at the same time. This approach is different to all other approaches I saw that only consider a single query at a time. I believe that a global approach is better, as it’s possible to reduce the total number of indexes, maximizing multi-column indexes usefulness.</p> <h3 id="how-global-suggestion-is-done">How global suggestion is done</h3> <p>The first step is to gather all predicates that could benefit from a new index. This is easy to get with pg_qualstats, by filtering the predicates coming from sequential scans, executed many time, that filter many rows (both in number of rows and in percentage) you get a perfect list of predicates that likely miss an index (or alternatively the list of poorly written queries in certain cases). For instance, let’s consider an application which uses those 4 predicates:</p> <p><a href="/images/global_advisor_1_quals.png"><img src="/images/global_advisor_1_quals.png" alt="List of all predicates found" /></a></p> <p>Next, we build the full set of paths with each AND-ed predicates that contains other, also possibly AND-ed, predicates. Using the same 4 predicates, we would get those paths:</p> <p><a href="/images/global_advisor_2_graphs.png"><img src="/images/global_advisor_2_graphs.png" alt="Build all possible paths of predicates" /></a></p> <p>Once all the paths are built, we just need to get the best path to find out the best index to suggest. The scoring is for now done by giving a weight to each node of each path corresponding to the number of simple predicates it contains and summing the weight for each path. This is very simple and allows to prefer a smaller amount of indexes to optimize as many queries as possible. With our simple example, we get:</p> <p><a href="/images/global_advisor_3_weighted.png"><img src="/images/global_advisor_3_weighted.png" alt="Weight all paths and choose the highest score" /></a></p> <p>Of course, other scoring approaches could be used to take into account other parameters and give possibly better suggestions. For instance, combining the number of executions or the predicate selectivity. If the read/write ratio for each table is known (this is available using <a href="https://github.com/powa-team/powa-archivist">powa-archivist</a>), it would also be possible to adapt the scoring method to limit index suggestions for write-mostly tables. With this algorithm, all of that could be added quite easily.</p> <p>Once the best path is found, we can generate an index DDL! As the order of the columns can be important, this is done using getting the columns for each node in ascending weight order. In our example, we would generate this index:</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">INDEX</span> <span class="k">ON</span> <span class="n">t1</span> <span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">ts</span><span class="p">,</span> <span class="n">val</span><span class="p">);</span></code></pre></figure> <p>Once an index is found, we simply remove the contained predicates for the global list of predicates and start again from scratch until there are no predicate left.</p> <h3 id="additional-details-and-caveat">Additional details and caveat</h3> <p>Of course, this is a simplified version of the suggestion algorithm. Some other informations are required. For instance, the list of predicates is actually expanded with <a href="https://www.postgresql.org/docs/current/indexes-opclass.html">operator classes and access method</a> depending on the column types and operator, to make sure that the suggested indexes are valid. If multiple index methods are found for a best path, <code class="language-plaintext highlighter-rouge">btree</code> will be chosen in priority.</p> <p>This brings another consideration: this approach is mostly thought for <strong>btree</strong> indexes, for which the column order is critical. Some other access methods don’t require a specific column order, and for those it could be possible to get better index suggestions if the column order parameters wasn’t considered.</p> <p>Another important point is that the operator classes and access method is not hardcoded but retrieved at execution time using the local catalogs. Therefore, you can get different (and possibly better) results if you make sure that optional operator classes are present when using the index advisor. This could be <strong>btree_gist</strong> or <strong>btree_gin</strong> extensions, but also other access methods. It’s also possible that some type / operator combination doesn’t have any associated access method recorded in the catalogs. In this case, those predicates are returned separately as a list of unoptimizable predicates, that should be manually analyzed.</p> <p>Finally, as pg_qualstats isn’t considering expression predicates, this advisor can’t suggest indexes on expression, for instance if you’re using fulltext search.</p> <h3 id="usage-example">Usage example</h3> <p>A simple set-returning function is provided, with optional parameters, that returns a json value:</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">OR</span> <span class="k">REPLACE</span> <span class="k">FUNCTION</span> <span class="n">pg_qualstats_index_advisor</span> <span class="p">(</span> <span class="n">min_filter</span> <span class="nb">integer</span> <span class="k">DEFAULT</span> <span class="mi">1000</span><span class="p">,</span> <span class="n">min_selectivity</span> <span class="nb">integer</span> <span class="k">DEFAULT</span> <span class="mi">30</span><span class="p">,</span> <span class="n">forbidden_am</span> <span class="nb">text</span><span class="p">[]</span> <span class="k">DEFAULT</span> <span class="s1">'{}'</span><span class="p">)</span> <span class="k">RETURNS</span> <span class="n">json</span></code></pre></figure> <p>The parameter names are self explanatory:</p> <ul> <li><code class="language-plaintext highlighter-rouge">min_filter</code>: how many tuples should a predicate filter on average to be considered for the global optimization, by default <strong>1000</strong>.</li> <li><code class="language-plaintext highlighter-rouge">min_selectivity</code>: how selective should a predicate filter on average to be considered for the global optimization, by default <strong>30%</strong>.</li> <li><code class="language-plaintext highlighter-rouge">forbidden_am</code>: list of access methods to ignore. None by default, although for PostgreSQL 9.6 and prior <strong>hash indexes will internally be discarded</strong>, as those are only safe since version 10.</li> </ul> <p>Using pg_qualstats regression tests, let’s see a simple example:</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">pgqs</span> <span class="k">AS</span> <span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="s1">'a'</span> <span class="n">val</span> <span class="k">FROM</span> <span class="n">generate_series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span> <span class="n">id</span><span class="p">;</span> <span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">adv</span> <span class="p">(</span><span class="n">id1</span> <span class="nb">integer</span><span class="p">,</span> <span class="n">id2</span> <span class="nb">integer</span><span class="p">,</span> <span class="n">id3</span> <span class="nb">integer</span><span class="p">,</span> <span class="n">val</span> <span class="nb">text</span><span class="p">);</span> <span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">adv</span> <span class="k">SELECT</span> <span class="n">i</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="s1">'line '</span> <span class="o">||</span> <span class="n">i</span> <span class="k">from</span> <span class="n">generate_series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1000</span><span class="p">)</span> <span class="n">i</span><span class="p">;</span> <span class="k">SELECT</span> <span class="n">pg_qualstats_reset</span><span class="p">();</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">id1</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">;</span> <span class="k">SELECT</span> <span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">id1</span> <span class="o">&lt;</span> <span class="mi">500</span><span class="p">;</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">val</span> <span class="o">=</span> <span class="s1">'meh'</span><span class="p">;</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">id1</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">and</span> <span class="n">val</span> <span class="o">=</span> <span class="s1">'meh'</span><span class="p">;</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">id1</span> <span class="o">=</span> <span class="mi">1</span> <span class="k">and</span> <span class="n">val</span> <span class="o">=</span> <span class="s1">'meh'</span><span class="p">;</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">id1</span> <span class="o">=</span> <span class="mi">1</span> <span class="k">and</span> <span class="n">id2</span> <span class="o">=</span> <span class="mi">2</span> <span class="k">AND</span> <span class="n">val</span> <span class="o">=</span> <span class="s1">'meh'</span><span class="p">;</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">id1</span> <span class="o">=</span> <span class="mi">6</span> <span class="k">and</span> <span class="n">id2</span> <span class="o">=</span> <span class="mi">6</span> <span class="k">AND</span> <span class="n">id3</span> <span class="o">=</span> <span class="mi">6</span> <span class="k">AND</span> <span class="n">val</span> <span class="o">=</span> <span class="s1">'meh'</span><span class="p">;</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">adv</span> <span class="k">WHERE</span> <span class="n">val</span> <span class="k">ILIKE</span> <span class="s1">'moh'</span><span class="p">;</span> <span class="k">SELECT</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">pgqs</span> <span class="k">WHERE</span> <span class="n">id</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span></code></pre></figure> <p>And here’s what the function returns:</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span> <span class="n">v</span> <span class="k">FROM</span> <span class="n">json_array_elements</span><span class="p">(</span> <span class="n">pg_qualstats_index_advisor</span><span class="p">(</span><span class="n">min_filter</span> <span class="o">=&gt;</span> <span class="mi">50</span><span class="p">)</span><span class="o">-&gt;</span><span class="s1">'indexes'</span><span class="p">)</span> <span class="n">v</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">v</span><span class="p">::</span><span class="nb">text</span> <span class="k">COLLATE</span> <span class="nv">"C"</span><span class="p">;</span> <span class="n">v</span> <span class="c1">---------------------------------------------------------------</span> <span class="nv">"CREATE INDEX ON public.adv USING btree (id1)"</span> <span class="nv">"CREATE INDEX ON public.adv USING btree (val, id1, id2, id3)"</span> <span class="nv">"CREATE INDEX ON public.pgqs USING btree (id)"</span> <span class="p">(</span><span class="mi">3</span> <span class="k">rows</span><span class="p">)</span> <span class="k">SELECT</span> <span class="n">v</span> <span class="k">FROM</span> <span class="n">json_array_elements</span><span class="p">(</span> <span class="n">pg_qualstats_index_advisor</span><span class="p">(</span><span class="n">min_filter</span> <span class="o">=&gt;</span> <span class="mi">50</span><span class="p">)</span><span class="o">-&gt;</span><span class="s1">'unoptimised'</span><span class="p">)</span> <span class="n">v</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">v</span><span class="p">::</span><span class="nb">text</span> <span class="k">COLLATE</span> <span class="nv">"C"</span><span class="p">;</span> <span class="n">v</span> <span class="c1">-----------------</span> <span class="nv">"adv.val ~~* ?"</span> <span class="p">(</span><span class="mi">1</span> <span class="k">row</span><span class="p">)</span></code></pre></figure> <p>The <a href="https://github.com/powa-team/pg_qualstats/">version 2 of pg_qualstats</a> is not released yet, but feel free to test it and <a href="https://github.com/powa-team/pg_qualstats/issues">report any issue you may find</a>!</p> <p><a href="https://rjuju.github.io/postgresql/2020/01/06/pg_qualstats-2-global-index-advisor.html">pg qualstats 2: Global index advisor</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on January 06, 2020.</p> <![CDATA[PoWA 4: Nouveau daemon powa-collector]]> https://rjuju.github.io/postgresqlfr/2019/12/10/powa-4-nouveau-powa-collector 2019-12-10T18:54:17+00:00 2019-12-10T18:54:17+00:00 Julien Rouhaud https://rjuju.github.io <p>Cet article fait partie d’une série d’article sur <a href="http://powa.readthedocs.io/">la beta de PoWA 4</a>, et décrit le nouveau <a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">daemon powa-collector</a>.</p> <h3 id="nouveau-daemon-powa-collector">Nouveau <a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">daemon powa-collector</a></h3> <p>Ce daemon remplace le précédent <em>background worker</em> lorsque le nouveau <a href="https://powa.readthedocs.io/en/latest/remote_setup.html">mode remote</a> est utilisé. Il s’agit d’un simple daemon écrit en python, qui s’occupera de toutes les étapes nécessaires pour effectuer des <em>snapshots distants</em>. Il est <a href="https://pypi.org/project/powa-collector/">disponible sur pypi</a>.</p> <p>Comme je l’ai expliqué dans mon <a href="/postgresql/2019/05/17/powa-4-with-remote-mode-beta-is-available.html">précédent article introduistant PoWA 4</a>, ce daemon est nécessaire pour la configuration d’un mode remote, en gardant cette architecture à l’esprit :</p> <p><a href="/images/powa_4_remote.svg"><img src="/images/powa_4_remote.svg" alt="Architecture de PoWA 4 en mode distant" /></a></p> <p>Sa configuration est très simple. Il vous suffit tout simplement de renommer le fichier <code class="language-plaintext highlighter-rouge">powa-collector.conf.sample</code> fourni, et d’adapter <a href="https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING">l’URI de connexion</a> pour décrire comment se connecter sur votre <em>serveur repository</em> dédié, et c’est fini.</p> <p>Une configuration typique devrait ressembler à :</p> <figure class="highlight"><pre><code class="language-conf" data-lang="conf">{ <span class="s2">"repository"</span>: { <span class="s2">"dsn"</span>: <span class="s2">"postgresql://powa_user@server_dns:5432/powa"</span>, }, <span class="s2">"debug"</span>: <span class="n">true</span> }</code></pre></figure> <p>La liste des <em>serveur distants</em>, leur configuration ainsi que tout le reste qui est nécessaire pour le bon fonctionnement sera automatiquement récupéré depuis le <em>serveur repository</em> que vous ave déjà configuré. Une fois démarré, il démarrera un thread dédié par <em>serveur distant</em> déclaré, et maintiendra une <strong>connexion persistente</strong> sur ce <em>serveur distant</em>. Chaque thread effectuera un <em>snapshot distant</em>, exportant les données sur le <em>serveur repository</em> en utilisant les nouvelles <em>fonctions sources</em>. Chaque thread ouvrira et fermera une connexion sur le <em>serveur repository</em> lors de l’exécution du <em>snapshot distant</em>.</p> <p>Bien évidemment, ce daemon a besoin de pouvoir se connecter sur tous les <em>serveurs distants</em> déclarés ainsi que le <em>serveur repository</em>. La table <code class="language-plaintext highlighter-rouge">powa_servers</code>, qui stocke la liste des <em>serveurs distants</em>, a un champ pour stocker les nom d’utilisateur et mot de passe pour se connecter aux <em>serveur distants</em>. Stocker un mot de passe en clair dans cette table est une hérésie, si l’on considère l’aspect sécurité. Ainsi, comme indiqué dans la <a href="https://powa.readthedocs.io/en/latest/security.html#connection-on-remote-servers">section sécurité de PoWA</a>, vous pouve stocker un mot de passe NULL et <a href="https://www.postgresql.org/docs/current/auth-methods.html">utiliser à la place n’importe laquelle des autres méthodes d’authentification supportées par la libpq</a> (fichier .pgpass, certificat…). C’est très fortement recommandé pour toute installation sérieuse.</p> <p>La connexion persistente sur le <em>serveur repository</em> est utilisée pour superviser la daemon :</p> <ul> <li>pour vérifier que le daemon est bien démarré</li> <li>pour communiquer au travers de l’UI en utilisant un <a href="https://powa.readthedocs.io/en/latest/components/powa-collector/protocol.html">protocole simple</a> afin d’effectuer des actions diverses (recharger la configuration, vérifier le status d’un thread dédié à un <em>serveur distant</em>…)</li> </ul> <p>Il est à noter que vous pouvez également demander au daemon de recharger sa configuration en envoyant un SIGHUP au processus du daemon. Un rechargement est nécessaire pour toute modification effectuée sur la liste des serveurs distants (ajout ou suppression d’un <em>serveur distant</em>, ou mise à jour d’un existant).</p> <p>Veuillez également noter que, par choix, <a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">powa-collector</a> n’effectuera pas de <em>snapshot local</em>. Si vous voulez utiliser PoWA pour le <em>serveur repository</em>, il vous faudra activer le <em>background worker</em> original.</p> <h5 id="nouvelle-page-de-configuration">Nouvelle page de configuration</h5> <p>La page de configuration est maintenant modifiée pour donner toutes les informations nécessaires sur le status du background worker, le <a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">powa-collector daemon</a> (incluant tous ses threads dédiés) ainsi que la liste des <em>serveurs distants</em> déclarés. Voici un exemple de cette nouvelle page racine de configuration :</p> <p><a href="/images/powa_4_configuration_page.png"><img src="/images/powa_4_configuration_page.png" alt="Nouvelle page de configuration" /></a></p> <p>Si le <a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">daemon powa-collector</a> est utilisé, le status de chaque serveur distant sera récupéré en utilisant le protocole de communication. Si le collecteur rencontre des erreurs (lors de la connexion à un <em>serveur distant</em>, durant un <em>snapshot</em> par exemple), celles-ci seront également affichées ici. À noter également que ces erreurs seront également affichées en haut de chaque page de toutes les pages de l’UI, afin d’être sûr de ne pas les rater.</p> <p>De plus, la section configuration a maintenant une hiérarchie, et vous pourrez voir la liste des extensions ainsi que la configuration actuelle de PostgreSQL pour le serveur <strong>local</strong> ou <strong>distant</strong> en cliquant sur le serveur de votre choix!</p> <p>Il y a également un nouveau bouton <strong>Reload collector</strong> sur le bandeau d’en-tête qui, comme on pourrait s’y attendre, demandera au collecteur de recharger sa configuration. Cela peut être utile si vous avez déclarés de nouveaux serveurs mais n’ave pas d’accès au serveur sur lequel le collecteur s’exécute.</p> <h3 id="conclusion">Conclusion</h3> <p>Cette article est le dernier de la séurie concernant la nouvelle version de PoWA. Il est toujours en beta, n’hésitez donc pas à le tester, <a href="https://powa.readthedocs.io/en/latest/support.html#support">rapporter tout bug rencontré</a> ou donner tout autre retour!</p> <p><a href="https://rjuju.github.io/postgresqlfr/2019/12/10/powa-4-nouveau-powa-collector.html">PoWA 4: Nouveau daemon powa-collector</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on December 10, 2019.</p> <![CDATA[PoWA 4: New powa-collector daemon]]> https://rjuju.github.io/postgresql/2019/12/10/powa-4-new-powa-collector 2019-12-10T18:54:17+00:00 2019-12-10T18:54:17+00:00 Julien Rouhaud https://rjuju.github.io <p>This article is part of the <a href="http://powa.readthedocs.io/">PoWA 4 beta</a> series, and describes the new <a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">powa-collector daemon</a>.</p> <h3 id="new-powa-collector-daemon">New <a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">powa-collector daemon</a></h3> <p>This daemon replaces the previous <em>background worker</em> when using the <a href="https://powa.readthedocs.io/en/latest/remote_setup.html">new remote mode</a>. It’s a simple daemon written in python, which will perform all the required steps to perform <em>remote snapshots</em>. It’s <a href="https://pypi.org/project/powa-collector/">available on pypi</a>.</p> <p>As I explained in my <a href="/postgresql/2019/05/17/powa-4-with-remote-mode-beta-is-available.html">previous article introducing PoWA 4</a>, this daemon is required for a remote mode setup, with this architecture in mind:</p> <p><a href="/images/powa_4_remote.svg"><img src="/images/powa_4_remote.svg" alt="PoWA 4 remote architecture" /></a></p> <p>Its configuration is very simple. All you need to do is copy and rename the provided <code class="language-plaintext highlighter-rouge">powa-collector.conf.sample</code> file, and adapt the <a href="https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING">connection URI</a> to describe how to connect on your dedicated <em>repository server</em>, and you’re done.</p> <p>A typical configuration will look like:</p> <figure class="highlight"><pre><code class="language-conf" data-lang="conf">{ <span class="s2">"repository"</span>: { <span class="s2">"dsn"</span>: <span class="s2">"postgresql://powa_user@server_dns:5432/powa"</span>, }, <span class="s2">"debug"</span>: <span class="n">true</span> }</code></pre></figure> <p>The list of <em>remote servers</em>, their configuration and everything else it needs will be automatically retrieved from the <em>repository server</em> you just configured. When started, it’ll spawn one dedicated thread per declared <em>remote server</em>, and maintain a <strong>persistent connection</strong> on the configured <strong>powa database</strong> on this <em>remote server</em>. Each thread will perform a <em>remote snapshot</em>, exporting the data on the <em>repository server</em> using the new <em>source functions</em>. Each thread will open and close a connection on the <em>repository server</em> when performing the <em>remote snapshot</em>.</p> <p>This daemon obviously needs to be able to connect to all the declared <em>remote servers</em> and the <em>repository server</em>. The <code class="language-plaintext highlighter-rouge">powa_servers</code> table, which store the list of <em>remote servers</em>, has a field to store username and password to connect to the <em>remote server</em>. Storing a password in plain text in this table is an heresy as far as security is concerned. So, as mentioned in the <a href="https://powa.readthedocs.io/en/latest/security.html#connection-on-remote-servers">PoWA security documentation</a>, you can store a NULL password and <a href="https://www.postgresql.org/docs/current/auth-methods.html">instead use any of the authentication method that libpq supports</a> (.pgpass file, certificate…). That’s strongly recommended for any non toy setup.</p> <p>The persistent connection on the <em>repository server</em> is used to monitor the daemon:</p> <ul> <li>to check that the daemon is up and running</li> <li>to communicate through the UI using a <a href="https://powa.readthedocs.io/en/latest/components/powa-collector/protocol.html">simple protocol</a> to perform various actions (reload the configuration, check for a <em>remote server</em> thread status…)</li> </ul> <p>Note that you can also ask the daemon to reload its configuration by issuing a SIGHUP to the daemon process. A reload is required if any modification to the list of remote servers (if you added or removed a <em>remote server</em>, or updated a setting for an existing) has been done.</p> <p>Also note that by choice, <a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">powa-collector</a> will not perform <em>local snapshots</em>. If you want to use PoWA for the <em>repository server</em>, you need to enable the original <em>background worker</em>.</p> <h5 id="new-configuration-page">New configuration page</h5> <p>The configuration page is now updated to give all needed information about the background worker status and the <a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">powa-collector daemon</a> status (including all of its dedicated threads) and the list of registered <em>remote servers</em>. Here’s an example of the new root configuration page:</p> <p><a href="/images/powa_4_configuration_page.png"><img src="/images/powa_4_configuration_page.png" alt="New configuration page" /></a></p> <p>If the <a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">powa-collector daemon</a> is used, each remote server status will be retrieved using the communication protocol. If the collector encountered any error (connecting to a <em>remote server</em>, during a <em>snapshot</em> or anything else), they’ll also be displayed here. Also note that such errors will also be displayed on top of any page of the UI, so that you can’t miss them.</p> <p>Also, the configuration section has now a hierarchy, and you’ll be able to see the list of extensions and the current PostgreSQL configuration for the <strong>local</strong> or <strong>remote servers</strong> by clicking on the server of your choice!</p> <p>There’s also a new <strong>Reload collector</strong> button on the header panel, which as expected will ask the collector to reload its configuration. That can be useful if you registered new servers and you don’t have access on the server where the collector is running.</p> <h3 id="conclusion">Conclusion</h3> <p>This is the last article introducing the new version of PoWA. It’s still in beta, so feel free to test it, <a href="https://powa.readthedocs.io/en/latest/support.html#support">report any issue you may find</a> or give any other feedback!</p> <p><a href="https://rjuju.github.io/postgresql/2019/12/10/powa-4-new-powa-collector.html">PoWA 4: New powa-collector daemon</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on December 10, 2019.</p> <![CDATA[PoWA 4: nouveautés dans powa-archivist !]]> https://rjuju.github.io/postgresqlfr/2019/06/05/powa-4-nouveaute-dans-powa-archivist 2019-06-05T14:26:17+00:00 2019-06-05T14:26:17+00:00 Julien Rouhaud https://rjuju.github.io <p>Cet article fait partie d’une série d’article sur <a href="http://powa.readthedocs.io/">la beta de PoWA 4</a>, et décrit les changements présents dans <a href="https://powa.readthedocs.io/en/latest/components/powa-archivist/index.html">powa-archivist</a>.</p> <p>Pour plus d’information sur cette version 4, vous pouvez consulter <a href="/postgresqlfr/2019/05/17/powa-4-avec-mode-remote-disponible-en-beta.html">l’article de présentation général</a>.</p> <h3 id="aperçu-rapide">Aperçu rapide</h3> <p>Tout d’abord, il faut savoir qu’il n’y a pas d’upgrade possible depuis la v3 vers la v4, il est donc nécessaire d’effectuer un <code class="language-plaintext highlighter-rouge">DROP EXTENSION powa</code> si vous utilisiez déjà PoWA sur vos serveurs. Cela est du au fait que la v4 apporte <strong>de très nombreux</strong> changements dans la partie SQL de l’extension, ce qui en fait le changement le plus significatif dans la suite PoWA pour cette nouvelle version. Au moment où j’écris cet article, la quantité de changements apportés dans cette extension est :</p> <figure class="highlight"><pre><code class="language-diff" data-lang="diff"> CHANGELOG.md | 14 + powa--4.0.0dev.sql | 2075 +++++++++++++++++++++------- powa.c | 44 +- 3 files changed, 1629 insertions(+), 504 deletions(-)</code></pre></figure> <p>L’absence d’upgrade ne devrait pas être un problème en pratique. PoWA est un outil pour analyser les performances, il est fait pour avoir des données avec une grande précision mais un historique très limité. Si vous cherchez une solution de supervision généraliste pour conserver des mois de données, PoWA n’est définitivement pas l’outil qu’il vous faut.</p> <h3 id="configurer-la-liste-des-serveurs-distants">Configurer la liste des <em>serveurs distants</em></h3> <p>En ce qui concerne les changements à proprement parler, le premier petit changement est que le <a href="https://www.postgresql.org/docs/current/bgworker.html">background worker</a> n’est plus nécessaire pour le fonctionnement de powa-archivist, car il n’est pas utilisé pour le mode distant. Cela signifie qu’un redémarrage de PostgreSQL n’est plus nécessaire pour installer PoWA. Bien évidemment, un redémarrage est toujours nécessaire si vous souhaitez utiliser le mode local, en utilisant le background worker, or si vous voulez installer des extensions additionelles qui nécessitent elles-même un redémarrage.</p> <p>Ensuite, comme PoWA requiert un peu de configuration (fréquence des snapshot, rétention des données et ainsi de suite), certaines nouvelles tables sont ajouter pour permettre de configurer tout ça. La nouvelle table <code class="language-plaintext highlighter-rouge">powa_servers</code> stocke la configuration de toutes les instances distantes dont les données doivent être stockées sur cette instance. Cette <em>instance PoWA locale</em> est appelée un <strong>serveur repository</strong> (qui devrait typiquement être dédiée à stocker des données PoWA), en opposition aux <strong>instances distantes</strong> qui sont les instances que vous voulez monitorer. Le contenu de cette table est tout ce qu’il y a de plus simple :</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="err">\</span><span class="n">d</span> <span class="n">powa_servers</span> <span class="k">Table</span> <span class="nv">"public.powa_servers"</span> <span class="k">Column</span> <span class="o">|</span> <span class="k">Type</span> <span class="o">|</span> <span class="k">Collation</span> <span class="o">|</span> <span class="k">Nullable</span> <span class="o">|</span> <span class="k">Default</span> <span class="c1">-----------+----------+-----------+----------+------------------------------------------</span> <span class="n">id</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="n">nextval</span><span class="p">(</span><span class="s1">'powa_servers_id_seq'</span><span class="p">::</span><span class="n">regclass</span><span class="p">)</span> <span class="n">hostname</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="k">alias</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="n">port</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="n">username</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="n">password</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="n">dbname</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="n">frequency</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="mi">300</span> <span class="n">powa_coalesce</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="mi">100</span> <span class="n">retention</span> <span class="o">|</span> <span class="n">interval</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="s1">'1 day'</span><span class="p">::</span><span class="n">interval</span></code></pre></figure> <p>Si vous avez déjà utilisé PoWA, vous devriez reconnaître la plupart des options de configuration qui sont maintenant stockées ici. Les nouvelles options sont utilisées pour décrire comment se connecter aux <em>instances distances</em>, et peuvent fournir un alias à afficher sur l’UI.</p> <p>Vous avez également probablement remarqué une colonne <strong>password</strong>. Stocker un mot de passe en clair dans cette table est une hérésie pour n’importe qui désirant un minimum de sécurité. Ainsi, comme mentionné dans la <a href="https://powa.readthedocs.io/en/latest/security.html#connection-on-remote-servers">section sécurité de la documentation de PoWA </a>, vous pouvez stocker NULL pour le champ password et à la place utiliser <a href="https://www.postgresql.org/docs/current/auth-methods.html">n’importe laquelle des autres méthodes d’authentification supportée par la libpq</a> (fichier .pgpass, certificat…). Une authentification plus sécurisée est chaudement recommandée pour toute installation sérieuse.</p> <p>Une autre table, la table <code class="language-plaintext highlighter-rouge">powa_snapshot_metas</code>, est également ajoutée pour stocker quelques métadonnées concernant les informations de snapshot pour chaque <em>serveur distant</em>.</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"> <span class="k">Table</span> <span class="nv">"public.powa_snapshot_metas"</span> <span class="k">Column</span> <span class="o">|</span> <span class="k">Type</span> <span class="o">|</span> <span class="k">Collation</span> <span class="o">|</span> <span class="k">Nullable</span> <span class="o">|</span> <span class="k">Default</span> <span class="c1">--------------+--------------------------+-----------+----------+---------------------------------------</span> <span class="n">srvid</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="n">coalesce_seq</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="mi">1</span> <span class="n">snapts</span> <span class="o">|</span> <span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="s1">'-infinity'</span><span class="p">::</span><span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span> <span class="n">aggts</span> <span class="o">|</span> <span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="s1">'-infinity'</span><span class="p">::</span><span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span> <span class="n">purgets</span> <span class="o">|</span> <span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="s1">'-infinity'</span><span class="p">::</span><span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span> <span class="n">errors</span> <span class="o">|</span> <span class="nb">text</span><span class="p">[]</span></code></pre></figure> <p>Il s’agit tout simplement d’un compteur pour compter le nombre de snapshots effectués, un timestamp pour chaque type d’événement survenu (snapshot, aggrégation et purge) et un tableau de chaîne de caractères pour stocker toute erreur survenant durant le snapshot, afin que l’UI pour l’afficher.</p> <h3 id="api-sql-pour-configurer-les-serveurs-distants">API SQL pour configurer les <em>serveurs distants</em></h3> <p>Bien que ces tables soient très simples, une <a href="https://powa.readthedocs.io/en/latest/remote_setup.html#configure-powa-and-stats-extensions-on-each-remote-server">API SQL basique est disponible pour déclarer de nouveaux serveurs et les configurer</a>. 6 fonctions de bases sont disponibles :</p> <ul> <li><code class="language-plaintext highlighter-rouge">powa_register_server()</code>, pour déclarer un nouveau <em>servuer distant</em>, ainsi que la liste des extensions qui y sont disponibles</li> <li><code class="language-plaintext highlighter-rouge">powa_configure_server()</code> pour mettre à jour un des paramètres pour le <em>serveur distant</em> spécifié (en utilisant un paramètre JSON, où la clé est le nom du paramètre à changer et la valeur la nouvelle valeur à utiliser)</li> <li><code class="language-plaintext highlighter-rouge">powa_deactivate_server()</code> pour désactiver les snapshots pour le <em>serveur distant</em> spécifiqué (ce qui concrètement positionnera le paramètre <code class="language-plaintext highlighter-rouge">frequency</code> à <strong>-1</strong>)</li> <li><code class="language-plaintext highlighter-rouge">powa_delete_and_purge_server()</code> pour supprimer le <em>serveur distant</em> spécifié de la liste des serveurs et supprimer toutes les données associées aux snapshots</li> <li><code class="language-plaintext highlighter-rouge">powa_activate_extension()</code>, pour déclarer qu’une nouvelle extension est disponible sur le <em>serveur distant</em> spécifié</li> <li><code class="language-plaintext highlighter-rouge">powa_deactivate_extension()</code>, pour spécifier qu’une extension n’est plus disponible sur le <em>serveur distant</em> spécifié</li> </ul> <p>Toute action plus compliquée que ça devra être effectuée en utilisant des requêtes SQL. Heureusement, il ne devrait pas y avoir beaucoup d’autres besoins, et les tables sont vraiment très simple donc cela ne devrait pas poser de soucis. <a href="https://github.com/powa-team/powa-archivist/issues">N’hésitez cependant pas à demander de nouvelles fonctions</a> si vous aviez d’autres besoins. Veuillez également noter que l’UI ne vous permet pas d’appeler ces fonctions, puisque celle-ci est pour le moment <strong>entièrement en lecture seule</strong>.</p> <h3 id="effectuer-des-snapshots-distants">Effectuer des <em>snapshots distants</em></h3> <p>Puisque les métriques sont maintenant stockées sur une instance PostgreSQL différente, nous avons énormément changé la façon dont les <em>snapshots</em> (récupérer les données fournies par une <a href="https://powa.readthedocs.io/en/latest/components/stats_extensions/index.html">extensions statistique</a> et les stockées dans le catalogue PoWA <a href="/postgresqlfr/2019/04/06/minimiser-le-surcout-de-stockage-par-ligne.html">de manière à optimiser le stockage</a>) sont effectués.</p> <p>La liste de toutes les extensions statistiques, ou <em>sources de données</em>, qui sont disponibles sur un <strong>serveur</strong> (soit <em>distant</em> soit <em>local</em>) et pour lesquelles un <em>snapshot</em> devrait être effectué est stockée dans une table appelée <code class="language-plaintext highlighter-rouge">powa_functions</code>:</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"> <span class="k">Table</span> <span class="nv">"public.powa_functions"</span> <span class="k">Column</span> <span class="o">|</span> <span class="k">Type</span> <span class="o">|</span> <span class="k">Collation</span> <span class="o">|</span> <span class="k">Nullable</span> <span class="o">|</span> <span class="k">Default</span> <span class="c1">----------------+---------+-----------+----------+---------</span> <span class="n">srvid</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="n">module</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="k">operation</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="n">function_name</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="n">query_source</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="n">added_manually</span> <span class="o">|</span> <span class="nb">boolean</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="k">true</span> <span class="n">enabled</span> <span class="o">|</span> <span class="nb">boolean</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="k">true</span> <span class="n">priority</span> <span class="o">|</span> <span class="nb">numeric</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="mi">10</span></code></pre></figure> <p>Un nouveau champ <code class="language-plaintext highlighter-rouge">query_source</code> a été rajouté. Celui-ci fournit le nom de la <em>fonction source</em>, nécessaire pour la compatibilité d’une <a href="https://powa.readthedocs.io/en/latest/components/stats_extensions/index.html">extension statistique</a> avec les snapshots distants. Cette fonction est utilisée pour exporter les compteurs fournis par cette extension sur un serveur différent, dans une <em>table transitoire</em> dédiée. La fonction de <em>snapshot</em> effectuera alors le <em>snapshot</em> en utilisant automatiquement ces données exportées plutôt que celles fournies par l’extension statististique locale quand le mode distant est utilisé. Il est à noter que l’export de ces compteurs ainsi que le snapshot distant est effectué automatiquement par le nouveau <a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">daemon powa-collector</a> que je présenterai dans un autre article.</p> <p>Voici un exemple montant comment PoWA effectue un <em>snapshot distant</em> d’une liste de base données. Comme vous allez le voir, c’est très simple ce qui signifie qu’il est également très simple d’ajouter cette même compatibilité pour une nouvelle extension statistique.</p> <p>La <em>table transitoire</em>:</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"> <span class="n">Unlogged</span> <span class="k">table</span> <span class="nv">"public.powa_databases_src_tmp"</span> <span class="k">Column</span> <span class="o">|</span> <span class="k">Type</span> <span class="o">|</span> <span class="k">Collation</span> <span class="o">|</span> <span class="k">Nullable</span> <span class="o">|</span> <span class="k">Default</span> <span class="c1">---------+---------+-----------+----------+---------</span> <span class="n">srvid</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="n">oid</span> <span class="o">|</span> <span class="n">oid</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="n">datname</span> <span class="o">|</span> <span class="n">name</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span></code></pre></figure> <p>Pour de meilleurs performances, toutes les <em>tables transitoires</em> sont <strong>non journalisées (unlogged)</strong>, puisque leur contenu n’est nécessaire que durant un <em>snapshot</em> et sont supprimées juste après. Dans cet examlple, la <em>table transitoire</em> ne stocke que l’identifiant du serveur distant correspondant à ces données, l’oid ainsi que le nom de chacune des bases de données présentes sur le <em>serveur distant</em>.</p> <p>Et la <em>fonction source</em> :</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">OR</span> <span class="k">REPLACE</span> <span class="k">FUNCTION</span> <span class="k">public</span><span class="p">.</span><span class="n">powa_databases_src</span><span class="p">(</span><span class="n">_srvid</span> <span class="nb">integer</span><span class="p">,</span> <span class="k">OUT</span> <span class="n">oid</span> <span class="n">oid</span><span class="p">,</span> <span class="k">OUT</span> <span class="n">datname</span> <span class="n">name</span><span class="p">)</span> <span class="k">RETURNS</span> <span class="k">SETOF</span> <span class="n">record</span> <span class="k">LANGUAGE</span> <span class="n">plpgsql</span> <span class="k">AS</span> <span class="err">$</span><span class="k">function</span><span class="err">$</span> <span class="k">BEGIN</span> <span class="n">IF</span> <span class="p">(</span><span class="n">_srvid</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span> <span class="k">THEN</span> <span class="k">RETURN</span> <span class="n">QUERY</span> <span class="k">SELECT</span> <span class="n">d</span><span class="p">.</span><span class="n">oid</span><span class="p">,</span> <span class="n">d</span><span class="p">.</span><span class="n">datname</span> <span class="k">FROM</span> <span class="n">pg_database</span> <span class="n">d</span><span class="p">;</span> <span class="k">ELSE</span> <span class="k">RETURN</span> <span class="n">QUERY</span> <span class="k">SELECT</span> <span class="n">d</span><span class="p">.</span><span class="n">oid</span><span class="p">,</span> <span class="n">d</span><span class="p">.</span><span class="n">datname</span> <span class="k">FROM</span> <span class="n">powa_databases_src_tmp</span> <span class="n">d</span> <span class="k">WHERE</span> <span class="n">srvid</span> <span class="o">=</span> <span class="n">_srvid</span><span class="p">;</span> <span class="k">END</span> <span class="n">IF</span><span class="p">;</span> <span class="k">END</span><span class="p">;</span> <span class="err">$</span><span class="k">function</span><span class="err">$</span></code></pre></figure> <p>Cette fonction retourne simplement le contenu de <code class="language-plaintext highlighter-rouge">pg_database</code> si les données locales sont demandées (l’identifiant de serveur <strong>0</strong> est toujours le serveur local), ou alors le contenu de la <em>table transitoire</em> pour le serveur distant spécifié.</p> <p>La <em>fonction de snapshot</em> peut alors facilement effectuer n’importe quel traitement avec ces données pour le <em>serveur distant</em> voulu. Dans le cas de la fonction <code class="language-plaintext highlighter-rouge">powa_databases_snapshot()</code>, il s’agit simplement de synchroniser la liste des bases de données, et de stocker le timestamp de suppression si une base de données qui existait précédemment n’est plus listée.</p> <p>Pour plus de détails, vous pouvez consulter la documentation concernant <a href="https://powa.readthedocs.io/en/latest/components/powa-archivist/development.html">l’ajout d’une source de données dans PoWA</a>, qui a été mise à jour pour les spécificités de la version 4.</p> <p><a href="https://rjuju.github.io/postgresqlfr/2019/06/05/powa-4-nouveaute-dans-powa-archivist.html">PoWA 4: nouveautés dans powa-archivist !</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on June 05, 2019.</p> <![CDATA[PoWA 4: changes in powa-archivist!]]> https://rjuju.github.io/postgresql/2019/06/05/powa-4-new-in-powa-archivist 2019-06-05T14:26:17+00:00 2019-06-05T14:26:17+00:00 Julien Rouhaud https://rjuju.github.io <p>This article is part of the <a href="http://powa.readthedocs.io/">PoWA 4 beta</a> series, and describes the changes done in <a href="https://powa.readthedocs.io/en/latest/components/powa-archivist/index.html">powa-archivist</a>.</p> <p>For more information about this v4, you can consult the <a href="/postgresql/2019/05/17/powa-4-with-remote-mode-beta-is-available.html">general introduction article</a>.</p> <h3 id="quick-overview">Quick overview</h3> <p>First of all, you have to know that there is not upgrade possible from v3 to v4, so a <code class="language-plaintext highlighter-rouge">DROP EXTENSION powa</code> is required if you were already using PoWA on any of your servers. This is because this v4 involved <strong>a lot</strong> of changes in the SQL part of the extension, making it the most significant change in the PoWA suite for this new version. Looking at the amount changes at the time I’m writing this article, I get:</p> <figure class="highlight"><pre><code class="language-diff" data-lang="diff"> CHANGELOG.md | 14 + powa--4.0.0dev.sql | 2075 +++++++++++++++++++++------- powa.c | 44 +- 3 files changed, 1629 insertions(+), 504 deletions(-)</code></pre></figure> <p>The lack of upgrade shouldn’t be a problem in practice though. PoWA is a performance tool, so it’s intended to have data with high precision but with a very limited history. If you’re looking for a general monitoring solution keeping months of counters, PoWA is definitely not the tool you need.</p> <h3 id="configuring-the-list-of-remote-servers">Configuring the list of <em>remote servers</em></h3> <p>Concerning the features themselves, the first small change is that powa-archivist does not require the <a href="https://www.postgresql.org/docs/current/bgworker.html">background worker</a> to be active anymore, as it won’t be used for remote setup. That means that a PostgreSQL restart is not needed needed anymore to install PoWA. Obviously, a restart is still required if you want to use the local setup, using the background worker, or if you want to install additional extensions that themselves require a restart.</p> <p>Then, as PoWA needs some configuration (frequency of snapshot, data retention and so on), some new tables are added to be able to configure all of that. The new <code class="language-plaintext highlighter-rouge">powa_servers</code> table stores the configuration for all the remote instances whose data should be stored on this instance. This <em>local PoWA instance</em> is call a <strong>repository server</strong> (that typically should be dedicated to storing PoWA data), in opposition to <strong>remote instances</strong> which are the instances you want to monitor. The content of this table is pretty straightforward:</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="err">\</span><span class="n">d</span> <span class="n">powa_servers</span> <span class="k">Table</span> <span class="nv">"public.powa_servers"</span> <span class="k">Column</span> <span class="o">|</span> <span class="k">Type</span> <span class="o">|</span> <span class="k">Collation</span> <span class="o">|</span> <span class="k">Nullable</span> <span class="o">|</span> <span class="k">Default</span> <span class="c1">-----------+----------+-----------+----------+------------------------------------------</span> <span class="n">id</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="n">nextval</span><span class="p">(</span><span class="s1">'powa_servers_id_seq'</span><span class="p">::</span><span class="n">regclass</span><span class="p">)</span> <span class="n">hostname</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="k">alias</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="n">port</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="n">username</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="n">password</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="n">dbname</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="n">frequency</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="mi">300</span> <span class="n">powa_coalesce</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="mi">100</span> <span class="n">retention</span> <span class="o">|</span> <span class="n">interval</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="s1">'1 day'</span><span class="p">::</span><span class="n">interval</span></code></pre></figure> <p>If you already used PoWA, you should recognize most of the configuration options, that are now stored here. The new options are used to describe how to connect to the <em>remote servers</em>, and can provide an alias to be displayed in the UI.</p> <p>You also probably noticed a <strong>password</strong> column here. Storing a password in plain text in this table is an heresy as far as security is concerned. So, as mentioned in the <a href="https://powa.readthedocs.io/en/latest/security.html#connection-on-remote-servers">PoWA security section of the documentation</a>, you can store a NULL password and use instead <a href="https://www.postgresql.org/docs/current/auth-methods.html">any of the authentication method that libpq supports</a> (.pgpass file, certificate…). That’s strongly recommended for any non toy setup.</p> <p>Another table, the <code class="language-plaintext highlighter-rouge">powa_snapshot_metas</code> table, is also added to store some metadata regarding each <em>remote server</em> snapshot information:</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"> <span class="k">Table</span> <span class="nv">"public.powa_snapshot_metas"</span> <span class="k">Column</span> <span class="o">|</span> <span class="k">Type</span> <span class="o">|</span> <span class="k">Collation</span> <span class="o">|</span> <span class="k">Nullable</span> <span class="o">|</span> <span class="k">Default</span> <span class="c1">--------------+--------------------------+-----------+----------+---------------------------------------</span> <span class="n">srvid</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="n">coalesce_seq</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="mi">1</span> <span class="n">snapts</span> <span class="o">|</span> <span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="s1">'-infinity'</span><span class="p">::</span><span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span> <span class="n">aggts</span> <span class="o">|</span> <span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="s1">'-infinity'</span><span class="p">::</span><span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span> <span class="n">purgets</span> <span class="o">|</span> <span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="s1">'-infinity'</span><span class="p">::</span><span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span> <span class="n">errors</span> <span class="o">|</span> <span class="nb">text</span><span class="p">[]</span></code></pre></figure> <p>That’s basically a counter to track the number of snapshots done, the timestamp for each kind of event that happened (snapshot, aggregate and purge), and a text array to store any error happening during the snapshot, that the UI can display.</p> <h3 id="sql-api-to-configure-the-remote-servers">SQL API to configure the <em>remote servers</em></h3> <p>While thoses table are simple, a <a href="https://powa.readthedocs.io/en/latest/remote_setup.html#configure-powa-and-stats-extensions-on-each-remote-server">basic SQL API is available to register new servers and configure them</a>. Basically, 6 functions are available:</p> <ul> <li><code class="language-plaintext highlighter-rouge">powa_register_server()</code>, to declare a new <em>remote server</em>, and the list of extensions available on it</li> <li><code class="language-plaintext highlighter-rouge">powa_configure_server()</code> to update any setting for the specified <em>remote server</em> (using a JSON where the key is the name of the parameter to change, and the value is the new value to use)</li> <li><code class="language-plaintext highlighter-rouge">powa_deactivate_server()</code> to disable snapshots on the specified <em>remote server</em> (which actually is setting up the <code class="language-plaintext highlighter-rouge">frequency</code> to <strong>-1</strong>)</li> <li><code class="language-plaintext highlighter-rouge">powa_delete_and_purge_server()</code> to remove the specified <em>remote server</em> from the list of servers and remove all associated snapshot data</li> <li><code class="language-plaintext highlighter-rouge">powa_activate_extension()</code>, to declare that a new extension is available on the specified <em>remote server</em></li> <li><code class="language-plaintext highlighter-rouge">powa_deactivate_extension()</code>, to specify that an extension is not available anymore on the specified <em>remote server</em></li> </ul> <p>Any action more complicated than this should be performed using plain SQL queries. Hopefully, there shouldn’t be many other needs, and the tables are straightforward so this shouldn’t be a problem. <a href="https://github.com/powa-team/powa-archivist/issues">Feel free to ask for more functions</a> if you feel the need though. Please also note that the UI doesn’t allow you to call those functions, as the UI is for now entirely <strong>read only</strong>.</p> <h3 id="performing-remote-snapshots">Performing <em>remote snapshots</em></h3> <p>As metrics are now stored on a different PostgreSQL instance, we had to extensively change the way <em>snapshots</em> (retrieving the data from a <a href="https://powa.readthedocs.io/en/latest/components/stats_extensions/index.html">stat extension</a> and storing them in PoWA catalog <a href="/postgresql/2016/09/16/minimizing-tuple-overhead.html">in a space efficient way</a>) are performed.</p> <p>The list of all stat extensions, or <em>data sources</em>, that are available on a <strong>server</strong> (either <em>remote</em> or <em>local</em>) and for which we should perform a <em>snapshot</em> are configured in a table called <code class="language-plaintext highlighter-rouge">powa_functions</code>:</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"> <span class="k">Table</span> <span class="nv">"public.powa_functions"</span> <span class="k">Column</span> <span class="o">|</span> <span class="k">Type</span> <span class="o">|</span> <span class="k">Collation</span> <span class="o">|</span> <span class="k">Nullable</span> <span class="o">|</span> <span class="k">Default</span> <span class="c1">----------------+---------+-----------+----------+---------</span> <span class="n">srvid</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="n">module</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="k">operation</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="n">function_name</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="n">query_source</span> <span class="o">|</span> <span class="nb">text</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="n">added_manually</span> <span class="o">|</span> <span class="nb">boolean</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="k">true</span> <span class="n">enabled</span> <span class="o">|</span> <span class="nb">boolean</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="k">true</span> <span class="n">priority</span> <span class="o">|</span> <span class="nb">numeric</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="mi">10</span></code></pre></figure> <p>A new <code class="language-plaintext highlighter-rouge">query_source</code> field is added, that provides the name of a <em>source</em> function, required to support remote snapshot of any <a href="https://powa.readthedocs.io/en/latest/components/stats_extensions/index.html">stat extensions</a>. This function is used to export the counters provided by this extension on a different server, in a dedicated <em>transient table</em>. The <em>snapshot</em> function will then perform the <em>snapshot</em> using those exported data instead of the one provided by stat extensions locally when the remote mode is used. Note that the counters export and the remote snapshot is done automatically with the the new <a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">powa-collector daemon</a>, that I’ll cover in another article.</p> <p>Here’s an example of how PoWA perform a <em>remote snapshot</em> of the list of databases. As you’ll see, this is very simplistic, meaning that it’s very easy to add support for a new stat extension.</p> <p>The <em>transient table</em>:</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"> <span class="n">Unlogged</span> <span class="k">table</span> <span class="nv">"public.powa_databases_src_tmp"</span> <span class="k">Column</span> <span class="o">|</span> <span class="k">Type</span> <span class="o">|</span> <span class="k">Collation</span> <span class="o">|</span> <span class="k">Nullable</span> <span class="o">|</span> <span class="k">Default</span> <span class="c1">---------+---------+-----------+----------+---------</span> <span class="n">srvid</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="n">oid</span> <span class="o">|</span> <span class="n">oid</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span> <span class="n">datname</span> <span class="o">|</span> <span class="n">name</span> <span class="o">|</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="o">|</span></code></pre></figure> <p>For better performance, all the <em>transient tables</em> are <strong>unlogged</strong>, as their content is only needed during a <em>snapshot</em> and are trashed afterwards. In this example the <em>transient table</em> only stores the server identifier for which the data are, the oid and name of each databases present on the <em>remote server</em>.</p> <p>And the <em>source function</em>:</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">OR</span> <span class="k">REPLACE</span> <span class="k">FUNCTION</span> <span class="k">public</span><span class="p">.</span><span class="n">powa_databases_src</span><span class="p">(</span><span class="n">_srvid</span> <span class="nb">integer</span><span class="p">,</span> <span class="k">OUT</span> <span class="n">oid</span> <span class="n">oid</span><span class="p">,</span> <span class="k">OUT</span> <span class="n">datname</span> <span class="n">name</span><span class="p">)</span> <span class="k">RETURNS</span> <span class="k">SETOF</span> <span class="n">record</span> <span class="k">LANGUAGE</span> <span class="n">plpgsql</span> <span class="k">AS</span> <span class="err">$</span><span class="k">function</span><span class="err">$</span> <span class="k">BEGIN</span> <span class="n">IF</span> <span class="p">(</span><span class="n">_srvid</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span> <span class="k">THEN</span> <span class="k">RETURN</span> <span class="n">QUERY</span> <span class="k">SELECT</span> <span class="n">d</span><span class="p">.</span><span class="n">oid</span><span class="p">,</span> <span class="n">d</span><span class="p">.</span><span class="n">datname</span> <span class="k">FROM</span> <span class="n">pg_database</span> <span class="n">d</span><span class="p">;</span> <span class="k">ELSE</span> <span class="k">RETURN</span> <span class="n">QUERY</span> <span class="k">SELECT</span> <span class="n">d</span><span class="p">.</span><span class="n">oid</span><span class="p">,</span> <span class="n">d</span><span class="p">.</span><span class="n">datname</span> <span class="k">FROM</span> <span class="n">powa_databases_src_tmp</span> <span class="n">d</span> <span class="k">WHERE</span> <span class="n">srvid</span> <span class="o">=</span> <span class="n">_srvid</span><span class="p">;</span> <span class="k">END</span> <span class="n">IF</span><span class="p">;</span> <span class="k">END</span><span class="p">;</span> <span class="err">$</span><span class="k">function</span><span class="err">$</span></code></pre></figure> <p>This function simply returns the content of <code class="language-plaintext highlighter-rouge">pg_database</code> if local data are asked (server id <strong>0</strong> is always the local server), or the content of the <em>transient table</em> for the given remote server otherwise.</p> <p>The <em>snapshot function</em> can then easily do any required work with the data for the wanted <em>remote server</em>. In the case of the <code class="language-plaintext highlighter-rouge">powa_databases_snapshot()</code> function, the just synchronizing the list of databases, and storing the timestamp of removal if a previously existing database is not found anymore.</p> <p>For more details, you can consult the <a href="https://powa.readthedocs.io/en/latest/components/powa-archivist/development.html">PoWA datasource integration</a> documentation, which was updated for the version 4 specificities.</p> <p><a href="https://rjuju.github.io/postgresql/2019/06/05/powa-4-new-in-powa-archivist.html">PoWA 4: changes in powa-archivist!</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on June 05, 2019.</p> <![CDATA[PoWA 4 brings a remote mode, available in beta!]]> https://rjuju.github.io/postgresql/2019/05/17/powa-4-with-remote-mode-beta-is-available 2019-05-17T11:04:17+00:00 2019-05-17T11:04:17+00:00 Julien Rouhaud https://rjuju.github.io <p><a href="http://powa.readthedocs.io/">PoWA 4</a> is available in beta.</p> <h3 id="new-remote-mode">New remote mode!</h3> <p>The <a href="https://powa.readthedocs.io/en/latest/remote_setup.html">new remote mode</a> is the biggest feature introduced in PoWA 4, though there have been other improvements.</p> <p>I’ll describe here what this new mode implies and what changed in the <a href="https://powa.readthedocs.io/en/latest/components/powa-web/index.html">UI</a>.</p> <p>If you’re interested in more details about the rest of the changes in PoWA 4, I’ll soon publish other articles for that.</p> <p>For the most hurried people, feel free to directly go on the <a href="https://dev-powa.anayrat.info/">v4 demo of PoWA</a>, kindly hosted by <a href="http://blog.anayrat.info/">Adrien Nayrat</a>. No credential needed, just click on “Login”.</p> <h3 id="why-is-a-remote-mode-important">Why is a remote mode important</h3> <p>This feature has probably been the most frequently asked since PoWA was first released, back in 2014. And that was asked for good reasons, as a local mode have some drawbacks.</p> <p>First, let’s see how was the architecture up to PoWA 3. Assuming an instance with 2 databases (db1 and db2), plus <strong>one database dedicated for PoWA</strong>. This dedicated database contains both the <em>stat extension</em> required to get the live performance data and to <strong>store them</strong>.</p> <p><a href="/images/powa_4_local.svg"><img src="/images/powa_4_local.svg" alt="Local mode architecture" /></a></p> <p>A custom <em><a href="https://powa.readthedocs.io/en/latest/components/powa-archivist/configuration.html#background-worker-configuration">background worker</a></em> is started by PoWA, which is responsible for taking snapshots and storing them in the dediacted powa database regularly. Then, using powa-web, you can see the activity of any of the <strong>local</strong> databases querying the stored data on the dedicated database, and possibly connect to one of the other local database when complete data are needed, for instance when using the index suggestion tool.</p> <p>With version 4, the architecture with a remote setup change quite a lot:</p> <p><a href="/images/powa_4_remote.svg"><img src="/images/powa_4_remote.svg" alt="Remote mode architecture" /></a></p> <p>You can see the a dedicated powa database is still required, but <strong>only for the stat extensions</strong>. Data are now stored on a different instance. Then, the <em><a href="https://powa.readthedocs.io/en/latest/components/powa-archivist/configuration.html#background-worker-configuration">background worker</a></em> is replaced by a <strong><a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">new collector daemon</a></strong>, which reads the performance data from the <em>remote servers</em>, and store them on the dedicated <em>repository server</em>. Powa-web will then be able to display the activity connecting on the <em>repository server</em>, and also on the <strong>remote server</strong> when complete data are needed.</p> <p>In short, with the new remote mode introduced in this version 4:</p> <ul> <li>a PostgreSQL restart is not required anymore to install powa-archivist extension, as the background worker is not mandatory anymore</li> <li>there is no overhead due to storing and querying data on the same PostgreSQL server as your production server (there are still some part of the UI that requires querying the original server, for instance when showing EXPLAIN plans, but that’s a negligible overhead)</li> <li>it’s now possible to use PoWA on a <strong>hot-standby server</strong></li> </ul> <p>The UI will therefore now welcome you with a initial page to let you chose which server stored on the configured database you want to wotk on: <a href="/images/powa_4_all_servers.png"><img src="/images/powa_4_all_servers.png" alt="Servers choice" /></a></p> <p>The main reason it took so much time to bring a remote mode is because this adds quite some complexity, requiring a major rewrite of the whole PoWA stack. We also wanted to add more feature first, such as the <strong>global index suggestion</strong>, with <strong>validation using <a href="http://hypopg.readthedocs.io/">hypopg</a></strong> introduced with <a href="https://powa.readthedocs.io/en/latest/releases/v3.0.0.html">PoWA 3</a>.</p> <h3 id="changes-in-powa-web">Changes in <a href="https://powa.readthedocs.io/en/latest/components/powa-web/index.html">powa-web</a></h3> <p>The <em>user interface</em> is the component which probably has the most visible changes in this version 4. Here are the most important ones.</p> <h5 id="remote-mode-compatibility">Remote mode compatibility</h5> <p>The biggest change is obviously the support for the <a href="https://powa.readthedocs.io/en/latest/remote_setup.html">new remote mode</a>. As a consequence, the first page shown is now a <strong>server selector</strong> page, displaying all registered <em>remote servers</em>. After choosing the wanted <em>remote server</em> (or <em>local server</em> if you don’t use the remote mode), all other pages will be similar to the one that were available until PoWA 3, but displaying data for a specific <em>remote server</em> only, and of course retrieving the data from the <strong>repository powa database</strong>, and with some new information I’ll describe just after.</p> <p>Note that as the data is now stored on a dedicated <em>repository server</em> when using the remote mode, most of the UI is usable without connecting on the currently selected <em>remote server</em>. However, powa-web still requires to connect on the <em>remote server</em> when the original data are needed (for instance, for index suggestion or when showing <strong>EXPLAIN</strong> plans). The <a href="https://powa.readthedocs.io/en/latest/security.html#connection-on-remote-servers">same authentication considerations and possibilities</a> as for the new <a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">powa-collector daemon</a> (which will be described in a following article) applies here.</p> <h5 id="pg_track_settings-support"><a href="https://github.com/rjuju/pg_track_settings/">pg_track_settings</a> support</h5> <p>When this extension is properly configured, a new timeline widget will appear, placed between each graph and its overview, displaying any kind of recorded change if any was detected in the currently selected time interval. On the per-database and per-query pages, this list will be filtered by the selected database.</p> <p>The same timeline will be displayed on every graph of each page, to easily check if this change had any visible impact using the various graphs.</p> <p>Note that details of the changes will be displayed on mouseover. You can also click on any event on the timeline to make the event stay displayed, and draw a vertical line on the underlying graph.</p> <p>Here’s an example of such detected configuration change in action:</p> <p><a href="/images/pg_track_settings_powa4.png"><img src="/images/pg_track_settings_powa4.png" alt="Configuration changes detected" /></a></p> <p>Please also note that you need at least version 2.0.0 of <a href="https://github.com/rjuju/pg_track_settings/">pg_track_settings</a>, and that the extension has to be installed <strong>both on the <em>remote servers</em> and the <em>repository server</em>.</strong></p> <h5 id="new-graphs-available">New graphs available</h5> <p>When <a href="https://powa.readthedocs.io/en/latest/components/stats_extensions/pg_stat_kcache.html">pg_stat_kcache</a> is setup, its information were previously only displayed on the per-query page. They’re now displayed on per-server and per-database too, in two graphs:</p> <ul> <li>in the <strong>Block Access</strong> graph, where the <strong>OS cache</strong> and <strong>disk read</strong> metrics will replace the <strong>read</strong> metric</li> <li>in a new <strong>System Resources</strong> graph (which is also added in the <em>per-query</em> page), showing the <a href="/postgresql/2018/07/17/pg_stat_kcache-2-1-is-out.html">metrics added in pg_stat_kcache 2.1</a></li> </ul> <p>Here is an example of this new <strong>System Resources</strong> graph:</p> <p><a href="/images/pg_stat_kcache_system_resources_powa4.png"><img src="/images/pg_stat_kcache_system_resources_powa4.png" alt="System ressources" /></a></p> <p>There was also a <strong>Wait Events</strong> graph (available when <a href="https://powa.readthedocs.io/en/v4/components/stats_extensions/pg_wait_sampling.html">pg_wait_sampling extension</a> is setup) only available on the per-query page. This graph is now available on the per-server and per-database pages too.</p> <h5 id="metrics-documentation-and-documentation-link">Metrics documentation and documentation link</h5> <p>Some metrics displayed in the user interface was quite self explanatory, while some could be a little bit obscure. Unfortunately, until now there wasn’t any documentation for any of the metrics. That’s now fixed, and all graphs have an <em>information icon</em>, that will display a description of the metrics used in the graph on mouseover. Some graphs will also include a link to the underlying <a href="https://powa.readthedocs.io/en/latest/components/stats_extensions/index.html">stat extension in PoWA documentation</a> for users who want to learn more about them.</p> <p>Here’s an example:</p> <p><a href="/images/powa_4_metrics_doc.png"><img src="/images/powa_4_metrics_doc.png" alt="Metrics documentation" /></a></p> <h5 id="and-general-bugfixes">And general bugfixes</h5> <p>Some longstanding issues were also reported:</p> <ul> <li>the graph hover box showing metric values had a wrong vertical position</li> <li>the time selection using the graph preview didn’t show a correct preview after applying the selection</li> <li>errors on hypothetical index creation or in certain cases their display wasn’t correctly handled in multiple pages</li> <li>grid filters weren’t reapplied when time selection was changed</li> </ul> <p>If you have ever been annoyed by any of this, you’ll be glad to know that they’re now all fixed!</p> <h3 id="conclusion">Conclusion</h3> <p>This 4th version of PoWA represents a lot of time on development, documentation improvements and testing. We’re now quite satisfied with it, but we may have missed some bugs. If you’re interested in this project, I hope that you’ll consider testing the beta, and if needed don’t hesitate <a href="https://powa.readthedocs.io/en/latest/support.html#support">to report a bug</a>!</p> <p><a href="https://rjuju.github.io/postgresql/2019/05/17/powa-4-with-remote-mode-beta-is-available.html">PoWA 4 brings a remote mode, available in beta!</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on May 17, 2019.</p> <![CDATA[PoWA 4 apporte un mode remote, disponible en beta !]]> https://rjuju.github.io/postgresqlfr/2019/05/17/powa-4-avec-mode-remote-disponible-en-beta 2019-05-17T11:04:17+00:00 2019-05-17T11:04:17+00:00 Julien Rouhaud https://rjuju.github.io <p><a href="http://powa.readthedocs.io/">PoWA 4</a> est disponible en beta.</p> <h3 id="nouveau-mode-remote-">Nouveau mode remote !</h3> <p>Le <a href="https://powa.readthedocs.io/en/latest/remote_setup.html">nouveau mode remote</a> est la plus grosse fonctionnalité ajoutée dans PoWA 4, bien qu’il y ait eu d’autres améliorations.</p> <p>Je vais décrire ici ce que ce nouveau mode implique ainsi que ce qui a changé sur l’<a href="https://powa.readthedocs.io/en/latest/components/powa-web/index.html">UI</a>.</p> <p>Si de plus amples détails sur le reste des changements apportés dans PoWA 4 vous intéresse, je publierai bientôt d’autres articles sur le sujet.</p> <p>Pour les plus pressés, n’hésitez pas à aller directement sur la <a href="https://dev-powa.anayrat.info/">démo v4 de PoWA</a>, très gentiment hébergée par <a href="http://blog.anayrat.info/">Adrien Nayrat</a>. Aucun authentification n’est requise, cliquez simplement sur “Login”.</p> <h3 id="pourquoi-un-mode-remote-est-il-important">Pourquoi un mode remote est-il important</h3> <p>Cette fonctionnalité a probablement été la plus fréquemment demandée depuis que PoWA a été publié, en 2014. Et c’est pour de bonnes raisons, car un mode local a quelques inconvénients.</p> <p>Tout d’abord, voyons comment se présentait l’architecture avec les versions 3 et antérieures. Imaginons une instance contenant 2 bases de données (db1 et db2), ainsi qu’<strong>une base de données dédiée à PoWA</strong>. Cette base de données dédiée contient à la fois les <em>extensions statistiques</em> nécessaires pour récupérer compteurs de performances actuels ainsi que pour <strong>les stocker</strong>.</p> <p><a href="/images/powa_4_local.svg"><img src="/images/powa_4_local.svg" alt="Architecture en mode local" /></a></p> <p>Un <em><a href="https://powa.readthedocs.io/en/latest/components/powa-archivist/configuration.html#background-worker-configuration">background worker</a></em> est démarré par PoWA, qui est responsable d’effectuer des <em>snapshots</em> et de les stocker dans la base powa dédiée à intervalle réguliers. Ensuite, en utilisant powa-web, vous pouvez consulter l’activité de n’importe laquelle des bases de données <strong>locales</strong> en effectuant des requêtes sur les données stockées dans la base dédié, et potentiellement en se connectant sur l’une des autres bases de données locales lorsque les données complètes sont nécessaires, par exemple lorsque l’outil de suggestion d’index est utilisé.</p> <p>Avec la version 4, l’architecture avec une configuration distante change de manière significative:</p> <p><a href="/images/powa_4_remote.svg"><img src="/images/powa_4_remote.svg" alt="Architecture en mode distant" /></a></p> <p>Vous pouvez voir qu’une base de donnée powa dédiée est toujours nécessaire, mais <strong>uniquement pour les extensions statistiques</strong>. Les données sont maintenant stockées sur une instance différente. Ensuite, le <em><a href="https://powa.readthedocs.io/en/latest/components/powa-archivist/configuration.html#background-worker-configuration">background worker</a></em> est remplacé par un <strong><a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">nouveau daemon collecteur</a></strong>, qui lit les métriques de performance depuis les <em>serveurs distants</em>, et les stocke sur le <em>serveur repository</em> dédié. Powa-web pourra présenter les données en se connectant sur le <em>serveur repository</em>, ainsi que sur les <strong>serveurs distants</strong> lorsque des données complètes sont nécessaires.</p> <p>En résumé, avec le nouveau mode distant ajouté dans cette version 4</p> <ul> <li>un redémarrage de PostgreSQL n’est plus nécessaire pour installer powa-archivist</li> <li>il n’y a plus de surcoût du au fait de stocker et requêter les données sur le même serveur PostgreSQL que vos serveurs de productions (il y a toujours certaines partie de l’UI qui nécessitent d’effectuer des requêtes sur le serveur d’origine, par exemple pour montrer des plans avec EXPLAIN, mais le surcoût est négligeable)</li> <li>il est maintenant possible d’utiliser PoWA sur un <strong>serveur en hot-standby</strong></li> </ul> <p>L’UI vous accueillera donc maintenant avec une page initiale afin de choisir lequel des serveurs stockés sur la base de données cible vous voulez travailler : <a href="/images/powa_4_all_servers.png"><img src="/images/powa_4_all_servers.png" alt="Choix des serveurs" /></a></p> <p>La principale raison pour laquelle il a fallu tellement de temps pour apporter ce mode distant est parce que cela apporte beaucoup de complexité, nécessitant une réécriture majeure de PoWA. Nous voulions également ajouter d’abord d’autres fonctionnalités, comme la <strong>suggestion globale d’index</strong>, avec une <strong>validation grâce à <a href="http://hypopg.readthedocs.io/">hypopg</a></strong> introduit avec <a href="https://powa.readthedocs.io/en/latest/releases/v3.0.0.html">PoWA 3</a>.</p> <h3 id="changements-dans-powa-web">Changements dans <a href="https://powa.readthedocs.io/en/latest/components/powa-web/index.html">powa-web</a></h3> <p>L’<em>interface graphique</em> est le composant qui a le plus de changements visibles dans cette version 4. Voici les plus changements les plus importants.</p> <h5 id="compatibilité-avec-le-mode-distant">Compatibilité avec le mode distant</h5> <p>Le changement le plus important est bien évidemment le support pour le <a href="https://powa.readthedocs.io/en/latest/remote_setup.html">nouveau mode remote</a>. En conséquence, la première page affichée est maintenant une page de <strong>sélection de serveur</strong>, affichant tous les <em>serveurs distants</em> enregistrés. Après avoir choisi le <em>serveur distant</em> voulu (ou le <em>serveur local</em> si vous n’utilisez pas le mode distant), toutes les autres pages seront similaires à celles disponibles jusqu’à la version 3, mais afficheront les données pour un <em>serveur distant</em> spécifique uniquement, et bien entendu en récupérant les données depuis la <strong>base de données repository</strong>, avec en plus de nouvelles informations décrites ci-dessous.</p> <p>Veuillez notez que puisque les données sont maintenant stockées sur un <em>serveur repository</em> dédié quand le mode remote est utilisé, la majorité de l’UI est utilisable sans se connecter au <em>serveur distant</em> sélectionné. Toutefois, powa-web nécessite toujours de pouvoir se connecter sur le <em>serveur distant</em> quand les données originales sont nécessaires (par exemple, pour la suggestion d’index ou pour montrer des plans avec <strong>EXPLAIN</strong>). Les <a href="https://powa.readthedocs.io/en/latest/security.html#connection-on-remote-servers">mêmes considérations et possibilités concernant l’authentification</a> que pour le nouveau <a href="https://powa.readthedocs.io/en/latest/components/powa-collector/index.html">daemon powa-collector </a> (qui sera décrit dans un prochain article) s’appliquent ici.</p> <h5 id="pg_track_settings-support"><a href="https://github.com/rjuju/pg_track_settings/">pg_track_settings</a> support</h5> <p>Quand cette extension est correctement configurée, un nouveau widget timeline apparaîtra, placé entre chaque graph et son aperçu, affichant différents types de changements enregistrés si ceux-ci ont été détectés sur l’intervalle de temps sélectionné. Sur les pages par base de données et par requête, la liste sera également filtrée en fonction de la base de données sélectionnée.</p> <p>La même timeline sera affichée sur chacun des graphs de chacune des pages, afin de facilement vérifier si ces changements ont eu un impact visible en utilisant les différents graphs.</p> <p>Veuillez noter que les détails des changements sont affichés au survol de la souris. Vous pouvez également cliquer sur n’importe lequel des événements de la timeline pour figer l’affichage, et tracer une ligne verticale sur le graph associé.</p> <p>Voici un exemple d’un tel changement de configuration en action :</p> <p><a href="/images/pg_track_settings_powa4.png"><img src="/images/pg_track_settings_powa4.png" alt="Changements de configuration détectés" /></a></p> <p>Veuillez également noter qu’il est nécessaire d’avoir au minimum la version 2.0.0 de <a href="https://github.com/rjuju/pg_track_settings/">pg_track_settings</a>, et que l’extension doit être installée <strong>à la fois sur les <em>serveurs distants</em> ainsi que sur le <em>serveur repository</em>.</strong></p> <h5 id="nouveaux-graphs-disponibles">Nouveaux graphs disponibles</h5> <p>Quand <a href="https://powa.readthedocs.io/en/latest/components/stats_extensions/pg_stat_kcache.html">pg_stat_kcache</a> est configuré, ses informations n’étaient auparavant affichées que sur la page par requête. Les informations sont maintenant également affichées sur les pages par serveur et par base, dans deux nouveaux graphs :</p> <ul> <li>dans le graph <strong>Block Access</strong>, où les métriques <strong>OS cache</strong> et <strong>disk read</strong> remplaceront la métrique <strong>read</strong></li> <li>dans un nouveau graph <strong>System Resources</strong> (qui est également ajouté dans la page <em>par requête</em>), montrant les <a href="/postgresql/2018/07/17/pg_stat_kcache-2-1-is-out.html">metrics ajoutées dans pg_stat_kcache 2.1</a></li> </ul> <p>Voici un example de ce nouveau graph <strong>System Resources</strong> :</p> <p><a href="/images/pg_stat_kcache_system_resources_powa4.png"><img src="/images/pg_stat_kcache_system_resources_powa4.png" alt="Ressources système" /></a></p> <p>Il y avait également un graph <strong>Wait Events</strong> (disponible quand <a href="https://powa.readthedocs.io/en/v4/components/stats_extensions/pg_wait_sampling.html">l’extension pg_wait_sampling</a> est configuée) disponible uniquement sur la page par requête. Ce graph est maintenant disponible sur les pages par serveur et par base également.</p> <h5 id="documentation-des-métriques-et-liens-vers-la-documentation">Documentation des métriques et liens vers la documentation</h5> <p>Certaines métriques affichées sur l’interface sont assez parlante, mais certaines autres peuvent être un peu obscures. Jusqu’à maintenant, il n’y avait malheureusement aucune documentation pour les métriques. Le problème est maintenant réglé, et tous les graphs ont une <em>icône d’information</em>, qui affichent une description des métriques utilisée dans le graph au survol de la souris. Certains graphs incluent également un lien vers la <a href="https://powa.readthedocs.io/en/latest/components/stats_extensions/index.html">documentation PoWA de extension statistiques</a> pour les utilisateurs qui désirent en apprendre plus à leur sujet.</p> <p>Voici un exemple :</p> <p><a href="/images/powa_4_metrics_doc.png"><img src="/images/powa_4_metrics_doc.png" alt="Documentation des métriques" /></a></p> <h5 id="et-des-correctifs-de-bugs-divers">Et des correctifs de bugs divers</h5> <p>Certains problèmes de longues dates ont également été rapportés :</p> <ul> <li>la boîte affichée au survol d’un graph montant les valeurs des métriques avait une position verticale incorrecte</li> <li>la sélection temporelle en utilisant l’aperçu des graphs ne montrait pas un aperçu correct après avoir appliqué la sélection</li> <li>les erreurs lors de la création d’index hypothétiques ou dans certains cas leur affichage n’était pas correctement gérés sur plusieurs pages</li> <li>les filtres des tableaux n’était pas réappliqués quand l’intervalle de temps sélectionné était changé</li> </ul> <p>Si un de ces problèmes vous a un jour posé problème, vous serez ravi d’apprendre qu’ils sont maintenant tous corrigés !</p> <h3 id="conclusion">Conclusion</h3> <p>Cette 4ème version de PoWA représente un temps de développement très important, de nombreuses améliorations sur la documentation et beaucoup de tests. Nous somme maintenant assez satisfaits, mais il est possible que nous ayons ratés certains bugs. Si vous vous intéressez à ce projet, j’espère que vous essaierez de tester cette beta, et si besoin n’hésitez pas à <a href="https://powa.readthedocs.io/en/latest/support.html#support">nous remonter un bug</a>!</p> <p><a href="https://rjuju.github.io/postgresqlfr/2019/05/17/powa-4-avec-mode-remote-disponible-en-beta.html">PoWA 4 apporte un mode remote, disponible en beta !</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on May 17, 2019.</p> <![CDATA[Nouveauté pg12: Statistiques sur les erreurs de checkums]]> https://rjuju.github.io/postgresqlfr/2019/04/18/nouveau-dans-pg12-statistiques-erreurs-checksums 2019-04-18T11:02:26+00:00 2019-04-18T11:02:26+00:00 Julien Rouhaud https://rjuju.github.io <h3 id="data-checksums">Data checksums</h3> <p>Ajoutés dans <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=96ef3b8ff1c">PostgreSQL 9.3</a>, les <a href="https://www.postgresql.org/docs/current/app-initdb.html#APP-INITDB-DATA-CHECKSUMS">data checksums</a> peuvent aider à détecter les corruptions de données survenant sur votre stockage.</p> <p>Les checksums sont activés si l’instance a été initialisée en utilisant <code class="language-plaintext highlighter-rouge">initdb --data-checksums</code> (ce qui n’est pas le comportement par défaut), ou s’ils ont été activés après en utilisant la nouvelle utilitaire activated afterwards with the new <a href="https://www.postgresql.org/docs/devel/app-pgchecksums.html">pg_checksums</a> également <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=ed308d783790">ajouté dans PostgreSQL 12</a>.</p> <p>Quand les checksums sont ativés, ceux-ci sont écrits à chaque fois qu’un bloc de données est écrit sur disque, et vérifiés à chaque fois qu’un bloc est lu depuis le disque (ou depuis le cache du système d’exploitation). Si la vérification échoue, une erreur est remontée dans les logs. Si le bloc était lu par un processus client, la requête associée échouera bien évidemment, mais si le bloc était lu par une opération <a href="https://www.postgresql.org/docs/current/protocol-replication.html#id-1.10.5.9.7.1.8.1.12">BASE_BACKUP</a> (tel que pg_basebackup), la commande continuera à s’exécuter. Bien que les data checksums ne détecteront qu’un sous ensemble des problèmes possibles, ils ont tout de même une certaine utilisé, surtout si vous ne faites pas confiance à votre stockage.</p> <p>Jusqu’à PostgreSQL 11, les erreurs de validation de checksum ne pouvaient être trouvées qu’en cherchant dans les logs, ce qui n’est clairement pas pratique si vous voulez monitorer de telles erreurs.</p> <h3 id="nouveaux-compteurs-disponibles-dans-pg_stat_database">Nouveaux compteurs disponibles dans pg_stat_database</h3> <p>Pour rendre la supervision des erreurs de checksum plus simple, et pour aider les utilisateurs à réagir dès qu’un tel problème survient, PostgreSQL 12 ajoute de nouveaux compteurs dans la vue <code class="language-plaintext highlighter-rouge">pg_stat_database</code> :</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>commit 6b9e875f7286d8535bff7955e5aa3602e188e436 Author: Magnus Hagander &lt;[email protected]&gt; Date: Sat Mar 9 10:45:17 2019 -0800 Track block level checksum failures in pg_stat_database This adds a column that counts how many checksum failures have occurred on files belonging to a specific database. Both checksum failures during normal backend processing and those created when a base backup detects a checksum failure are counted. Author: Magnus Hagander Reviewed by: Julien Rouhaud </code></pre></div></div> <p> </p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>commit 77bd49adba4711b4497e7e39a5ec3a9812cbd52a Author: Magnus Hagander &lt;[email protected]&gt; Date: Fri Apr 12 14:04:50 2019 +0200 Show shared object statistics in pg_stat_database This adds a row to the pg_stat_database view with datoid 0 and datname NULL for those objects that are not in a database. This was added particularly for checksums, but we were already tracking more satistics for these objects, just not returning it. Also add a checksum_last_failure column that holds the timestamptz of the last checksum failure that occurred in a database (or in a non-dataabase file), if any. Author: Julien Rouhaud &lt;[email protected]&gt; </code></pre></div></div> <p> </p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>commit 252b707bc41cc9bf6c55c18d8cb302a6176b7e48 Author: Magnus Hagander &lt;[email protected]&gt; Date: Wed Apr 17 13:51:48 2019 +0200 Return NULL for checksum failures if checksums are not enabled Returning 0 could falsely indicate that there is no problem. NULL correctly indicates that there is no information about potential problems. Also return 0 as numbackends instead of NULL for shared objects (as no connection can be made to a shared object only). Author: Julien Rouhaud &lt;[email protected]&gt; Reviewed-by: Robert Treat &lt;[email protected]&gt; </code></pre></div></div> <p>Ces compteurs reflèteront les erreurs de validation de checksum à la fois pour les processus clients et pour l’activité <a href="https://www.postgresql.org/docs/current/protocol-replication.html#id-1.10.5.9.7.1.8.1.12">BASE_BACKUP</a>, par base de données.</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="n">rjuju</span><span class="o">=#</span> <span class="err">\</span><span class="n">d</span> <span class="n">pg_stat_database</span> <span class="k">View</span> <span class="nv">"pg_catalog.pg_stat_database"</span> <span class="k">Column</span> <span class="o">|</span> <span class="k">Type</span> <span class="o">|</span> <span class="k">Collation</span> <span class="o">|</span> <span class="k">Nullable</span> <span class="o">|</span> <span class="k">Default</span> <span class="c1">-----------------------+--------------------------+-----------+----------+---------</span> <span class="n">datid</span> <span class="o">|</span> <span class="n">oid</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="n">datname</span> <span class="o">|</span> <span class="n">name</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="p">[...]</span> <span class="n">checksum_failures</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="n">checksum_last_failure</span> <span class="o">|</span> <span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="p">[...]</span> <span class="n">stats_reset</span> <span class="o">|</span> <span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span></code></pre></figure> <p>La colonne <code class="language-plaintext highlighter-rouge">checksum_failures</code> montrera un nombre cumulé d’erreurs, et la colonne <code class="language-plaintext highlighter-rouge">checksum_last_failure</code> montrera l’horodatage de la dernière erreur de validation sur la base de données (NULL si aucune erreur n’est jamais survenue).</p> <p>Pour éviter toute confusion (merci à Robert Treat pour l’avoir signalé), ces deux colonnes retourneront toujours NULL si les data checkums ne sont pas activés, afin qu’on ne puisse pas croire que les checksums sont toujours vérifiés avec succès.</p> <p>Comme effet de bord, <code class="language-plaintext highlighter-rouge">pg_stat_database</code> montrera maintenant également les statistiques disponibles pour les objets partagés (tels que la table <code class="language-plaintext highlighter-rouge">pg_database</code> par exemple), dans une nouvelle ligne pour laquelle <code class="language-plaintext highlighter-rouge">datid</code> vaut <strong>0</strong>, et <code class="language-plaintext highlighter-rouge">datname</code> vaut <strong>NULL</strong>.</p> <p><del>Une sonde dédiée est également <a href="https://github.com/OPMDG/check_pgactivity/issues/226">déjà planifiée</a> dans <a href="https://opm.readthedocs.io/probes/check_pgactivity.html">check_pgactivity</a> !</del> Une sonde dédiée est également <a href="https://github.com/OPMDG/check_pgactivity/commit/0e8b516e95e4364470d4e205aebc9fe68bbcfd23">déjà disponible</a> dans <a href="https://opm.readthedocs.io/probes/check_pgactivity.html">check_pgactivity</a> !</p> <p><a href="https://rjuju.github.io/postgresqlfr/2019/04/18/nouveau-dans-pg12-statistiques-erreurs-checksums.html">Nouveauté pg12: Statistiques sur les erreurs de checkums</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on April 18, 2019.</p> <![CDATA[New in pg12: Statistics on checkums errors]]> https://rjuju.github.io/postgresql/2019/04/18/new-in-pg12-statistics-checksums-errors 2019-04-18T11:02:26+00:00 2019-04-18T11:02:26+00:00 Julien Rouhaud https://rjuju.github.io <h3 id="data-checksums">Data checksums</h3> <p>Added in <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=96ef3b8ff1c">PostgreSQL 9.3</a>, <a href="https://www.postgresql.org/docs/current/app-initdb.html#APP-INITDB-DATA-CHECKSUMS">data checksums</a> can help to detect data corruption happening on the storage side.</p> <p>Checksums are only enabled if the instance was setup using <code class="language-plaintext highlighter-rouge">initdb --data-checksums</code> (which isn’t the default behavior), or if activated afterwards with the new <a href="https://www.postgresql.org/docs/devel/app-pgchecksums.html">pg_checksums</a> tool also <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=ed308d783790">added in PostgreSQL 12</a>.</p> <p>When enabled, checksums are written each time a block is written to disk, and verified each time a block is read from disk (or from the operating system cache). If the checksum verification fails, an error is reported in the logs. If the block was read by a backend, the query will obviously fails, but if the block was read by a <a href="https://www.postgresql.org/docs/current/protocol-replication.html#id-1.10.5.9.7.1.8.1.12">BASE_BACKUP</a> operation (such as pg_basebackup), the command will continue its processing . While data checkums will only catch a subset of possible problems, they still have some values, especially if you don’t trust your storage reliability.</p> <p>Up to PostgreSQL 11, any checksum validation error could only be found by looking into the logs, which clearly isn’t convenient if you want to monitor such error.</p> <h3 id="new-counters-available-in-pg_stat_database">New counters available in pg_stat_database</h3> <p>To make checksum errors easier to monitor, and help users to react as soon as such a problem occurs, PostgreSQL 12 adds new counters in the <code class="language-plaintext highlighter-rouge">pg_stat_database</code> view:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>commit 6b9e875f7286d8535bff7955e5aa3602e188e436 Author: Magnus Hagander &lt;[email protected]&gt; Date: Sat Mar 9 10:45:17 2019 -0800 Track block level checksum failures in pg_stat_database This adds a column that counts how many checksum failures have occurred on files belonging to a specific database. Both checksum failures during normal backend processing and those created when a base backup detects a checksum failure are counted. Author: Magnus Hagander Reviewed by: Julien Rouhaud </code></pre></div></div> <p> </p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>commit 77bd49adba4711b4497e7e39a5ec3a9812cbd52a Author: Magnus Hagander &lt;[email protected]&gt; Date: Fri Apr 12 14:04:50 2019 +0200 Show shared object statistics in pg_stat_database This adds a row to the pg_stat_database view with datoid 0 and datname NULL for those objects that are not in a database. This was added particularly for checksums, but we were already tracking more satistics for these objects, just not returning it. Also add a checksum_last_failure column that holds the timestamptz of the last checksum failure that occurred in a database (or in a non-dataabase file), if any. Author: Julien Rouhaud &lt;[email protected]&gt; </code></pre></div></div> <p> </p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>commit 252b707bc41cc9bf6c55c18d8cb302a6176b7e48 Author: Magnus Hagander &lt;[email protected]&gt; Date: Wed Apr 17 13:51:48 2019 +0200 Return NULL for checksum failures if checksums are not enabled Returning 0 could falsely indicate that there is no problem. NULL correctly indicates that there is no information about potential problems. Also return 0 as numbackends instead of NULL for shared objects (as no connection can be made to a shared object only). Author: Julien Rouhaud &lt;[email protected]&gt; Reviewed-by: Robert Treat &lt;[email protected]&gt; </code></pre></div></div> <p>Those counters will reflect checksum validation errors for both backend activity and <a href="https://www.postgresql.org/docs/current/protocol-replication.html#id-1.10.5.9.7.1.8.1.12">BASE_BACKUP</a> activity, per database.</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="n">rjuju</span><span class="o">=#</span> <span class="err">\</span><span class="n">d</span> <span class="n">pg_stat_database</span> <span class="k">View</span> <span class="nv">"pg_catalog.pg_stat_database"</span> <span class="k">Column</span> <span class="o">|</span> <span class="k">Type</span> <span class="o">|</span> <span class="k">Collation</span> <span class="o">|</span> <span class="k">Nullable</span> <span class="o">|</span> <span class="k">Default</span> <span class="c1">-----------------------+--------------------------+-----------+----------+---------</span> <span class="n">datid</span> <span class="o">|</span> <span class="n">oid</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="n">datname</span> <span class="o">|</span> <span class="n">name</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="p">[...]</span> <span class="n">checksum_failures</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="n">checksum_last_failure</span> <span class="o">|</span> <span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="p">[...]</span> <span class="n">stats_reset</span> <span class="o">|</span> <span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span></code></pre></figure> <p>The <code class="language-plaintext highlighter-rouge">checksum_failures</code> column will show a cumulated number of errors, and the <code class="language-plaintext highlighter-rouge">checksum_last_failure</code> column will show the timestamp of the last checksum failure on the database (NULL if no error ever happened).</p> <p>To avoid any confusion (thanks to Robert Treat for pointing it), those two columns will always return NULL if data checksums aren’t enabled, so people won’t mistakenly think that data checksums are always successfully verified.</p> <p>As a side effect, <code class="language-plaintext highlighter-rouge">pg_stat_database</code> will also now show available statistics for shared objects (such as the <code class="language-plaintext highlighter-rouge">pg_database</code> table for instance), in a new row with <code class="language-plaintext highlighter-rouge">datid</code> valued to <strong>0</strong>, and a <strong>NULL</strong> <code class="language-plaintext highlighter-rouge">datname</code>. Those were always accumulated, but weren’t displayed in any system view until now.</p> <p><del>A dedicated check is also <a href="https://github.com/OPMDG/check_pgactivity/issues/226">already planned</a> in <a href="https://opm.readthedocs.io/probes/check_pgactivity.html">check_pgactivity</a>!</del> A dedicated check is also <a href="https://github.com/OPMDG/check_pgactivity/commit/0e8b516e95e4364470d4e205aebc9fe68bbcfd23">already available</a> in <a href="https://opm.readthedocs.io/probes/check_pgactivity.html">check_pgactivity</a>!</p> <p><a href="https://rjuju.github.io/postgresql/2019/04/18/new-in-pg12-statistics-checksums-errors.html">New in pg12: Statistics on checkums errors</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on April 18, 2019.</p> <![CDATA[Minimiser le surcoût de stockage par ligne]]> https://rjuju.github.io/postgresqlfr/2019/04/06/minimiser-le-surcout-de-stockage-par-ligne 2019-04-06T07:51:28+00:00 2019-04-06T07:51:28+00:00 Julien Rouhaud https://rjuju.github.io <p>J’entends régulièrement des complaintes sur la quantité d’espace disque gâchée par PostgreSQL pour chacune des lignes qu’il stocke. Je vais essayer de montrer ici quelques astuces pour minimiser cet effet, afin d’avoir un stockage plus efficace.</p> <h3 id="quel-surcoût-">Quel surcoût ?</h3> <p>Si vous n’avez pas de table avec plus que quelques centaines de millions de lignes, il est probable que ce n’est pas un problème pour vous.</p> <p>Pour chaque ligne stockée, postgres conservera quelques données additionnelles pour ses propres besoins. C’est <a href="https://www.postgresql.fr/docs/current/storage-page-layout.html#heaptupleheaderdata-table">documenté ici</a>. La documentation indique :</p> <table> <thead> <tr> <th>Field</th> <th>Type</th> <th>Length</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td>t_xmin</td> <td>TransactionId</td> <td>4 bytes</td> <td>XID d’insertion</td> </tr> <tr> <td>t_xmax</td> <td>TransactionId</td> <td>4 bytes</td> <td>XID de suppresion</td> </tr> <tr> <td>t_cid</td> <td>CommandId</td> <td>4 bytes</td> <td>CID d’insertion et de suppression (surcharge avec t_xvac)</td> </tr> <tr> <td>t_xvac</td> <td>TransactionId</td> <td>4 bytes</td> <td>XID pour l’opération VACUUM déplaçant une version de ligne</td> </tr> <tr> <td>t_ctid</td> <td>ItemPointerData</td> <td>6 bytes</td> <td>TID en cours pour cette version de ligne ou pour une version plus récente</td> </tr> <tr> <td>t_infomask2</td> <td>uint16</td> <td>2 bytes</td> <td>nombre d’attributs et quelques bits d’état</td> </tr> <tr> <td>t_infomask</td> <td>uint16</td> <td>2 bytes</td> <td>différents bits d’options (flag bits)</td> </tr> <tr> <td>t_hoff</td> <td>uint8</td> <td>1 byte</td> <td>décalage vers les données utilisateur</td> </tr> </tbody> </table> <p>Ce qui représente <strong>23 octets</strong> sur la plupart des architectures (il y a soit <strong>t_cid</strong> soit <strong>t_xvac</strong>).</p> <p>Vous pouvez d’ailleurs consulter une partie de ces champs grâce aux colonnes cachées présentes dans n’importe quelle table en les ajoutant dans la partie SELECT d’une requête, ou en cherchant pour les numéros d’attribut négatifs dans le catalogue <strong>pg_attribute</strong> :</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="o">#</span> <span class="err">\</span><span class="n">d</span> <span class="n">test</span> <span class="k">Table</span> <span class="nv">"public.test"</span> <span class="k">Column</span> <span class="o">|</span> <span class="k">Type</span> <span class="o">|</span> <span class="n">Modifiers</span> <span class="c1">--------+---------+-----------</span> <span class="n">id</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="o">#</span> <span class="k">SELECT</span> <span class="n">xmin</span><span class="p">,</span> <span class="n">xmax</span><span class="p">,</span> <span class="n">id</span> <span class="k">FROM</span> <span class="n">test</span> <span class="k">LIMIT</span> <span class="mi">1</span><span class="p">;</span> <span class="n">xmin</span> <span class="o">|</span> <span class="n">xmax</span> <span class="o">|</span> <span class="n">id</span> <span class="c1">------+------+----</span> <span class="mi">1361</span> <span class="o">|</span> <span class="mi">0</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">#</span> <span class="k">SELECT</span> <span class="n">attname</span><span class="p">,</span> <span class="n">attnum</span><span class="p">,</span> <span class="n">atttypid</span><span class="p">::</span><span class="n">regtype</span><span class="p">,</span> <span class="n">attlen</span> <span class="k">FROM</span> <span class="n">pg_class</span> <span class="k">c</span> <span class="k">JOIN</span> <span class="n">pg_attribute</span> <span class="n">a</span> <span class="k">ON</span> <span class="n">a</span><span class="p">.</span><span class="n">attrelid</span> <span class="o">=</span> <span class="k">c</span><span class="p">.</span><span class="n">oid</span> <span class="k">WHERE</span> <span class="n">relname</span> <span class="o">=</span> <span class="s1">'test'</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">attnum</span><span class="p">;</span> <span class="n">attname</span> <span class="o">|</span> <span class="n">attnum</span> <span class="o">|</span> <span class="n">atttypid</span> <span class="o">|</span> <span class="n">attlen</span> <span class="c1">----------+--------+----------+--------</span> <span class="n">tableoid</span> <span class="o">|</span> <span class="o">-</span><span class="mi">7</span> <span class="o">|</span> <span class="n">oid</span> <span class="o">|</span> <span class="mi">4</span> <span class="n">cmax</span> <span class="o">|</span> <span class="o">-</span><span class="mi">6</span> <span class="o">|</span> <span class="n">cid</span> <span class="o">|</span> <span class="mi">4</span> <span class="n">xmax</span> <span class="o">|</span> <span class="o">-</span><span class="mi">5</span> <span class="o">|</span> <span class="n">xid</span> <span class="o">|</span> <span class="mi">4</span> <span class="n">cmin</span> <span class="o">|</span> <span class="o">-</span><span class="mi">4</span> <span class="o">|</span> <span class="n">cid</span> <span class="o">|</span> <span class="mi">4</span> <span class="n">xmin</span> <span class="o">|</span> <span class="o">-</span><span class="mi">3</span> <span class="o">|</span> <span class="n">xid</span> <span class="o">|</span> <span class="mi">4</span> <span class="n">ctid</span> <span class="o">|</span> <span class="o">-</span><span class="mi">1</span> <span class="o">|</span> <span class="n">tid</span> <span class="o">|</span> <span class="mi">6</span> <span class="n">id</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="nb">integer</span> <span class="o">|</span> <span class="mi">4</span></code></pre></figure> <p>Si vous comparez ces champs avec le tableau précédent, vous pouvez constater que toutes ces colonnes ne sont pas stockées sur disque. Bien évidemment, PostgreSQL ne stocke pas l’oid de la table pour chaque ligne. Celui-ci est ajouté après, lors de la construction d’une ligne.</p> <p>Si vous voulez plus de détails techniques, vous pouvez regarder <a href="http://doxygen.postgresql.org/htup__details_8h.html">htup_detail.c</a>, en commençant par <a href="http://doxygen.postgresql.org/structHeapTupleHeaderData.html">TupleHeaderData struct</a>.</p> <h3 id="combien-est-ce-que-ça-coûte-">Combien est-ce que ça coûte ?</h3> <p>Puisque ce surcoût est fixe, plus la taille des lignes croît plus il devient négligeable. Si vous ne stocker qu’une simple colonne de type intt (<strong>4 octets</strong>), chaque ligne nécessitera :</p> <figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="mi">23</span><span class="n">B</span> <span class="o">+</span> <span class="mi">4</span><span class="n">B</span> <span class="o">=</span> <span class="mi">27</span><span class="n">B</span></code></pre></figure> <p>soit <strong>85% de surcoût</strong>, ce qui est plutôt horrible.</p> <p>D’une autre côté, si vous stockez 5 integer, 3 bigint et 2 colonnes de type texte (disons environ 80 octets en moyenne), cela donnera :</p> <figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="mi">23</span><span class="n">B</span> <span class="o">+</span> <span class="mi">5</span><span class="o">*</span><span class="mi">4</span><span class="n">B</span> <span class="o">+</span> <span class="mi">3</span><span class="o">*</span><span class="mi">8</span><span class="n">B</span> <span class="o">+</span> <span class="mi">2</span><span class="o">*</span><span class="mi">80</span><span class="n">B</span> <span class="o">=</span> <span class="mi">227</span><span class="n">B</span></code></pre></figure> <p>C’est “seulement” <strong>10% de surcoût</strong>.</p> <h3 id="et-donc-comment-minimiser-ce-surcoût">Et donc, comment minimiser ce surcoût</h3> <p>L’idée est de stocker les même données, mais avec moins d’enregistrements. Comment faire ? En aggrégeant les données dans des tableaux. Plus vous mettez d’enregistrements dans un seul tableau, plus vous minimiserez le surcoût. Et si vous aggrégez suffisamment de données, vous pouvez bénéficier d’une compression entièrement transparente grâce au <a href="https://www.postgresql.fr/docs/current/storage-toast.html">mécanisme de TOAST</a>.</p> <p>Voyons ce que cela donne avec une table ne disposant que d’une seule colonne, avec 10 millions de lignes :</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="o">#</span> <span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">raw_1</span> <span class="p">(</span><span class="n">id</span> <span class="nb">integer</span><span class="p">);</span> <span class="o">#</span> <span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">raw_1</span> <span class="k">SELECT</span> <span class="n">generate_series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">10000000</span><span class="p">);</span> <span class="o">#</span> <span class="k">CREATE</span> <span class="k">INDEX</span> <span class="k">ON</span> <span class="n">raw_1</span> <span class="p">(</span><span class="n">id</span><span class="p">);</span></code></pre></figure> <p>Les données utilisateur ne devrait nécessiter que 10M * 4 octets, soit environ <strong>30 Mo</strong>, alors que cette table pèse <strong>348 Mo</strong>. L’insertion des données prend environ <strong>23 secondes</strong>.</p> <p class="notice"><strong>NOTE :</strong> Si vous faites le calcul, vous trouverez que le surcoût est d’un peu plus que <strong>32 octets</strong> par ligne, pas <strong>23 octets</strong>. C’est parce que chaque bloc de données a également un surcoût, une gestion des colonnes NULL ainsi que des contraintes d’alignement. Si vous voulez plus d’informations à ce sujet, je vous recommande de regarder <a href="https://github.com/dhyannataraj/tuple-internals-presentation">cette présentation</a></p> <p>Comparons maintenant cela avec la version aggrégées des même données :</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="o">#</span> <span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">agg_1</span> <span class="p">(</span><span class="n">id</span> <span class="nb">integer</span><span class="p">[]);</span> <span class="o">#</span> <span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">agg_1</span> <span class="k">SELECT</span> <span class="n">array_agg</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">generate_series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">10000000</span><span class="p">)</span> <span class="n">i</span> <span class="k">GROUP</span> <span class="k">BY</span> <span class="n">i</span> <span class="o">%</span> <span class="mi">2000000</span><span class="p">;</span> <span class="o">#</span> <span class="k">CREATE</span> <span class="k">INDEX</span> <span class="k">ON</span> <span class="n">agg_1</span> <span class="p">(</span><span class="n">id</span><span class="p">);</span></code></pre></figure> <p>Cette requête insèrera 5 éléments par ligne. J’ai fait le même test avec 20, 100, 200 et 1000 éléments par ligne. Les résultats sont les suivants :</p> <p><a href="/images/tuple_overhead_1.svg"><img src="/images/tuple_overhead_1.svg" alt="Benchmark 1" /></a></p> <p class="notice"><strong>NOTE :</strong> La taille pour 1000 éléments par ligne est un peu plus importante que pour la valeur précédents. C’est parce que c’est le seul qui implique une taille suffisamment importante pour être TOAST-ée, mais pas assez pour être compressée. On peut donc voir ici un peu de surcoût lié au TOAST.</p> <p>Jusqu’ici tout va bien, on peut voir de plutôt bonnes améliorations à la fois sur la taille et sur le temps d’insertion, même pour les tableaux les plus petits. Voyons maintenant l’impact pour récupérer des lignes. Je testerai la récupération de toutes les lignes, ainsi qu’une seule ligne au moyen d’un parcours d’index (j’ai utilisé pour les tests EXPLAIN ANALYZE afin de minimiser le temps passé par psql à afficher les données) : psql):</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="o">#</span> <span class="k">SELECT</span> <span class="n">id</span> <span class="k">FROM</span> <span class="n">raw_1</span><span class="p">;</span> <span class="o">#</span> <span class="k">CREATE</span> <span class="k">INDEX</span> <span class="k">ON</span> <span class="n">raw_1</span> <span class="p">(</span><span class="n">id</span><span class="p">);</span> <span class="o">#</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">raw_1</span> <span class="k">WHERE</span> <span class="n">id</span> <span class="o">=</span> <span class="mi">500</span><span class="p">;</span></code></pre></figure> <p>Pour correctement indexer le tableau, nous avons besoin d’un index GIN. Pour récupérer les valeurs de toutes les données aggrégées, il est nécessaire d’appeler unnest() sur le tableau, et pour récupérer un seul enregistrement il faut être un peu plus créatif :</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="o">#</span> <span class="k">SELECT</span> <span class="k">unnest</span><span class="p">(</span><span class="n">id</span><span class="p">)</span> <span class="k">AS</span> <span class="n">id</span> <span class="k">FROM</span> <span class="n">agg_1</span><span class="p">;</span> <span class="o">#</span> <span class="k">CREATE</span> <span class="k">INDEX</span> <span class="k">ON</span> <span class="n">agg_1</span> <span class="k">USING</span> <span class="n">gin</span> <span class="p">(</span><span class="n">id</span><span class="p">);</span> <span class="o">#</span> <span class="k">WITH</span> <span class="n">s</span><span class="p">(</span><span class="n">id</span><span class="p">)</span> <span class="k">AS</span> <span class="p">(</span> <span class="k">SELECT</span> <span class="k">unnest</span><span class="p">(</span><span class="n">id</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">agg_1</span> <span class="k">WHERE</span> <span class="n">id</span> <span class="o">&amp;&amp;</span> <span class="n">array</span><span class="p">[</span><span class="mi">500</span><span class="p">]</span> <span class="p">)</span> <span class="k">SELECT</span> <span class="n">id</span> <span class="k">FROM</span> <span class="n">s</span> <span class="k">WHERE</span> <span class="n">id</span> <span class="o">=</span> <span class="mi">500</span><span class="p">;</span></code></pre></figure> <p>Voici le tableau comparant les temps de création de l’index ainsi que la taille de celui-ci, pour chaque dimension de tableau :</p> <p><a href="/images/tuple_overhead_2.svg"><img src="/images/tuple_overhead_2.svg" alt="Benchmark 2" /></a></p> <p>L’index GIN est un peu plus que deux fois plus volumineux que l’index btree, et si on accumule la taille de la table à la taille de l’index, la taille totale est presque identique avec ou sans aggrégation. Ce n’est pas un gros problème puisque cet exemple est très naïf, et nous verrons juste après comme éviter d’avoir recours à un index GIN pour conserver une taille totale faible. De plus, l’index est bien plus lent à créer, ce qui signifie qu’INSERT sera également plus lent.</p> <p>Voici le tableau comparant le temps pour récupérer toutes les lignes ainsi qu’une seule ligne :</p> <p><a href="/images/tuple_overhead_3.svg"><img src="/images/tuple_overhead_3.svg" alt="Benchmark 3" /></a></p> <p>Récupérer toutes les lignes n’est probablement pas un exemple intéressant, mais il est intéressant de noter que dès que le tableau contient suffisamement d’éléments cela devient plus efficace que faire la même chose avec la table originale. Nous voyons également que récuérer un seul élément est bien plus rapide qu’avec l’index btree, grâce à l’efficacité de GIN. Ce n’est pas testé ici, mais puisque seul les index btree sont nativement triés, si vous devez récupérer un grand nombre d’enregistrements triés, l’utilisation d’un index GIN nécessitera un tri supplémentaire, ce qui sera bien plus lent qu’un simple parcours d’index btree.</p> <h3 id="un-exemple-plus-réaliste">Un exemple plus réaliste</h3> <p>Maintenant que nous avons vu les bases, voyons comment aller un peu plus loin : aggréger plus d’une colonne et éviter d’utiliser trop d’espce disque (et de ralentissements à l’écriture) du fait d’un index GIN. Pour cela, je vais présenter comme <a href="https://powa.readthedocs.io/">PoWA</a> stocke ses données.</p> <p>Pour chaque source de données collectée, deux tables sont utilisées : une pour les données <strong>historiques et aggrégées</strong>, ainsi qu’une pour <strong>les données courantes</strong>. Ces tables stockent les données dans un type de données personnalisé plutôt que des colonnes. Voyons les tables liées à l’extension <strong>pg_stat_statements</strong> :</p> <p>Le type de données, grosso modo tous les compteurs présents dans pg_stat_statements ainsi que l’horodatage associé à l’enregistrement :</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="n">powa</span><span class="o">=#</span> <span class="err">\</span><span class="n">d</span> <span class="n">powa_statements_history_record</span> <span class="n">Composite</span> <span class="k">type</span> <span class="nv">"public.powa_statements_history_record"</span> <span class="k">Column</span> <span class="o">|</span> <span class="k">Type</span> <span class="o">|</span> <span class="n">Modifiers</span> <span class="c1">---------------------+--------------------------+-----------</span> <span class="n">ts</span> <span class="o">|</span> <span class="nb">timestamp</span> <span class="k">with</span> <span class="nb">time</span> <span class="k">zone</span> <span class="o">|</span> <span class="n">calls</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span> <span class="n">total_time</span> <span class="o">|</span> <span class="nb">double</span> <span class="nb">precision</span> <span class="o">|</span> <span class="k">rows</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span> <span class="n">shared_blks_hit</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span> <span class="n">shared_blks_read</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span> <span class="n">shared_blks_dirtied</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span> <span class="n">shared_blks_written</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span> <span class="n">local_blks_hit</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span> <span class="n">local_blks_read</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span> <span class="n">local_blks_dirtied</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span> <span class="n">local_blks_written</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span> <span class="n">temp_blks_read</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span> <span class="n">temp_blks_written</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span> <span class="n">blk_read_time</span> <span class="o">|</span> <span class="nb">double</span> <span class="nb">precision</span> <span class="o">|</span> <span class="n">blk_write_time</span> <span class="o">|</span> <span class="nb">double</span> <span class="nb">precision</span> <span class="o">|</span></code></pre></figure> <p>La table pour les données courrante stocke l’identifieur unique de pg_stat_statements (queryid, dbid, userid), ainsi qu’un enregistrement de compteurs :</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="n">powa</span><span class="o">=#</span> <span class="err">\</span><span class="n">d</span> <span class="n">powa_statements_history_current</span> <span class="k">Table</span> <span class="nv">"public.powa_statements_history_current"</span> <span class="k">Column</span> <span class="o">|</span> <span class="k">Type</span> <span class="o">|</span> <span class="n">Modifiers</span> <span class="c1">---------+--------------------------------+-----------</span> <span class="n">queryid</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="n">dbid</span> <span class="o">|</span> <span class="n">oid</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="n">userid</span> <span class="o">|</span> <span class="n">oid</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="n">record</span> <span class="o">|</span> <span class="n">powa_statements_history_record</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span></code></pre></figure> <p>La table pour les données aggrégées contient le même identifieur unique, un tableau d’enregistrements ainsi que quelques champs spéciaux :</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="n">powa</span><span class="o">=#</span> <span class="err">\</span><span class="n">d</span> <span class="n">powa_statements_history</span> <span class="k">Table</span> <span class="nv">"public.powa_statements_history"</span> <span class="k">Column</span> <span class="o">|</span> <span class="k">Type</span> <span class="o">|</span> <span class="n">Modifiers</span> <span class="c1">----------------+----------------------------------+-----------</span> <span class="n">queryid</span> <span class="o">|</span> <span class="nb">bigint</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="n">dbid</span> <span class="o">|</span> <span class="n">oid</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="n">userid</span> <span class="o">|</span> <span class="n">oid</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="n">coalesce_range</span> <span class="o">|</span> <span class="n">tstzrange</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="n">records</span> <span class="o">|</span> <span class="n">powa_statements_history_record</span><span class="p">[]</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="n">mins_in_range</span> <span class="o">|</span> <span class="n">powa_statements_history_record</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="n">maxs_in_range</span> <span class="o">|</span> <span class="n">powa_statements_history_record</span> <span class="o">|</span> <span class="k">not</span> <span class="k">null</span> <span class="n">Indexes</span><span class="p">:</span> <span class="nv">"powa_statements_history_query_ts"</span> <span class="n">gist</span> <span class="p">(</span><span class="n">queryid</span><span class="p">,</span> <span class="n">coalesce_range</span><span class="p">)</span></code></pre></figure> <p>Nous stockons également l’intervalle d’horodatage (<em>coalesce_range</em>) contenant tous les compteurs aggrégés dans la ligne, ainsi que les valeurs minimales et maximales de chaque compteurs dans deux compteurs dédiés. Ces champs supplémentaires ne consomment pas trop d’espace, et permettent une indexation ainsi qu’un traitement très efficace, basé sur les modèles d’accès aux données de l’application associée.</p> <p>Cette table est utilisée pour savoir combien de ressources ont été utilisée par une requête sur un intervalle de temps donné. L’index GiST ne sera pas très gros puisqu’il n’indexe que deux petites valeus pour X compteurs aggrégés, et trouvera les lignes correspondant à une requête et un intervalle de temps données de manière très efficace.</p> <p>Ensuite, calculer les ressources consommées peut être fait de manière très efficace, puisque les compteurs de pg_stat_statements sont strictement monotones. L’algorithme pourrait être :</p> <ul> <li>si l’intervalle de temps de la ligne est entièrement contenu dans l’intervalle de temps demandé, nous n’avons besoin de calculer que le delta du résumé de l’enregistrement : <strong>maxs_in_range.counter - mins_in_range.counter</strong></li> <li>sinon (c’est-à-dire pour uniquement deux lignes par queryid) nous dépilons le tableau, filtrons les enregistrements qui ne sont pas compris dans l’intervalle de temps demandé, conservons la première et dernière valeur et calculons pour chaque compteur le maximum moins le minimum.</li> </ul> <p class="notice"><strong>NOTE :</strong> Dans les faits, l’interface de PoWA dépilera toujours tous les enregistrements contenus dans l’intervalle de temps demandé, puisque l’interface est faite pour montrer l’évolution de ces compteurs sur un intervalle de temps relativement réduit, mais avec une grande précision. Heureusement, dépiler les tableaux n’est pas si coûteux que ça, surtout en regard de l’espace disque économisé.</p> <p>Et voici la taille nécessaire pour les valeurs aggrégées et non aggrégées. Pour cela j’ai laissé PoWA générer <strong>12 331 366 enregistrements</strong> (en configurant une capture toutes les 5 secondes pendant quelques heures, et avec l’aggrégation par défaut de 100 enregistrements par lignes), et créé un index btree sur (queryid, ((record).ts) pour simuler l’index présent sur les tables aggrégées :</p> <p><a href="/images/tuple_overhead_4.svg"><img src="/images/tuple_overhead_4.svg" alt="Benchmark 4" /></a></p> <p>Vous trouvez aussi que c’est plutôt efficace ?</p> <h3 id="limitations">Limitations</h3> <p>Il y a quelques limitations avec l’aggrégation d’enregistrements. Si vous faites ça, vous ne pouvez plus garantir de contraintes telles que des clés étrangères ou contrainte d’unicité. C’est donc à utiliser pour des données non relationnelles, telles que des compteurs ou des métadonnées.</p> <h3 id="bonus">Bonus</h3> <p>L’utilisation de type de données personnalisés vous permet de faire des choses sympathiques, comme définir des <strong>opérateurs personnalisés</strong>. Par exemple, la version 3.1.0 de PoWA fournit deux opérateurs pour chacun des types de données personnalisé définis :</p> <ul> <li>l’opérateur <strong>-</strong>, pour obtenir la différent entre deux enregistrements</li> <li>l’opérateur <strong>/</strong>, pour obtenir la différence <em>par seconde</em></li> </ul> <p>Vous pouvez donc faire très facilement des requêtes du genre :</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="o">#</span> <span class="k">SELECT</span> <span class="p">(</span><span class="n">record</span> <span class="o">-</span> <span class="n">lag</span><span class="p">(</span><span class="n">record</span><span class="p">)</span> <span class="n">over</span><span class="p">()).</span><span class="o">*</span> <span class="k">FROM</span> <span class="k">from</span> <span class="n">powa_statements_history_current</span> <span class="k">WHERE</span> <span class="n">queryid</span> <span class="o">=</span> <span class="mi">3589441560</span> <span class="k">AND</span> <span class="n">dbid</span> <span class="o">=</span> <span class="mi">16384</span><span class="p">;</span> <span class="n">intvl</span> <span class="o">|</span> <span class="n">calls</span> <span class="o">|</span> <span class="n">total_time</span> <span class="o">|</span> <span class="k">rows</span> <span class="o">|</span> <span class="p">...</span> <span class="c1">-----------------+--------+------------------+--------+ ...</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="p">...</span> <span class="mi">00</span><span class="p">:</span><span class="mi">00</span><span class="p">:</span><span class="mi">05</span><span class="p">.</span><span class="mi">004611</span> <span class="o">|</span> <span class="mi">5753</span> <span class="o">|</span> <span class="mi">20</span><span class="p">.</span><span class="mi">5570000000005</span> <span class="o">|</span> <span class="mi">5753</span> <span class="o">|</span> <span class="p">...</span> <span class="mi">00</span><span class="p">:</span><span class="mi">00</span><span class="p">:</span><span class="mi">05</span><span class="p">.</span><span class="mi">004569</span> <span class="o">|</span> <span class="mi">1879</span> <span class="o">|</span> <span class="mi">6</span><span class="p">.</span><span class="mi">40500000000047</span> <span class="o">|</span> <span class="mi">1879</span> <span class="o">|</span> <span class="p">...</span> <span class="mi">00</span><span class="p">:</span><span class="mi">00</span><span class="p">:</span><span class="mi">05</span><span class="p">.</span><span class="mi">00477</span> <span class="o">|</span> <span class="mi">14369</span> <span class="o">|</span> <span class="mi">48</span><span class="p">.</span><span class="mi">9060000000006</span> <span class="o">|</span> <span class="mi">14369</span> <span class="o">|</span> <span class="p">...</span> <span class="mi">00</span><span class="p">:</span><span class="mi">00</span><span class="p">:</span><span class="mi">05</span><span class="p">.</span><span class="mi">00418</span> <span class="o">|</span> <span class="mi">0</span> <span class="o">|</span> <span class="mi">0</span> <span class="o">|</span> <span class="mi">0</span> <span class="o">|</span> <span class="p">...</span> <span class="o">#</span> <span class="k">SELECT</span> <span class="p">(</span><span class="n">record</span> <span class="o">/</span> <span class="n">lag</span><span class="p">(</span><span class="n">record</span><span class="p">)</span> <span class="n">over</span><span class="p">()).</span><span class="o">*</span> <span class="k">FROM</span> <span class="n">powa_statements_history_current</span> <span class="k">WHERE</span> <span class="n">queryid</span> <span class="o">=</span> <span class="mi">3589441560</span> <span class="k">AND</span> <span class="n">dbid</span> <span class="o">=</span> <span class="mi">16384</span><span class="p">;</span> <span class="n">sec</span> <span class="o">|</span> <span class="n">calls_per_sec</span> <span class="o">|</span> <span class="n">runtime_per_sec</span> <span class="o">|</span> <span class="n">rows_per_sec</span> <span class="o">|</span> <span class="p">...</span> <span class="c1">--------+---------------+------------------+--------------+ ...</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="p">...</span> <span class="mi">5</span> <span class="o">|</span> <span class="mi">1150</span><span class="p">.</span><span class="mi">6</span> <span class="o">|</span> <span class="mi">4</span><span class="p">.</span><span class="mi">1114000000001</span> <span class="o">|</span> <span class="mi">1150</span><span class="p">.</span><span class="mi">6</span> <span class="o">|</span> <span class="p">...</span> <span class="mi">5</span> <span class="o">|</span> <span class="mi">375</span><span class="p">.</span><span class="mi">8</span> <span class="o">|</span> <span class="mi">1</span><span class="p">.</span><span class="mi">28100000000009</span> <span class="o">|</span> <span class="mi">375</span><span class="p">.</span><span class="mi">8</span> <span class="o">|</span> <span class="p">...</span> <span class="mi">5</span> <span class="o">|</span> <span class="mi">2873</span><span class="p">.</span><span class="mi">8</span> <span class="o">|</span> <span class="mi">9</span><span class="p">.</span><span class="mi">78120000000011</span> <span class="o">|</span> <span class="mi">2873</span><span class="p">.</span><span class="mi">8</span> <span class="o">|</span> <span class="p">...</span></code></pre></figure> <p>Si vous êtes intéressés sur la façon d’implémenter de tels opérateurs, vous pouvez regarder <a href="https://github.com/powa-team/powa-archivist/commit/203ed02a5205ad41ce0854bf0580779d7fb6193b#diff-efeed95efc180d43a149361145c2f082R1079">l’implémentation de PoWA</a>.</p> <h3 id="conclusion">Conclusion</h3> <p>Vous connaissez maintenant les bases pour éviter le surcoût de stockage par ligne. En fonction de vos besoins et de la spécificité de vos données, vous devriez pouvoir trouver un moyen d’aggréger vos données, en ajoutant potentiellement quelques colonnes supplémentaires, afin de conserver de bonnes performances et économiser de l’espace disque.</p> <!-- Test 1, simple integer, 10M row with s(id) AS (select unnest(id) from agg_1 where id && array[500]) select * from s where id = 500; raw_1 (id integer) insert: 23s size: 346 MB read data: 2.2s create index: 5.2s index size: 214 MB find 1 row: 1.4ms agg_1 (id integer[]) 5 val per row INSERT INTO agg_1 SELECT array_agg(i) FROM generate_series(1,10000000) i GROUP BY i % 2000000 ; insert: 18s size: 146 MB (no toast) read raw data: 377 ms unnnest: 4s create (GIN) index: 73s index size: 478 MB find 1 val: 0.25ms agg_1 (id integer[]) 20 val per row INSERT INTO agg_1 SELECT array_agg(i) FROM generate_series(1,10000000) i GROUP BY i % 500000 ; insert: 13s size: 64 MB (no toast) read raw data: 100ms read unnnest: 2.6 s create (GIN) index: 70s index size: 478MB find 1 val: 0.3ms agg_1 (id integer[]) 100 val per row INSERT INTO agg_1 SELECT array_agg(i) FROM generate_series(1,10000000) i GROUP BY i % 100000; insert: 10s size: 43MB (notoast) read raw data: 31ms read unnnest: 2s create (GIN) index: 68s index size: 478 MB find 1 val: 0.45 ms agg_1 (id integer[]) 200 val per row INSERT INTO agg_1 SELECT array_agg(i) FROM generate_series(1,10000000) i GROUP BY i % 50000; insert: 9.7s size: 43MB (notoast) read raw data: 21ms read unnnest: 2s create (GIN) index: 69s index size: 478MB find 1 val: 0.7ms agg_1 (id integer[]) 1000 val per row INSERT INTO agg_1 SELECT array_agg(i) FROM generate_series(1,10000000) i GROUP BY i % 10000; insert: 10s size: 53MB (toast) read raw data: 7ms read unnnest: 2s create (GIN) index: 67s index size: 478MB find 1 val: 2,7ms --> <p><a href="https://rjuju.github.io/postgresqlfr/2019/04/06/minimiser-le-surcout-de-stockage-par-ligne.html">Minimiser le surcoût de stockage par ligne</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on April 06, 2019.</p> <![CDATA[Support des Wait Events pour PoWA]]> https://rjuju.github.io/postgresqlfr/2019/04/02/support-des-wait-events-pour-powa 2019-04-02T17:08:24+00:00 2019-04-02T17:08:24+00:00 Julien Rouhaud https://rjuju.github.io <p>Vous avez la possibilité de visualiser les <strong>Wait Events</strong> dans <a href="https://powa.readthedocs.io/">PoWA 3.2.0</a> grâce à l’extension <a href="https://github.com/postgrespro/pg_wait_sampling/">pg_wait_sampling</a> extension.</p> <h3 id="wait-events--pg_wait_sampling">Wait Events &amp; pg_wait_sampling</h3> <p>Les wait events sont une fonctionnalité connues, et bien utiles, dans de nombreux moteurs de base de données relationnelles. Ceux-ci ont été ajouté à <a href="https://github.com/postgres/postgres/commit/53be0b1add7">PostgreSQL 9.6</a>, il y a maintenant quelques versions. Contrairement à la plupart des autres statistiques exposées par PostgreSQL, ceux-ci ne sont qu’une vision à un instant donné des événements sur lesquels les processus sont en attente, et non pas des compteurs cumulés. Vous pouvez consulter cette information en utilisant la vue <code class="language-plaintext highlighter-rouge">pg_stat_activity</code>, par exemple :</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="o">=#</span> <span class="k">SELECT</span> <span class="n">datid</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="n">wait_event_type</span><span class="p">,</span> <span class="n">wait_event</span><span class="p">,</span> <span class="n">query</span> <span class="k">FROM</span> <span class="n">pg_stat_activity</span><span class="p">;</span> <span class="n">datid</span> <span class="o">|</span> <span class="n">pid</span> <span class="o">|</span> <span class="n">wait_event_type</span> <span class="o">|</span> <span class="n">wait_event</span> <span class="o">|</span> <span class="n">query</span> <span class="c1">--------+-------+-----------------+---------------------+-------------------------------------------------------------------------</span> <span class="o">&lt;</span><span class="k">NULL</span><span class="o">&gt;</span> <span class="o">|</span> <span class="mi">13782</span> <span class="o">|</span> <span class="n">Activity</span> <span class="o">|</span> <span class="n">AutoVacuumMain</span> <span class="o">|</span> <span class="mi">16384</span> <span class="o">|</span> <span class="mi">16615</span> <span class="o">|</span> <span class="k">Lock</span> <span class="o">|</span> <span class="n">relation</span> <span class="o">|</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">t1</span><span class="p">;</span> <span class="mi">16384</span> <span class="o">|</span> <span class="mi">16621</span> <span class="o">|</span> <span class="n">Client</span> <span class="o">|</span> <span class="n">ClientRead</span> <span class="o">|</span> <span class="k">LOCK</span> <span class="k">TABLE</span> <span class="n">t1</span><span class="p">;</span> <span class="mi">847842</span> <span class="o">|</span> <span class="mi">16763</span> <span class="o">|</span> <span class="n">LWLock</span> <span class="o">|</span> <span class="n">WALWriteLock</span> <span class="o">|</span> <span class="k">END</span><span class="p">;</span> <span class="mi">847842</span> <span class="o">|</span> <span class="mi">16764</span> <span class="o">|</span> <span class="k">Lock</span> <span class="o">|</span> <span class="n">transactionid</span> <span class="o">|</span> <span class="k">UPDATE</span> <span class="n">pgbench_branches</span> <span class="k">SET</span> <span class="n">bbalance</span> <span class="o">=</span> <span class="n">bbalance</span> <span class="o">+</span> <span class="mi">1229</span> <span class="k">WHERE</span> <span class="n">bid</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="mi">847842</span> <span class="o">|</span> <span class="mi">16766</span> <span class="o">|</span> <span class="n">LWLock</span> <span class="o">|</span> <span class="n">WALWriteLock</span> <span class="o">|</span> <span class="k">END</span><span class="p">;</span> <span class="mi">847842</span> <span class="o">|</span> <span class="mi">16767</span> <span class="o">|</span> <span class="k">Lock</span> <span class="o">|</span> <span class="n">transactionid</span> <span class="o">|</span> <span class="k">UPDATE</span> <span class="n">pgbench_tellers</span> <span class="k">SET</span> <span class="n">tbalance</span> <span class="o">=</span> <span class="n">tbalance</span> <span class="o">+</span> <span class="mi">3383</span> <span class="k">WHERE</span> <span class="n">tid</span> <span class="o">=</span> <span class="mi">86</span><span class="p">;</span> <span class="mi">847842</span> <span class="o">|</span> <span class="mi">16769</span> <span class="o">|</span> <span class="k">Lock</span> <span class="o">|</span> <span class="n">transactionid</span> <span class="o">|</span> <span class="k">UPDATE</span> <span class="n">pgbench_branches</span> <span class="k">SET</span> <span class="n">bbalance</span> <span class="o">=</span> <span class="n">bbalance</span> <span class="o">+</span> <span class="o">-</span><span class="mi">3786</span> <span class="k">WHERE</span> <span class="n">bid</span> <span class="o">=</span> <span class="mi">10</span><span class="p">;</span> <span class="p">[...]</span></code></pre></figure> <p>Dans cet exemple, nous voyons que le //wait event// pour le pid 16615 est un <code class="language-plaintext highlighter-rouge">Lock</code> sur une <code class="language-plaintext highlighter-rouge">Relation</code>. En d’autre terme, la requête est bloquée en attente d’un verrou lourd, alors que le pid 16621, qui clairement détient le verrou, est inactif en attente de commandes du client. Il s’agit d’informations qu’il était déjà possible d’obtenir avec les anciennes versions, bien que cela se faisait d’une autre manière. Mais plus intéressant, nous pouvons également voir que le //wait event// pour le pid 16766 est un <code class="language-plaintext highlighter-rouge">LWLock</code>, c’est-à-dire un <strong>Lightweight Lock</strong>, ou verrou léger. Les verrous légers sont des verrous internes et transitoires qu’il était auparavant impossible de voir au niveau SQL. dans cet exemple, la requête est en attente d’un <strong>WALWriteLock</strong>, un verrou léger principalement utilisé pour contrôler l’écriture dans les tampons des journaux de transaction. Une liste complète des //wait events// disponible est <a href="https://docs.postgresql.fr/current/monitoring-stats.html#wait-event-table">disponible sur la documentation officielle</a>.</p> <p>Ces informations manquaient curellement et sont bien utiles pour diagnostiquer les causes de ralentissement. Cependant, n’avoir que la vue de ces //wait events// à l’instant présent n’est clairement pas suffisant pour avoir une bonne idée de ce qu’il se passe sur le serveur. Puisque la plupart des //wait events// sont pas nature très éphémères, ce dont vous avez besoin est de les échantilloner à une fréquence élevée. Tenter de faire cet échantillonage avec un outil externe, même à une seconde d’intervalle, n’est généralement pas suffisant. C’est là que <a href="https://github.com/postgrespro/pg_wait_sampling/">l’extension pg_wait_sampling</a> apporte une solution vraiment brillante. Il s’agit d’une extension écrite par <a href="http://akorotkov.github.io/">Alexander Korotkov</a> et Ildus Kurbangaliev. Une fois activée (il est nécessaire de la configurer dans le <code class="language-plaintext highlighter-rouge">shared_preload_libraries</code>, un redémarrage de l’instance est donc nécessaire), elle échantillonera en mémoire partagée les //wait events// toutes les <strong>10 ms</strong> (par défaut), et aggèrega également les compteurs par type de //wait event// (wait_event_type), //wait event// et queryid (si <code class="language-plaintext highlighter-rouge">pg_stat_statements</code> est également acctivé). Pour plus de détails sur la configuration et l’utilisation de cette extension, vous pouvez consulter le <a href="https://github.com/postgrespro/pg_wait_sampling/blob/master/README.md">README de l’extension</a>. Comme tout le travail est fait en mémoire au moyen d’une extension écrite en C, c’est très efficace. De plus, l’implémentation est faite avec très peu de verouillage, le surcoût de cette extension devrait être presque négligable. J’ai fait quelques tests de performance sur mon pc portable (je n’ai malheureusement pas de meilleure machine sur laquelle tester) avec un <a href="https://www.postgresql.org/docs/current/static/pgbench.html">pgbench</a> en lecture seule où toutes les données tenaient dans le cache de PostgreSQL (<code class="language-plaintext highlighter-rouge">shared_buffers</code>), avec 8 puis 90 clients, afin d’essayer d’avoir le maximum de surcoût possible. La moyenne sur 3 tests était d’environ 1% de surcoût, avec des fluctuations entre chaque test d’environ 0.8%.</p> <h3 id="et-powa-">Et PoWA ?</h3> <p>Ainsi, grâce à cette extension, nous avons à notre disposition une vue cumulée et extrêmement précise des //wait events//. C’est très bien, mais comme toutes les autres statistiques cumulées dans PostgreSQL, vous devez échantillonner ces compteurs régulièrement si vous voulez pouvoir être capable de savoir ce qu’il s’est passé à un certain moment dans le passé, comme c’est d’ailleurs précisé dans le README de l’extension :</p> <blockquote> <p>[…] Waits profile. It’s implemented as in-memory hash table where count of samples are accumulated per each process and each wait event (and each query with <code class="language-plaintext highlighter-rouge">pg_stat_statements</code>). This hash table can be reset by user request. Assuming there is a client who periodically dumps profile and resets it, user can have statistics of intensivity of wait events among time.</p> </blockquote> <p>C’est exactement le but de <a href="http://powa.readthedocs.io/">PoWA</a>: sauvegarder les compteurs statistiques de manière efficace, et les afficher sur une interface graphique.</p> <p>PoWA 3.2 détecte automatiquement si l’extension <a href="https://github.com/postgrespro/pg_wait_sampling/">pg_wait_sampling</a> est déjà présente ou si vous l’installez ultérieurement, et commencera à collecter ses données, vous donnant une vue vraiment précise des //wait events// dans le temps sur vos bases de données !</p> <p>Les données sont centralisée dans des <a href="/postgresql/2016/09/16/minimizing-tuple-overhead.html (article en cours de traduction)">tables PoWA classiques</a>, <code class="language-plaintext highlighter-rouge">powa_wait_sampling_history_current</code> pour les 100 dernières collectes (valeur par défaut de <code class="language-plaintext highlighter-rouge">powa.coalesce</code>), et les valeurs plus anciennes sont aggrégées dans la table <code class="language-plaintext highlighter-rouge">powa_wait_sampling_history</code>, avec un historique allant jusqu’à une période définie par <code class="language-plaintext highlighter-rouge">powa.retention</code>. Par exemple, voici une requête simple affichant les 20 premiers changements survenus au sein des 100 premiers instantanés :</p> <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">WITH</span> <span class="n">s</span> <span class="k">AS</span> <span class="p">(</span> <span class="k">SELECT</span> <span class="p">(</span><span class="n">record</span><span class="p">).</span><span class="n">ts</span><span class="p">,</span> <span class="n">queryid</span><span class="p">,</span> <span class="n">event_type</span><span class="p">,</span> <span class="n">event</span><span class="p">,</span> <span class="p">(</span><span class="n">record</span><span class="p">).</span><span class="k">count</span> <span class="o">-</span> <span class="n">lag</span><span class="p">((</span><span class="n">record</span><span class="p">).</span><span class="k">count</span><span class="p">)</span> <span class="n">OVER</span> <span class="p">(</span><span class="k">PARTITION</span> <span class="k">BY</span> <span class="n">queryid</span><span class="p">,</span> <span class="n">event_type</span><span class="p">,</span> <span class="n">event</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="p">(</span><span class="n">record</span><span class="p">).</span><span class="n">ts</span><span class="p">)</span> <span class="k">AS</span> <span class="n">events</span> <span class="k">FROM</span> <span class="n">powa_wait_sampling_history_current</span> <span class="n">w</span> <span class="k">JOIN</span> <span class="n">pg_database</span> <span class="n">d</span> <span class="k">ON</span> <span class="n">d</span><span class="p">.</span><span class="n">oid</span> <span class="o">=</span> <span class="n">w</span><span class="p">.</span><span class="n">dbid</span> <span class="k">WHERE</span> <span class="n">d</span><span class="p">.</span><span class="n">datname</span> <span class="o">=</span> <span class="s1">'bench'</span> <span class="p">)</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">s</span> <span class="k">WHERE</span> <span class="n">events</span> <span class="o">!=</span> <span class="mi">0</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">ts</span> <span class="k">ASC</span><span class="p">,</span> <span class="n">event</span> <span class="k">DESC</span> <span class="k">LIMIT</span> <span class="mi">20</span><span class="p">;</span> <span class="n">ts</span> <span class="o">|</span> <span class="n">queryid</span> <span class="o">|</span> <span class="n">event_type</span> <span class="o">|</span> <span class="n">event</span> <span class="o">|</span> <span class="n">events</span> <span class="c1">-------------------------------+----------------------+------------+----------------+--------</span> <span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">08</span><span class="p">.</span><span class="mi">037191</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="o">-</span><span class="mi">6531859117817823569</span> <span class="o">|</span> <span class="n">LWLock</span> <span class="o">|</span> <span class="n">pg_qualstats</span> <span class="o">|</span> <span class="mi">1233</span> <span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">28</span><span class="p">.</span><span class="mi">035212</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="mi">8851222058009799098</span> <span class="o">|</span> <span class="k">Lock</span> <span class="o">|</span> <span class="n">tuple</span> <span class="o">|</span> <span class="mi">4</span> <span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">28</span><span class="p">.</span><span class="mi">035212</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="o">-</span><span class="mi">6860707137622661878</span> <span class="o">|</span> <span class="k">Lock</span> <span class="o">|</span> <span class="n">tuple</span> <span class="o">|</span> <span class="mi">149</span> <span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">28</span><span class="p">.</span><span class="mi">035212</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="mi">8851222058009799098</span> <span class="o">|</span> <span class="k">Lock</span> <span class="o">|</span> <span class="n">transactionid</span> <span class="o">|</span> <span class="mi">193</span> <span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">28</span><span class="p">.</span><span class="mi">035212</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="o">-</span><span class="mi">6860707137622661878</span> <span class="o">|</span> <span class="k">Lock</span> <span class="o">|</span> <span class="n">transactionid</span> <span class="o">|</span> <span class="mi">1143</span> <span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">28</span><span class="p">.</span><span class="mi">035212</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="o">-</span><span class="mi">6531859117817823569</span> <span class="o">|</span> <span class="n">LWLock</span> <span class="o">|</span> <span class="n">pg_qualstats</span> <span class="o">|</span> <span class="mi">1</span> <span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">28</span><span class="p">.</span><span class="mi">035212</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="mi">8851222058009799098</span> <span class="o">|</span> <span class="n">LWLock</span> <span class="o">|</span> <span class="n">lock_manager</span> <span class="o">|</span> <span class="mi">2</span> <span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">28</span><span class="p">.</span><span class="mi">035212</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="o">-</span><span class="mi">6860707137622661878</span> <span class="o">|</span> <span class="n">LWLock</span> <span class="o">|</span> <span class="n">lock_manager</span> <span class="o">|</span> <span class="mi">3</span> <span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">28</span><span class="p">.</span><span class="mi">035212</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="o">-</span><span class="mi">6860707137622661878</span> <span class="o">|</span> <span class="n">LWLock</span> <span class="o">|</span> <span class="n">buffer_content</span> <span class="o">|</span> <span class="mi">2</span> <span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">48</span><span class="p">.</span><span class="mi">037205</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="mi">8851222058009799098</span> <span class="o">|</span> <span class="k">Lock</span> <span class="o">|</span> <span class="n">tuple</span> <span class="o">|</span> <span class="mi">14</span> <span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">48</span><span class="p">.</span><span class="mi">037205</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="o">-</span><span class="mi">6860707137622661878</span> <span class="o">|</span> <span class="k">Lock</span> <span class="o">|</span> <span class="n">tuple</span> <span class="o">|</span> <span class="mi">335</span> <span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">48</span><span class="p">.</span><span class="mi">037205</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="o">-</span><span class="mi">6860707137622661878</span> <span class="o">|</span> <span class="k">Lock</span> <span class="o">|</span> <span class="n">transactionid</span> <span class="o">|</span> <span class="mi">2604</span> <span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">48</span><span class="p">.</span><span class="mi">037205</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="mi">8851222058009799098</span> <span class="o">|</span> <span class="k">Lock</span> <span class="o">|</span> <span class="n">transactionid</span> <span class="o">|</span> <span class="mi">384</span> <span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">48</span><span class="p">.</span><span class="mi">037205</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="o">-</span><span class="mi">6860707137622661878</span> <span class="o">|</span> <span class="n">LWLock</span> <span class="o">|</span> <span class="n">lock_manager</span> <span class="o">|</span> <span class="mi">13</span> <span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">48</span><span class="p">.</span><span class="mi">037205</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="mi">8851222058009799098</span> <span class="o">|</span> <span class="n">LWLock</span> <span class="o">|</span> <span class="n">lock_manager</span> <span class="o">|</span> <span class="mi">4</span> <span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">48</span><span class="p">.</span><span class="mi">037205</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="mi">8221555873158496753</span> <span class="o">|</span> <span class="n">IO</span> <span class="o">|</span> <span class="n">DataFileExtend</span> <span class="o">|</span> <span class="mi">1</span> <span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">44</span><span class="p">:</span><span class="mi">48</span><span class="p">.</span><span class="mi">037205</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="o">-</span><span class="mi">6860707137622661878</span> <span class="o">|</span> <span class="n">LWLock</span> <span class="o">|</span> <span class="n">buffer_content</span> <span class="o">|</span> <span class="mi">4</span> <span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">45</span><span class="p">:</span><span class="mi">08</span><span class="p">.</span><span class="mi">032938</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="mi">8851222058009799098</span> <span class="o">|</span> <span class="k">Lock</span> <span class="o">|</span> <span class="n">tuple</span> <span class="o">|</span> <span class="mi">5</span> <span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">45</span><span class="p">:</span><span class="mi">08</span><span class="p">.</span><span class="mi">032938</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="o">-</span><span class="mi">6860707137622661878</span> <span class="o">|</span> <span class="k">Lock</span> <span class="o">|</span> <span class="n">tuple</span> <span class="o">|</span> <span class="mi">312</span> <span class="mi">2018</span><span class="o">-</span><span class="mi">07</span><span class="o">-</span><span class="mi">09</span> <span class="mi">10</span><span class="p">:</span><span class="mi">45</span><span class="p">:</span><span class="mi">08</span><span class="p">.</span><span class="mi">032938</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span> <span class="o">-</span><span class="mi">6860707137622661878</span> <span class="o">|</span> <span class="k">Lock</span> <span class="o">|</span> <span class="n">transactionid</span> <span class="o">|</span> <span class="mi">2586</span> <span class="p">(</span><span class="mi">20</span> <span class="k">rows</span><span class="p">)</span></code></pre></figure> <p class="notice"><strong>NOTE:</strong> Il y a également une version par base de données de ces valeurs pour un traitement plus efficace au niveau des basesn dans les tables <code class="language-plaintext highlighter-rouge">powa_wait_sampling_history_current_db</code> et <code class="language-plaintext highlighter-rouge">powa_wait_sampling_history_db</code></p> <p>Et ces données sont visibles avec l’interface <a href="https://pypi.org/project/powa-web/">powa-web</a>. Voici quelques exemples d’affichage des //wait events// tels qu’affichés par PoWA avec un simple pgbench :</p> <h5 id="wait-events-pour-linstance-entière">Wait events pour l’instance entière</h5> <p><a href="/images/powa_waits_overview.png"><img src="/images/powa_waits_overview.png" alt="Wait events pour l'instance entière" /></a></p> <h5 id="wait-events-pour-une-base-de-données">Wait events pour une base de données</h5> <p><a href="/images/powa_waits_db.png"><img src="/images/powa_waits_db.png" alt="Wait events pour une base de données" /></a></p> <h5 id="wait-events-pour-une-seule-requête">Wait events pour une seule requête</h5> <p><a href="/images/powa_waits_query.png"><img src="/images/powa_waits_query.png" alt="Wait events pour une seule requête" /></a></p> <div class="gallery"> </div> <p>Cette fonctionnalité est disponible depuis la version 3.2 de PoWA. J’espère pouvoir afficher plus de vues de ces données dans le futur, en incluant d’autres graphes, puisque toutes les données sont déjà disponibles en bases. Également, si vous êtes un développeur python ou javascript, <a href="https://github.com/powa-team/powa-web">les contributions sont toujours bienvenues</a>!</p> <p><a href="https://rjuju.github.io/postgresqlfr/2019/04/02/support-des-wait-events-pour-powa.html">Support des Wait Events pour PoWA</a> was originally published by Julien Rouhaud at <a href="https://rjuju.github.io">rjuju's home</a> on April 02, 2019.</p>