Asynchronous MergeAppend

Started by Alexander Pyhalovover 1 year ago27 messageshackers

Alexander Pyhalov

[email protected]

over 1 year ago

Hello.

I'd like to make MergeAppend node Async-capable like Append node.
Nowadays when planner chooses MergeAppend plan, asynchronous execution
is not possible. With attached patches you can see plans like

EXPLAIN (VERBOSE, COSTS OFF)
SELECT * FROM async_pt WHERE b % 100 = 0 ORDER BY b, a;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Merge Append
Sort Key: async_pt.b, async_pt.a
-> Async Foreign Scan on public.async_p1 async_pt_1
Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE (((b %
100) = 0)) ORDER BY b ASC NULLS LAST, a ASC NULLS LAST
-> Async Foreign Scan on public.async_p2 async_pt_2
Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE (((b %
100) = 0)) ORDER BY b ASC NULLS LAST, a ASC NULLS LAST

This can be quite profitable (in our test cases you can gain up to two
times better speed with MergeAppend async execution on remote servers).

Code for asynchronous execution in Merge Append was mostly borrowed from
Append node.

What significantly differs - in ExecMergeAppendAsyncGetNext() you must
return tuple from the specified slot.
Subplan number determines tuple slot where data should be retrieved to.
When subplan is ready to provide some data,
it's cached in ms_asyncresults. When we get tuple for subplan, specified
in ExecMergeAppendAsyncGetNext(),
ExecMergeAppendAsyncRequest() returns true and loop in
ExecMergeAppendAsyncGetNext() ends. We can fetch data for
subplans which either don't have cached result ready or have already
returned them to the upper node. This
flag is stored in ms_has_asyncresults. As we can get data for some
subplan either earlier or after loop in ExecMergeAppendAsyncRequest(),
we check this flag twice in this function.
Unlike ExecAppendAsyncEventWait(), it seems
ExecMergeAppendAsyncEventWait() doesn't need a timeout - as there's no
need to get result
from synchronous subplan if a tuple form async one was explicitly
requested.

Also we had to fix postgres_fdw to avoid directly looking at Append
fields. Perhaps, accesors to Append fields look strange, but allows
to avoid some code duplication. I suppose, duplication could be even
less if we reworked async Append implementation, but so far I haven't
tried to do this to avoid big diff from master.

Also mark_async_capable() believes that path corresponds to plan. This
can be not true when create_[merge_]append_plan() inserts sort node.
In this case mark_async_capable() can treat Sort plan node as some other
and crash, so there's a small fix for this.
--
Best regards,
Alexander Pyhalov,
Postgres Professional

[email protected]

over 1 year ago

In reply to: Alexander Pyhalov (#1)

Re: Asynchronous MergeAppend

Hi! Thank you for your work on this subject! I think this is a very
useful optimization)

While looking through your code, I noticed some points that I think
should be taken into account. Firstly, I noticed only two tests to
verify the functionality of this function and I think that this is not
enough.
Are you thinking about adding some tests with queries involving, for
example, join connections with different tables and unusual operators?

In addition, I have a question about testing your feature on a
benchmark. Are you going to do this?

On 17.07.2024 16:24, Alexander Pyhalov wrote:

Hello.

I'd like to make MergeAppend node Async-capable like Append node.
Nowadays when planner chooses MergeAppend plan, asynchronous execution
is not possible. With attached patches you can see plans like

EXPLAIN (VERBOSE, COSTS OFF)
SELECT * FROM async_pt WHERE b % 100 = 0 ORDER BY b, a;
                                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------

Merge Append
   Sort Key: async_pt.b, async_pt.a
   -> Async Foreign Scan on public.async_p1 async_pt_1
         Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
         Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE (((b %
100) = 0)) ORDER BY b ASC NULLS LAST, a ASC NULLS LAST
   -> Async Foreign Scan on public.async_p2 async_pt_2
         Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
         Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE (((b %
100) = 0)) ORDER BY b ASC NULLS LAST, a ASC NULLS LAST

This can be quite profitable (in our test cases you can gain up to two
times better speed with MergeAppend async execution on remote servers).

Code for asynchronous execution in Merge Append was mostly borrowed
from Append node.

What significantly differs - in ExecMergeAppendAsyncGetNext() you must
return tuple from the specified slot.
Subplan number determines tuple slot where data should be retrieved
to. When subplan is ready to provide some data,
it's cached in ms_asyncresults. When we get tuple for subplan,
specified in ExecMergeAppendAsyncGetNext(),
ExecMergeAppendAsyncRequest() returns true and loop in
ExecMergeAppendAsyncGetNext() ends. We can fetch data for
subplans which either don't have cached result ready or have already
returned them to the upper node. This
flag is stored in ms_has_asyncresults. As we can get data for some
subplan either earlier or after loop in ExecMergeAppendAsyncRequest(),
we check this flag twice in this function.
Unlike ExecAppendAsyncEventWait(), it seems
ExecMergeAppendAsyncEventWait() doesn't need a timeout - as there's no
need to get result
from synchronous subplan if a tuple form async one was explicitly
requested.

Also we had to fix postgres_fdw to avoid directly looking at Append
fields. Perhaps, accesors to Append fields look strange, but allows
to avoid some code duplication. I suppose, duplication could be even
less if we reworked async Append implementation, but so far I haven't
tried to do this to avoid big diff from master.

Also mark_async_capable() believes that path corresponds to plan. This
can be not true when create_[merge_]append_plan() inserts sort node.
In this case mark_async_capable() can treat Sort plan node as some
other and crash, so there's a small fix for this.

I think you should add this explanation to the commit message because
without it it's hard to understand the full picture of how your code works.

--
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Alexander Pyhalov

[email protected]

over 1 year ago

In reply to: Alena Rybakina (#2)

Re: Asynchronous MergeAppend

Hi.

Alena Rybakina писал(а) 2024-08-10 23:24:

Hi! Thank you for your work on this subject! I think this is a very
useful optimization)

While looking through your code, I noticed some points that I think
should be taken into account. Firstly, I noticed only two tests to
verify the functionality of this function and I think that this is not
enough.
Are you thinking about adding some tests with queries involving, for
example, join connections with different tables and unusual operators?

I've added some more tests - tests for joins and pruning.

In addition, I have a question about testing your feature on a
benchmark. Are you going to do this?

The main reason for this work is a dramatic performance degradation when
Append plans with async foreign scan nodes are switched to MergeAppend
plans with synchronous foreign scans.

I've performed some synthetic tests to prove the benefits of async Merge
Append. So far tests are performed on one physical host.

For tests I've deployed 3 PostgreSQL instances on ports 5432-5434.

The first instance:
create server s2 foreign data wrapper postgres_fdw OPTIONS ( port
'5433', dbname 'postgres', async_capable 'on');
create server s3 foreign data wrapper postgres_fdw OPTIONS ( port
'5434', dbname 'postgres', async_capable 'on');

create foreign table players_p1 partition of players for values with
(modulus 4, remainder 0) server s2;
create foreign table players_p2 partition of players for values with
(modulus 4, remainder 1) server s2;
create foreign table players_p3 partition of players for values with
(modulus 4, remainder 2) server s3;
create foreign table players_p4 partition of players for values with
(modulus 4, remainder 3) server s3;

s2 instance:
create table players_p1 (id int, name text, score int);
create table players_p2 (id int, name text, score int);
create index on players_p1(score);
create index on players_p2(score);

s3 instance:
create table players_p3 (id int, name text, score int);
create table players_p4 (id int, name text, score int);
create index on players_p3(score);
create index on players_p4(score);

s1 instance:
insert into players select i, 'player_' ||i, random()* 100 from
generate_series(1,100000) i;

pgbench script:
\set rnd_offset random(0,200)
\set rnd_limit random(10,20)

select * from players order by score desc offset :rnd_offset limit
:rnd_limit;

pgbench was run as:
pgbench -n -f 1.sql postgres -T 100 -c 16 -j 16

CPU idle was about 5-10%.

pgbench results:

Without patch, async_capable on:

pgbench (14.13, server 18devel)
transaction type: 1.sql
scaling factor: 1
query mode: simple
number of clients: 16
number of threads: 16
duration: 100 s
number of transactions actually processed: 130523
latency average = 12.257 ms
initial connection time = 29.824 ms
tps = 1305.363500 (without initial connection time)

Without patch, async_capable off:

pgbench (14.13, server 18devel)
transaction type: 1.sql
scaling factor: 1
query mode: simple
number of clients: 16
number of threads: 16
duration: 100 s
number of transactions actually processed: 130075
latency average = 12.299 ms
initial connection time = 26.931 ms
tps = 1300.877993 (without initial connection time)

as expected - we see no difference.

Patched, async_capable on:

pgbench (14.13, server 18devel)
transaction type: 1.sql
scaling factor: 1
query mode: simple
number of clients: 16
number of threads: 16
duration: 100 s
number of transactions actually processed: 135616
latency average = 11.796 ms
initial connection time = 28.619 ms
tps = 1356.341587 (without initial connection time)

Patched, async_capable off:

pgbench (14.13, server 18devel)
transaction type: 1.sql
scaling factor: 1
query mode: simple
number of clients: 16
number of threads: 16
duration: 100 s
number of transactions actually processed: 131300
latency average = 12.185 ms
initial connection time = 29.573 ms
tps = 1313.138405 (without initial connection time)

Here we can see that async MergeAppend behaves a bit better. You can
argue that benefit is not so big and perhaps is related to some random
factors.
However, if we set number of threads to 1, so that CPU has idle cores,
we'll see more evident improvements:

Patched, async_capable on:
pgbench (14.13, server 18devel)
transaction type: 1.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 100 s
number of transactions actually processed: 20221
latency average = 4.945 ms
initial connection time = 7.035 ms
tps = 202.221816 (without initial connection time)

Patched, async_capable off
transaction type: 1.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 100 s
number of transactions actually processed: 14941
latency average = 6.693 ms
initial connection time = 7.037 ms
tps = 149.415688 (without initial connection time)

--
Best regards,
Alexander Pyhalov,
Postgres Professional

Alexander Pyhalov

[email protected]

8 months ago

In reply to: Alexander Pyhalov (#3)

Re: Asynchronous MergeAppend

Hi.

I've updated patches for asynchronous merge append. They allowed us to
significantly improve performance in practice. Earlier select from
partitioned (and distributed table) could switch to synchronous merge
append plan from asynchronous append. Given that table could have 20+
partitions, it was cheaper, but much less efficient due to remote parts
executing synchronously.

In this version there's a couple of small fixes - earlier
ExecMergeAppend() scanned all asyncplans, but should do this only for
valid asyncplans. Also incorporated logic from

commit af717317a04f5217728ce296edf4a581eb7e6ea0
Author: Heikki Linnakangas <[email protected]>
Date: Wed Mar 12 20:53:09 2025 +0200

Handle interrupts while waiting on Append's async subplans

into ExecMergeAppendAsyncEventWait().

--
Best regards,
Alexander Pyhalov,
Postgres Professional

[email protected]

5 months ago

In reply to: Alexander Pyhalov (#4)

Re: Asynchronous MergeAppend

I noticed that this patch has gone largely unreviewed, but it needs
rebase due to the GUC changes, so here it is again.

Thanks

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/

Matheus Alcantara

[email protected]

5 months ago

In reply to: Alexander Pyhalov (#3)

Re: Asynchronous MergeAppend

Hi, thanks for working on this!

On Tue Aug 20, 2024 at 6:14 AM -03, Alexander Pyhalov wrote:

In addition, I have a question about testing your feature on a
benchmark. Are you going to do this?

The main reason for this work is a dramatic performance degradation when
Append plans with async foreign scan nodes are switched to MergeAppend
plans with synchronous foreign scans.

I've performed some synthetic tests to prove the benefits of async Merge
Append. So far tests are performed on one physical host.

For tests I've deployed 3 PostgreSQL instances on ports 5432-5434.

The first instance:
create server s2 foreign data wrapper postgres_fdw OPTIONS ( port
'5433', dbname 'postgres', async_capable 'on');
create server s3 foreign data wrapper postgres_fdw OPTIONS ( port
'5434', dbname 'postgres', async_capable 'on');

create foreign table players_p1 partition of players for values with
(modulus 4, remainder 0) server s2;
create foreign table players_p2 partition of players for values with
(modulus 4, remainder 1) server s2;
create foreign table players_p3 partition of players for values with
(modulus 4, remainder 2) server s3;
create foreign table players_p4 partition of players for values with
(modulus 4, remainder 3) server s3;

s2 instance:
create table players_p1 (id int, name text, score int);
create table players_p2 (id int, name text, score int);
create index on players_p1(score);
create index on players_p2(score);

s3 instance:
create table players_p3 (id int, name text, score int);
create table players_p4 (id int, name text, score int);
create index on players_p3(score);
create index on players_p4(score);

s1 instance:
insert into players select i, 'player_' ||i, random()* 100 from
generate_series(1,100000) i;

pgbench script:
\set rnd_offset random(0,200)
\set rnd_limit random(10,20)

select * from players order by score desc offset :rnd_offset limit
:rnd_limit;

pgbench was run as:
pgbench -n -f 1.sql postgres -T 100 -c 16 -j 16

CPU idle was about 5-10%.

pgbench results:

[...]
However, if we set number of threads to 1, so that CPU has idle cores,
we'll see more evident improvements:

Patched, async_capable on:
pgbench (14.13, server 18devel)
transaction type: 1.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 100 s
number of transactions actually processed: 20221
latency average = 4.945 ms
initial connection time = 7.035 ms
tps = 202.221816 (without initial connection time)

Patched, async_capable off
transaction type: 1.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 100 s
number of transactions actually processed: 14941
latency average = 6.693 ms
initial connection time = 7.037 ms
tps = 149.415688 (without initial connection time)

I ran some benchmarks based on v4 attached by Alvaro in [1]/messages/by-id/[email protected] using a
smaller number of threads so that some CPU cores would be idle and I
also obtained better results:

Patched, async_capable on:
tps = 4301.567405

Master, async_capable on:
tps = 3847.084545

So I'm +1 for the idea. I know it's been while since the last patch, and
unfortunully it hasn't received reviews since then. Do you still plan to
work on it? I still need to take a look on the code to see if I can help
with some comments.

During the tests I got compiler errors due to fce7c73fba4, so I'm
attaching a v5 with guc_parameters.dat correctly sorted.

The postgres_fdw/regress tests was also failling due to some whitespace
problems, v5 also fix this.

[1]: /messages/by-id/[email protected]

--
Matheus Alcantara

Alexander Pyhalov

[email protected]

5 months ago

In reply to: Matheus Alcantara (#6)

Re: Asynchronous MergeAppend

Hi.

Matheus Alcantara писал(а) 2025-11-03 16:00:

So I'm +1 for the idea. I know it's been while since the last patch,
and
unfortunully it hasn't received reviews since then. Do you still plan
to
work on it? I still need to take a look on the code to see if I can
help
with some comments.

During the tests I got compiler errors due to fce7c73fba4, so I'm
attaching a v5 with guc_parameters.dat correctly sorted.

The postgres_fdw/regress tests was also failling due to some whitespace
problems, v5 also fix this.

[1]
/messages/by-id/[email protected]

I'm still interested in working on this patch, but it didn't get any
review (besides internal one). I suppose, Append and MergeAppend nodes
need some unification, for example, ExecAppendAsyncEventWait and
ExecMergeAppendAsyncEventWait looks the same, both
classify_matching_subplans() versions are suspiciously similar. But
honestly, patch needs thorough review.
--
Best regards,
Alexander Pyhalov,
Postgres Professional

Matheus Alcantara

[email protected]

5 months ago

In reply to: Alexander Pyhalov (#7)

Re: Asynchronous MergeAppend

On Wed Nov 5, 2025 at 3:30 AM -03, Alexander Pyhalov wrote:

So I'm +1 for the idea. I know it's been while since the last patch,
and unfortunully it hasn't received reviews since then. Do you still
plan to work on it? I still need to take a look on the code to see if
I can help with some comments.

I'm still interested in working on this patch, but it didn't get any
review (besides internal one). I suppose, Append and MergeAppend nodes
need some unification, for example, ExecAppendAsyncEventWait and
ExecMergeAppendAsyncEventWait looks the same, both
classify_matching_subplans() versions are suspiciously similar. But
honestly, patch needs thorough review.

Here are some comments on my first look at the patches. I still don't
have too much experience with the executor code but I hope that I can
help with something.

v5-0001-mark_async_capable-subpath-should-match-subplan.patch

I don't have to much comments on this, perhaps we could have a commit
message explaining the reason behind the change.

----

v5-0002-MergeAppend-should-support-Async-Foreign-Scan-sub.patch

The AppendState struct has the "as_syncdone", this field is not needed
on MergeAppendState?

----
Regarding the duplicated code on classify_matching_subplans I think that
we can have a more generic function that operates over function
parameters, something like this:

/*
* Classify valid subplans into sync and async groups.
*
* It calculates the intersection of *valid_subplans and *asyncplans,
* stores the result in *valid_asyncplans, and removes those members
* from *valid_subplans (leaving only sync plans).
*
* Returns true if valid async plans were found, false otherwise.
*/
static bool
classify_subplans_internal(Bitmapset **valid_subplans,
Bitmapset *asyncplans,
Bitmapset **valid_asyncplans);

----
The GetValidAsyncplans() is not being used

----
We have some reduction of code coverage on nodeMergeAppend.c. The
significant blocks are on ExecMergeAppendAsyncBegin():
+	/* If we've yet to determine the valid subplans then do so now. */
+	if (!node->ms_valid_subplans_identified)
+	{
+		node->ms_valid_subplans =
+			ExecFindMatchingSubPlans(node->ms_prune_state, false, NULL);
+		node->ms_valid_subplans_identified = true;
+
+		classify_matching_subplans(node);
+	}

And there are some blocks on ExecReScanMergeAppend(). It's worth adding
a test case for then? I'm not sure how hard would be to write a
regression test that cover these blocks.

----
I agree that duplicated code is not good but it seems to me that we
already have some code on nodeMergeAppend.c borrowed from nodeAppend.c
even without you patch, for example the ExecInitMergeAppend(),
ExecReScanMergeAppend() and partially ExecMergeAppend().

Although nodeMergeAppend.c and nodeAppend.c have similar functions ,
some difference exists and I'm wondering if we should wait for the rule
of three [1]https://en.wikipedia.org/wiki/Rule_of_three_(computer_programming) to refactor these duplicated code?

[1]: https://en.wikipedia.org/wiki/Rule_of_three_(computer_programming)

--
Matheus Alcantara
EDB: http://www.enterprisedb.com

Alexander Pyhalov

[email protected]

5 months ago

In reply to: Matheus Alcantara (#8)

Re: Asynchronous MergeAppend

Matheus Alcantara писал(а) 2025-11-12 00:00:

On Wed Nov 5, 2025 at 3:30 AM -03, Alexander Pyhalov wrote:

So I'm +1 for the idea. I know it's been while since the last patch,
and unfortunully it hasn't received reviews since then. Do you still
plan to work on it? I still need to take a look on the code to see if
I can help with some comments.

I'm still interested in working on this patch, but it didn't get any
review (besides internal one). I suppose, Append and MergeAppend nodes
need some unification, for example, ExecAppendAsyncEventWait and
ExecMergeAppendAsyncEventWait looks the same, both
classify_matching_subplans() versions are suspiciously similar. But
honestly, patch needs thorough review.

Hi.
Thanks for review.

Here are some comments on my first look at the patches. I still don't
have too much experience with the executor code but I hope that I can
help with something.

v5-0001-mark_async_capable-subpath-should-match-subplan.patch

I don't have to much comments on this, perhaps we could have a commit
message explaining the reason behind the change.

I've expanded commit message. The issue is that mark_async_capable()
relies
on the fact that plan node type corresponds to path type. This is not
true when
(Merge)Append decides to insert Sort node.

----

v5-0002-MergeAppend-should-support-Async-Foreign-Scan-sub.patch

The AppendState struct has the "as_syncdone", this field is not needed
on MergeAppendState?

We don't need as_syncdone. Async Append fetches tuple from async subplan
and waits for them either when they have some data or when there's no
more sync subplans (as we can return any tuple we receive from
subplans). But ExecMergeAppend should decide which tuple to return based
on sort order, so there's no need to remember if we are done with sync
subplans, as subplans ordering matters, and we can't arbitrary switch
between them.

Regarding the duplicated code on classify_matching_subplans I think
that
we can have a more generic function that operates over function
parameters

I've tried to do so, but there are two issues. There's no suitable
common header between
nodAppend and nodeMergeAppend. I've put
classify_matching_subplans_common() into src/include/nodes/execnodes.h
and sure that it's not the best choice. The second issue is with
as_syncdone, it exists only in AppendState, so
we should check for empty valid_subplans separately. In fact, there's 3
outcomes for Append 1) no sync plans,
no async plans, 2) no async plans, 3) async plans present, and only
last two states have meaning
for MergeAppend.

The GetValidAsyncplans() is not being used

As well as GetValidAsyncRequest(), these parts are used by our FDW, so
they slipped in the patch. Removed them.

----
We have some reduction of code coverage on nodeMergeAppend.c. The
significant blocks are on ExecMergeAppendAsyncBegin():
+	/* If we've yet to determine the valid subplans then do so now. */
+	if (!node->ms_valid_subplans_identified)
+	{
+		node->ms_valid_subplans =
+			ExecFindMatchingSubPlans(node->ms_prune_state, false, NULL);
+		node->ms_valid_subplans_identified = true;
+
+		classify_matching_subplans(node);
+	}

And there are some blocks on ExecReScanMergeAppend(). It's worth adding
a test case for then? I'm not sure how hard would be to write a
regression test that cover these blocks.

You are right. There's difference between ExecAppend and
ExecMergeAppend. Append identifies valid subplans in
ExecAppendAsyncBegin. MergeAppend - earlier, in ExecMergeAppend(). So
this is really the dead code. And there was an issue with it, which
became evident when I've added test for rescan. When we've identified
new subplans in ExecMergeAppend(), we have to classify them.
--
Best regards,
Alexander Pyhalov,
Postgres Professional

Matheus Alcantara

[email protected]

5 months ago

In reply to: Alexander Pyhalov (#9)

Re: Asynchronous MergeAppend

On Sat Nov 15, 2025 at 7:57 AM -03, Alexander Pyhalov wrote:

Here are some comments on my first look at the patches. I still don't
have too much experience with the executor code but I hope that I can
help with something.

v5-0001-mark_async_capable-subpath-should-match-subplan.patch

I don't have to much comments on this, perhaps we could have a commit
message explaining the reason behind the change.

I've expanded commit message. The issue is that mark_async_capable()
relies
on the fact that plan node type corresponds to path type. This is not
true when
(Merge)Append decides to insert Sort node.

Your explanation about why this change is needed that you've include on
your first email sounds more clear IMHO. I would suggest the following
for a commit message:
mark_async_capable() believes that path corresponds to plan. This is
not true when create_[merge_]append_plan() inserts sort node. In
this case mark_async_capable() can treat Sort plan node as some
other and crash. Fix this by handling the Sort node separately.

This is needed to make MergeAppend node async-capable that it will
be implemented in a next commit.

What do you think?

I was reading the patch changes again and I have a minor point:

SubqueryScan *scan_plan = (SubqueryScan *) plan;

                                /*
-                                * If the generated plan node includes
a gating Result node,
-                                * we can't execute it asynchronously.
+                                * If the generated plan node includes
a gating Result node or
+                                * a Sort node, we can't execute it
asynchronously.
                                 */
-                               if (IsA(plan, Result))
+                               if (IsA(plan, Result) || IsA(plan, Sort))

Shouldn't we cast the plan to SubqueryScan* after the IsA(...) check? I
know this code has been before your changes but type casting before a
IsA() check sounds a bit strange to me. Also perhaps we could add an
Assert(IsA(plan, SubqueryScan)) after the IsA(...) check and before the
type casting just for sanity?

----

v5-0002-MergeAppend-should-support-Async-Foreign-Scan-sub.patch

The AppendState struct has the "as_syncdone", this field is not needed
on MergeAppendState?

We don't need as_syncdone. Async Append fetches tuple from async subplan
and waits for them either when they have some data or when there's no
more sync subplans (as we can return any tuple we receive from
subplans). But ExecMergeAppend should decide which tuple to return based
on sort order, so there's no need to remember if we are done with sync
subplans, as subplans ordering matters, and we can't arbitrary switch
between them.

Ok, thanks for the explanation.

Regarding the duplicated code on classify_matching_subplans I think
that
we can have a more generic function that operates over function
parameters

I've tried to do so, but there are two issues. There's no suitable
common header between
nodAppend and nodeMergeAppend. I've put
classify_matching_subplans_common() into src/include/nodes/execnodes.h
and sure that it's not the best choice. The second issue is with
as_syncdone, it exists only in AppendState, so
we should check for empty valid_subplans separately. In fact, there's 3
outcomes for Append 1) no sync plans,
no async plans, 2) no async plans, 3) async plans present, and only
last two states have meaning
for MergeAppend.

I think that's ok to have these separated checks on nodeAppend.c and
nodeMergeAppend.c once the majority of duplicated steps that would be
required is centralized into a single reusable function.

I also agree that execnodes.h may not be the best place to declare this
function but I also don't have too many ideas of where to put it. Let's
see if we have more comments on this.

----
We have some reduction of code coverage on nodeMergeAppend.c. The
significant blocks are on ExecMergeAppendAsyncBegin():
+    /* If we've yet to determine the valid subplans then do so now. */
+    if (!node->ms_valid_subplans_identified)
+    {
+            node->ms_valid_subplans =
+                    ExecFindMatchingSubPlans(node->ms_prune_state, false, NULL);
+            node->ms_valid_subplans_identified = true;
+
+            classify_matching_subplans(node);
+    }
And there are some blocks on ExecReScanMergeAppend(). It's worth adding
a test case for then? I'm not sure how hard would be to write a
regression test that cover these blocks.
You are right. There's difference between ExecAppend and
ExecMergeAppend. Append identifies valid subplans in
ExecAppendAsyncBegin. MergeAppend - earlier, in ExecMergeAppend(). So
this is really the dead code. And there was an issue with it, which
became evident when I've added test for rescan. When we've identified
new subplans in ExecMergeAppend(), we have to classify them.

Thanks, the code coverage looks better now.

I plan to do another round of review on 0002, in the meantime I'm
sharing these comments for now.

--
Matheus Alcantara
EDB: http://www.enterprisedb.com

Alexander Pyhalov

[email protected]

5 months ago

In reply to: Matheus Alcantara (#10)

Re: Asynchronous MergeAppend

Hi.

Matheus Alcantara писал(а) 2025-11-18 00:09:

On Sat Nov 15, 2025 at 7:57 AM -03, Alexander Pyhalov wrote:

Here are some comments on my first look at the patches. I still don't
have too much experience with the executor code but I hope that I can
help with something.

v5-0001-mark_async_capable-subpath-should-match-subplan.patch

I don't have to much comments on this, perhaps we could have a commit
message explaining the reason behind the change.

I've expanded commit message. The issue is that mark_async_capable()
relies
on the fact that plan node type corresponds to path type. This is not
true when
(Merge)Append decides to insert Sort node.

Your explanation about why this change is needed that you've include on
your first email sounds more clear IMHO. I would suggest the following
for a commit message:
mark_async_capable() believes that path corresponds to plan. This
is
not true when create_[merge_]append_plan() inserts sort node. In
this case mark_async_capable() can treat Sort plan node as some
other and crash. Fix this by handling the Sort node separately.

This is needed to make MergeAppend node async-capable that it will
be implemented in a next commit.

What do you think?

Seems to be OK.

I was reading the patch changes again and I have a minor point:

SubqueryScan *scan_plan = (SubqueryScan
*) plan;
/*
-                                * If the generated plan node includes
a gating Result node,
-                                * we can't execute it asynchronously.
+                                * If the generated plan node includes
a gating Result node or
+                                * a Sort node, we can't execute it
asynchronously.
*/
-                               if (IsA(plan, Result))
+                               if (IsA(plan, Result) || IsA(plan, 
Sort))
Shouldn't we cast the plan to SubqueryScan* after the IsA(...) check? I
know this code has been before your changes but type casting before a
IsA() check sounds a bit strange to me. Also perhaps we could add an
Assert(IsA(plan, SubqueryScan)) after the IsA(...) check and before the
type casting just for sanity?

Yes, checking for node not to be A and then using it as B seems to be
strange. But casting to another type and checking if node is of a
particular type before using seems to be rather common. It doesn't do
any harm if we don't actually refer to SubqueryScan fields.

Updated the first patch.

--
Best regards,
Alexander Pyhalov,
Postgres Professional

Matheus Alcantara

[email protected]

5 months ago

In reply to: Alexander Pyhalov (#11)

Re: Asynchronous MergeAppend

On Tue Nov 18, 2025 at 4:14 AM -03, Alexander Pyhalov wrote:

Updated the first patch.

Thanks for the new version. Some new comments.

v7-0002-MergeAppend-should-support-Async-Foreign-Scan-subpla.patch:

1. Should be "nasyncplans" instead of "nplans"? ExecInitAppend use
"nasyncplans" to allocate the as_asyncresults array.

+		mergestate->ms_asyncresults = (TupleTableSlot **)
+			palloc0(nplans * sizeof(TupleTableSlot *));
+

2. I think that this comment should be updated. IIUC ms_valid_subplans
may not be all subplans because classify_matching_subplans() may move
async ones to ms_valid_asyncplans. Is that right?

/*
* If we've yet to determine the valid subplans then do so now. If
* run-time pruning is disabled then the valid subplans will always be
* set to all subplans.
*/

3. This code comment should also mention the Assert(!bms_is_member(...));?

+	/* The result should be a TupleTableSlot or NULL. */
+	Assert(slot == NULL || IsA(slot, TupleTableSlot));
+	Assert(!bms_is_member(areq->request_index, node->ms_has_asyncresults));

4. It's worth include bms_num_members(node->ms_needrequest) <= 0 check
on ExecMergeAppendAsyncRequest() as an early return? IIUC it would avoid
the bms_is_member(), bms_next_member() and bms_is_empty(needrequest)
calls.

ExecMergeAppendAsyncRequest(MergeAppendState *node, int mplan)
Bitmapset *needrequest;
int i;

+       if (bms_num_members(node->ms_needrequest) <= 0)
+               return false;
+

I'm attaching a diff with some cosmetic changes of indentation and
comments. Feel free to include on the patch or not.

--
Matheus Alcantara
EDB: http://www.enterprisedb.com

Alexander Pyhalov

[email protected]

5 months ago

In reply to: Matheus Alcantara (#12)

Re: Asynchronous MergeAppend

Matheus Alcantara писал(а) 2025-11-20 00:51:

On Tue Nov 18, 2025 at 4:14 AM -03, Alexander Pyhalov wrote:

Updated the first patch.

Thanks for the new version. Some new comments.

v7-0002-MergeAppend-should-support-Async-Foreign-Scan-subpla.patch:

1. Should be "nasyncplans" instead of "nplans"? ExecInitAppend use
"nasyncplans" to allocate the as_asyncresults array.
+		mergestate->ms_asyncresults = (TupleTableSlot **)
+			palloc0(nplans * sizeof(TupleTableSlot *));
+

No. There's a difference between how Append and MergeAppend handle async
results.

When Append looks for the next result, it can return any of them.
So, async results are not ordered in Append.
We have maximum nasyncplans of async results and return the first
available result
when we asked for one . For example, in ExecAppendAsyncRequest():

1004 /*
1005 * If there are any asynchronously-generated results that
have not yet
1006 * been returned, we have nothing to do; just return one of
them.
1007 */
1008 if (node->as_nasyncresults > 0)
1009 {
1010 --node->as_nasyncresults;
1011 *result =
node->as_asyncresults[node->as_nasyncresults];
1012 return true;
1013 }

ExecAppendAsyncGetNext() looks (via ExecAppendAsyncRequest()) on any
result.

However, when we are asked for result in MergeAppend, we should return
result of
the specific subplan. To achieve this we should know, which subplan
given results correspond to.
So, we enumerate async results in the same way as requests (or
ms_valid_asyncplans).
Look at ExecAppendAsyncGetNext()/ExecAppendAsyncRequest().

2. I think that this comment should be updated. IIUC ms_valid_subplans
may not be all subplans because classify_matching_subplans() may move
async ones to ms_valid_asyncplans. Is that right?

/*
* If we've yet to determine the valid subplans then do so now. If
* run-time pruning is disabled then the valid subplans will always be
* set to all subplans.
*/

Yes, you are correct, and similar comment in nodeAppend.c lacks the last
sentence.
Removed it.

3. This code comment should also mention the
Assert(!bms_is_member(...));?

+	/* The result should be a TupleTableSlot or NULL. */
+	Assert(slot == NULL || IsA(slot, TupleTableSlot));
+	Assert(!bms_is_member(areq->request_index, 
node->ms_has_asyncresults));

4. It's worth include bms_num_members(node->ms_needrequest) <= 0 check
on ExecMergeAppendAsyncRequest() as an early return? IIUC it would
avoid
the bms_is_member(), bms_next_member() and bms_is_empty(needrequest)
calls.

We can't exclude the first bms_is_member(), as node->ms_needrequest can
be empty
(we've already got result), so do not need to do request to get it, just
return previously fetched result.

Not sure about check above the following lines:

650 i = -1;
651 while ((i = bms_next_member(node->ms_needrequest, i)) >= 0)
652 {
653 if (!bms_is_member(i, node->ms_has_asyncresults))
654 needrequest = bms_add_member(needrequest,
i);
655 }
656

I think, it shouldn't be much cheaper as bms_next_member() will execute
a couple instructions
to find out that the number of words in bitmapset is zero, but will do
nothing expensive.

ExecMergeAppendAsyncRequest(MergeAppendState *node, int mplan)
Bitmapset *needrequest;
int i;
+       if (bms_num_members(node->ms_needrequest) <= 0)
+               return false;
+

No, as I've mentioned, we can't exclude bms_is_member(mplan,
node->ms_has_asyncresults) check.
We could have received result while waiting for data for another
subplan.

Let's assume, we have 2 async subplans (0 and 1). For example, we've
decided
to get data from subplan 1. We 've already send requests to both async
subplans (in ExecMergeAppendAsyncBegin() or later).
Now we do ExecMergeAppendAsyncRequest(), but there's no subplans, which
need request.
So we enter the waiting loop. Let's assume we got event for another
subplan (0). Via ExecAsyncNotify(),
ExecAsyncForeignScanNotify() we go to postgresForeignAsyncNotify(),
fetch data and mark request as complete.
Via ExecAsyncMergeAppendResponse() we save results for subplan 0 in
node->ms_asyncresults[0]. When we finally
got result for subplan 1, we do the same, but now exit the loop. When
MergeAppend finally decides that it needs
results from subplan 0, we already have them, but ms_needrequest is
empty, so ExecMergeAppendAsyncRequest()
just returns this pre-fetched tuple.
--
Best regards,
Alexander Pyhalov,
Postgres Professional

Matheus Alcantara

[email protected]

4 months ago

In reply to: Alexander Pyhalov (#13)

Re: Asynchronous MergeAppend

Thanks for the new version. I don't have other comments on the current
state of the patch. It seems to working as expected and we have
performance improvements that I think that it make it worthwhile.

I have just a small comment on 0002:

+	noccurred = WaitEventSetWait(node->ms_eventset, -1 /* no timeout */ , occurred_event,
+								 nevents, WAIT_EVENT_APPEND_READY);

Should we use the same WAIT_EVENT_APPEND_READY or create a new wait
event for merge append?

I've spent some time thinking about how we could remove some parts of
the duplicated code that you've previously mention. I think that we
could try to create something like we already have for relation scan
operations, that we have execScan.c that is used for example by
nodeSeqScan.c and nodeIndexScan.c. The attached patch 0003 is a draft
that I attempt to implement this idea. The 0001 and 0002 remains the
same as the previous version. The 0003 was build on top of these.

I've created Appender and AppenderState types that are used by
Append/MergeAppend and AppendState/MergeAppendState respectively (I
should have think in a better name for these base type, ideas are
welcome). The execAppend.c was created to have the functions that can be
reused by Append and MergeAppend execution. I've tried to remove
duplicated code blocks that was almost the same and that didn't require
much refactoring.

I think that there still more opportunities to remove similar code
blocks that for example are on ExecMergeAppendAsyncGetNext and
ExecAppendAsyncGetNext but it require refactoring.

Thoughts?

--
Matheus Alcantara
EDB: http://www.enterprisedb.com

Alexander Pyhalov

[email protected]

4 months ago

In reply to: Matheus Alcantara (#14)

Re: Asynchronous MergeAppend

Hi.

Matheus Alcantara писал(а) 2025-12-17 23:01:

Thanks for the new version. I don't have other comments on the current
state of the patch. It seems to working as expected and we have
performance improvements that I think that it make it worthwhile.

I have just a small comment on 0002:
+	noccurred = WaitEventSetWait(node->ms_eventset, -1 /* no timeout */ , 
occurred_event,
+								 nevents, WAIT_EVENT_APPEND_READY);
Should we use the same WAIT_EVENT_APPEND_READY or create a new wait
event for merge append?

I'm not sure that new wait event is needed - for observability I think
it's not critical
to distinguish Append and MergeAppend when they waited for foreign
scans. But also it's perhaps
doesn't do any harm to record specific wait event.

I've spent some time thinking about how we could remove some parts of
the duplicated code that you've previously mention. I think that we
could try to create something like we already have for relation scan
operations, that we have execScan.c that is used for example by
nodeSeqScan.c and nodeIndexScan.c. The attached patch 0003 is a draft
that I attempt to implement this idea. The 0001 and 0002 remains the
same as the previous version. The 0003 was build on top of these.

I've created Appender and AppenderState types that are used by
Append/MergeAppend and AppendState/MergeAppendState respectively (I
should have think in a better name for these base type, ideas are
welcome). The execAppend.c was created to have the functions that can
be
reused by Append and MergeAppend execution. I've tried to remove
duplicated code blocks that was almost the same and that didn't require
much refactoring.

Overall I like new Appender node. Splitting code in this way really
helps to avoid code duplication.
However, some similar code is still needed, because logic of getting new
tuples is different.

Some minor issues I've noticed.
1) ExecReScanAppender() sets node->needrequest to NULL.
ExecReScanAppend() calls bms_free(node->as.needrequest) immediately
after this. The same is true for ExecReScanMergeAppend(). We should move
it to ExecReScanAppender().

2) In src/backend/executor/execAppend.c:
planstates are named as mergeplans in ExecEndAppender(), perhaps,
appendplans or subplans are better names.

ExecInitAppender() could use palloc_array() to allocate appendplanstates
- as ExecInitMergeAppend().

I think that there still more opportunities to remove similar code
blocks that for example are on ExecMergeAppendAsyncGetNext and
ExecAppendAsyncGetNext but it require refactoring.

Thoughts?

--
Matheus Alcantara
EDB: http://www.enterprisedb.com

--
Best regards,
Alexander Pyhalov,
Postgres Professional

Matheus Alcantara

[email protected]

4 months ago

In reply to: Alexander Pyhalov (#15)

Re: Asynchronous MergeAppend

On Thu Dec 18, 2025 at 6:56 AM -03, Alexander Pyhalov wrote:

+	noccurred = WaitEventSetWait(node->ms_eventset, -1 /* no timeout */ , 
occurred_event,
+								 nevents, WAIT_EVENT_APPEND_READY);
Should we use the same WAIT_EVENT_APPEND_READY or create a new wait
event for merge append?
I'm not sure that new wait event is needed - for observability I think
it's not critical
to distinguish Append and MergeAppend when they waited for foreign
scans. But also it's perhaps
doesn't do any harm to record specific wait event.

Ok, I think that we can keep this way for now and let's see if a new
wait event is really needed.

I've created Appender and AppenderState types that are used by
Append/MergeAppend and AppendState/MergeAppendState respectively (I
should have think in a better name for these base type, ideas are
welcome). The execAppend.c was created to have the functions that can
be
reused by Append and MergeAppend execution. I've tried to remove
duplicated code blocks that was almost the same and that didn't require
much refactoring.

Overall I like new Appender node. Splitting code in this way really
helps to avoid code duplication.
However, some similar code is still needed, because logic of getting new
tuples is different.

Indeed.

Some minor issues I've noticed.
1) ExecReScanAppender() sets node->needrequest to NULL.
ExecReScanAppend() calls bms_free(node->as.needrequest) immediately
after this. The same is true for ExecReScanMergeAppend(). We should move
it to ExecReScanAppender().

Fixed

2) In src/backend/executor/execAppend.c:
planstates are named as mergeplans in ExecEndAppender(), perhaps,
appendplans or subplans are better names.

Fixed

ExecInitAppender() could use palloc_array() to allocate appendplanstates
- as ExecInitMergeAppend().

Fixed

--
Matheus Alcantara
EDB: http://www.enterprisedb.com

Alexander Pyhalov

[email protected]

3 months ago

In reply to: Matheus Alcantara (#16)

Re: Asynchronous MergeAppend

Matheus Alcantara писал(а) 2025-12-19 16:45:

On Thu Dec 18, 2025 at 6:56 AM -03, Alexander Pyhalov wrote:
+	noccurred = WaitEventSetWait(node->ms_eventset, -1 /* no timeout */ 
,
occurred_event,
+								 nevents, WAIT_EVENT_APPEND_READY);
Should we use the same WAIT_EVENT_APPEND_READY or create a new wait
event for merge append?
I'm not sure that new wait event is needed - for observability I think
it's not critical
to distinguish Append and MergeAppend when they waited for foreign
scans. But also it's perhaps
doesn't do any harm to record specific wait event.
Ok, I think that we can keep this way for now and let's see if a new
wait event is really needed.

I've created Appender and AppenderState types that are used by
Append/MergeAppend and AppendState/MergeAppendState respectively (I
should have think in a better name for these base type, ideas are
welcome). The execAppend.c was created to have the functions that can
be
reused by Append and MergeAppend execution. I've tried to remove
duplicated code blocks that was almost the same and that didn't
require
much refactoring.

Overall I like new Appender node. Splitting code in this way really
helps to avoid code duplication.
However, some similar code is still needed, because logic of getting
new
tuples is different.

Hi.

I've looked through updated patch. Tested it (also with our fdw).
Overall looks good.

In execAppend.c there's still reference to as_valid_subplans. Also we
could perhaps use palloc0_array() in some more places, for example, for
for state->asyncrequests and state->asyncresults.
--
Best regards,
Alexander Pyhalov,
Postgres Professional

Matheus Alcantara

[email protected]

3 months ago

In reply to: Alexander Pyhalov (#17)

Re: Asynchronous MergeAppend

On Tue Dec 23, 2025 at 5:50 AM -03, Alexander Pyhalov wrote:

I've looked through updated patch. Tested it (also with our fdw).
Overall looks good.

Thanks for testing.

In execAppend.c there's still reference to as_valid_subplans. Also we
could perhaps use palloc0_array() in some more places, for example, for
for state->asyncrequests and state->asyncresults.

Fixed on the new attached version.

--
Matheus Alcantara
EDB: https://www.enterprisedb.com

Alexander Pyhalov

[email protected]

3 months ago

In reply to: Matheus Alcantara (#18)

Re: Asynchronous MergeAppend

Matheus Alcantara писал(а) 2025-12-29 16:43:

On Tue Dec 23, 2025 at 5:50 AM -03, Alexander Pyhalov wrote:

I've looked through updated patch. Tested it (also with our fdw).
Overall looks good.

Thanks for testing.

In execAppend.c there's still reference to as_valid_subplans. Also we
could perhaps use palloc0_array() in some more places, for example,
for
for state->asyncrequests and state->asyncresults.

Fixed on the new attached version.

--
Matheus Alcantara
EDB: https://www.enterprisedb.com

Hi.

Looks good. What do you think about classify_matching_subplans_common()?
Should it stay where it is or should we hide it to
src/include/executor/execAppend.h ?
--
Best regards,
Alexander Pyhalov,
Postgres Professional

Matheus Alcantara

[email protected]

3 months ago

In reply to: Alexander Pyhalov (#19)

Re: Asynchronous MergeAppend

On Tue Dec 30, 2025 at 10:15 AM -03, Alexander Pyhalov wrote:

Looks good. What do you think about classify_matching_subplans_common()?
Should it stay where it is or should we hide it to
src/include/executor/execAppend.h ?

Yeah, sounds better to me to move classify_matching_subplans_common to
execAppend.h since it very specific for the MergeAppend and Append
execution. I've declared the function as static inline on execAppend.h
but I'm not sure if it's the best approach.

I've also wrote a proper commit message for this new version.

--
Matheus Alcantara
EDB: https://www.enterprisedb.com

Alexander Pyhalov

[email protected]

3 months ago

In reply to: Matheus Alcantara (#20)

Alexander Pyhalov

[email protected]

about 2 months ago

In reply to: Alexander Pyhalov (#21)

Alexander Pyhalov

[email protected]

about 1 month ago

In reply to: Alexander Pyhalov (#22)

Alexander Pyhalov

[email protected]

18 days ago

In reply to: Alexander Pyhalov (#23)

Alexander Korotkov

[email protected]

6 days ago

In reply to: Alexander Pyhalov (#24)

Matheus Alcantara

[email protected]

6 days ago

In reply to: Alexander Korotkov (#25)

Alexander Korotkov

[email protected]

about 8 hours ago

In reply to: Matheus Alcantara (#26)