SQL Plan Archives - SQLPerformance.com

Simple Parameterization and Trivial Plans — Part 6

Paul White — Thu, 04 Aug 2022 09:00:03 +0000

In part 5, I described how the early phases of query compilation affect simple parameterization. To recap:

Normalization and decoding promote cached plan reuse
The decoding step has fixed and limited capabilities
The Failed Auto-Params/sec counter is incremented when decoding fails and the statement is not parameterized
Shell plans optimize for very frequent execution of simple statements, bypassing the parsing, parameter replacement, normalization, and decoding stages
The first application of constant folding occurs after decoding, resulting in a separate parameter for each value in a foldable expression

Let’s now continue following the compilation process to see how SQL Server decides if simple parameterization is safe or unsafe.

Code examples use the Stack Overflow 2010 database on SQL Server 2019 CU 16 with the following additional nonclustered index:

CREATE INDEX [IX dbo.Users Reputation (DisplayName)] 
ON dbo.Users (Reputation) 
INCLUDE (DisplayName);

Simplification

The simplification compilation stage aims to remove logical redundancy and reduce expression complexity.

Some opportunities for logical tree simplification arise from prior compilation activity, others when the original statement contains unnecessary elements. The latter is common for SQL generated by applications and frameworks but also often occurs in dynamic queries. Opportunities may also arise as database CHECK constraints are incorporated in the logical tree during simplification.

Timing is a key point here. Since decoding succeeded earlier, the query processor is now dealing with a parameterized (prepared) statement. Many simplifications can't be applied to parameters, only constant values.

Let’s look at an example:

SELECT 
    U.DisplayName
FROM dbo.Users AS U 
WHERE 
    U.Reputation BETWEEN 1 AND 999
    AND U.Reputation BETWEEN 999 AND 1234;

This statement qualifies for simple parameterization. Four constants are identified, with inferred integer data types shrunk to smallint or tinyint due to the comparison operator parsing context (described in part 4):

This plan will continue to return correct results when reused with different parameter values. Let’s run the query again with an extra element to prevent simple parameterization:

SELECT 
    U.DisplayName
FROM dbo.Users AS U 
WHERE 
    U.Reputation BETWEEN 1 AND 999
    AND U.Reputation BETWEEN 999 AND 1234
OPTION (KEEP PLAN); -- New

The query hint has no effect on the statement besides disabling simple parameterization at the parsing stage. Without parameters, simplification removes logical redundancy from the predicates to leave a single equality test:

None of this is particularly surprising. The outcome is the same as when we choose to parameterize client-side. Reusable plans often permit fewer optimizations than ones built for specific values.

Nevertheless, there’s an important observation to make here. The query processor is not yet fully committed to producing a prepared plan. The simplification stage will not be repeated using constant values if the parameterization attempt fails later.

There are several compilation stages after simplification but they don’t interact much with simple parameterization. I’m going to skip those and move on to where the final decision to apply simple parameterization is made.

Trivial Plans

Like simple parameterization, the trivial plan compilation stage aims to reduce the cost of creating execution plans for simple statements. It’s the final step before possibly invoking the full cost-based optimizer (CBO) where alternatives are subject to detailed costing analysis.

The trivial plan stage provides a fast path through compilation for simple statements, whether parameterized or not. The CBO has significant start-up and runtime costs and may consume significant server resources. These costs are unlikely to be recovered through finding an incrementally better plan in simple cases.

Avoiding CBO can confer significant performance and resource usage advantages for workloads frequently executing simple queries.

Generating a Trivial Plan

Trivial plans are often referred to as having no cost-based choices. This is perhaps a useful shorthand, but it isn’t the whole story.

SQL Server generates trivial plans for ‘common’ statement types having an ‘obvious’ implementation and a ‘short’ expected run time. Whether a statement is ‘common’ and ‘obvious’ is largely determined by heuristics and implementation complexity. The ‘expected run time’ is approximated by estimated plan cost.

The trivial plan stage has entry conditions like CBO stages do. If the logical tree has features that can't be implemented in a trivial plan, the stage is skipped. Otherwise, a limited set of substitution and implementation rules are evaluated where they apply to the current tree.

Entering the trivial plan stage is not a guarantee of success. The limited rules available still might fail to generate a complete execution plan. In this case, the trivial plan stage fails and compilation moves on to the CBO.

Qualification

Microsoft deliberately don’t document the precise criteria used to decide if a statement qualifies for a trivial plan. The rules can and do change from time to time.

Even so, it’s useful to say a simple statement with an obvious ‘best’ index is most likely to get a trivial plan. The index need not be covering if it's unique and an equality index seek is possible. Trivial plans are possible for index scans as well as seeks. A table scan (heap or clustered) can also feature in a trivial plan.

In principle, pretty much any relatively simple statement pattern could be implemented in SQL Server at the trivial plan stage, assuming a known good plan shape exists. In practice, SQL Server is quite conservative.

This doesn’t mean SQL Server only generates a trivial plan when no cost-based choices exist. One could argue every decision is cost-based to some extent, including which index to use at the trivial plan stage.

Index analysis is a big part of the trivial plan decision. SQL Server uses heuristics to decide if a particular index is good enough to make CBO unlikely worth pursuing.

For example, a covering nonclustered index will usually be selected as trivial if it involves reading fewer pages than any other index (including the heap or clustered index). This determination takes into account the selectivity of any predicates and index fill factors.

The trivial plan selection heuristics also don’t account for facilities only available in CBO, like index intersection plans. SQL Server might therefore select a single-index trivial plan when full CBO analysis would’ve found a lower-cost index intersection plan.

Parameterization Safety

When a statement passes the earlier parser and decoder checks, it arrives at the trivial plan stage as a prepared (parameterized) statement. The query processor now needs to decide if the parameterization attempt is safe.

Parameterization is considered safe if the query processor would generate the same plan for all possible future parameter values. This might seem like a complex determination to make, but SQL Server takes a practical approach.

Safe Parameterization

Simply stated, a simple parameterization attempt is considered safe if the trivial plan stage produces a plan for the prepared statement.

When this occurs, the Safe Auto-Params/sec counter of the SQL Statistics object is incremented. Refer back to part 3 for a script to reliably detect simple parameterization and trivial plan outcomes.

If the trivial plan stage is disabled with trace flag 8757, simple parameterization is never applied because a trivial plan can’t be generated.

Unsafe Parameterization

If the trivial plan stage doesn’t produce a plan for the current prepared statement, the simple parameterization attempt is considered unsafe.

When this occurs, the Unsafe Auto-Params/sec counter is incremented.

After an unsafe parameterization attempt, SQL Server replaces parameters in the prepared plan with the original constant values before invoking the CBO. The statement remains technically prepared but contains no parameters.

Parameters are only replaced by constants inside query plan operators. Other elements of the prepared statement aren't cleaned up. This explains why an actual execution plan produced after an unsafe attempt at simple parameterization retains the parameterized text and a parameter list (see part 3).

Parallelism

While a simple parameterization attempt is considered safe if a trivial plan is found, it doesn’t mean the final execution plan will be TRIVIAL.

If the estimated cost of the trivial plan found exceeds the cost threshold for parallelism, SQL Server will invoke the CBO to consider more alternatives, possibly including parallel execution plans. The plan will be marked as having been through FULL optimization.

The outcome of CBO need not be a parallel plan, but it will remain parameterized in any case. I showed an example of simple parameterization producing a parallel query plan in part 3.

Feature Interaction

Simple parameterization and the trivial plan stage are closely related but remain independent activities.

A statement can qualify for simple parameterization without producing a TRIVIAL execution plan.

This happens when a trivial plan is found with a cost exceeding the parallelism threshold. The final plan will remain parameterized after CBO but might be serial or parallel.

A statement containing constants can produce a TRIVIAL plan without also qualifying for simple parameterization.

This happens when the statement contains syntax elements considered unsuitable for simple parameterization by parsing or decoding (covered in parts four and five).

Process Summary

I’ve covered a lot of ground in this series, so I thought it would be useful to finish up with a flowchart showing the main decision points during consideration of simple parameterization and trivial plans:

The post Simple Parameterization and Trivial Plans — Part 6 appeared first on SQLPerformance.com.

Simple Parameterization and Trivial Plans — Part 5

Paul White — Wed, 08 Jun 2022 09:00:58 +0000

In part 4, I described the role the parser plays turning a statement with constant values into a parameterized prepared statement. To recap briefly, the parser:

Identifies potential parameters
Assigns an initial data type
Accounts for both simple and forced parameterization rules
Sets a flag if it determines simple parameterization is impossible

When the parser allows simple parameterization to continue, SQL Server® increments the Auto-Param Attempts/sec counter of the SQL Statistics object.

Let’s now continue following the compilation process noting the effects on simple parameterization as we go.

As in previous parts, code examples use the Stack Overflow 2010 database on SQL Server 2019 CU 16 with the following additional nonclustered index:

CREATE INDEX [IX dbo.Users Reputation (DisplayName)] 
ON dbo.Users (Reputation) 
INCLUDE (DisplayName);

Normalization

The output of parsing is a logical representation of the statement called a parse tree. This tree does not contain SQL language elements. It's an abstraction of the logical elements of the original query specification.

It's frequently possible to write the same logical requirement in different ways. Using a SQL analogy, x IN (4, 8) is logically the same as writing x = 4 OR x = 8. This flexibility can be useful when writing queries, but it makes implementing an optimizer more difficult.

In general terms, normalization is an attempt to standardize. It recognises common variations expressing the same logic, and rewrites them in a standard way. For example, normalization is responsible for turning x BETWEEN y AND z into x >= y AND x <= z. The normalized, or standardized, form chosen is one the query processor finds convenient to work with internally.

I’m simplifying a bit here. Like parameterization, normalization happens in stages at different points during statement compilation and optimization. Logical operator trees and expressions are normalized at different times and in different ways. It's only important to understand the broad concept for what follows.

Maximizing Parameterized Plan Reuse

SQL Server could use the parameter locations determined during parsing to directly replace constants with parameter markers. This would function to allow plan reuse, but also require future statements to be written in exactly the same way (constant values aside) to benefit from plan reuse.

This might be useful enough in itself, but we can do better by normalizing. Expressing the parameterized statement in a standardized way allows future statements to match if they normalize to the same form, even where the original text is quite different. I’ll show you some examples shortly.

Decoding

SQL Server implements this important normalization aspect by decoding the logical operator tree back into a SQL representation. Decoding uses slightly different rules for simple and forced parameterization but in either case the normalized SQL will usually be quite different from the original.

For simple parameterization, decoding emits keywords in upper case, data types and intrinsic functions in lower case, delimits object names with square brackets, and removes unnecessary spaces and comments.

Forced parameterization emits keywords in lower case, does not use bracket delimiters, and places a space either side of the ‘dot’ between schema and object names.

Decoding also standardizes elements like comparison operators, the AS keyword for aliases, and the presence of an ASC or DESC specification in an ORDER BY clause. For example, decoding always expresses ‘not equal’ using <> even if != was used in the original statement.

Neither normalization scheme changes the casing of non-keyword elements like object names. This could be unsafe in case-sensitive databases.

Whichever normalization scheme is used, the key point is AdHoc statements can be written using different keyword casing, spacing, comments, and delimiters but still benefit from plan reuse through server-side parameterization. The text will normalize to the same parameterized form, allowing a cached prepared plan to be matched and reused.

Let’s look at an example.

Example 1

The following statements generate the same normalized text under simple parameterization:

ALTER DATABASE SCOPED CONFIGURATION 
    CLEAR PROCEDURE_CACHE;
GO
SELECT U.DisplayName FROM dbo.Users AS U 
WHERE U.Reputation IN (2) ORDER BY U.Id;
GO
SELECT U.DisplayName 
FROM dbo.Users AS U 
WHERE U.Reputation =3
ORDER BY U.Id;
GO
select 
    U.DisplayName 
FROM [dbo].[Users] as U 
WhErE 
    U.Reputation = (4)
Order By
    U.[Id];
GO
SELECT
    U.DisplayName 
    -- Note: No AS used for the alias
from [dbo] . Users U 
where
    U.Reputation = 5
order by 
    U.Id asc;
GO

Let’s look at the plan cache:

SELECT
    CP.usecounts,
    CP.objtype,
    ST.[text]
FROM sys.dm_exec_cached_plans AS CP
CROSS APPLY sys.dm_exec_sql_text (CP.plan_handle) AS ST
WHERE 
    ST.[text] NOT LIKE '%dm_exec_cached_plans%'
    AND ST.[text] LIKE '%DisplayName%Users%'
ORDER BY 
    CP.objtype ASC;

The output shows four AdHoc shell plans pointing to a single Prepared plan:

Each AdHoc statement is different, but the logical trees all decode to the same form (line breaks added for clarity):

(@1 tinyint)
SELECT [U].[DisplayName]
FROM [dbo].[Users] [U]
WHERE [U].[Reputation]=@1
ORDER BY [U].[Id] ASC

The normalized and parameterized form is used in cache lookups so the prepared plan is successfully found and reused.

This is a neat feature of simple parameterization but it's still best to use a consistent SQL style. Admittedly, this can be difficult to enforce in practical work environments over time.

Normalization isn’t perfect. You can probably find any number of ways to subtly change the statements above to produce different ‘normalized’ parameterized forms. Normalization promotes reuse of parameterized plans—it doesn’t guarantee it.

Failed Parameterization

The decoding step for simple parameterization has deliberately limited capabilities. It's only able to decode certain logical tree elements into SQL. These limitations determine which clauses and intrinsic functions (among other things) are compatible with simple parameterization.

If the decoding step successfully produces a complete SQL representation, the compilation process proceeds from this point as a prepared (parameterized) statement.

When decoding fails, the Failed Auto-Params/sec counter of the SQL Statistics object is incremented and compilation continues as an unparameterized (not prepared) statement.

Statements that fail at the parsing or decoding steps don't show any parameter details in estimated or actual plans because a prepared version of the statement is never created. Refer back to part three for execution plan details.

Decoder Limitations

For a parameterization attempt to fail here, it must have passed the generic tests performed by the parser, but fail at the decoding stage. The syntax elements, global variables, and intrinsic functions supported by the simple parameterization decoder aren't documented.

The decoder limitations explain why (and when) examples shown earlier in this series failed with LOWER and CEILING built-in functions, but succeeded with FLOOR and ABS.

Constant to constant comparison is also unsupported by the simple parameterization decoder, which explains why the well-known WHERE 1 = 1 trick prevents simple parameterization. As further examples, the decoder supports the @@SPID and @@TRANCOUNT global variables, but not @@ROWCOUNT or @@IDENTITY.

The performance counter test rig in part three can be used to explore which statement features fail at the decoding stage. Such statements will increment both Auto-Param Attempts/sec and Failed Auto-Params/sec.

Remember a statement disqualified from simple parameterization by the parser doesn’t increment Auto-Param Attempts/sec.

Example 2 – Plan Reuse

Try to predict how many plans of which type (ad-hoc or prepared) will be cached if we remove the GO batch separators from example 1:

-- Single batch
SELECT U.DisplayName FROM dbo.Users AS U 
WHERE U.Reputation IN (2) ORDER BY U.Id;

SELECT U.DisplayName 
FROM dbo.Users AS U 
WHERE U.Reputation =3
ORDER BY U.Id;

select 
    U.DisplayName 
FROM [dbo].[Users] as U 
WhErE 
    U.Reputation = (4)
Order By
    U.[Id];

SELECT
    U.DisplayName 
    -- Note: No AS used for the alias
from [dbo] . Users U 
where
    U.Reputation = 5
order by 
    U.Id asc;

You may be surprised to find only two plans are cached, one Adhoc and one Prepared.

The prepared plan is used four times:

Explanation

Back in part one, I said it can be useful to think of a prepared statement as an (unnamed) stored procedure.

Imagine example 2 had contained four calls to a single stored procedure with different parameter values each time. That would cache the ad-hoc batch once and the stored procedure plan once. The single stored procedure plan would be executed four times with the different parameter values.

Something conceptually similar happened with simple parameterization as I’ll now explain.

SQL Server compiles a single plan for all statements in a batch, but it compiles each statement in the batch sequentially. In example 2, the first statement encountered qualifies for simple parameterization so a separate prepared statement is built and cached.

The second, third, and fourth statements in the batch also qualify for simple parameterization. After normalization and decoding they match the prepared plan cached by the first statement. The single prepared plan is therefore executed a total of four times.

Shell Plans Revisited

The Adhoc plan cached in example 2 contains four shell plans (covered in part 1). Each shell plan points to the same Prepared plan.

Caching the Adhoc shell ensures that future execution of exactly the same statement text (including constant values) will quickly find the parameterized plan.

Without the shells, the statement would need to be parsed, parameterized, normalized, and decoded back to SQL representation before it could be matched to the prepared plan.

This is less work than generating even a trivial plan, but it is still less efficient than using the shell to locate the prepared statement directly from the source text.

The goal of simple parameterization is to optimize performance for simple and frequently-executed ad-hoc SQL statements that parameterize to the same form. For an OLTP workload with a very rapid submission of such statements, every millisecond counts. Besides, all compilation activity consumes some server resources.

Algebrization and Constant Folding

The next step of the compilation process is algebrization. This complex stage performs a number of tasks, including binding object names found in the parse tree to physical database entities, constant folding, matching aggregates to grouping elements, and deriving final data types from metadata using type conversion rules as necessary. The output from this activity is a bound tree of logical operators.

The most relevant part of algebrization to simple parameterization is the initial round of constant folding it performs.

As the linked documentation states, constant folding is the early evaluation of expressions to improve runtime performance. For example, the expression DATEFROMPARTS(2022, 07, 11) can be evaluated early to the date value 11 July 2022.

The algebrization stage is the first time constant folding is applied during compilation, so earlier stages will always see unfolded expressions.

Let’s look at this with an example.

Example 3

SELECT 
    U.DisplayName
FROM dbo.Users AS U 
WHERE 
    U.Reputation = 900 + 90 + 9;

This statement qualifies for simple parameterization. Each of the constants is identified as a separate parameter by the parser, and initially typed as integer. No integer type shrinking to smallint or tinyint is performed because the immediate parsing context is an arithmetic operator as described in part four.

The post-execution (actual) plan confirms parameterization occurred before constant folding had a chance to evaluate 900 + 90 + 9:

Notice the three integer parameters and the normalized representation of the parameterized statement text in the top bar.

This example is specific to simple parameterization. Forced parameterization doesn’t parameterize constant-foldable expressions that are arguments of the +, -, *, /, and % operators.

Constant folding may be repeated several times later in the compilation and optimization process as new folding opportunities arise. These later constant folding runs may apply to the whole tree, newly generated alternative subtrees, or an individual expression.

End of Part 5

In the final part of this series, I’ll continue analysis of the compilation process with the simplification and trivial plan stages, including how and when SQL Server decides if simple parameterization is safe or unsafe.

The post Simple Parameterization and Trivial Plans — Part 5 appeared first on SQLPerformance.com.

Simple Parameterization and Trivial Plans — Part 4

Paul White — Fri, 20 May 2022 13:35:37 +0000

The Compilation Process

The most important things to understand about server-side parameterization are it doesn’t happen all at once and a final decision to parameterize isn’t made until the end of the process.

Multiple compilation components are involved in successfully turning a SQL statement containing constants into a parameterized prepared statement with typed parameters. Each component is active at a different time during compilation and performs its own parameterization-related tasks using the information available at that time.

The scope and timing of each component’s activity explain the curious things seen in the previous parts of this series. Even when the parameterization attempt isn’t successful, there can still be effects from preparations for parameterization visible in the final execution plan.

There are a number of details to explore to gain a complete understanding. Let’s follow the main compilation stages noting the effects on parameterization as we go.

As in previous parts, most code examples use the Stack Overflow 2010 database on SQL Server 2019 CU 16 with the following additional nonclustered index:

CREATE INDEX [IX dbo.Users Reputation (DisplayName)] 
ON dbo.Users (Reputation) 
INCLUDE (DisplayName);

Database compatibility is set to 150, and the cost threshold for parallelism is set to 50 to avoid parallelism for the time being:

ALTER DATABASE StackOverflow2010
SET COMPATIBILITY_LEVEL = 150;
GO
EXECUTE sys.sp_configure
    @configname = 'show advanced options',
    @configvalue = 1;
RECONFIGURE;
GO
EXECUTE sys.sp_configure
    @configname = 'cost threshold for parallelism',
    @configvalue = 50;
RECONFIGURE;

Statement Parsing

The parser identifies constants in the statement and marks them as potential parameters if they are in an allowable context. For example, the constants in TOP (50) and CONVERT(varchar(11, x) are not parameterizable, but the literal values in WHERE x = 123 and WHERE y IN (3, 8) are.

Decisions about parameterization at this early stage are generic and conservative because the only information available is the statement itself. Names of tables and columns haven’t been resolved to database objects yet, and no type information is available.

Imagine someone handing you a query written for an unknown SQL Server database and asking you to identify parameterizable constants. That’s roughly the task facing the parser.

Parsing marks constants as potential parameters unless the current clause context forbids it, accommodating both simple and forced parameterization rules.

For example, AND x = 100 + 50 is not acceptable to forced parameterization, but simple parameterization allows it, so the parser marks both constants as potential parameters.

In contrast, the earlier example WHERE y IN (3, 8) is acceptable to forced parameterization, but not simple. Again, the parser marks both constants as potential parameters just in case.

Initial Data Types

Under simple parameterization, constants are assigned an initial data type based on the textual representation (see part two for details). The data type may then be refined depending on the context.

For example, in WHERE x = 5 the constant is initially parsed as an integer because the textual form is not surrounded by quotation marks and has no decimal point. The context is a comparison operator (equals), so the data type is shrunk to a tinyint. This is the smallest integer type able to contain the value 5.

As a second example, in the expression 123 + 456 both constants are initially typed as integer based on the textual representation. Neither is shrunk to a smaller integer subtype because the context is an arithmetical operation, not a comparison. This explains why the constant 7 was typed as an integer rather than tinyint in the arithmetic operators section of part two.

These rules might seem odd or arbitrary but they were created for SQL Server 7.0 where simple parameterization was a new feature, then called “automatic parameters” or “auto-parameterization”. The engine’s ability to match indexes and reason through implicit conversions has improved markedly since then but the parsing rules remain the same for compatibility.

CAST and CONVERT

In part two I described how these very specific inferred parameter data types could prevent plan reuse—a particular problem with numeric and decimal data types:

An obvious solution would be to provide an explicit type for each constant, but T-SQL doesn’t provide a way to do this for all constant values. The best we can do sometimes is add CAST or CONVERT, but this does not work well with simple parameterization for reasons I will now set out.

In principle, the parser could incorporate the CAST or CONVERT in its parameter data typing decision, but this would require either a call into the expression services component or an early round of constant folding.

Neither of these facilities are available to the parser. It’s simply too early in the process. For example, constant folding expects an operator tree that doesn’t exist yet. All things are possible with enough engineering effort of course, but as things stand the parser cannot use a wrapping CAST or CONVERT to determine the data type of a constant literal.

The end result today is a parameter with the original parser-derived type surrounded by an explicit CAST or CONVERT. This doesn’t solve the problem of plan reuse at all.

Example 1

This is a slightly simplified version of the decimal example from part two. An explicit CONVERT matching the column data type has been added around each constant in an attempt to promote plan reuse:

ALTER DATABASE SCOPED CONFIGURATION 
    CLEAR PROCEDURE_CACHE;
GO
DROP TABLE IF EXISTS dbo.Test;
GO
CREATE TABLE dbo.Test
(
    SomeValue decimal(19,8) NOT NULL
);
GO
SELECT T.SomeValue 
FROM dbo.Test AS T 
WHERE T.SomeValue = CONVERT(decimal(19,8), 1.23);
GO
SELECT T.SomeValue 
FROM dbo.Test AS T 
WHERE T.SomeValue = CONVERT(decimal(19,8), 12.345);
GO
SELECT T.SomeValue 
FROM dbo.Test AS T 
WHERE T.SomeValue = CONVERT(decimal(19,8), 123.4567);
GO

Let’s look at the plan cache:

SELECT
    CP.usecounts,
    CP.objtype,
    ST.[text]
FROM sys.dm_exec_cached_plans AS CP
CROSS APPLY sys.dm_exec_sql_text (CP.plan_handle) AS ST
WHERE 
    ST.[text] NOT LIKE '%dm_exec_cached_plans%'
    AND ST.[text] LIKE '%SomeValue%Test%'
ORDER BY 
    CP.objtype ASC;

It shows a prepared statement for each query:

This is the same outcome as before we added the CONVERT. The parameter data types are still different, so separate plans are cached and no plan reuse occurs.

Example 2

This is the other example from part two with a CONVERT added to match the integer type of the Reputation column:

ALTER DATABASE SCOPED CONFIGURATION 
    CLEAR PROCEDURE_CACHE;
GO
SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = CONVERT(integer, 252);
GO
SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = CONVERT(integer, 25221);
GO
SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = CONVERT(integer, 252552);
GO

As a reminder, without the CONVERT these statements resulted in three separate prepared cached plans due to the parser’s typing rules:

Let’s look at the plan cache after running the statements with the CONVERT:

SELECT
    CP.usecounts,
    CP.objtype,
    ST.[text]
FROM sys.dm_exec_cached_plans AS CP
CROSS APPLY sys.dm_exec_sql_text (CP.plan_handle) AS ST
WHERE 
    ST.[text] NOT LIKE '%dm_exec_cached_plans%'
    AND ST.[text] LIKE '%DisplayName%Users%'
ORDER BY 
    CP.objtype ASC;

We see one prepared statement used three times:

This result should surprise you. Didn’t we just establish adding a CAST or CONVERT doesn’t help plan reuse?

This case is slightly different though. The constants in these statements were initially typed by the parser as integer then shrunk to the smallest possible integer subtype (smallint or tinyint) capable of holding the specific value. The shrinking caused different prepared statements without the CONVERT.

Remember from earlier this shrinking happens under simple parameterization only when the parse context is a comparison operator. Without the CONVERT the immediate context for the constant is the equality comparison operator, so shrinking is applied.

With the CONVERT the immediate context is the conversion, which is not a comparison operator so no shrinking occurs. All three constants remain typed as integer resulting in a single prepared statement used three times.

As an aside, notice the explicit CONVERT to integer remains in the prepared statement text even though the parameter is typed as integer.

A CAST or CONVERT isn’t the only operator capable of preventing integer type shrinking by the parser. Anything that gets between the constant and a comparison operator will do the job, as long as the extra item is acceptable for simple parameterization.

For example, we could use FLOOR or ABS around the constant value—but not CEILING. The list of intrinsic functions compatible with simple parameterization is quite limited and undocumented.

Parameterization Attempts

If the parser encounters a syntax element that always prevents simple parameterization it sets a flag so components involved in later stages can avoid wasted effort.

These syntax checks are not exhaustive. For example, the presence of a subquery, TOP clause, or query hint is sufficient to set the flag, but an IN clause, constant-to-constant comparison, or disallowed intrinsic function like LOWER or CEILING is not.

There is a partial list of the syntax elements that prevent simple parameterization in Appendix A of the Microsoft Technical Paper Plan Caching and Recompilation in SQL Server 2012. The list is not complete or maintained, and doesn’t detail why each item excludes parameterization, or at what stage of the compilation process the test is applied.

When the parser decides simple parameterization is impossible, none of the auto-parameterization performance counters mentioned in part 3 are incremented.

In particular, the Auto-Param Attempts/sec counter of the SQL Statistics object is not incremented. This is the primary way to detect a statement with constants was determined unsuitable for simple parameterization by the parser.

If Auto-Param Attempts/sec is incremented, it means the parser was satisfied simple parameterization might succeed. Later components will determine the eventual outcome of the parameterization attempt, either Failed, Safe, or Unsafe. I will cover these details later in this series.

In either case, the parser performs the lightweight work to identify potential parameters and assign an initial data type. Partly this is due to the streaming nature of the parser—it might encounter constants in the token stream before anything that disallows simple parameterization. The work might still prove useful if forced parameterization is active, either at the database level, via a plan guide, or due to undocumented trace flag 144.

End of Part 4

In the next part of this series, I’ll continue the compilation process at the algebrization and normalization stages, showing how these components explain some of the curious things we’ve seen with simple parameterization and trivial plans.

The post Simple Parameterization and Trivial Plans — Part 4 appeared first on SQLPerformance.com.

Simple Parameterization and Trivial Plans — Part 3

Paul White — Mon, 18 Apr 2022 09:00:28 +0000

Execution Plans

It’s more complicated than you might expect to tell from the information provided in execution plans if a SQL statement uses simple parameterization. It’s no surprise even highly experienced SQL Server users tend to get this wrong, given the contradictory information often supplied to us.

Let’s look at some examples using the Stack Overflow 2010 database on SQL Server 2019 CU 14, with database compatibility set to 150.

To begin, we’ll need a new nonclustered index:

CREATE INDEX [IX dbo.Users Reputation (DisplayName)] 
ON dbo.Users (Reputation) 
INCLUDE (DisplayName);

1. Simple Parameterization Applied

This first example query uses simple parameterization:

SELECT U.DisplayName 
FROM dbo.Users AS U 
WHERE U.Reputation = 999;

The estimated (pre-execution) plan has the following parameterization-related elements:

Estimated plan parameterization properties

Notice the @1 parameter is introduced everywhere except the query text shown across the top.

The actual (post-execution) plan has:

Actual plan parameterization properties

Notice the properties window has now lost the ParameterizedText element, while gaining information about the parameter runtime value. The parameterized query text is now shown across the top of the window with ‘@1’ instead of ‘999’.

2. Simple Parameterization Not Applied

This second example does not use simple parameterization:

-- Projecting an extra column
SELECT 
    U.DisplayName, 
    U.CreationDate -- NEW
FROM dbo.Users AS U 
WHERE 
    U.Reputation = 999;

The estimated plan shows:

Estimated non-parameterized plan

This time, the parameter @1 is missing from the Index Seek tooltip, but the parameterized text and other parameter list elements are the same as before.

Let’s look at the actual execution plan:

Actual non-parameterized plan

The results are the same as the previous parameterized actual plan, except now the Index Seek tooltip displays the non-parameterized value ‘999’. The query text shown across the top uses the @1 parameter marker. The properties window also uses @1 and displays the runtime value of the parameter.

The query is not a parameterized statement despite all the evidence to the contrary.

3. Parameterization Failed

My third example is also not parameterized by the server:

-- LOWER function used
SELECT 
    U.DisplayName, 
    LOWER(U.DisplayName)
FROM dbo.Users AS U 
WHERE 
    U.Reputation = 999;

The estimated plan is:

Estimated plan parameterization failed

There’s no mention of a @1 parameter anywhere now, and the Parameter List section of the properties window is missing.

The actual execution plan is the same, so I won’t bother showing it.

4. Parallel Parameterized Plan

I want to show you one more example using parallelism in the execution plan. The low estimated cost of my test queries means we need to lower the cost threshold for parallelism to 1:

EXECUTE sys.sp_configure
    @configname = 'cost threshold for parallelism',
    @configvalue = 1;
RECONFIGURE;

The example is a bit more complex this time:

SELECT 
    U.DisplayName 
FROM dbo.Users AS U 
WHERE 
    U.Reputation >= 5 
    AND U.DisplayName > N'ZZZ' 
ORDER BY 
    U.Reputation DESC;

The estimated execution plan is:

Estimated parallel parameterized plan

The query text across the top remains unparameterized while everything else is. There are two parameter markers now, @1 and @2, because simple parameterization found two suitable literal values.

The actual execution plan follows the pattern of example 1:

Actual parallel parameterized plan

The query text across the top is now parameterized and the properties window contains runtime parameter values. This parallel plan (with a Sort operator) is definitely parameterized by the server using simple parameterization.

Reliable Methods

There are reasons for all the behaviours shown so far, and a few more besides. I’ll attempt to explain many of these in the next part of this series when I cover plan compilation.

In the meantime, the situation with showplan in general, and SSMS in particular, is less than ideal. It’s confusing for people who’ve been working with SQL Server their entire careers. Which parameter markers do you trust, and which ones do you ignore?

There are several reliable methods for determining if a particular statement had simple parameterization successfully applied to it or not.

Query Store

I’ll start with one of the most convenient, the query store. Unfortunately, it’s not always as straightforward as you might imagine.

You must enable the query store feature for the database context where the statement is executed and the OPERATION_MODE must be set to READ_WRITE, allowing the query store to actively collect data.

After meeting these conditions, post-execution showplan output contains extra attributes, including the StatementParameterizationType. As the name suggests, this contains a code describing the type of parameterization used for the statement.

It’s visible in the SSMS properties window when the root node of a plan is selected:

StatementParameterizationType

The values are documented in sys.query_store_query:

0 – None
1 – User (explicit parameterization)
2 – Simple parameterization
3 – Forced parameterization

This beneficial attribute only appears in SSMS when an actual plan is requested and missing when an estimated plan is selected. It’s important to remember the plan must be cached. Requesting an estimated plan from SSMS does not cache the plan produced (since SQL Server 2012).

Once the plan is cached, the StatementParameterizationType appears in the usual places, including via sys.dm_exec_query_plan.

You can also trust the other places parameterization type is recorded in the query store, such as the query_parameterization_type_desc column in sys.query_store_query.

One important caveat. When the query store OPERATION_MODE is set to READ_ONLY, the StatementParameterizationType attribute is still populated in SSMS actual plans—but it’s always zero—giving a false impression the statement was not parameterized when it might well have been.

If you’re happy enabling query store, are sure it’s read-write, and only look at post-execution plans in SSMS, this will work for you.

Standard Plan Predicates

The query text shown across the top of the graphical showplan window in SSMS isn’t reliable, as the examples have shown. Neither can you rely on the ParameterList displayed in the Properties window when the root node of the plan is selected. The ParameterizedText attribute shown for estimated plans only is also not conclusive.

You can, however, rely on the properties associated with individual plan operators. The given examples show these are present in the tooltips when hovering over an operator.

A predicate containing a parameter marker like @1 or @2 indicates a parameterized plan. The operators most likely to contain a parameter are Index Scan, Index Seek, and Filter.

Predicates with parameter markers

If the numbering starts with @1, it uses simple parameterization. Forced parameterization begins with @0. I should mention the numbering scheme documented here is subject to change at any time:

Change warning

Nevertheless, this is the method I use most often to determine if a plan was subject to server-side parameterization. It’s generally quick and easy to check a plan visually for predicates containing parameter markers. This method also works for both types of plans, estimated and actual.

Dynamic Management Objects

There are several ways to query the plan cache and related DMOs to determine if a statement was parameterized. Naturally, these queries only work on plans in cache, so the statement must have been executed to completion, cached, and not subsequently evicted for any reason.

The most direct approach is to look for an Adhoc plan using an exact SQL textual match to the statement of interest. The Adhoc plan will be a shell containing a ParameterizedPlanHandle if the statement is parameterized by the server. The plan handle is then used to locate the Prepared plan. An Adhoc plan will not exist if the optimize for ad hoc workloads is enabled, and the statement in question has only executed once.

This type of enquiry often ends up shredding a significant amount of XML and scanning the entire plan cache at least once. It’s also easy getting the code wrong, not least because plans in cache cover an entire batch. A batch may contain multiple statements, each of which may or may not be parameterized. Not all the DMOs work at the same granularity (batch or statement) making it quite easy to come unstuck.

An efficient way to list statements of interest, together with plan fragments for just those individual statements, is shown below:

SELECT
    StatementText =
        SUBSTRING(T.[text], 
            1 + (QS.statement_start_offset / 2), 
            1 + ((QS.statement_end_offset - 
                QS.statement_start_offset) / 2)),
    IsParameterized = 
        IIF(T.[text] LIKE N'(%',
            'Yes',
            'No'),
    query_plan = 
        TRY_CONVERT(xml, P.query_plan)
FROM sys.dm_exec_query_stats AS QS
CROSS APPLY sys.dm_exec_sql_text (QS.[sql_handle]) AS T
CROSS APPLY sys.dm_exec_text_query_plan (
    QS.plan_handle, 
    QS.statement_start_offset, 
    QS.statement_end_offset) AS P
WHERE 
    -- Statements of interest
    T.[text] LIKE N'%DisplayName%Users%'
    -- Exclude queries like this one
    AND T.[text] NOT LIKE N'%sys.dm%'
ORDER BY
    QS.last_execution_time ASC,
    QS.statement_start_offset ASC;

To illustrate, let’s run a single batch containing the four examples from earlier:

ALTER DATABASE SCOPED CONFIGURATION 
    CLEAR PROCEDURE_CACHE;
GO
-- Example 1
SELECT U.DisplayName 
FROM dbo.Users AS U 
WHERE U.Reputation = 999;

-- Example 2
SELECT 
    U.DisplayName, 
    U.CreationDate 
FROM dbo.Users AS U 
WHERE 
    U.Reputation = 999;

-- Example 3
SELECT 
    U.DisplayName, 
    LOWER(U.DisplayName)
FROM dbo.Users AS U 
WHERE 
    U.Reputation = 999;

-- Example 4
SELECT 
    U.DisplayName 
FROM dbo.Users AS U 
WHERE 
    U.Reputation >= 5 
    AND U.DisplayName > N'ZZZ' 
ORDER BY 
    U.Reputation DESC;
GO

The output of the DMO query is:

DMO query output

This confirms only examples 1 and 4 were successfully parameterized.

Performance Counters

It’s possible to use the SQL Statistics performance counters to get a detailed insight into parameterization activity for both estimated and actual plans. The counters used aren’t scoped per-session, so you’ll need to use a test instance with no other concurrent activity to get accurate results.

I’m going to supplement the parameterization counter information with data from the sys.dm_exec_query_optimizer_info DMO to provide statistics on trivial plans as well.

Some care is needed to prevent statements reading the counter information from modifying those counters themselves. I’m going to address this by creating a couple of temporary stored procedures:

CREATE PROCEDURE #TrivialPlans
AS
SET NOCOUNT ON;

SELECT
    OI.[counter],
    OI.occurrence
FROM sys.dm_exec_query_optimizer_info AS OI
WHERE
    OI.[counter] = N'trivial plan';
GO
CREATE PROCEDURE #PerfCounters
AS
SET NOCOUNT ON;

SELECT
    PC.[object_name],
    PC.counter_name,
    PC.cntr_value
FROM 
    sys.dm_os_performance_counters AS PC
WHERE 
    PC.counter_name LIKE N'%Param%';

The script to test a particular statement then looks like this:

ALTER DATABASE SCOPED CONFIGURATION 
    CLEAR PROCEDURE_CACHE;
GO
EXECUTE #PerfCounters;
EXECUTE #TrivialPlans;
GO
SET SHOWPLAN_XML ON;
GO
-- The statement(s) under test:
-- Example 3
SELECT 
    U.DisplayName, 
    LOWER(U.DisplayName)
FROM dbo.Users AS U 
WHERE 
    U.Reputation = 999;
GO
SET SHOWPLAN_XML OFF;
GO
EXECUTE #TrivialPlans;
EXECUTE #PerfCounters;

Comment the SHOWPLAN_XML batches out to run the target statement(s) and get actual plans. Leave them in place for estimated execution plans.

Running the whole thing as written gives the following results:

Performance counter test results

I’ve highlighted above where values changed when testing example 3.

The increase in the “trivial plan” counter from 1050 to 1051 shows a trivial plan was found for the test statement.

The simple parameterization counters increased by 1 for both attempts and failures, showing SQL Server tried to parameterize the statement, but failed.

End of Part 3

In the next part of this series, I’ll explain the curious things we’ve seen by describing how simple parameterization and trivial plans interact with the compilation process.

If you changed your cost threshold for parallelism to run the examples, remember to reset it (mine was set to 50):

EXECUTE sys.sp_configure
    @configname = 'cost threshold for parallelism',
    @configvalue = 50;
RECONFIGURE;

The post Simple Parameterization and Trivial Plans — Part 3 appeared first on SQLPerformance.com.

Simple Parameterization and Trivial Plans — Part 2

Paul White — Tue, 29 Mar 2022 09:00:40 +0000

Parameter Data Types

As mentioned in the first part of this series, one of the reasons it is better to explicitly parameterize is so you have full control over parameter data types. Simple parameterization has a number of quirks in this area, which can result in more parameterized plans being cached than expected, or finding different results compared with the unparameterized version.

When SQL Server applies simple parameterization to an ad-hoc statement, it makes a guess about the data type of the replacement parameter. I’ll cover the reasons for the guessing later in the series.

For the time being, let’s look at some examples using the Stack Overflow 2010 database on SQL Server 2019 CU 14. Database compatibility is set to 150, and the cost threshold for parallelism is set to 50 to avoid parallelism for now:

ALTER DATABASE SCOPED CONFIGURATION 
    CLEAR PROCEDURE_CACHE;
GO
SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = 252;
GO
SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = 25221;
GO
SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = 252552;

These statements result in six cached plans, three Adhoc and three Prepared:

Different guessed types

Notice the different parameter data types in the Prepared plans.

Data Type Inference

The details of how each data type is guessed are complex and incompletely documented. As a starting point, SQL Server infers a basic type from the textual representation of the value, then uses the smallest compatible subtype.

For a string of numbers without quotation marks or a decimal point, SQL Server chooses from tinyint, smallint, and integer. For such numbers beyond the range of an integer, SQL Server uses numeric with the smallest possible precision. For example, the number 2,147,483,648 is typed as numeric(10,0). The bigint type isn’t used for server-side parameterization. This paragraph explains the data types selected in the prior examples.

Strings of numbers with a decimal point are interpreted as numeric, with a precision and scale just large enough to contain the value provided. Strings prefixed with a currency symbol are interpreted as money. Strings in scientific notation translate to float. The smallmoney and real types aren’t employed.

The datetime and uniqueidentifer types cannot be inferred from natural string formats. To get a datetime or uniqueidentifier parameter type, the literal value must be provided in ODBC escape format. For example {d '1901-01-01'}, {ts '1900-01-01 12:34:56.790'}, or {guid 'F85C72AB-15F7-49E9-A949-273C55A6C393'}. Otherwise, the intended date or UUID literal is typed as a string. Date and time types other than datetime aren’t used.

General string and binary literals are typed as varchar(8000), nvarchar(4000), or varbinary(8000) as appropriate, unless the literal exceeds 8000 bytes in which case the max variant is used. This scheme helps avoid the cache pollution and low level of reuse that would result from using specific lengths.

It isn’t possible to use CAST or CONVERT to set the data type for parameters for reasons I’ll detail later in this series. There is an example of this in the next section.

I won’t cover forced parameterization in this series, but I do want to mention the rules for data type inference in that case have some important differences compared to simple parameterization. Forced parameterization wasn’t added until SQL Server 2005, so Microsoft had the opportunity to incorporate some lessons from the simple parameterization experience, and didn’t have to worry much about backward-compatibility issues.

Numeric Types

For numbers with a decimal point and whole numbers beyond the range of integer , the inferred type rules present special problems for plan reuse and cache pollution.

Consider the following query using decimals:

ALTER DATABASE SCOPED CONFIGURATION 
    CLEAR PROCEDURE_CACHE;
GO
DROP TABLE IF EXISTS dbo.Test;
GO
CREATE TABLE dbo.Test
(
    SomeValue decimal(19,8) NOT NULL
);
GO
SELECT 
    T.SomeValue 
FROM dbo.Test AS T 
WHERE 
    T.SomeValue >= 987.65432 
    AND T.SomeValue < 123456.789;

This query qualifies for simple parameterization. SQL Server chooses the smallest precision and scale for the parameters able to contain the supplied values. This means it chooses numeric(8,5) for 987.65432 and numeric(9,3) for 123456.789:

Inferred numeric data types

These inferred types don’t match the decimal(19,8) type of the column, so a conversion around the parameter appears in the execution plan:

Conversion to column type

These conversions only represent a small runtime inefficiency in this particular case. In other situations, a mismatch between the column data type and the inferred type of a parameter might prevent an index seek or require SQL Server to do extra work to manufacture a dynamic seek.

Even where the resulting execution plan seems reasonable, a type mismatch can easily affect plan quality due to the effect of the type mismatch on cardinality estimation. It’s always best to use matching data types, and to pay careful attention to the derived types resulting from expressions.

Plan Reuse

The main issue with the current plan is the specific inferred types affecting cached plan matching and therefore reuse. Let’s run a couple more queries of the same general form:

SELECT 
    T.SomeValue 
FROM dbo.Test AS T 
WHERE 
    T.SomeValue >= 98.76 
    AND T.SomeValue < 123.4567;
GO
SELECT 
    T.SomeValue 
FROM dbo.Test AS T 
WHERE 
    T.SomeValue >= 1.2 
    AND T.SomeValue < 1234.56789;
GO

Now look at the plan cache:

SELECT
    CP.usecounts,
    CP.objtype,
    ST.[text]
FROM sys.dm_exec_cached_plans AS CP
CROSS APPLY sys.dm_exec_sql_text (CP.plan_handle) AS ST
WHERE 
    ST.[text] NOT LIKE '%dm_exec_cached_plans%'
    AND ST.[text] LIKE '%SomeValue%Test%'
ORDER BY 
    CP.objtype ASC;

It shows an AdHoc and Prepared statement for each query we submitted:

Separate prepared statements

The parameterized text is the same, but the parameter data types are different, so separate plans are cached, and no plan reuse occurs.

If we continue to submit queries with different combinations of scale or precision, a new Prepared plan will be created and cached each time. Remember the inferred type of each parameter isn’t limited by the column data type, so we could end up with a tremendous number of cached plans, depending on the numeric literals submitted. The number of combinations from numeric(1,0) to numeric(38,38) is already large before we think about multiple parameters.

Explicit Parameterization

This problem doesn’t arise when we use explicit parameterization, ideally choosing the same data type as the column the parameter is compared with:

ALTER DATABASE SCOPED CONFIGURATION 
    CLEAR PROCEDURE_CACHE;
GO
DECLARE 
    @stmt nvarchar(4000) =
        N'SELECT T.SomeValue FROM dbo.Test AS T WHERE T.SomeValue >= @P1 AND T.SomeValue < @P2;',
    @params nvarchar(4000) =
        N'@P1 numeric(19,8), @P2 numeric(19,8)';

EXECUTE sys.sp_executesql 
    @stmt, 
    @params, 
    @P1 = 987.65432, 
    @P2 = 123456.789;

EXECUTE sys.sp_executesql 
    @stmt, 
    @params, 
    @P1 = 98.76, 
    @P2 = 123.4567;

EXECUTE sys.sp_executesql 
    @stmt, 
    @params, 
    @P1 = 1.2, 
    @P2 = 1234.56789;

With explicit parameterization, the plan cache query shows only one plan cached, used three times, and no type conversions needed:

Explicit parameterization

As a final side note, I’ve used decimal and numeric interchangeably in this section. They are technically different types, though documented to be synonyms and behaving equivalently. This is usually the case, but not always:

-- Raises error 8120:
-- Column 'dbo.Test.SomeValue' is invalid in the select list
-- because it is not contained in either an aggregate function
-- or the GROUP BY clause.
SELECT CONVERT(decimal(19,8), T.SomeValue)
FROM dbo.Test AS T 
GROUP BY CONVERT(numeric(19,8), T.SomeValue);

It’s probably a small parser bug, but it still pays to be consistent (unless you’re writing an article and want to point out an interesting exception).

Arithmetic Operators

There’s one other edge case I want to address, based on an example given in the documentation, but in a bit more detail (and perhaps accuracy):

-- The dbo.LinkTypes table contains two rows

-- Uses simple parameterization
SELECT r = CONVERT(float, 1./ 7) 
FROM dbo.LinkTypes AS LT;

-- No simple parameterization due to
-- constant-constant comparison
SELECT r = CONVERT(float, 1./ 7) 
FROM dbo.LinkTypes AS LT 
WHERE 1 = 1;

The results are different, as documented:

Different results

With Simple Parameterization

When simple parameterization occurs, SQL Server parameterizes both literal values. The 1. value is typed as numeric(1,0) as expected. Somewhat inconsistently, the 7 is typed as integer (not tinyint). The rules of type inference have been built over time, by different teams. Behaviours are maintained to avoid breaking legacy code.

The next step involves the / arithmetic operator. SQL Server requires compatible types before performing the division. Given numeric (decimal) has a higher data type precedence than integer, the integer will be converted to numeric.

SQL Server needs to implicitly convert the integer to numeric. But which precision and scale to use? The answer could be based on the original literal, as SQL Server does in other circumstances, but it always uses numeric(10) here.

The data type of the result of dividing a numeric(1,0)by a numeric(10,0) is determined by another set of rules, given in the documentation for precision, scale, and length. Plugging the numbers into the formulas for result precision and scale given there, we have:

Result precision:
- p1 - s1 + s2 + max(6, s1 + p2 + 1)
- = 1 - 0 + 0 + max(6, 0 + 10 + 1)
- = 1 + max(6, 11)
- = 1 + 11
- = 12
Result scale:
- max(6, s1 + p2 + 1)
- = max(6, 0 + 10 + 1)
- = max(6, 11)
- = 11

The data type of 1. / 7 is, therefore, numeric(12, 11). This value is then converted to float as requested and displayed as 0.14285714285 (with 11 digits after the decimal point).

Without Simple Parameterization

When simple parameterization isn’t performed, the 1. literal is typed as numeric(1,0) as before. The 7 is initially typed as integer also as seen previously. The key difference is the integer is converted to numeric(1,0), so the division operator has common types to work with. This is the smallest precision and scale able to contain the value 7. Remember simple parameterization used numeric(10,0) here.

The precision and scale formulas for dividing numeric(1,0) by numeric(1,0) give a result data type of numeric(7,6):

Result precision:
- p1 - s1 + s2 + max(6, s1 + p2 + 1)
- = 1 - 0 + 0 + max(6, 0 + 1 + 1)
- = 1 + max(6, 2)
- = 1 + 6
- = 7
Result scale:
- max(6, s1 + p2 + 1)
- = max(6, 0 + 1 + 1)
- = max(6, 2)
- = 6

After the final conversion to float, the displayed result is 0.142857 (with six digits after the decimal point).

The observed difference in the results is therefore due to interim type derivation (numeric(12,11) vs. numeric(7,6)) rather than the final conversion to float.

If you need further evidence the conversion to float isn’t responsible, consider:

-- Simple parameterization
SELECT r = CONVERT(decimal(13,12), 1. / 7)
FROM dbo.LinkTypes AS LT;

-- No simple parameterization
SELECT r = CONVERT(decimal(13,12), 1. / 7)
FROM dbo.LinkTypes AS LT 
OPTION (MAXDOP 1);

Result with decimal

The results differ in value and scale as before.

This section doesn’t cover every quirk of data type inference and conversion with simple parameterization by any means. As said before, you’re better off using explicit parameters with known data types wherever possible.

End of Part 2

The next part of this series describes how simple parameterization affects execution plans.

The post Simple Parameterization and Trivial Plans — Part 2 appeared first on SQLPerformance.com.

Simple Parameterization and Trivial Plans — Part 1

Paul White — Tue, 22 Mar 2022 09:00:24 +0000

This is the first part of a series about simple parameterization and trivial plans. These two compilation features are closely connected and have similar goals. Both target performance and efficiency for workloads frequently submitting simple statements.

Despite the “simple” and “trivial” names, both have subtle behaviours and implementation details that can make how they work difficult to understand. This series doesn’t dwell too long on the basics but concentrates on less well-known aspects likely to trip up even the most experienced database professionals.

In this first part, after a quick introduction, I look at the effects of simple parameterization on the plan cache.

Simple Parameterization

It's almost always better to explicitly parameterize statements, rather than relying on the server to do it. Being explicit gives you complete control over all aspects of the parameterization process, including where parameters are used, the precise data types used, and when plans are reused.

Most clients and drivers provide specific ways to use explicit parameterization. There are also options like sp_executesql, stored procedures, and functions.

I’m not going to get into the related issues of parameter sniffing or SQL injection because, while important, they're not the focus of this series. Still, you should write code with both close to the front of your mind.

For legacy applications or other third-party code that cannot be easily changed, explicit parameterization may not always be possible. You may be able to overcome some obstacles using template plan guides. In any event, it would be an unusual workload that does not contain at least some parameterized statements server-side.

Shell Plans

When SQL Server 2005 introduced Forced Parameterization, the existing auto-parameterization feature was renamed to Simple Parameterization. Despite the change in terminology, simple parameterization works the same as auto-parameterization always did: SQL Server attempts to replace constant literal values in ad hoc statements with parameter markers. The aim is to reduce compilations by increasing cached plan reuse.

Let’s look at an example, using the Stack Overflow 2010 database on SQL Server 2019 CU 14. Database compatibility is set to 150, and the cost threshold for parallelism is set to 50 to avoid parallelism for the moment:

EXECUTE sys.sp_configure
    @configname = 'show advanced options',
    @configvalue = 1;
RECONFIGURE;
GO
EXECUTE sys.sp_configure
    @configname = 'cost threshold for parallelism',
    @configvalue = 50;
RECONFIGURE;

Example code:

-- Clear the cache of plans for this database
ALTER DATABASE SCOPED CONFIGURATION 
    CLEAR PROCEDURE_CACHE;
GO
SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = 2521;
GO
SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = 2827;
GO
SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = 3144;
GO
SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = 3151;
GO

Those statements feature predicates that differ only in their constant literal values. SQL Server successfully applies simple parameterization, resulting in a parameterized plan. The single parameterized plan is used four times as we can see by querying the plan cache:

SELECT
    CP.usecounts,
    CP.cacheobjtype,
    CP.objtype,
    CP.size_in_bytes,
    ST.[text],
    QP.query_plan
FROM sys.dm_exec_cached_plans AS CP
OUTER APPLY sys.dm_exec_sql_text (CP.plan_handle) AS ST
OUTER APPLY sys.dm_exec_query_plan (CP.plan_handle) AS QP
WHERE 
    ST.[text] NOT LIKE '%dm_exec_cached_plans%'
    AND ST.[text] LIKE '%DisplayName%Users%'
ORDER BY 
    CP.usecounts ASC;

The results show an Adhoc plan cache entry for each original statement and a single Prepared plan:

Four adhoc plans and one prepared plan

A Prepared statement is similar to a stored procedure, with parameters inferred from literal values found in the Adhoc statement. I mention this as a useful mental model when thinking about the server-side parameterization process.

Notice that SQL Server caches both the original text and the parameterized form. When simple parameterization is successful, the plan associated with the original text is Adhoc and does not contain a full execution plan. Instead, the cached plan is a shell with very little besides a pointer to the Prepared parameterized plan.

The XML representation of the shell plans contain text like:

That's the entire plan. The ParameterizedPlanHandle points from the Adhoc shell to the full parameterized plan. The handle value is the same for all four shell plans.

Plan Stubs

Shell plans are smaller than a full compiled plan—16KB instead of 40KB in the example. This can still add up to a significant amount of memory if you have many statements using simple parameterization or lots of different parameter values. Most SQL Server instances are not so awash with memory that they can afford to waste it like this. The shell plans are considered very disposable by SQL Server, but finding and removing them consumes resources and can become a point of contention.

We can reduce the total memory consumption for shell plans by enabling the optimize for ad hoc workloads option.

EXECUTE sys.sp_configure
    @configname = 'show advanced options',
    @configvalue = 1;
RECONFIGURE;
GO
EXECUTE sys.sp_configure
    @configname = 'optimize for ad hoc workloads',
    @configvalue = 1;
RECONFIGURE;

This caches a tiny stub the first time an ad hoc statement is encountered instead of a shell. The stub serves as a bookmark so the server can remember it's seen the exact statement text before. Upon encountering the same text a second time, compilation and caching proceed as if optimize for ad hoc workloads were not enabled.

Re-running the example with optimize for ad hoc workloads enabled shows the effect on the plan cache.

Compiled Plan Stubs

No plan is cached for the ad-hoc statements, just a stub. There is no ParameterizedPlanHandle pointer to the Prepared plan, though a complete parameterized plan is cached.

Running the test batches for a second time (without clearing the plan cache) gives the same outcome as when optimize for ad hoc workloads was not enabled—four Adhoc shell plans pointing to the Prepared plan.

Before continuing, reset the optimize for ad hoc workloads setting to zero:

EXECUTE sys.sp_configure
    @configname = 'optimize for ad hoc workloads',
    @configvalue = 0;
RECONFIGURE;

Plan Cache Size Limits

Whether plan shells or plan stubs are used, there are still downsides to all these Adhoc cache entries. I've already mentioned total memory use, but each plan cache also has a maximum number of entries. Even where the total memory usage is not a concern, the sheer quantity may be.

The limits can be raised with documented trace flag 174 (number of entries) and trace flag 8032 (total size). Depending on the workload and other memory demands, this may not be the best solution. After all, it just means caching more low-value Adhoc plans, taking memory away from other needs.

Caching Only Prepared Plans

If the workload rarely issues ad-hoc batches with exactly the same statement text, caching plan shells or plan stubs is a waste of resources. It consumes memory and may cause contention when the SQL Plans cache store (CACHESTORE_SQLCP) needs to be shrunk to fit within configured limits.

The ideal would be to parameterize incoming ad-hoc batches, but only cache the parameterized version. There is a cost to doing this, because future ad-hoc statements need to be parameterized before they can be matched to the parameterized cached plan. On the other hand, this would have happened anyway since we've already stated exact textual matches are rare for the target workload.

For workloads that benefit from simple parameterization, but not the caching of Adhoc entries, there are a couple of options.

Undocumented Trace Flag

The first option is to enable undocumented trace flag 253. This prevents the caching of Adhoc plans completely. It does not simply restrict the number of such plans, or prevent them from “staying” in the cache, as has sometimes been suggested.

Trace flag 253 can be enabled at the session level—restricting its effects to just that connection—or more widely as a global or start-up flag. It also functions as a query hint, but using those prevents simple parameterization, which would be counterproductive here. There is a partial list of the things that prevent simple parameterization in the Microsoft Technical Paper, Plan Caching and Recompilation in SQL Server 2012.

With trace flag 253 active before the batch is compiled, only the Prepared statements are cached:

ALTER DATABASE SCOPED CONFIGURATION 
    CLEAR PROCEDURE_CACHE;
GO
-- Do not cache ad-hoc plans
DBCC TRACEON (253);
GO
SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = 2521;
GO
SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = 2827;
GO
SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = 3144;
GO
SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = 3151;
GO
-- Cache ad-hoc plans again
DBCC TRACEOFF (253);
GO

The plan cache query confirms only the Prepared statement is cached and reused.

Only the prepared statement is cached

The Uncacheable Batch

The second option is to include a statement that marks the entire batch as uncacheable. Suitable statements are often security-related or otherwise sensitive in some way.

This might sound impractical, but there are a couple of mitigations. First, the sensitive statement need not be executed—it just needs to be present. When that condition is met, the user running the batch doesn’t even need permission to execute the sensitive statement. Note carefully, the effect is confined to the batch containing the sensitive statement.

Two suitably-sensitive statements and example usage are shown below (with the test statements now in a single batch):

ALTER DATABASE SCOPED CONFIGURATION 
    CLEAR PROCEDURE_CACHE;
GO
-- Prevent caching of all statements in this batch.
-- Neither KEY nor CERTIFICATE need to exist.
-- No special permissions are needed.
-- GOTO is used to ensure the statements are not executed.
GOTO Start
    OPEN SYMMETRIC KEY Banana 
        DECRYPTION BY CERTIFICATE Banana;
Start:

/* Another way to achieve the same effect without GOTO
IF 1 = 0
BEGIN
    CREATE APPLICATION ROLE Banana 
    WITH PASSWORD = '';
END;
*/

SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = 2521;

SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = 2827;

SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = 3144;

SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = 3151;
GO

The Prepared plans created by simple parameterization are still cached and reused despite the parent batch being marked as uncacheable.

Only the prepared statement is cached

Neither solution is ideal, but until Microsoft provides a documented and supported solution for this issue, they’re the best options I’m aware of.

End of Part 1

There's a lot more ground to cover on this topic. Part two will cover the data types assigned when simple parameterization is employed.

The post Simple Parameterization and Trivial Plans — Part 1 appeared first on SQLPerformance.com.

The Adaptive Join Threshold

Paul White — Thu, 04 Nov 2021 09:00:37 +0000

First introduced in SQL Server 2017 Enterprise Edition, an adaptive join enables a runtime transition from a batch mode hash join to a row mode correlated nested loops indexed join (apply) at runtime. For brevity, I’ll refer to a “correlated nested loops indexed join” as an apply throughout the rest of this article. If you need a refresher on the difference between nested loops and apply, please see my previous article.

Whether an adaptive join transitions from a hash join to apply at runtime depends on a value labelled Adaptive Threshold Rows on the Adaptive Join execution plan operator. This article shows how an adaptive join works, includes details of the threshold calculation, and covers the implications of some of the design choices made.

Introduction

One thing I want you to bear in mind throughout this piece is an adaptive join always starts executing as a batch mode hash join. This is true even if the execution plan indicates the adaptive join expects to run as a row mode apply.

Like any hash join, an adaptive join reads all rows available on its build input and copies the required data into a hash table. The batch mode flavour of hash join stores these rows in an optimized format, and partitions them using one or more hash functions. Once the build input has been consumed, the hash table is fully populated and partitioned, ready for the hash join to start checking probe-side rows for matches.

This is the point where an adaptive join makes the decision to proceed with the batch mode hash join or to transition to a row mode apply. If the number of rows in the hash table is less than the threshold value, the join switches to an apply; otherwise, the join continues as a hash join by starting to read rows from the probe input.

If a transition to an apply join occurs, the execution plan doesn’t reread the rows used to populate the hash table to drive the apply operation. Instead, an internal component known as an adaptive buffer reader expands the rows already stored in the hash table and makes them available on-demand to the outer input of the apply operator. There’s a cost associated with the adaptive buffer reader, but it’s much lower than the cost of completely rewinding the build input.

Choosing an Adaptive Join

Query optimization involves one or more stages of logical exploration and physical implementation of alternatives. At each stage, when the optimizer explores the physical options for a logical join, it might consider both batch mode hash join and row mode apply alternatives.

If one of those physical join options forms part of the cheapest solution found during the current stage—and the other type of join can deliver the same required logical properties—the optimizer marks the logical join group as potentially suitable for an adaptive join. If not, consideration of an adaptive join ends here (and no adaptive join extended event is fired).

The normal operation of the optimizer means the cheapest solution found will only include one of the physical join options—either hash or apply, whichever had the lowest estimated cost. The next thing the optimizer does is build and cost a fresh implementation of the type of join that wasn’t chosen as cheapest.

Since the current optimization phase has already ended with a cheapest solution found, a special single-group exploration and implementation round is performed for the adaptive join. Finally, the optimizer calculates the adaptive threshold.

If any of the preceding work is unsuccessful, the extended event adaptive_join_skipped is fired with a reason.

If the adaptive join processing is successful, a Concat operator is added to the internal plan above the hash and apply alternatives with the adaptive buffer reader and any required batch/row mode adapters. Remember, only one of the join alternatives will execute at runtime, depending on the number of rows actually encountered compared with the adaptive threshold.

The Concat operator and individual hash/apply alternatives aren’t normally shown in the final execution plan. We’re instead presented with a single Adaptive Join operator. This is a just a presentation decision—the Concat and joins are still present in the code run by the SQL Server execution engine. You can find more details about this in the Appendix and Related Reading sections of this article.

The Adaptive Threshold

An apply is generally cheaper than a hash join for a smaller number of driving rows. The hash join has an extra startup cost to build its hash table but a lower per-row cost when it starts probing for matches.

There's generally a point where the estimated cost of an apply and hash join will be equal. This idea was nicely illustrated by Joe Sack in his article, Introducing Batch Mode Adaptive Joins:

Calculating the Threshold

At this point, the optimizer has a single estimate for the number of rows entering the build input of the hash join and apply alternatives. It also has the estimated cost of the hash and apply operators as a whole.

This gives us a single point on the extreme right edge of the orange and blue lines in the diagram above. The optimizer needs another point of reference for each join type so it can “draw the lines” and find the intersection (it doesn’t literally draw lines, but you get the idea).

To find a second point for the lines, the optimizer asks the two joins to produce a new cost estimate based on a different (and hypothetical) input cardinality. If the first cardinality estimate was more than 100 rows, it asks the joins to estimate new costs for one row. If the original cardinality was less than or equal to 100 rows, the second point is based on an input cardinality of 10,000 rows (so there’s a decent enough range to extrapolate).

In any case, the result is two different costs and row counts for each join type, allowing the lines to be “drawn.”

The Intersection Formula

Finding the intersection of two lines based on two points for each line is a problem with several well-known solutions. SQL Server uses one based on determinants as described on Wikipedia:

where:

The first line is defined by the points (x₁, y₁) and (x₂, y₂). The second line is given by the points (x₃, y₃) and (x₄, y₄). The intersection is at (P_x, P_y).

Our scheme has the number of rows on the x-axis and the estimated cost on the y-axis. We’re interested in the number of rows where the lines intersect. This is given by the formula for P_x. If we wanted to know the estimated cost at the intersection, it would be P_y.

For P_x rows, the estimated costs of the apply and hash join solutions would be equal. This is the adaptive threshold we need.

A Worked Example

Here’s an example using the AdventureWorks2017 sample database and the following indexing trick by Itzik Ben-Gan to get unconditional consideration of batch mode execution:

-- Itzik's trick
CREATE NONCLUSTERED COLUMNSTORE INDEX BatchMode
ON Sales.SalesOrderHeader (SalesOrderID)
WHERE SalesOrderID = -1
AND SalesOrderID = -2;

-- Test query
SELECT SOH.SubTotal
FROM Sales.SalesOrderHeader AS SOH
JOIN Sales.SalesOrderDetail AS SOD
    ON SOD.SalesOrderID = SOH.SalesOrderID
WHERE SOH.SalesOrderID <= 75123;

The execution plan shows an adaptive join with a threshold of 1502.07 rows:

The estimated number of rows driving the adaptive join is 31,465.

Join Costs

In this simplified case, we can find estimated subtree costs for the hash and apply join alternatives using hints:

-- Hash
SELECT SOH.SubTotal
FROM Sales.SalesOrderHeader AS SOH
JOIN Sales.SalesOrderDetail AS SOD
    ON SOD.SalesOrderID = SOH.SalesOrderID
WHERE SOH.SalesOrderID <= 75123
OPTION (HASH JOIN, MAXDOP 1);

-- Apply
SELECT SOH.SubTotal
FROM Sales.SalesOrderHeader AS SOH
JOIN Sales.SalesOrderDetail AS SOD
    ON SOD.SalesOrderID = SOH.SalesOrderID
WHERE SOH.SalesOrderID <= 75123
OPTION (LOOP JOIN, MAXDOP 1);

This gives us one point on the line for each join type:

31,465 rows
- Hash cost 1.05083
- Apply cost 10.0552

The Second Point on the Line

Since the estimated number of rows is more than 100, the second reference points come from special internal estimates based on one join input row. Unfortunately, there’s no easy way to obtain the exact cost numbers for this internal calculation (I’ll talk more about this shortly).

For now, I’ll just show you the cost numbers (using the full internal precision rather than the six significant figures presented in execution plans):

One row (internal calculation)
- Hash cost 0.999027422729
- Apply cost 0.547927305023
31,465 rows
- Hash cost 1.05082787359
- Apply cost 10.0552890166

As expected, the apply join is cheaper than the hash for a small input cardinality but much more expensive for the expected cardinality of 31,465 rows.

The Intersection Calculation

Plugging these cardinality and cost numbers into the line intersection formula gives you the following:

-- Hash points (x = cardinality; y = cost)

DECLARE 
    @x1 float = 1, 
    @y1 float = 0.999027422729,
    @x2 float = 31465, 
    @y2 float = 1.05082787359;

-- Apply points (x = cardinality; y = cost)

DECLARE
    @x3 float = 1, 
    @y3 float = 0.547927305023,
    @x4 float = 31465, 
    @y4 float = 10.0552890166;

-- Formula:

SELECT Threshold =
    (
        (@x1 * @y2 - @y1 * @x2) * (@x3 - @x4) - 
        (@x1 - @x2) * (@x3 * @y4 - @y3 * @x4)
    )
    /
    (
        (@x1 - @x2) * (@y3 - @y4) - 
        (@y1 - @y2) * (@x3 - @x4)
    );

-- Returns 1502.06521571273

Rounded to six significant figures, this result matches the 1502.07 rows shown in the adaptive join execution plan:

Defect or Design?

Remember, SQL Server needs four points to “draw” the row count versus cost lines to find the adaptive join threshold. In the present case, this means finding cost estimations for the one-row and 31,465-row cardinalities for both apply and hash join implementations.

The optimizer calls a routine named sqllang!CuNewJoinEstimate to calculate these four costs for an adaptive join. Sadly, there are no trace flags or extended events to provide a handy overview of this activity. The normal trace flags used to investigate optimizer behaviour and display costs don’t function here (see the Appendix if you’re interested in more details).

The only way to obtain the one-row cost estimates is to attach a debugger and set a breakpoint after the fourth call to CuNewJoinEstimate in the code for sqllang!CardSolveForSwitch. I used WinDbg to obtain this call stack on SQL Server 2019 CU12:

At this point in the code, double-precision floating points costs are stored in four memory locations pointed to by addresses at rsp+b0, rsp+d0, rsp+30, and rsp+28 (where rsp is a CPU register and offsets are in hexadecimal):

The operator subtree cost numbers shown match those used in the adaptive join threshold calculation formula.

About Those One-Row Cost Estimates

You may have noticed the estimated subtree costs for the one-row joins seem quite high for the amount of work involved in joining one row:

One row
- Hash cost 0.999027422729
- Apply cost 0.547927305023

If you try to produce one-row input execution plans for the hash join and apply examples, you’ll see much lower estimated subtree costs at the join than those shown above. Likewise, running the original query with a row goal of one (or the number of join output rows expected for an input of one row) will also produce an estimated cost way lower than shown.

The reason is the CuNewJoinEstimate routine estimates the one-row case in a way I think most people wouldn’t find intuitive.

The final cost is made up of three main components:

The build input subtree cost
The local cost of the join
The probe input subtree cost

Items 2 and 3 depend on the type of join. For a hash join, they account for the cost of reading all the rows from the probe input, matching them (or not) with the one row in the hash table, and passing the results on to the next operator. For an apply, the costs cover one seek on the lower input to the join, the internal cost of the join itself, and returning the matched rows to the parent operator.

None of this is unusual or surprising.

The Cost Surprise

The surprise comes on the build side of the join (item 1 in the list). One might expect the optimizer to do some fancy calculation to scale the already-calculated subtree cost for 31,465 rows down to one average row, or something like that.

In fact, both hash and apply one-row join estimates simply use the whole subtree cost for the original cardinality estimate of 31,465 rows. In our running example, this “subtree” is the 0.54456 cost of the batch mode clustered index seek on the header table:

To be clear: the build-side estimated costs for the one-row join alternatives use an input cost calculated for 31,465 rows. That should strike you as a bit odd.

As a reminder, the one-row costs computed by CuNewJoinEstimate were as follows:

One row
- Hash cost 0.999027422729
- Apply cost 0.547927305023

You can see the total apply cost (~0.54793) is dominated by the 0.54456 build-side subtree cost, with a tiny extra amount for the single inner-side seek, processing the small number of resulting rows within the join, and passing them on to the parent operator.

The estimated one-row hash join cost is higher because the probe side of the plan consists of a full index scan, where all resulting rows must pass through the join. The total cost of the one-row hash join is a little lower than the original 1.05095 cost for the 31,465-row example because there’s now only one row in the hash table.

Implications

One would expect a one-row join estimate to be based, in part, on the cost of delivering one row to the driving join input. As we’ve seen, this isn’t the case for an adaptive join: both apply and hash alternatives are saddled with the full estimated cost for 31,465 rows. The rest of the join is costed pretty much as one would expect for a one-row build input.

This intuitively strange arrangement is why it’s difficult (perhaps impossible) to show an execution plan mirroring the calculated costs. We’d need to construct a plan delivering 31,465 rows to the upper join input but costing the join itself and its inner input as if only one row were present. A tough ask.

The effect of all this is to raise the leftmost point on our intersecting-lines diagram up the y-axis. This affects the slope of the line and therefore the intersection point.

Another practical effect is the calculated adaptive join threshold now depends on the original cardinality estimate at the hash build input, as noted by Joe Obbish in his 2017 blog post. For example, if we change the WHERE clause in the test query to SOH.SalesOrderID <= 55000, the adaptive threshold reduces from 1502.07 to 1259.8 without changing the query plan hash. Same plan, different threshold.

This arises because, as we’ve seen, the internal one-row cost estimate depends on the build input cost for the original cardinality estimate. This means different initial build-side estimates will give a different y-axis “boost” to the one-row estimate. In turn, the line will have a different slope and a different intersection point.

Intuition would suggest the one-row estimate for the same join should always give the same value regardless of the other cardinality estimate on the line (given the exact same join with the same properties and row sizes has a close-to-linear relationship between driving rows and cost). This isn’t the case for an adaptive join.

By Design?

I can tell you with some confidence what SQL Server does when calculating the adaptive join threshold. I don’t have any special insight as to why it does it this way.

Still, there are some reasons to think this arrangement is deliberate and came about after due consideration and feedback from testing. The remainder of this section covers some of my thoughts on this aspect.

An adaptive join isn’t a straight choice between a normal apply and batch mode hash join. An adaptive join always starts by fully populating the hash table. Only once this work is complete is the decision made to switch to an apply implementation or not.

By this time, we’ve already incurred potentially significant cost by populating and partitioning the hash join in memory. This may not matter much for the one-row case, but it becomes progressively more important as cardinality increases. The unexpected “boost” may be a way to incorporate these realities into the calculation while retaining a reasonable computation cost.

The SQL Server cost model has long been a bit biased against nested loops join, arguably with some justification. Even the ideal indexed apply case can be slow in practice if the data needed isn’t already in memory, and the I/O subsystem isn’t flash, especially with a somewhat random access pattern. Limited amounts of memory and sluggish I/O won’t be entirely unfamiliar to users of lower-end cloud-based database engines, for example.

It’s possible practical testing in such environments revealed an intuitively costed adaptive join was too quick to transition to an apply. Theory is sometimes only great in theory.

Still, the current situation isn’t ideal; caching a plan based on an unusually low cardinality estimate will produce an adaptive join much more reluctant to switch to an apply than it would’ve been with a larger initial estimate. This is a variety of the parameter-sensitivity problem, but it’ll be a new consideration of this type for many of us.

Now, it’s also possible using the full build input subtree cost for the leftmost point of the intersecting cost lines is simply an uncorrected error or oversight. My feeling is the current implementation is probably a deliberate practical compromise, but you’d need someone with access to the design documents and source code to know for sure.

Summary

An adaptive join allows SQL Server to transition from a batch mode hash join to an apply after the hash table has been fully populated. It makes this decision by comparing the number of rows in the hash table with a precalculated adaptive threshold.

The threshold is computed by predicting where apply and hash join costs are equal. To find this point, SQL Server produces a second internal join cost estimate for a different build input cardinality—normally, one row.

Surprisingly, the estimated cost for the one-row estimate includes the full build-side subtree cost for the original cardinality estimate (not scaled to one row). This means the threshold value depends on the original cardinality estimate at the build input.

Consequently, an adaptive join may have an unexpectedly low threshold value, meaning the adaptive join is much less likely to transition away from a hash join. It’s unclear if this behaviour is by design.

Appendix

This section covers a couple of adaptive join aspects that were difficult to include in the main text in a natural way.

The Expanded Adaptive Plan

You might try looking at a visual representation of the internal plan using undocumented trace flag 9415, as provided by Dima Pilugin in his excellent adaptive join internals article linked above. With this flag active, the adaptive join plan for our running example becomes the following:

This is a useful representation to aid understanding, but it isn’t entirely accurate, complete, or consistent. For example, the Table Spool doesn’t exist—it’s a default representation for the adaptive buffer reader reading rows directly from the batch mode hash table.

The operator properties and cardinality estimates are also a bit all over the place. The output from the adaptive buffer reader (“spool”) should be 31,465 rows, not 121,317. The subtree cost of the apply is incorrectly capped by the parent operator cost. This is normal for showplan, but it makes no sense in an adaptive join context.

There are other inconsistencies as well—too many to usefully list— but that can happen with undocumented trace flags. The expanded plan shown above isn’t intended for use by end users, so perhaps it isn’t entirely surprising. The message here is not to rely too heavily on the numbers and properties shown in this undocumented form.

I should also mention in passing the finished standard adaptive join plan operator isn’t entirely without its own consistency issues. These stem pretty much exclusively from the hidden details.

For example, the displayed adaptive join properties come from a mixture of the underlying Concat, Hash Join, and Apply operators. You can see an adaptive join reporting batch mode execution for nested loops join (which is impossible), and the elapsed time shown is actually copied from the hidden Concat, not the particular join that executed at runtime.

The Usual Suspects

We can get some useful information from the sorts of undocumented trace flags normally used to look at optimizer output. For example:

SELECT SOH.SubTotal
FROM Sales.SalesOrderHeader AS SOH
JOIN Sales.SalesOrderDetail AS SOD
    ON SOD.SalesOrderID = SOH.SalesOrderID
WHERE SOH.SalesOrderID <= 75123
OPTION (
    QUERYTRACEON 3604,
    QUERYTRACEON 8607,
    QUERYTRACEON 8612);

Output (heavily edited for readability):

*** Output Tree: ***
PhyOp_ExecutionModeAdapter(BatchToRow) Card=121317 Cost=1.05095

PhyOp_Concat (batch) Card=121317 Cost=1.05325
PhyOp_HashJoinx_jtInner (batch) Card=121317 Cost=1.05083
- PhyOp_Range Sales.SalesOrderHeader Card=31465 Cost=0.54456
- PhyOp_Filter(batch) Card=121317 Cost=0.397185
  - PhyOp_Range Sales.SalesOrderDetail Card=121317 Cost=0.338953
PhyOp_ExecutionModeAdapter(RowToBatch) Card=121317 Cost=10.0798
- PhyOp_Apply Card=121317 Cost=10.0553
  - PhyOp_ExecutionModeAdapter(BatchToRow) Card=31465 Cost=0.544623
    - PhyOp_Range Sales.SalesOrderHeader Card=31465 Cost=0.54456 [** 3 **]
  - PhyOp_Filter Card=3.85562 Cost=9.00356

This gives some insight into the estimated costs for the full-cardinality case with hash and apply alternatives without writing separate queries and using hints. As mentioned in the main text, these trace flags aren’t effective within CuNewJoinEstimate, so we can’t directly see the repeat calculations for the 31,465-row case or any of the details for the one-row estimates this way.

Merge Join and Row Mode Hash Join

Adaptive joins only offer a transition from batch mode hash join to row mode apply. For the reasons why row mode hash join isn’t supported, see the Intelligent Query Processing Q&A in the Related Reading section. In short, it’s thought row mode hash joins would be too prone to performance regressions.

Switching to a row mode merge join would be another option, but the optimizer doesn’t currently consider this. As I understand it, it’s unlikely to be expanded in this direction in future.

Some of the considerations are the same as they are for row mode hash join. In addition, merge join plans tend to be less easily interchangeable with hash join, even if we limit ourselves to indexed merge join (no explicit sort).

There’s also a much greater distinction between hash and apply than there is between hash and merge. Both hash and merge are suitable for larger inputs, and apply is better suited to a smaller driving input. Merge join isn’t as easily parallelized as hash join and doesn’t scale as well with increasing thread counts.

Given the motivation for adaptive joins is to cope better with significantly varying input sizes—and only hash join supports batch mode processing—the choice of batch hash versus row apply is the more natural one. Finally, having three adaptive join choices would significantly complicate the threshold calculation for potentially little gain.

The post The Adaptive Join Threshold appeared first on SQLPerformance.com.

How Parallel Plans Start Up – Part 5

Paul White — Mon, 17 May 2021 09:00:00 +0000

This is the final part of a five-part series taking a deep dive into the way SQL Server row mode parallel plans start executing. Part 1 initialized execution context zero for the parent task, and part 2 created the query scan tree. Part 3 started the query scan, performed some early phase processing, and started the first additional parallel tasks in branch C. Part 4 described exchange synchronization, and the start up of parallel plan branches C & D.

Branch B Parallel Tasks Start

A reminder of the branches in this parallel plan (click to enlarge):

This is the fourth stage in the execution sequence:

Branch A (parent task).
Branch C (additional parallel tasks).
Branch D (additional parallel tasks).
Branch B (additional parallel tasks).

The only thread active right now (not suspended on CXPACKET) is the parent task, which is at the consumer side of the repartition streams exchange at node 11 in Branch B:

The parent task now returns from nested early phases calls, setting elapsed and CPU times in profilers as it goes. First and last active times are not updated during early phase processing. Remember these numbers are being recorded against execution context zero — the Branch B parallel tasks do not exist yet.

The parent task ascends the tree from node 11, through the stream aggregate at node 10 and the merge join at node 3, back to the gather streams exchange at node 2.

Early phase processing is now complete.

With the original EarlyPhases call at the node 2 gather streams exchange finally completed, the parent task returns to opening that exchange (you may just about remember that call from right at the start of this series). The open method at node 2 now calls CQScanExchangeNew::StartAllProducers to create the parallel tasks for Branch B.

The parent task now waits on CXPACKET at the consumer side of the node 2 gather streams exchange. This wait will continue until the newly-created Branch B tasks have completed their nested Open calls and returned to complete opening the producer side of the gather streams exchange.

Branch B parallel tasks open

The two new parallel tasks in Branch B start at the producer side of the node 2 gather streams exchange. Following the usual row mode iterative execution model, they call:

CQScanXProducerNew::Open (node 2 producer side open).
CQScanProfileNew::Open (profiler for node 3).
CQScanMergeJoinNew::Open (node 3 merge join).
CQScanProfileNew::Open (profiler for node 4).
CQScanStreamAggregateNew::Open (node 4 stream aggregate).
CQScanProfileNew::Open (profiler for node 5).
CQScanExchangeNew::Open (repartition streams exchange).

The parallel tasks both follow the outer (upper) input to the merge join, just as the early phase processing did.

Completing the exchange

When the Branch B tasks arrive at the consumer side of the repartition streams exchange at node 5, each task:

Registers with the exchange port (CXPort).
Creates the pipes (CXPipe) that connect this task to one or more producer side tasks (depending on the type of exchange). The current exchange is a repartition streams, so each consumer task has two pipes (at DOP 2). Each consumer may receive rows from either of the two producers.
Adds a CXPipeMerge to merge rows from multiple pipes (since this is an order-preserving exchange).
Creates row packets (confusingly named CXPacket) used for flow control and to buffer rows across the exchange pipes. These are allocated from previously granted query memory.

Once both consumer-side parallel tasks have completed that work, the node 5 exchange is ready to go. The two consumers (in Branch B) and the two producers (in Branch C) have all opened the exchange port, so the node 5 CXPACKET waits end.

Checkpoint

As things stand:

The parent task in Branch A is waiting on CXPACKET at the consumer side of the node 2 gather streams exchange. This wait will continue until both node 2 producers return and open the exchange.
The two parallel tasks in Branch B are runnable. They have just opened the consumer side of the repartition streams exchange at node 5.
The two parallel tasks in Branch C have just been released from their CXPACKET wait, and are now runnable. The two stream aggregates at node 6 (one per parallel task) can begin aggregating rows from the two sorts at node 7. Recall the index seeks at node 9 closed some time ago, when the sorts completed their input phase.
The two parallel tasks in Branch D are waiting on CXPACKET at the producer side of the repartition streams exchange at node 11. They are waiting for the consumer side of node 11 to be opened by the two parallel tasks in Branch B. The index seeks have closed down, and the sorts are ready to transition to their output phase.

Multiple active branches

This is the first time we have had multiple branches (B and C) active at the same time, which could be challenging to discuss. Luckily, the design of the demo query is such that the stream aggregates in Branch C will produce only a few rows. The small number of narrow output rows will easily fit in the row packet buffers at the node 5 repartition streams exchange. The Branch C tasks can therefore get on with their work (and eventually close down) without waiting for the node 5 repartition streams consumer side to fetch any rows.

Conveniently, this means we can let the two Branch C parallel tasks run in the background without worrying about them. We need only concern ourselves with what the two Branch B parallel tasks are doing.

Branch B opening completes

A reminder of Branch B:

The two parallel workers in Branch B return from their Open calls at the node 5 repartition streams exchange. This takes them back though the stream aggregate at node 4, to the merge join at node 3.

Because we are ascending the tree in the Open method, the profilers above node 5 and node 4 are recording last active time, as well as accumulating elapsed and CPU times (per task). We are not executing early phases on the parent task now, so the numbers recorded for execution context zero are not affected.

At the merge join, the two Branch B parallel tasks start descending the inner (lower) input, taking them through the stream aggregate at node 10 (and a couple of profilers) to the consumer side of the repartition streams exchange at node 11.

Branch D resumes execution

A repeat of the Branch C events at node 5 now occurs at the node 11 repartition streams. The consumer side of the node 11 exchange is completed and opened. The two producers in Branch D end their CXPACKET waits, becoming runnable again. We will let the Branch D tasks run in the background, placing their results in exchange buffers.

There are now six parallel tasks (two each in Branches B, C, and D) cooperatively sharing time on the two schedulers assigned to additional parallel tasks in this query.

Branch A Opening Completes

The two parallel tasks in Branch B return from their Open calls at the node 11 repartition streams exchange, up past the node 10 stream aggregate, through the merge join at node 3, and back to the producer side of the gather streams at node 2. Profiler last active and accumulated elapsed & CPU times are updated as we ascend the tree in nested Open methods.

At the producer side of the gather streams exchange, the two Branch B parallel tasks synchronize opening the exchange port, then wait on CXPACKET for the consumer side to open.

The parent task waiting on the consumer side of the gather streams is now released from its CXPACKET wait, which allows it to complete opening the exchange port on the consumer side. This in turn releases the producers from their (brief) CXPACKET wait. The node 2 gather streams has now been opened by all owners.

Completing the Query Scan

The parent task now ascends the query scan tree from the gather streams exchange, returning from the Open calls at the exchange, segment, and sequence project operators in Branch A.

This completes opening the query scan tree, initiated all that while ago by the call to CQueryScan::StartupQuery. All branches of the parallel plan have now started executing.

Returning rows

The execution plan is ready to begin returning rows in response to GetRow calls at the root of the query scan tree, initiated by a call to CQueryScan::GetRow. I am not going to go into full detail, since it is strictly beyond the scope of an article about how parallel plans start up.

Still, the brief sequence is:

The parent task calls GetRow on the sequence project, which calls GetRow on the segment, which calls GetRow on the consumer side of the gather streams exchange.
If are no rows available at the exchange yet, the parent task waits on CXCONSUMER.
Meanwhile, the independently-running Branch B parallel tasks have been recursively calling GetRow starting at the producer side of the gather streams exchange.
Rows are supplied to Branch B by the consumer sides of the repartition streams exchanges at nodes 5 and 12.
Branches C and D are still processing rows from their sorts through their respective stream aggregates. Branch B tasks may have to wait on CXCONSUMER at repartition streams nodes 5 and 12 for a complete packet of rows to become available.
Rows emerging from the nested GetRow calls in Branch B are assembled into row packets at the producer side of the gather streams exchange.
The parent task’s CXCONSUMER wait at the consumer side of the gather streams ends when a packet becomes available.
A row at a time is then processed through the parent operators in Branch A, and finally on to the client.
Eventually, the rows run out, and a nested Close call ripples down the tree, across the exchanges, and parallel execution comes to an end.

Summary and Final Notes

First, a summary of the execution sequence of this particular parallel execution plan:

The parent task opens branch A. Early phase processing begins at the gather streams exchange.
Parent task early phase calls descend the scan tree to the index seek at node 9, then ascend back to the repartitioning exchange at node 5.
The parent task starts parallel tasks for Branch C, then waits while they read all available rows into the blocking sort operators at node 7.
Early phase calls ascend to the merge join, then descend the inner input to the exchange at node 11.
Tasks for Branch D are started just as for Branch C, while the parent task waits at node 11.
Early phase calls return from node 11 as far as the gather streams. The early phase ends here.
The parent task creates parallel tasks for Branch B, and waits until the opening of branch B is complete.
Branch B tasks reach the node 5 repartition streams, synchronize, complete the exchange, and release Branch C tasks to start aggregating rows from the sorts.
When Branch B tasks reach the node 12 repartition streams, they synchronize, complete the exchange, and release Branch D tasks to start aggregating rows from the sort.
Branch B tasks return to the gather streams exchange and synchronize, releasing the parent task from its wait. The parent task is now ready to start the process of returning rows to the client.

You might like to watch the execution of this plan in Sentry One Plan Explorer. Be sure to enable the "With Live Query Profile" option of Actual Plan collection. The nice thing about executing the query directly within Plan Explorer is you will be able to step through multiple captures at your own pace, and even rewind. It will also show a graphical summary of I/O, CPU, and waits synchronized with the live query profiling data.

Additional notes

Ascending the query scan tree during early phase processing sets first and last active times at each profiling iterator for the parent task, but does not accumulate elapsed or CPU time. Ascending the tree during Open and GetRow calls on a parallel task sets last active time, and accumulates elapsed and CPU time at each profiling iterator per task.

Early phase processing is specific to row mode parallel plans. It is necessary to ensure exchanges are initialized in the correct order, and all the parallel machinery works correctly.

The parent task does not always perform the whole of early phase processing. Early phases start at a root exchange, but how those calls navigate the tree depends on the iterators encountered. I chose a merge join for this demo because it happens to require early phase processing for both inputs.

Early phases at (for example) a parallel hash join propagate down the build input only. When the hash join transitions to its probe phase, it opens iterators on that input, including any exchanges. Another round of early phase processing is initiated, handled by (exactly) one of the parallel tasks, playing the role of the parent task.

When early phase processing encounters a parallel branch containing a blocking iterator, it starts the additional parallel tasks for that branch, and waits for those producers to complete their opening phase. That branch may also have child branches, which are handled in the same way, recursively.

Some branches in a row mode parallel plan may be required to run on a single thread (e.g. due to a global aggregate or top). These ‘serial zones’ also run on an additional ‘parallel’ task, the only difference being there is only one task, execution context, and worker for that branch. Early phase processing works the same regardless of the number of tasks assigned to a branch. For example, a ‘serial zone’ reports timings for the parent task (or a parallel task playing that role) as well as the single additional task. This manifests in showplan as data for “thread 0” (early phases) as well as “thread 1” (the additional task).

Closing thoughts

All this certainly represents an extra layer of complexity. The return on that investment is in runtime resource usage (primarily threads and memory), reduced synchronization waits, increased throughput, potentially accurate performance metrics, and a minimized chance of intra-query parallel deadlocks.

Though row mode parallelism has largely been eclipsed by the more modern batch mode parallel execution engine, the row mode design still has a certain beauty to it. Most iterators get to pretend they are still running in a serial plan, with almost all of the synchronization, flow control, and scheduling handled by the exchanges. The care and attention evident in implementation details like early phase processing enables even the largest parallel plans to execute successfully without the query designer giving too much thought to the practical difficulties.

The post How Parallel Plans Start Up – Part 5 appeared first on SQLPerformance.com.

How Parallel Plans Start Up – Part 4

Paul White — Thu, 06 May 2021 09:00:58 +0000

This is the fourth part in a five-part series taking a deep dive into the way SQL Server row mode parallel plans start executing. Part 1 initialized execution context zero for the parent task, and part 2 created the query scan tree. Part 3 started the query scan, performed some early phase processing, and started the first additional parallel tasks in branch C.

Branch C execution details

This is the second step of the execution sequence:

Branch A (parent task).
Branch C (additional parallel tasks).
Branch D (additional parallel tasks).
Branch B (additional parallel tasks).

A reminder of the branches in our parallel plan (click to enlarge)

A short time after the new tasks for branch C are queued, SQL Server attaches a worker to each task, and places the worker on a scheduler ready for execution. Each new task runs inside a new execution context. At DOP 2, there are two new tasks, two worker threads, and two execution contexts for branch C. Each task runs its own copy of the iterators in branch C on its own worker thread:

The two new parallel tasks start running at a sub-procedure entry point, which initially leads to an Open call on the producer side of the exchange (CQScanXProducerNew::Open). Both tasks have identical call stacks at the start of their lives:

Exchange synchronization

Meanwhile, the parent task (running on its own worker thread) registers the new sub-processes with the sub-process manager, then waits at the consumer side of the repartition streams exchange at node 5. The parent task waits on CXPACKET* until all of the branch C parallel tasks complete their Open calls and return to the producer side of the exchange. The parallel tasks will open every iterator in their subtree (i.e. down to the index seek at node 9 and back) before returning to the repartition streams exchange at node 5. The parent task will wait on CXPACKET while this happens. Remember the parent task is executing early phases calls.

We can see this wait in the waiting tasks DMV:

Execution context zero (the parent task) is blocked by both of the new execution contexts. These execution contexts are the first additional ones to be created after context zero, so they are assigned the numbers one and two. To emphasise: Both new execution contexts need to open their subtrees and return to the exchange for the parent task’s CXPACKET wait to end.

You might have been expecting to see CXCONSUMER waits here, but that wait is reserved for waiting on row data to arrive. The current wait is not for rows — it is for the producer side to open, so we get a generic CXPACKET* wait.

* Azure SQL Database and Managed Instance use the new CXSYNC_PORT wait instead of CXPACKET here, but that improvement hasn’t made its way into SQL Server yet (as of 2019 CU9).

Inspecting the new parallel tasks

We can see the new tasks in the query profiles DMV. Profiling information for the new tasks appears in the DMV because their execution contexts were derived (cloned, then updated) from the parent (execution context zero):

There are now three entries for each iterator in Branch C (highlighted). One for the parent task (execution context zero), and one for each new additional parallel task (contexts 1 and 2). Notice that the per-thread estimated row counts ( see part 1) have arrived now, and are shown only for the parallel tasks. The first and last active times for the parallel tasks represent the time their execution contexts were created. None of the new tasks has opened any iterators yet.

The repartition streams exchange at node 5 still only has a single entry in the DMV output. This is because the associated invisible profiler monitors the consumer side of the exchange. The additional parallel tasks are on the producer side of the exchange. The consumer side of node 5 will eventually have parallel tasks, but we haven’t got to that point yet.

Checkpoint

This seems like a good point to pause for breath and summarize where everything is at the moment. There will be more of these stopping points as we go along.

The parent task is on the consumer side of the repartition streams exchange at node 5, waiting on CXPACKET. It is in the middle of executing early phases calls. It paused to start up Branch C because that branch contains a blocking sort. The parent task’s wait will continue until both parallel tasks complete opening their subtrees.
Two new parallel tasks on the producer side of the node 5 exchange are ready to open the iterators in Branch C.

Nothing outside Branch C of this parallel execution plan can make forward progress until the parent task is released from its CXPACKET wait. Remember we have only created one set of additional parallel workers so far, for Branch C. The only other thread is the parent task, and that is blocked.

Branch C Parallel Execution

The two parallel tasks start at the producer side of the repartition streams exchange at node 5. Each has a separate (serial) plan with its own stream aggregate, sort, and index seek. The compute scalar does not appear in the runtime plan because its calculations are deferred to the sort.

Each instance of the index seek is parallel-aware and operates on disjoint sets of rows. These sets are generated on demand from the parent rowset created earlier by the parent task (covered in part 1). When either instance of the seek needs a new subrange of rows, it synchronizes with the other worker threads, so that only one is allocating a new subrange at the same time. The synchronization object used was also created earlier by the parent task. When a task waits for exclusive access to the parent rowset to acquire a new subrange, it waits on CXROWSET_SYNC.

Branch C tasks open

The sequence of Open calls for each task in Branch C is:

CQScanXProducerNew::Open. Notice there is no preceding profiler on the producer side of an exchange. This is unfortunate for query tuners.
CXTransLocal::Open
CXPort::Register
CXTransLocal::ActivateWorkers
CQScanProfileNew::Open. The profiler above node 6.
CQScanStreamAggregateNew::Open (node 6)
CQScanProfileNew::Open. The profiler above node 7.
CQScanSortNew::Open (node 7)

The sort is a fully blocking operator. It consumes its entire input during its Open call. There are a great number of interesting internal details to explore here, but space is short, so I will only cover the highlights:

The sort builds its sort table by opening its subtree and consuming all the rows its children can provide. Once sorting is complete, the sort is ready to transition to output mode, and it returns control to its parent. The sort will later respond to GetRow() calls, returning the next sorted row each time. An illustrative call stack during sort input is:

Execution continues until each sort has consumed all the (disjoint ranges of) rows available from its child index seek. The sorts then call Close on the index seeks, and return control to their parent stream aggregate. The stream aggregates initialize their counters and return control to the producer side of the repartition exchange at node 5. The sequence of Open calls is now complete in this branch.

The profiling DMV at this point shows updated timing numbers, and close times for the parallel index seeks:

More exchange synchronization

Recall the parent task is waiting on the consumer side of node 5 for all producers to open. A similar synchronization process now happens among the parallel tasks on the producer side of the same exchange:

Each producer task synchronizes with the others via CXTransLocal::Synchronize. The producers call CXPort::Open, then wait on CXPACKET for all consumer-side parallel tasks to open. When the first Branch C parallel task arrives back at the producer side of the exchange and waits, the waiting tasks DMV looks like this:

We still have the parent task’s consumer-side waits. The new CXPACKET highlighted is our first producer-side parallel task waiting for all consumer-side parallel tasks to open the exchange port.

The consumer-side parallel tasks (in Branch B) do not even exist yet, so the producer task displays NULL for the execution context it is blocked by. The task currently waiting on the consumer side of the repartition streams exchange is the parent task (not a parallel task!) running EarlyPhases code, so it doesn’t count.

Parent task CXPACKET wait ends

When the second parallel task in Branch C arrives back at the producer side of the exchange from its Open calls, all producers have opened the exchange port, so the parent task on the consumer side of the exchange is released from its CXPACKET wait.

The workers on the producer side continue to wait for the consumer side parallel tasks to be created and open the exchange port:

Checkpoint

At this point in time:

There are a total of three tasks: Two in Branch C, plus the parent task.
Both producers at the node 5 exchange have opened, and are waiting on CXPACKET for the consumer side parallel tasks to open. Much of the exchange machinery (including row buffers) is created by the consumer side, so there is nowhere for the producers to put rows yet.
The sorts in Branch C have consumed all their input, and are ready to provide sorted output.
The index seeks in Branch C have completed their work and closed down.
The parent task has just been released from waiting on CXPACKET at the consumer side of the node 5 repartition streams exchange. It is still executing nested EarlyPhases calls.

Branch D Parallel Tasks Start

This is the third step in the execution sequence:

Branch A (parent task).
Branch C (additional parallel tasks).
Branch D (additional parallel tasks).
Branch B (additional parallel tasks).

Released from its CXPACKET wait at the consumer side of the repartition streams exchange at node 5, the parent task ascends the Branch B query scan tree. It returns from nested EarlyPhases calls to the various iterators and profilers on the outer (upper) input of the merge join.

As mentioned, ascending the tree updates the elapsed and CPU times recorded by the invisible profiling iterators. We are executing code using the parent task, so those numbers are recorded against execution context zero. This is the ultimate source of the “thread 0” timing numbers referred to in my previous article, Understanding Execution Plan Operator Timings.

Once back at the merge join, the parent task calls EarlyPhases for the iterators and profilers on the inner (lower) input to the merge join. These are nodes 10 to 15 (excluding 14, which is deferred):

Once the parent task’s early phases calls reach the index seek at node 15, it begins to ascend the tree again (setting profiling times) until it reaches the repartition streams exchange at node 11.

Then, just as it did on the outer (upper) input to the merge join, it starts the producer side of the exchange at node 11, creating two new parallel tasks.
This sets Branch D in motion (shown below). Branch D executes exactly as already described in detail for Branch C.

Immediately after starting tasks for Branch D, the parent task waits on CXPACKET at node 11 for the new producers to open the exchange port:

The new CXPACKET waits are highlighted. Notice the reported node id might be a bit misleading. The parent task really is waiting at the consumer side of node 11 (repartition streams), not node 2 (gather streams). This is a quirk of early phase processing.

Meanwhile, the producer threads in Branch C continue to wait on CXPACKET for the consumer side of the node 5 repartition streams exchange to open.

Branch D opening

Just after the parent task starts the producers for Branch D, the query profile DMV shows the new execution contexts (3 and 4):

The two new parallel tasks in Branch D proceed exactly as those in Branch C did. The sorts consume all their input, and the Branch D tasks return to the exchange. This releases the parent task from its CXPACKET wait. The Branch D workers then wait on CXPACKET at the producer side of node 11 for the consumer side parallel tasks to open. Those parallel workers (in Branch B) do not exist yet.

Checkpoint

The waiting tasks at this point are shown below:

Both sets of parallel tasks in Branches C and D are waiting on CXPACKET for their parallel task consumers to open, at repartition streams exchange nodes 5 and 11 respectively. The only runnable task in the whole query right now is the parent task.

The query profiler DMV at this point is shown below, with operators in Branches C and D highlighted:

The only parallel tasks we haven’t started yet are in Branch B. All the work in Branch B so far has been early phases calls performed by the parent task.

End of Part 4

In the final part of this series, I will describe how the remainder of this particular parallel execution plan starts up, and briefly cover how the plan returns results. I will conclude with a more general description that applies to parallel plans of arbitrary complexity.

The post How Parallel Plans Start Up – Part 4 appeared first on SQLPerformance.com.

How Parallel Plans Start Up – Part 3

Paul White — Tue, 20 Apr 2021 09:00:57 +0000

This is the third in a five-part series taking a deep dive into the way SQL Server row mode parallel plans start executing. Part 1 initialized execution context zero for the parent task, and part 2 created the query scan tree. We are now ready to start the query scan, perform some early phase processing, and start the first additional parallel tasks.

Query Scan Start Up

Recall that only the parent task exists right now, and the exchanges (parallelism operators) have only a consumer side. Still, this is enough for query execution to begin, on the parent task’s worker thread. The query processor begins execution by starting the query scan process via a call to CQueryScan::StartupQuery. A reminder of the plan (click to enlarge):

This is the first point in the process so far that an in-flight execution plan is available (SQL Server 2016 SP1 onward) in sys.dm_exec_query_statistics_xml. There is nothing particularly interesting to see in such a plan at this point, because all the transient counters are zero, but the plan is at least available. There is no hint that parallel tasks have not been created yet, or that the exchanges lack a producer side. The plan looks ‘normal’ in all respects.

Parallel Plan Branches

Since this is a parallel plan, it will be useful to show it broken up into branches. These are shaded below, and labelled as branches A to D:

Branch A is associated with the parent task, running on the worker thread provided by the session. Additional parallel workers will be started to run the additional parallel tasks contained in branches B, C, and D. Those branches are parallel, so there will be DOP additional tasks and workers in each one.

Our example query is running at DOP 2, so branch B will get two additional tasks. The same goes for branch C and branch D, giving a total of six additional tasks. Each task will run on its own worker thread in its own execution context.

Two schedulers (S₁ and S₂) are assigned to this query to run additional parallel workers. Each additional worker will run on one of those two schedulers. The parent worker may run on a different scheduler, so our DOP 2 query may use a maximum of three processor cores at any one moment in time.

To summarize, our plan will eventually have:

Branch A (parent)
- Parent task.
- Parent worker thread.
- Execution context zero.
- Any single scheduler available to the query.
Branch B (additional)
- Two additional tasks.
- An additional worker thread bound to each new task.
- Two new execution contexts, one for each new task.
- One worker thread runs on scheduler S₁. The other runs on scheduler S₂.
Branch C (additional)
- Two additional tasks.
- An additional worker thread bound to each new task.
- Two new execution contexts, one for each new task.
- One worker thread runs on scheduler S₁. The other runs on scheduler S₂.
Branch D (additional)
- Two additional tasks.
- An additional worker thread bound to each new task.
- Two new execution contexts, one for each new task.
- One worker thread runs on scheduler S₁. The other runs on scheduler S₂.

The question is how all these extra tasks, workers, and execution contexts are created, and when they start running.

Starting Sequence

The sequence in which additional tasks start executing for this particular plan is:

Branch A (parent task).
Branch C (additional parallel tasks).
Branch D (additional parallel tasks).
Branch B (additional parallel tasks).

That may not be the start-up order you were expecting.

There may be a significant delay between each of these steps, for reasons we will explore shortly. The key point at this stage is that the additional tasks, workers, and execution contexts are not all created at once, and they do not all start executing at the same time.

SQL Server could have been designed to start all the extra parallel bits all at once. That might be easy to comprehend, but it wouldn’t be very efficient in general. It would maximize the number of additional threads and other resources used by the query, and result in a great deal of unnecessary parallel waits.

With the design employed by SQL Server, parallel plans will often use fewer total worker threads than (DOP multiplied by the total number of branches). This is achieved by recognizing that some branches can run to completion before some other branch needs to start. This can allow reuse of threads within the same query, and generally reduces resource consumption overall.

Let’s now turn to the details of how our parallel plan starts up.

Opening Branch A

The query scan starts executing with the parent task calling Open() on the iterator at the root of the tree. This is the start of the execution sequence:

Branch A (parent task).
Branch C (additional parallel tasks).
Branch D (additional parallel tasks).
Branch B (additional parallel tasks).

We are executing this query with an ‘actual’ plan requested, so the root iterator is not the sequence project operator at node 0. Rather, it is the invisible profiling iterator that records runtime metrics in row mode plans.

The illustration below shows the query scan iterators in Branch A of the plan, with the position of invisible profiling iterators represented by the ‘spectacles’ icons.

Execution starts with a call to open the first profiler, CQScanProfileNew::Open. This sets the open time for the child sequence project operator via the operating system’s Query Performance Counter API.

We can see this number in sys.dm_exec_query_profiles:

The entries there may have the operator names listed, but the data comes from the profiler above the operator, not the operator itself.

As it happens, a sequence project (CQScanSeqProjectNew) does not need to do any work when opened, so it does not actually have an Open() method. The profiler above the sequence project is called, so an open time for the sequence project is recorded in the DMV.

The profiler’s Open method does not call Open on the sequence project (since it doesn’t have one). Instead it calls Open on the profiler for the next iterator in sequence. This is the segment iterator at node 1. That sets the open time for the segment, just as the prior profiler did for the sequence project:

A segment iterator does have things to do when opened, so the next call is to CQScanSegmentNew::Open. Once the segment has done what it needs to, it calls the profiler for the next iterator in sequence — the consumer side of the gather streams exchange at node 2:

The next call down the query scan tree in the opening process is CQScanExchangeNew::Open, which is where things start to get more interesting.

Opening the gather streams exchange

Asking the consumer side of the exchange to open:

Opens a local (parallel nested) transaction (CXTransLocal::Open). Every process needs a containing transaction, and additional parallel tasks are no exception. They can’t share the parent (base) transaction directly, so nested transactions are used. When a parallel task needs to access the base transaction, it synchronizes on a latch, and may encounter NESTING_TRANSACTION_READONLY or NESTING_TRANSACTION_FULL waits.
Registers the current worker thread with the exchange port (CXPort::Register).
Synchronizes with other threads on the consumer side of the exchange (sqlmin!CXTransLocal::Synchronize). There are no other threads on the consumer side of a gather streams, so this is essentially a no-op on this occasion.

“Early Phases” Processing

The parent task has now reached the edge of Branch A. The next step is particular to row mode parallel plans: The parent task continues execution by calling CQScanExchangeNew::EarlyPhases on the gather streams exchange iterator at node 2. This is an additional iterator method beyond the usual Open, GetRow, and Close methods that many of you will be familiar with. EarlyPhases is only called in row mode parallel plans.

I want to be clear about something at this point: The producer side of the gather streams exchange at node 2 has not been created yet, and no additional parallel tasks have been created. We are still executing code for the parent task, using the only thread running right now.

Not all iterators implement EarlyPhases, because not all of them have anything special to do at this point in row mode parallel plans. This is analogous to the sequence project not implementing the Open method because it has nothing to do at that time. The main iterators with EarlyPhases methods are:

CQScanConcatNew (concatenation).
CQScanMergeJoinNew (merge join).
CQScanSwitchNew (switch).
CQScanExchangeNew (parallelism).
CQScanNew (rowset access e.g. scans and seeks).
CQScanProfileNew (invisible profilers).
CQScanLightProfileNew (invisible lightweight profilers).

Branch B early phases

The parent task continues by calling EarlyPhases on child operators beyond the gather streams exchange at node 2. A task moving over a branch boundary might seem unusual, but remember execution context zero contains the whole serial plan, with exchanges included. Early phase processing is about initializing parallelism, so it doesn’t count as execution per se.

To help you keep track, the picture below shows the iterators in the Branch B of the plan:

Remember, we are still in execution context zero, so I am only referring to this as Branch B for convenience. We have not started any parallel execution yet.

The sequence of early phase code invocations in Branch B is:

CQScanProfileNew::EarlyPhases for the profiler above node 3.
CQScanMergeJoinNew::EarlyPhases at the node 3 merge join.
CQScanProfileNew::EarlyPhases for the profiler above node 4. The node 4 stream aggregate itself does not have an early phases method.
CQScanProfileNew::EarlyPhases on the profiler above node 5.
CQScanExchangeNew::EarlyPhases for the repartition streams exchange at node 5.

Notice we are only processing the outer (upper) input to the merge join at this stage. This is just the normal row mode execution iterative sequence. It is not particular to parallel plans.

Branch C early phases

Early phase processing continues with the iterators in Branch C:

The sequence of calls here is:

CQScanProfileNew::EarlyPhases for the profiler above node 6.
CQScanProfileNew::EarlyPhases for the profiler above node 7.
CQScanProfileNew::EarlyPhases on the profiler above node 9.
CQScanNew::EarlyPhases for the index seek at node 9.

There is no EarlyPhases method on the stream aggregate or sort. The work performed by compute scalar at node 8 is deferred (to the sort), so it does not appear in the query scan tree, and does not have an associated profiler.

About profiler timings

Parent task early phase processing began at the gather streams exchange at node 2. It descended the query scan tree, following the outer (upper) input to the merge join, all the way down to the index seek at node 9. Along the way, the parent task has called the EarlyPhases method on every iterator that supports it.

None of the early phases activity has so far updated any times in the profiling DMV. Specifically, none of the iterators touched by early phases processing have had their ‘open time’ set. This makes sense, because early phase processing is just setting up parallel execution — these operators will be opened for execution later on.

The index seek at node 9 is a leaf node – it has no children. The parent task now begins returning from the nested EarlyPhases calls, ascending the query scan tree back towards the gather streams exchange.

Each of the profilers calls the Query Performance Counter API on entry to their EarlyPhases method, and they call it again on the way out. The difference between the two numbers represents elapsed time for the iterator and all its children (since the method calls are nested).

After the profiler for the index seek returns, the profiler DMV shows elapsed and CPU time for the index seek only, as well as an updated last active time. Notice also that this information is recorded against the parent task (the only option right now):

None of the prior iterators touched by the early phases calls have elapsed times, or updated last active times. These numbers are only updated when we ascend the tree.

After the next profiler early phases call return, the sort times are updated:

The next return takes us up past the profiler for the stream aggregate at node 6:

Returning from this profiler takes us back to the EarlyPhases call at the repartition streams exchange at node 5. Remember that this is not where the sequence of early phases calls started — that was the gather streams exchange at node 2.

Branch C Parallel Tasks Enqueued

Aside from updating profiling data, the prior early phases calls didn’t seem to do very much. That all changes with the repartition streams exchange at node 5.

I am going to describe Branch C in a fair amount of detail to introduce a number of important concepts, which will apply to the other parallel branches as well. Covering this ground once now means later branch discussion can be more succinct.

Having completed nested early phase processing for its subtree (down to the index seek at node 9), the exchange can begin its own early phase work. This starts out the same as opening the gather streams exchange at node 2:

CXTransLocal::Open (opening the local parallel sub-transaction).
CXPort::Register (registering with the exchange port).

The next steps are different because branch C contains a fully blocking iterator (the sort at node 7). The early phase processing at the node 5 repartition streams does the following:

Calls CQScanExchangeNew::StartAllProducers. This is the first time we have encountered anything referencing the producer side of the exchange. Node 5 is the first exchange in this plan to create its producer side.
Acquires a mutex so no other thread can be enqueueing tasks at the same time.
Starts parallel nested transactions for the producer tasks (CXPort::StartNestedTransactions and ReadOnlyXactImp::BeginParallelNestedXact).
Registers the sub-transactions with the parent query scan object (CQueryScan::AddSubXact).
Creates producer descriptors (CQScanExchangeNew::PxproddescCreate).
Creates new producer execution contexts (CExecContext) derived from execution context zero.
Updates the linked map of plan iterators.
Sets DOP for the new context (CQueryExecContext::SetDop) so all tasks know what the overall DOP setting is.
Initializes the parameter cache (CQueryExecContext::InitParamCache).
Links the parallel nested transactions to the base transaction (CExecContext::SetBaseXact).
Queues the new sub-processes for execution (SubprocessMgr::EnqueueMultipleSubprocesses).
Creates new parallel tasks tasks via sqldk!SOS_Node::EnqueueMultipleTasksDirect.

The parent task’s call stack (for those of you that enjoy these things) around this point in time is:

End of Part Three

We have now created the producer side of the repartition streams exchange at node 5, created additional parallel tasks to run Branch C, and linked everything back to parent structures as required. Branch C is the first branch to start any parallel tasks. The final part of this series will look at branch C opening in detail, and start the remaining parallel tasks.

The post How Parallel Plans Start Up – Part 3 appeared first on SQLPerformance.com.