SQL Optimizer Archives - SQLPerformance.com

Simple Parameterization and Trivial Plans — Part 6

Paul White — Thu, 04 Aug 2022 09:00:03 +0000

In part 5, I described how the early phases of query compilation affect simple parameterization. To recap:

Normalization and decoding promote cached plan reuse
The decoding step has fixed and limited capabilities
The Failed Auto-Params/sec counter is incremented when decoding fails and the statement is not parameterized
Shell plans optimize for very frequent execution of simple statements, bypassing the parsing, parameter replacement, normalization, and decoding stages
The first application of constant folding occurs after decoding, resulting in a separate parameter for each value in a foldable expression

Let’s now continue following the compilation process to see how SQL Server decides if simple parameterization is safe or unsafe.

Code examples use the Stack Overflow 2010 database on SQL Server 2019 CU 16 with the following additional nonclustered index:

CREATE INDEX [IX dbo.Users Reputation (DisplayName)] 
ON dbo.Users (Reputation) 
INCLUDE (DisplayName);

Simplification

The simplification compilation stage aims to remove logical redundancy and reduce expression complexity.

Some opportunities for logical tree simplification arise from prior compilation activity, others when the original statement contains unnecessary elements. The latter is common for SQL generated by applications and frameworks but also often occurs in dynamic queries. Opportunities may also arise as database CHECK constraints are incorporated in the logical tree during simplification.

Timing is a key point here. Since decoding succeeded earlier, the query processor is now dealing with a parameterized (prepared) statement. Many simplifications can't be applied to parameters, only constant values.

Let’s look at an example:

SELECT 
    U.DisplayName
FROM dbo.Users AS U 
WHERE 
    U.Reputation BETWEEN 1 AND 999
    AND U.Reputation BETWEEN 999 AND 1234;

This statement qualifies for simple parameterization. Four constants are identified, with inferred integer data types shrunk to smallint or tinyint due to the comparison operator parsing context (described in part 4):

This plan will continue to return correct results when reused with different parameter values. Let’s run the query again with an extra element to prevent simple parameterization:

SELECT 
    U.DisplayName
FROM dbo.Users AS U 
WHERE 
    U.Reputation BETWEEN 1 AND 999
    AND U.Reputation BETWEEN 999 AND 1234
OPTION (KEEP PLAN); -- New

The query hint has no effect on the statement besides disabling simple parameterization at the parsing stage. Without parameters, simplification removes logical redundancy from the predicates to leave a single equality test:

None of this is particularly surprising. The outcome is the same as when we choose to parameterize client-side. Reusable plans often permit fewer optimizations than ones built for specific values.

Nevertheless, there’s an important observation to make here. The query processor is not yet fully committed to producing a prepared plan. The simplification stage will not be repeated using constant values if the parameterization attempt fails later.

There are several compilation stages after simplification but they don’t interact much with simple parameterization. I’m going to skip those and move on to where the final decision to apply simple parameterization is made.

Trivial Plans

Like simple parameterization, the trivial plan compilation stage aims to reduce the cost of creating execution plans for simple statements. It’s the final step before possibly invoking the full cost-based optimizer (CBO) where alternatives are subject to detailed costing analysis.

The trivial plan stage provides a fast path through compilation for simple statements, whether parameterized or not. The CBO has significant start-up and runtime costs and may consume significant server resources. These costs are unlikely to be recovered through finding an incrementally better plan in simple cases.

Avoiding CBO can confer significant performance and resource usage advantages for workloads frequently executing simple queries.

Generating a Trivial Plan

Trivial plans are often referred to as having no cost-based choices. This is perhaps a useful shorthand, but it isn’t the whole story.

SQL Server generates trivial plans for ‘common’ statement types having an ‘obvious’ implementation and a ‘short’ expected run time. Whether a statement is ‘common’ and ‘obvious’ is largely determined by heuristics and implementation complexity. The ‘expected run time’ is approximated by estimated plan cost.

The trivial plan stage has entry conditions like CBO stages do. If the logical tree has features that can't be implemented in a trivial plan, the stage is skipped. Otherwise, a limited set of substitution and implementation rules are evaluated where they apply to the current tree.

Entering the trivial plan stage is not a guarantee of success. The limited rules available still might fail to generate a complete execution plan. In this case, the trivial plan stage fails and compilation moves on to the CBO.

Qualification

Microsoft deliberately don’t document the precise criteria used to decide if a statement qualifies for a trivial plan. The rules can and do change from time to time.

Even so, it’s useful to say a simple statement with an obvious ‘best’ index is most likely to get a trivial plan. The index need not be covering if it's unique and an equality index seek is possible. Trivial plans are possible for index scans as well as seeks. A table scan (heap or clustered) can also feature in a trivial plan.

In principle, pretty much any relatively simple statement pattern could be implemented in SQL Server at the trivial plan stage, assuming a known good plan shape exists. In practice, SQL Server is quite conservative.

This doesn’t mean SQL Server only generates a trivial plan when no cost-based choices exist. One could argue every decision is cost-based to some extent, including which index to use at the trivial plan stage.

Index analysis is a big part of the trivial plan decision. SQL Server uses heuristics to decide if a particular index is good enough to make CBO unlikely worth pursuing.

For example, a covering nonclustered index will usually be selected as trivial if it involves reading fewer pages than any other index (including the heap or clustered index). This determination takes into account the selectivity of any predicates and index fill factors.

The trivial plan selection heuristics also don’t account for facilities only available in CBO, like index intersection plans. SQL Server might therefore select a single-index trivial plan when full CBO analysis would’ve found a lower-cost index intersection plan.

Parameterization Safety

When a statement passes the earlier parser and decoder checks, it arrives at the trivial plan stage as a prepared (parameterized) statement. The query processor now needs to decide if the parameterization attempt is safe.

Parameterization is considered safe if the query processor would generate the same plan for all possible future parameter values. This might seem like a complex determination to make, but SQL Server takes a practical approach.

Safe Parameterization

Simply stated, a simple parameterization attempt is considered safe if the trivial plan stage produces a plan for the prepared statement.

When this occurs, the Safe Auto-Params/sec counter of the SQL Statistics object is incremented. Refer back to part 3 for a script to reliably detect simple parameterization and trivial plan outcomes.

If the trivial plan stage is disabled with trace flag 8757, simple parameterization is never applied because a trivial plan can’t be generated.

Unsafe Parameterization

If the trivial plan stage doesn’t produce a plan for the current prepared statement, the simple parameterization attempt is considered unsafe.

When this occurs, the Unsafe Auto-Params/sec counter is incremented.

After an unsafe parameterization attempt, SQL Server replaces parameters in the prepared plan with the original constant values before invoking the CBO. The statement remains technically prepared but contains no parameters.

Parameters are only replaced by constants inside query plan operators. Other elements of the prepared statement aren't cleaned up. This explains why an actual execution plan produced after an unsafe attempt at simple parameterization retains the parameterized text and a parameter list (see part 3).

Parallelism

While a simple parameterization attempt is considered safe if a trivial plan is found, it doesn’t mean the final execution plan will be TRIVIAL.

If the estimated cost of the trivial plan found exceeds the cost threshold for parallelism, SQL Server will invoke the CBO to consider more alternatives, possibly including parallel execution plans. The plan will be marked as having been through FULL optimization.

The outcome of CBO need not be a parallel plan, but it will remain parameterized in any case. I showed an example of simple parameterization producing a parallel query plan in part 3.

Feature Interaction

Simple parameterization and the trivial plan stage are closely related but remain independent activities.

A statement can qualify for simple parameterization without producing a TRIVIAL execution plan.

This happens when a trivial plan is found with a cost exceeding the parallelism threshold. The final plan will remain parameterized after CBO but might be serial or parallel.

A statement containing constants can produce a TRIVIAL plan without also qualifying for simple parameterization.

This happens when the statement contains syntax elements considered unsuitable for simple parameterization by parsing or decoding (covered in parts four and five).

Process Summary

I’ve covered a lot of ground in this series, so I thought it would be useful to finish up with a flowchart showing the main decision points during consideration of simple parameterization and trivial plans:

The post Simple Parameterization and Trivial Plans — Part 6 appeared first on SQLPerformance.com.

Simple Parameterization and Trivial Plans — Part 5

Paul White — Wed, 08 Jun 2022 09:00:58 +0000

In part 4, I described the role the parser plays turning a statement with constant values into a parameterized prepared statement. To recap briefly, the parser:

Identifies potential parameters
Assigns an initial data type
Accounts for both simple and forced parameterization rules
Sets a flag if it determines simple parameterization is impossible

When the parser allows simple parameterization to continue, SQL Server® increments the Auto-Param Attempts/sec counter of the SQL Statistics object.

Let’s now continue following the compilation process noting the effects on simple parameterization as we go.

As in previous parts, code examples use the Stack Overflow 2010 database on SQL Server 2019 CU 16 with the following additional nonclustered index:

CREATE INDEX [IX dbo.Users Reputation (DisplayName)] 
ON dbo.Users (Reputation) 
INCLUDE (DisplayName);

Normalization

The output of parsing is a logical representation of the statement called a parse tree. This tree does not contain SQL language elements. It's an abstraction of the logical elements of the original query specification.

It's frequently possible to write the same logical requirement in different ways. Using a SQL analogy, x IN (4, 8) is logically the same as writing x = 4 OR x = 8. This flexibility can be useful when writing queries, but it makes implementing an optimizer more difficult.

In general terms, normalization is an attempt to standardize. It recognises common variations expressing the same logic, and rewrites them in a standard way. For example, normalization is responsible for turning x BETWEEN y AND z into x >= y AND x <= z. The normalized, or standardized, form chosen is one the query processor finds convenient to work with internally.

I’m simplifying a bit here. Like parameterization, normalization happens in stages at different points during statement compilation and optimization. Logical operator trees and expressions are normalized at different times and in different ways. It's only important to understand the broad concept for what follows.

Maximizing Parameterized Plan Reuse

SQL Server could use the parameter locations determined during parsing to directly replace constants with parameter markers. This would function to allow plan reuse, but also require future statements to be written in exactly the same way (constant values aside) to benefit from plan reuse.

This might be useful enough in itself, but we can do better by normalizing. Expressing the parameterized statement in a standardized way allows future statements to match if they normalize to the same form, even where the original text is quite different. I’ll show you some examples shortly.

Decoding

SQL Server implements this important normalization aspect by decoding the logical operator tree back into a SQL representation. Decoding uses slightly different rules for simple and forced parameterization but in either case the normalized SQL will usually be quite different from the original.

For simple parameterization, decoding emits keywords in upper case, data types and intrinsic functions in lower case, delimits object names with square brackets, and removes unnecessary spaces and comments.

Forced parameterization emits keywords in lower case, does not use bracket delimiters, and places a space either side of the ‘dot’ between schema and object names.

Decoding also standardizes elements like comparison operators, the AS keyword for aliases, and the presence of an ASC or DESC specification in an ORDER BY clause. For example, decoding always expresses ‘not equal’ using <> even if != was used in the original statement.

Neither normalization scheme changes the casing of non-keyword elements like object names. This could be unsafe in case-sensitive databases.

Whichever normalization scheme is used, the key point is AdHoc statements can be written using different keyword casing, spacing, comments, and delimiters but still benefit from plan reuse through server-side parameterization. The text will normalize to the same parameterized form, allowing a cached prepared plan to be matched and reused.

Let’s look at an example.

Example 1

The following statements generate the same normalized text under simple parameterization:

ALTER DATABASE SCOPED CONFIGURATION 
    CLEAR PROCEDURE_CACHE;
GO
SELECT U.DisplayName FROM dbo.Users AS U 
WHERE U.Reputation IN (2) ORDER BY U.Id;
GO
SELECT U.DisplayName 
FROM dbo.Users AS U 
WHERE U.Reputation =3
ORDER BY U.Id;
GO
select 
    U.DisplayName 
FROM [dbo].[Users] as U 
WhErE 
    U.Reputation = (4)
Order By
    U.[Id];
GO
SELECT
    U.DisplayName 
    -- Note: No AS used for the alias
from [dbo] . Users U 
where
    U.Reputation = 5
order by 
    U.Id asc;
GO

Let’s look at the plan cache:

SELECT
    CP.usecounts,
    CP.objtype,
    ST.[text]
FROM sys.dm_exec_cached_plans AS CP
CROSS APPLY sys.dm_exec_sql_text (CP.plan_handle) AS ST
WHERE 
    ST.[text] NOT LIKE '%dm_exec_cached_plans%'
    AND ST.[text] LIKE '%DisplayName%Users%'
ORDER BY 
    CP.objtype ASC;

The output shows four AdHoc shell plans pointing to a single Prepared plan:

Each AdHoc statement is different, but the logical trees all decode to the same form (line breaks added for clarity):

(@1 tinyint)
SELECT [U].[DisplayName]
FROM [dbo].[Users] [U]
WHERE [U].[Reputation]=@1
ORDER BY [U].[Id] ASC

The normalized and parameterized form is used in cache lookups so the prepared plan is successfully found and reused.

This is a neat feature of simple parameterization but it's still best to use a consistent SQL style. Admittedly, this can be difficult to enforce in practical work environments over time.

Normalization isn’t perfect. You can probably find any number of ways to subtly change the statements above to produce different ‘normalized’ parameterized forms. Normalization promotes reuse of parameterized plans—it doesn’t guarantee it.

Failed Parameterization

The decoding step for simple parameterization has deliberately limited capabilities. It's only able to decode certain logical tree elements into SQL. These limitations determine which clauses and intrinsic functions (among other things) are compatible with simple parameterization.

If the decoding step successfully produces a complete SQL representation, the compilation process proceeds from this point as a prepared (parameterized) statement.

When decoding fails, the Failed Auto-Params/sec counter of the SQL Statistics object is incremented and compilation continues as an unparameterized (not prepared) statement.

Statements that fail at the parsing or decoding steps don't show any parameter details in estimated or actual plans because a prepared version of the statement is never created. Refer back to part three for execution plan details.

Decoder Limitations

For a parameterization attempt to fail here, it must have passed the generic tests performed by the parser, but fail at the decoding stage. The syntax elements, global variables, and intrinsic functions supported by the simple parameterization decoder aren't documented.

The decoder limitations explain why (and when) examples shown earlier in this series failed with LOWER and CEILING built-in functions, but succeeded with FLOOR and ABS.

Constant to constant comparison is also unsupported by the simple parameterization decoder, which explains why the well-known WHERE 1 = 1 trick prevents simple parameterization. As further examples, the decoder supports the @@SPID and @@TRANCOUNT global variables, but not @@ROWCOUNT or @@IDENTITY.

The performance counter test rig in part three can be used to explore which statement features fail at the decoding stage. Such statements will increment both Auto-Param Attempts/sec and Failed Auto-Params/sec.

Remember a statement disqualified from simple parameterization by the parser doesn’t increment Auto-Param Attempts/sec.

Example 2 – Plan Reuse

Try to predict how many plans of which type (ad-hoc or prepared) will be cached if we remove the GO batch separators from example 1:

-- Single batch
SELECT U.DisplayName FROM dbo.Users AS U 
WHERE U.Reputation IN (2) ORDER BY U.Id;

SELECT U.DisplayName 
FROM dbo.Users AS U 
WHERE U.Reputation =3
ORDER BY U.Id;

select 
    U.DisplayName 
FROM [dbo].[Users] as U 
WhErE 
    U.Reputation = (4)
Order By
    U.[Id];

SELECT
    U.DisplayName 
    -- Note: No AS used for the alias
from [dbo] . Users U 
where
    U.Reputation = 5
order by 
    U.Id asc;

You may be surprised to find only two plans are cached, one Adhoc and one Prepared.

The prepared plan is used four times:

Explanation

Back in part one, I said it can be useful to think of a prepared statement as an (unnamed) stored procedure.

Imagine example 2 had contained four calls to a single stored procedure with different parameter values each time. That would cache the ad-hoc batch once and the stored procedure plan once. The single stored procedure plan would be executed four times with the different parameter values.

Something conceptually similar happened with simple parameterization as I’ll now explain.

SQL Server compiles a single plan for all statements in a batch, but it compiles each statement in the batch sequentially. In example 2, the first statement encountered qualifies for simple parameterization so a separate prepared statement is built and cached.

The second, third, and fourth statements in the batch also qualify for simple parameterization. After normalization and decoding they match the prepared plan cached by the first statement. The single prepared plan is therefore executed a total of four times.

Shell Plans Revisited

The Adhoc plan cached in example 2 contains four shell plans (covered in part 1). Each shell plan points to the same Prepared plan.

Caching the Adhoc shell ensures that future execution of exactly the same statement text (including constant values) will quickly find the parameterized plan.

Without the shells, the statement would need to be parsed, parameterized, normalized, and decoded back to SQL representation before it could be matched to the prepared plan.

This is less work than generating even a trivial plan, but it is still less efficient than using the shell to locate the prepared statement directly from the source text.

The goal of simple parameterization is to optimize performance for simple and frequently-executed ad-hoc SQL statements that parameterize to the same form. For an OLTP workload with a very rapid submission of such statements, every millisecond counts. Besides, all compilation activity consumes some server resources.

Algebrization and Constant Folding

The next step of the compilation process is algebrization. This complex stage performs a number of tasks, including binding object names found in the parse tree to physical database entities, constant folding, matching aggregates to grouping elements, and deriving final data types from metadata using type conversion rules as necessary. The output from this activity is a bound tree of logical operators.

The most relevant part of algebrization to simple parameterization is the initial round of constant folding it performs.

As the linked documentation states, constant folding is the early evaluation of expressions to improve runtime performance. For example, the expression DATEFROMPARTS(2022, 07, 11) can be evaluated early to the date value 11 July 2022.

The algebrization stage is the first time constant folding is applied during compilation, so earlier stages will always see unfolded expressions.

Let’s look at this with an example.

Example 3

SELECT 
    U.DisplayName
FROM dbo.Users AS U 
WHERE 
    U.Reputation = 900 + 90 + 9;

This statement qualifies for simple parameterization. Each of the constants is identified as a separate parameter by the parser, and initially typed as integer. No integer type shrinking to smallint or tinyint is performed because the immediate parsing context is an arithmetic operator as described in part four.

The post-execution (actual) plan confirms parameterization occurred before constant folding had a chance to evaluate 900 + 90 + 9:

Notice the three integer parameters and the normalized representation of the parameterized statement text in the top bar.

This example is specific to simple parameterization. Forced parameterization doesn’t parameterize constant-foldable expressions that are arguments of the +, -, *, /, and % operators.

Constant folding may be repeated several times later in the compilation and optimization process as new folding opportunities arise. These later constant folding runs may apply to the whole tree, newly generated alternative subtrees, or an individual expression.

End of Part 5

In the final part of this series, I’ll continue analysis of the compilation process with the simplification and trivial plan stages, including how and when SQL Server decides if simple parameterization is safe or unsafe.

The post Simple Parameterization and Trivial Plans — Part 5 appeared first on SQLPerformance.com.

Simple Parameterization and Trivial Plans — Part 4

Paul White — Fri, 20 May 2022 13:35:37 +0000

The Compilation Process

The most important things to understand about server-side parameterization are it doesn’t happen all at once and a final decision to parameterize isn’t made until the end of the process.

Multiple compilation components are involved in successfully turning a SQL statement containing constants into a parameterized prepared statement with typed parameters. Each component is active at a different time during compilation and performs its own parameterization-related tasks using the information available at that time.

The scope and timing of each component’s activity explain the curious things seen in the previous parts of this series. Even when the parameterization attempt isn’t successful, there can still be effects from preparations for parameterization visible in the final execution plan.

There are a number of details to explore to gain a complete understanding. Let’s follow the main compilation stages noting the effects on parameterization as we go.

As in previous parts, most code examples use the Stack Overflow 2010 database on SQL Server 2019 CU 16 with the following additional nonclustered index:

CREATE INDEX [IX dbo.Users Reputation (DisplayName)] 
ON dbo.Users (Reputation) 
INCLUDE (DisplayName);

Database compatibility is set to 150, and the cost threshold for parallelism is set to 50 to avoid parallelism for the time being:

ALTER DATABASE StackOverflow2010
SET COMPATIBILITY_LEVEL = 150;
GO
EXECUTE sys.sp_configure
    @configname = 'show advanced options',
    @configvalue = 1;
RECONFIGURE;
GO
EXECUTE sys.sp_configure
    @configname = 'cost threshold for parallelism',
    @configvalue = 50;
RECONFIGURE;

Statement Parsing

The parser identifies constants in the statement and marks them as potential parameters if they are in an allowable context. For example, the constants in TOP (50) and CONVERT(varchar(11, x) are not parameterizable, but the literal values in WHERE x = 123 and WHERE y IN (3, 8) are.

Decisions about parameterization at this early stage are generic and conservative because the only information available is the statement itself. Names of tables and columns haven’t been resolved to database objects yet, and no type information is available.

Imagine someone handing you a query written for an unknown SQL Server database and asking you to identify parameterizable constants. That’s roughly the task facing the parser.

Parsing marks constants as potential parameters unless the current clause context forbids it, accommodating both simple and forced parameterization rules.

For example, AND x = 100 + 50 is not acceptable to forced parameterization, but simple parameterization allows it, so the parser marks both constants as potential parameters.

In contrast, the earlier example WHERE y IN (3, 8) is acceptable to forced parameterization, but not simple. Again, the parser marks both constants as potential parameters just in case.

Initial Data Types

Under simple parameterization, constants are assigned an initial data type based on the textual representation (see part two for details). The data type may then be refined depending on the context.

For example, in WHERE x = 5 the constant is initially parsed as an integer because the textual form is not surrounded by quotation marks and has no decimal point. The context is a comparison operator (equals), so the data type is shrunk to a tinyint. This is the smallest integer type able to contain the value 5.

As a second example, in the expression 123 + 456 both constants are initially typed as integer based on the textual representation. Neither is shrunk to a smaller integer subtype because the context is an arithmetical operation, not a comparison. This explains why the constant 7 was typed as an integer rather than tinyint in the arithmetic operators section of part two.

These rules might seem odd or arbitrary but they were created for SQL Server 7.0 where simple parameterization was a new feature, then called “automatic parameters” or “auto-parameterization”. The engine’s ability to match indexes and reason through implicit conversions has improved markedly since then but the parsing rules remain the same for compatibility.

CAST and CONVERT

In part two I described how these very specific inferred parameter data types could prevent plan reuse—a particular problem with numeric and decimal data types:

An obvious solution would be to provide an explicit type for each constant, but T-SQL doesn’t provide a way to do this for all constant values. The best we can do sometimes is add CAST or CONVERT, but this does not work well with simple parameterization for reasons I will now set out.

In principle, the parser could incorporate the CAST or CONVERT in its parameter data typing decision, but this would require either a call into the expression services component or an early round of constant folding.

Neither of these facilities are available to the parser. It’s simply too early in the process. For example, constant folding expects an operator tree that doesn’t exist yet. All things are possible with enough engineering effort of course, but as things stand the parser cannot use a wrapping CAST or CONVERT to determine the data type of a constant literal.

The end result today is a parameter with the original parser-derived type surrounded by an explicit CAST or CONVERT. This doesn’t solve the problem of plan reuse at all.

Example 1

This is a slightly simplified version of the decimal example from part two. An explicit CONVERT matching the column data type has been added around each constant in an attempt to promote plan reuse:

ALTER DATABASE SCOPED CONFIGURATION 
    CLEAR PROCEDURE_CACHE;
GO
DROP TABLE IF EXISTS dbo.Test;
GO
CREATE TABLE dbo.Test
(
    SomeValue decimal(19,8) NOT NULL
);
GO
SELECT T.SomeValue 
FROM dbo.Test AS T 
WHERE T.SomeValue = CONVERT(decimal(19,8), 1.23);
GO
SELECT T.SomeValue 
FROM dbo.Test AS T 
WHERE T.SomeValue = CONVERT(decimal(19,8), 12.345);
GO
SELECT T.SomeValue 
FROM dbo.Test AS T 
WHERE T.SomeValue = CONVERT(decimal(19,8), 123.4567);
GO

Let’s look at the plan cache:

SELECT
    CP.usecounts,
    CP.objtype,
    ST.[text]
FROM sys.dm_exec_cached_plans AS CP
CROSS APPLY sys.dm_exec_sql_text (CP.plan_handle) AS ST
WHERE 
    ST.[text] NOT LIKE '%dm_exec_cached_plans%'
    AND ST.[text] LIKE '%SomeValue%Test%'
ORDER BY 
    CP.objtype ASC;

It shows a prepared statement for each query:

This is the same outcome as before we added the CONVERT. The parameter data types are still different, so separate plans are cached and no plan reuse occurs.

Example 2

This is the other example from part two with a CONVERT added to match the integer type of the Reputation column:

ALTER DATABASE SCOPED CONFIGURATION 
    CLEAR PROCEDURE_CACHE;
GO
SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = CONVERT(integer, 252);
GO
SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = CONVERT(integer, 25221);
GO
SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = CONVERT(integer, 252552);
GO

As a reminder, without the CONVERT these statements resulted in three separate prepared cached plans due to the parser’s typing rules:

Let’s look at the plan cache after running the statements with the CONVERT:

SELECT
    CP.usecounts,
    CP.objtype,
    ST.[text]
FROM sys.dm_exec_cached_plans AS CP
CROSS APPLY sys.dm_exec_sql_text (CP.plan_handle) AS ST
WHERE 
    ST.[text] NOT LIKE '%dm_exec_cached_plans%'
    AND ST.[text] LIKE '%DisplayName%Users%'
ORDER BY 
    CP.objtype ASC;

We see one prepared statement used three times:

This result should surprise you. Didn’t we just establish adding a CAST or CONVERT doesn’t help plan reuse?

This case is slightly different though. The constants in these statements were initially typed by the parser as integer then shrunk to the smallest possible integer subtype (smallint or tinyint) capable of holding the specific value. The shrinking caused different prepared statements without the CONVERT.

Remember from earlier this shrinking happens under simple parameterization only when the parse context is a comparison operator. Without the CONVERT the immediate context for the constant is the equality comparison operator, so shrinking is applied.

With the CONVERT the immediate context is the conversion, which is not a comparison operator so no shrinking occurs. All three constants remain typed as integer resulting in a single prepared statement used three times.

As an aside, notice the explicit CONVERT to integer remains in the prepared statement text even though the parameter is typed as integer.

A CAST or CONVERT isn’t the only operator capable of preventing integer type shrinking by the parser. Anything that gets between the constant and a comparison operator will do the job, as long as the extra item is acceptable for simple parameterization.

For example, we could use FLOOR or ABS around the constant value—but not CEILING. The list of intrinsic functions compatible with simple parameterization is quite limited and undocumented.

Parameterization Attempts

If the parser encounters a syntax element that always prevents simple parameterization it sets a flag so components involved in later stages can avoid wasted effort.

These syntax checks are not exhaustive. For example, the presence of a subquery, TOP clause, or query hint is sufficient to set the flag, but an IN clause, constant-to-constant comparison, or disallowed intrinsic function like LOWER or CEILING is not.

There is a partial list of the syntax elements that prevent simple parameterization in Appendix A of the Microsoft Technical Paper Plan Caching and Recompilation in SQL Server 2012. The list is not complete or maintained, and doesn’t detail why each item excludes parameterization, or at what stage of the compilation process the test is applied.

When the parser decides simple parameterization is impossible, none of the auto-parameterization performance counters mentioned in part 3 are incremented.

In particular, the Auto-Param Attempts/sec counter of the SQL Statistics object is not incremented. This is the primary way to detect a statement with constants was determined unsuitable for simple parameterization by the parser.

If Auto-Param Attempts/sec is incremented, it means the parser was satisfied simple parameterization might succeed. Later components will determine the eventual outcome of the parameterization attempt, either Failed, Safe, or Unsafe. I will cover these details later in this series.

In either case, the parser performs the lightweight work to identify potential parameters and assign an initial data type. Partly this is due to the streaming nature of the parser—it might encounter constants in the token stream before anything that disallows simple parameterization. The work might still prove useful if forced parameterization is active, either at the database level, via a plan guide, or due to undocumented trace flag 144.

End of Part 4

In the next part of this series, I’ll continue the compilation process at the algebrization and normalization stages, showing how these components explain some of the curious things we’ve seen with simple parameterization and trivial plans.

The post Simple Parameterization and Trivial Plans — Part 4 appeared first on SQLPerformance.com.

Simple Parameterization and Trivial Plans — Part 3

Paul White — Mon, 18 Apr 2022 09:00:28 +0000

Execution Plans

It’s more complicated than you might expect to tell from the information provided in execution plans if a SQL statement uses simple parameterization. It’s no surprise even highly experienced SQL Server users tend to get this wrong, given the contradictory information often supplied to us.

Let’s look at some examples using the Stack Overflow 2010 database on SQL Server 2019 CU 14, with database compatibility set to 150.

To begin, we’ll need a new nonclustered index:

CREATE INDEX [IX dbo.Users Reputation (DisplayName)] 
ON dbo.Users (Reputation) 
INCLUDE (DisplayName);

1. Simple Parameterization Applied

This first example query uses simple parameterization:

SELECT U.DisplayName 
FROM dbo.Users AS U 
WHERE U.Reputation = 999;

The estimated (pre-execution) plan has the following parameterization-related elements:

Estimated plan parameterization properties

Notice the @1 parameter is introduced everywhere except the query text shown across the top.

The actual (post-execution) plan has:

Actual plan parameterization properties

Notice the properties window has now lost the ParameterizedText element, while gaining information about the parameter runtime value. The parameterized query text is now shown across the top of the window with ‘@1’ instead of ‘999’.

2. Simple Parameterization Not Applied

This second example does not use simple parameterization:

-- Projecting an extra column
SELECT 
    U.DisplayName, 
    U.CreationDate -- NEW
FROM dbo.Users AS U 
WHERE 
    U.Reputation = 999;

The estimated plan shows:

Estimated non-parameterized plan

This time, the parameter @1 is missing from the Index Seek tooltip, but the parameterized text and other parameter list elements are the same as before.

Let’s look at the actual execution plan:

Actual non-parameterized plan

The results are the same as the previous parameterized actual plan, except now the Index Seek tooltip displays the non-parameterized value ‘999’. The query text shown across the top uses the @1 parameter marker. The properties window also uses @1 and displays the runtime value of the parameter.

The query is not a parameterized statement despite all the evidence to the contrary.

3. Parameterization Failed

My third example is also not parameterized by the server:

-- LOWER function used
SELECT 
    U.DisplayName, 
    LOWER(U.DisplayName)
FROM dbo.Users AS U 
WHERE 
    U.Reputation = 999;

The estimated plan is:

Estimated plan parameterization failed

There’s no mention of a @1 parameter anywhere now, and the Parameter List section of the properties window is missing.

The actual execution plan is the same, so I won’t bother showing it.

4. Parallel Parameterized Plan

I want to show you one more example using parallelism in the execution plan. The low estimated cost of my test queries means we need to lower the cost threshold for parallelism to 1:

EXECUTE sys.sp_configure
    @configname = 'cost threshold for parallelism',
    @configvalue = 1;
RECONFIGURE;

The example is a bit more complex this time:

SELECT 
    U.DisplayName 
FROM dbo.Users AS U 
WHERE 
    U.Reputation >= 5 
    AND U.DisplayName > N'ZZZ' 
ORDER BY 
    U.Reputation DESC;

The estimated execution plan is:

Estimated parallel parameterized plan

The query text across the top remains unparameterized while everything else is. There are two parameter markers now, @1 and @2, because simple parameterization found two suitable literal values.

The actual execution plan follows the pattern of example 1:

Actual parallel parameterized plan

The query text across the top is now parameterized and the properties window contains runtime parameter values. This parallel plan (with a Sort operator) is definitely parameterized by the server using simple parameterization.

Reliable Methods

There are reasons for all the behaviours shown so far, and a few more besides. I’ll attempt to explain many of these in the next part of this series when I cover plan compilation.

In the meantime, the situation with showplan in general, and SSMS in particular, is less than ideal. It’s confusing for people who’ve been working with SQL Server their entire careers. Which parameter markers do you trust, and which ones do you ignore?

There are several reliable methods for determining if a particular statement had simple parameterization successfully applied to it or not.

Query Store

I’ll start with one of the most convenient, the query store. Unfortunately, it’s not always as straightforward as you might imagine.

You must enable the query store feature for the database context where the statement is executed and the OPERATION_MODE must be set to READ_WRITE, allowing the query store to actively collect data.

After meeting these conditions, post-execution showplan output contains extra attributes, including the StatementParameterizationType. As the name suggests, this contains a code describing the type of parameterization used for the statement.

It’s visible in the SSMS properties window when the root node of a plan is selected:

StatementParameterizationType

The values are documented in sys.query_store_query:

0 – None
1 – User (explicit parameterization)
2 – Simple parameterization
3 – Forced parameterization

This beneficial attribute only appears in SSMS when an actual plan is requested and missing when an estimated plan is selected. It’s important to remember the plan must be cached. Requesting an estimated plan from SSMS does not cache the plan produced (since SQL Server 2012).

Once the plan is cached, the StatementParameterizationType appears in the usual places, including via sys.dm_exec_query_plan.

You can also trust the other places parameterization type is recorded in the query store, such as the query_parameterization_type_desc column in sys.query_store_query.

One important caveat. When the query store OPERATION_MODE is set to READ_ONLY, the StatementParameterizationType attribute is still populated in SSMS actual plans—but it’s always zero—giving a false impression the statement was not parameterized when it might well have been.

If you’re happy enabling query store, are sure it’s read-write, and only look at post-execution plans in SSMS, this will work for you.

Standard Plan Predicates

The query text shown across the top of the graphical showplan window in SSMS isn’t reliable, as the examples have shown. Neither can you rely on the ParameterList displayed in the Properties window when the root node of the plan is selected. The ParameterizedText attribute shown for estimated plans only is also not conclusive.

You can, however, rely on the properties associated with individual plan operators. The given examples show these are present in the tooltips when hovering over an operator.

A predicate containing a parameter marker like @1 or @2 indicates a parameterized plan. The operators most likely to contain a parameter are Index Scan, Index Seek, and Filter.

Predicates with parameter markers

If the numbering starts with @1, it uses simple parameterization. Forced parameterization begins with @0. I should mention the numbering scheme documented here is subject to change at any time:

Change warning

Nevertheless, this is the method I use most often to determine if a plan was subject to server-side parameterization. It’s generally quick and easy to check a plan visually for predicates containing parameter markers. This method also works for both types of plans, estimated and actual.

Dynamic Management Objects

There are several ways to query the plan cache and related DMOs to determine if a statement was parameterized. Naturally, these queries only work on plans in cache, so the statement must have been executed to completion, cached, and not subsequently evicted for any reason.

The most direct approach is to look for an Adhoc plan using an exact SQL textual match to the statement of interest. The Adhoc plan will be a shell containing a ParameterizedPlanHandle if the statement is parameterized by the server. The plan handle is then used to locate the Prepared plan. An Adhoc plan will not exist if the optimize for ad hoc workloads is enabled, and the statement in question has only executed once.

This type of enquiry often ends up shredding a significant amount of XML and scanning the entire plan cache at least once. It’s also easy getting the code wrong, not least because plans in cache cover an entire batch. A batch may contain multiple statements, each of which may or may not be parameterized. Not all the DMOs work at the same granularity (batch or statement) making it quite easy to come unstuck.

An efficient way to list statements of interest, together with plan fragments for just those individual statements, is shown below:

SELECT
    StatementText =
        SUBSTRING(T.[text], 
            1 + (QS.statement_start_offset / 2), 
            1 + ((QS.statement_end_offset - 
                QS.statement_start_offset) / 2)),
    IsParameterized = 
        IIF(T.[text] LIKE N'(%',
            'Yes',
            'No'),
    query_plan = 
        TRY_CONVERT(xml, P.query_plan)
FROM sys.dm_exec_query_stats AS QS
CROSS APPLY sys.dm_exec_sql_text (QS.[sql_handle]) AS T
CROSS APPLY sys.dm_exec_text_query_plan (
    QS.plan_handle, 
    QS.statement_start_offset, 
    QS.statement_end_offset) AS P
WHERE 
    -- Statements of interest
    T.[text] LIKE N'%DisplayName%Users%'
    -- Exclude queries like this one
    AND T.[text] NOT LIKE N'%sys.dm%'
ORDER BY
    QS.last_execution_time ASC,
    QS.statement_start_offset ASC;

To illustrate, let’s run a single batch containing the four examples from earlier:

ALTER DATABASE SCOPED CONFIGURATION 
    CLEAR PROCEDURE_CACHE;
GO
-- Example 1
SELECT U.DisplayName 
FROM dbo.Users AS U 
WHERE U.Reputation = 999;

-- Example 2
SELECT 
    U.DisplayName, 
    U.CreationDate 
FROM dbo.Users AS U 
WHERE 
    U.Reputation = 999;

-- Example 3
SELECT 
    U.DisplayName, 
    LOWER(U.DisplayName)
FROM dbo.Users AS U 
WHERE 
    U.Reputation = 999;

-- Example 4
SELECT 
    U.DisplayName 
FROM dbo.Users AS U 
WHERE 
    U.Reputation >= 5 
    AND U.DisplayName > N'ZZZ' 
ORDER BY 
    U.Reputation DESC;
GO

The output of the DMO query is:

DMO query output

This confirms only examples 1 and 4 were successfully parameterized.

Performance Counters

It’s possible to use the SQL Statistics performance counters to get a detailed insight into parameterization activity for both estimated and actual plans. The counters used aren’t scoped per-session, so you’ll need to use a test instance with no other concurrent activity to get accurate results.

I’m going to supplement the parameterization counter information with data from the sys.dm_exec_query_optimizer_info DMO to provide statistics on trivial plans as well.

Some care is needed to prevent statements reading the counter information from modifying those counters themselves. I’m going to address this by creating a couple of temporary stored procedures:

CREATE PROCEDURE #TrivialPlans
AS
SET NOCOUNT ON;

SELECT
    OI.[counter],
    OI.occurrence
FROM sys.dm_exec_query_optimizer_info AS OI
WHERE
    OI.[counter] = N'trivial plan';
GO
CREATE PROCEDURE #PerfCounters
AS
SET NOCOUNT ON;

SELECT
    PC.[object_name],
    PC.counter_name,
    PC.cntr_value
FROM 
    sys.dm_os_performance_counters AS PC
WHERE 
    PC.counter_name LIKE N'%Param%';

The script to test a particular statement then looks like this:

ALTER DATABASE SCOPED CONFIGURATION 
    CLEAR PROCEDURE_CACHE;
GO
EXECUTE #PerfCounters;
EXECUTE #TrivialPlans;
GO
SET SHOWPLAN_XML ON;
GO
-- The statement(s) under test:
-- Example 3
SELECT 
    U.DisplayName, 
    LOWER(U.DisplayName)
FROM dbo.Users AS U 
WHERE 
    U.Reputation = 999;
GO
SET SHOWPLAN_XML OFF;
GO
EXECUTE #TrivialPlans;
EXECUTE #PerfCounters;

Comment the SHOWPLAN_XML batches out to run the target statement(s) and get actual plans. Leave them in place for estimated execution plans.

Running the whole thing as written gives the following results:

Performance counter test results

I’ve highlighted above where values changed when testing example 3.

The increase in the “trivial plan” counter from 1050 to 1051 shows a trivial plan was found for the test statement.

The simple parameterization counters increased by 1 for both attempts and failures, showing SQL Server tried to parameterize the statement, but failed.

End of Part 3

In the next part of this series, I’ll explain the curious things we’ve seen by describing how simple parameterization and trivial plans interact with the compilation process.

If you changed your cost threshold for parallelism to run the examples, remember to reset it (mine was set to 50):

EXECUTE sys.sp_configure
    @configname = 'cost threshold for parallelism',
    @configvalue = 50;
RECONFIGURE;

The post Simple Parameterization and Trivial Plans — Part 3 appeared first on SQLPerformance.com.

Simple Parameterization and Trivial Plans — Part 2

Paul White — Tue, 29 Mar 2022 09:00:40 +0000

Parameter Data Types

As mentioned in the first part of this series, one of the reasons it is better to explicitly parameterize is so you have full control over parameter data types. Simple parameterization has a number of quirks in this area, which can result in more parameterized plans being cached than expected, or finding different results compared with the unparameterized version.

When SQL Server applies simple parameterization to an ad-hoc statement, it makes a guess about the data type of the replacement parameter. I’ll cover the reasons for the guessing later in the series.

For the time being, let’s look at some examples using the Stack Overflow 2010 database on SQL Server 2019 CU 14. Database compatibility is set to 150, and the cost threshold for parallelism is set to 50 to avoid parallelism for now:

ALTER DATABASE SCOPED CONFIGURATION 
    CLEAR PROCEDURE_CACHE;
GO
SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = 252;
GO
SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = 25221;
GO
SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = 252552;

These statements result in six cached plans, three Adhoc and three Prepared:

Different guessed types

Notice the different parameter data types in the Prepared plans.

Data Type Inference

The details of how each data type is guessed are complex and incompletely documented. As a starting point, SQL Server infers a basic type from the textual representation of the value, then uses the smallest compatible subtype.

For a string of numbers without quotation marks or a decimal point, SQL Server chooses from tinyint, smallint, and integer. For such numbers beyond the range of an integer, SQL Server uses numeric with the smallest possible precision. For example, the number 2,147,483,648 is typed as numeric(10,0). The bigint type isn’t used for server-side parameterization. This paragraph explains the data types selected in the prior examples.

Strings of numbers with a decimal point are interpreted as numeric, with a precision and scale just large enough to contain the value provided. Strings prefixed with a currency symbol are interpreted as money. Strings in scientific notation translate to float. The smallmoney and real types aren’t employed.

The datetime and uniqueidentifer types cannot be inferred from natural string formats. To get a datetime or uniqueidentifier parameter type, the literal value must be provided in ODBC escape format. For example {d '1901-01-01'}, {ts '1900-01-01 12:34:56.790'}, or {guid 'F85C72AB-15F7-49E9-A949-273C55A6C393'}. Otherwise, the intended date or UUID literal is typed as a string. Date and time types other than datetime aren’t used.

General string and binary literals are typed as varchar(8000), nvarchar(4000), or varbinary(8000) as appropriate, unless the literal exceeds 8000 bytes in which case the max variant is used. This scheme helps avoid the cache pollution and low level of reuse that would result from using specific lengths.

It isn’t possible to use CAST or CONVERT to set the data type for parameters for reasons I’ll detail later in this series. There is an example of this in the next section.

I won’t cover forced parameterization in this series, but I do want to mention the rules for data type inference in that case have some important differences compared to simple parameterization. Forced parameterization wasn’t added until SQL Server 2005, so Microsoft had the opportunity to incorporate some lessons from the simple parameterization experience, and didn’t have to worry much about backward-compatibility issues.

Numeric Types

For numbers with a decimal point and whole numbers beyond the range of integer , the inferred type rules present special problems for plan reuse and cache pollution.

Consider the following query using decimals:

ALTER DATABASE SCOPED CONFIGURATION 
    CLEAR PROCEDURE_CACHE;
GO
DROP TABLE IF EXISTS dbo.Test;
GO
CREATE TABLE dbo.Test
(
    SomeValue decimal(19,8) NOT NULL
);
GO
SELECT 
    T.SomeValue 
FROM dbo.Test AS T 
WHERE 
    T.SomeValue >= 987.65432 
    AND T.SomeValue < 123456.789;

This query qualifies for simple parameterization. SQL Server chooses the smallest precision and scale for the parameters able to contain the supplied values. This means it chooses numeric(8,5) for 987.65432 and numeric(9,3) for 123456.789:

Inferred numeric data types

These inferred types don’t match the decimal(19,8) type of the column, so a conversion around the parameter appears in the execution plan:

Conversion to column type

These conversions only represent a small runtime inefficiency in this particular case. In other situations, a mismatch between the column data type and the inferred type of a parameter might prevent an index seek or require SQL Server to do extra work to manufacture a dynamic seek.

Even where the resulting execution plan seems reasonable, a type mismatch can easily affect plan quality due to the effect of the type mismatch on cardinality estimation. It’s always best to use matching data types, and to pay careful attention to the derived types resulting from expressions.

Plan Reuse

The main issue with the current plan is the specific inferred types affecting cached plan matching and therefore reuse. Let’s run a couple more queries of the same general form:

SELECT 
    T.SomeValue 
FROM dbo.Test AS T 
WHERE 
    T.SomeValue >= 98.76 
    AND T.SomeValue < 123.4567;
GO
SELECT 
    T.SomeValue 
FROM dbo.Test AS T 
WHERE 
    T.SomeValue >= 1.2 
    AND T.SomeValue < 1234.56789;
GO

Now look at the plan cache:

SELECT
    CP.usecounts,
    CP.objtype,
    ST.[text]
FROM sys.dm_exec_cached_plans AS CP
CROSS APPLY sys.dm_exec_sql_text (CP.plan_handle) AS ST
WHERE 
    ST.[text] NOT LIKE '%dm_exec_cached_plans%'
    AND ST.[text] LIKE '%SomeValue%Test%'
ORDER BY 
    CP.objtype ASC;

It shows an AdHoc and Prepared statement for each query we submitted:

Separate prepared statements

The parameterized text is the same, but the parameter data types are different, so separate plans are cached, and no plan reuse occurs.

If we continue to submit queries with different combinations of scale or precision, a new Prepared plan will be created and cached each time. Remember the inferred type of each parameter isn’t limited by the column data type, so we could end up with a tremendous number of cached plans, depending on the numeric literals submitted. The number of combinations from numeric(1,0) to numeric(38,38) is already large before we think about multiple parameters.

Explicit Parameterization

This problem doesn’t arise when we use explicit parameterization, ideally choosing the same data type as the column the parameter is compared with:

ALTER DATABASE SCOPED CONFIGURATION 
    CLEAR PROCEDURE_CACHE;
GO
DECLARE 
    @stmt nvarchar(4000) =
        N'SELECT T.SomeValue FROM dbo.Test AS T WHERE T.SomeValue >= @P1 AND T.SomeValue < @P2;',
    @params nvarchar(4000) =
        N'@P1 numeric(19,8), @P2 numeric(19,8)';

EXECUTE sys.sp_executesql 
    @stmt, 
    @params, 
    @P1 = 987.65432, 
    @P2 = 123456.789;

EXECUTE sys.sp_executesql 
    @stmt, 
    @params, 
    @P1 = 98.76, 
    @P2 = 123.4567;

EXECUTE sys.sp_executesql 
    @stmt, 
    @params, 
    @P1 = 1.2, 
    @P2 = 1234.56789;

With explicit parameterization, the plan cache query shows only one plan cached, used three times, and no type conversions needed:

Explicit parameterization

As a final side note, I’ve used decimal and numeric interchangeably in this section. They are technically different types, though documented to be synonyms and behaving equivalently. This is usually the case, but not always:

-- Raises error 8120:
-- Column 'dbo.Test.SomeValue' is invalid in the select list
-- because it is not contained in either an aggregate function
-- or the GROUP BY clause.
SELECT CONVERT(decimal(19,8), T.SomeValue)
FROM dbo.Test AS T 
GROUP BY CONVERT(numeric(19,8), T.SomeValue);

It’s probably a small parser bug, but it still pays to be consistent (unless you’re writing an article and want to point out an interesting exception).

Arithmetic Operators

There’s one other edge case I want to address, based on an example given in the documentation, but in a bit more detail (and perhaps accuracy):

-- The dbo.LinkTypes table contains two rows

-- Uses simple parameterization
SELECT r = CONVERT(float, 1./ 7) 
FROM dbo.LinkTypes AS LT;

-- No simple parameterization due to
-- constant-constant comparison
SELECT r = CONVERT(float, 1./ 7) 
FROM dbo.LinkTypes AS LT 
WHERE 1 = 1;

The results are different, as documented:

Different results

With Simple Parameterization

When simple parameterization occurs, SQL Server parameterizes both literal values. The 1. value is typed as numeric(1,0) as expected. Somewhat inconsistently, the 7 is typed as integer (not tinyint). The rules of type inference have been built over time, by different teams. Behaviours are maintained to avoid breaking legacy code.

The next step involves the / arithmetic operator. SQL Server requires compatible types before performing the division. Given numeric (decimal) has a higher data type precedence than integer, the integer will be converted to numeric.

SQL Server needs to implicitly convert the integer to numeric. But which precision and scale to use? The answer could be based on the original literal, as SQL Server does in other circumstances, but it always uses numeric(10) here.

The data type of the result of dividing a numeric(1,0)by a numeric(10,0) is determined by another set of rules, given in the documentation for precision, scale, and length. Plugging the numbers into the formulas for result precision and scale given there, we have:

Result precision:
- p1 - s1 + s2 + max(6, s1 + p2 + 1)
- = 1 - 0 + 0 + max(6, 0 + 10 + 1)
- = 1 + max(6, 11)
- = 1 + 11
- = 12
Result scale:
- max(6, s1 + p2 + 1)
- = max(6, 0 + 10 + 1)
- = max(6, 11)
- = 11

The data type of 1. / 7 is, therefore, numeric(12, 11). This value is then converted to float as requested and displayed as 0.14285714285 (with 11 digits after the decimal point).

Without Simple Parameterization

When simple parameterization isn’t performed, the 1. literal is typed as numeric(1,0) as before. The 7 is initially typed as integer also as seen previously. The key difference is the integer is converted to numeric(1,0), so the division operator has common types to work with. This is the smallest precision and scale able to contain the value 7. Remember simple parameterization used numeric(10,0) here.

The precision and scale formulas for dividing numeric(1,0) by numeric(1,0) give a result data type of numeric(7,6):

Result precision:
- p1 - s1 + s2 + max(6, s1 + p2 + 1)
- = 1 - 0 + 0 + max(6, 0 + 1 + 1)
- = 1 + max(6, 2)
- = 1 + 6
- = 7
Result scale:
- max(6, s1 + p2 + 1)
- = max(6, 0 + 1 + 1)
- = max(6, 2)
- = 6

After the final conversion to float, the displayed result is 0.142857 (with six digits after the decimal point).

The observed difference in the results is therefore due to interim type derivation (numeric(12,11) vs. numeric(7,6)) rather than the final conversion to float.

If you need further evidence the conversion to float isn’t responsible, consider:

-- Simple parameterization
SELECT r = CONVERT(decimal(13,12), 1. / 7)
FROM dbo.LinkTypes AS LT;

-- No simple parameterization
SELECT r = CONVERT(decimal(13,12), 1. / 7)
FROM dbo.LinkTypes AS LT 
OPTION (MAXDOP 1);

Result with decimal

The results differ in value and scale as before.

This section doesn’t cover every quirk of data type inference and conversion with simple parameterization by any means. As said before, you’re better off using explicit parameters with known data types wherever possible.

End of Part 2

The next part of this series describes how simple parameterization affects execution plans.

The post Simple Parameterization and Trivial Plans — Part 2 appeared first on SQLPerformance.com.

Simple Parameterization and Trivial Plans — Part 1

Paul White — Tue, 22 Mar 2022 09:00:24 +0000

This is the first part of a series about simple parameterization and trivial plans. These two compilation features are closely connected and have similar goals. Both target performance and efficiency for workloads frequently submitting simple statements.

Despite the “simple” and “trivial” names, both have subtle behaviours and implementation details that can make how they work difficult to understand. This series doesn’t dwell too long on the basics but concentrates on less well-known aspects likely to trip up even the most experienced database professionals.

In this first part, after a quick introduction, I look at the effects of simple parameterization on the plan cache.

Simple Parameterization

It's almost always better to explicitly parameterize statements, rather than relying on the server to do it. Being explicit gives you complete control over all aspects of the parameterization process, including where parameters are used, the precise data types used, and when plans are reused.

Most clients and drivers provide specific ways to use explicit parameterization. There are also options like sp_executesql, stored procedures, and functions.

I’m not going to get into the related issues of parameter sniffing or SQL injection because, while important, they're not the focus of this series. Still, you should write code with both close to the front of your mind.

For legacy applications or other third-party code that cannot be easily changed, explicit parameterization may not always be possible. You may be able to overcome some obstacles using template plan guides. In any event, it would be an unusual workload that does not contain at least some parameterized statements server-side.

Shell Plans

When SQL Server 2005 introduced Forced Parameterization, the existing auto-parameterization feature was renamed to Simple Parameterization. Despite the change in terminology, simple parameterization works the same as auto-parameterization always did: SQL Server attempts to replace constant literal values in ad hoc statements with parameter markers. The aim is to reduce compilations by increasing cached plan reuse.

Let’s look at an example, using the Stack Overflow 2010 database on SQL Server 2019 CU 14. Database compatibility is set to 150, and the cost threshold for parallelism is set to 50 to avoid parallelism for the moment:

EXECUTE sys.sp_configure
    @configname = 'show advanced options',
    @configvalue = 1;
RECONFIGURE;
GO
EXECUTE sys.sp_configure
    @configname = 'cost threshold for parallelism',
    @configvalue = 50;
RECONFIGURE;

Example code:

-- Clear the cache of plans for this database
ALTER DATABASE SCOPED CONFIGURATION 
    CLEAR PROCEDURE_CACHE;
GO
SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = 2521;
GO
SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = 2827;
GO
SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = 3144;
GO
SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = 3151;
GO

Those statements feature predicates that differ only in their constant literal values. SQL Server successfully applies simple parameterization, resulting in a parameterized plan. The single parameterized plan is used four times as we can see by querying the plan cache:

SELECT
    CP.usecounts,
    CP.cacheobjtype,
    CP.objtype,
    CP.size_in_bytes,
    ST.[text],
    QP.query_plan
FROM sys.dm_exec_cached_plans AS CP
OUTER APPLY sys.dm_exec_sql_text (CP.plan_handle) AS ST
OUTER APPLY sys.dm_exec_query_plan (CP.plan_handle) AS QP
WHERE 
    ST.[text] NOT LIKE '%dm_exec_cached_plans%'
    AND ST.[text] LIKE '%DisplayName%Users%'
ORDER BY 
    CP.usecounts ASC;

The results show an Adhoc plan cache entry for each original statement and a single Prepared plan:

Four adhoc plans and one prepared plan

A Prepared statement is similar to a stored procedure, with parameters inferred from literal values found in the Adhoc statement. I mention this as a useful mental model when thinking about the server-side parameterization process.

Notice that SQL Server caches both the original text and the parameterized form. When simple parameterization is successful, the plan associated with the original text is Adhoc and does not contain a full execution plan. Instead, the cached plan is a shell with very little besides a pointer to the Prepared parameterized plan.

The XML representation of the shell plans contain text like:

That's the entire plan. The ParameterizedPlanHandle points from the Adhoc shell to the full parameterized plan. The handle value is the same for all four shell plans.

Plan Stubs

Shell plans are smaller than a full compiled plan—16KB instead of 40KB in the example. This can still add up to a significant amount of memory if you have many statements using simple parameterization or lots of different parameter values. Most SQL Server instances are not so awash with memory that they can afford to waste it like this. The shell plans are considered very disposable by SQL Server, but finding and removing them consumes resources and can become a point of contention.

We can reduce the total memory consumption for shell plans by enabling the optimize for ad hoc workloads option.

EXECUTE sys.sp_configure
    @configname = 'show advanced options',
    @configvalue = 1;
RECONFIGURE;
GO
EXECUTE sys.sp_configure
    @configname = 'optimize for ad hoc workloads',
    @configvalue = 1;
RECONFIGURE;

This caches a tiny stub the first time an ad hoc statement is encountered instead of a shell. The stub serves as a bookmark so the server can remember it's seen the exact statement text before. Upon encountering the same text a second time, compilation and caching proceed as if optimize for ad hoc workloads were not enabled.

Re-running the example with optimize for ad hoc workloads enabled shows the effect on the plan cache.

Compiled Plan Stubs

No plan is cached for the ad-hoc statements, just a stub. There is no ParameterizedPlanHandle pointer to the Prepared plan, though a complete parameterized plan is cached.

Running the test batches for a second time (without clearing the plan cache) gives the same outcome as when optimize for ad hoc workloads was not enabled—four Adhoc shell plans pointing to the Prepared plan.

Before continuing, reset the optimize for ad hoc workloads setting to zero:

EXECUTE sys.sp_configure
    @configname = 'optimize for ad hoc workloads',
    @configvalue = 0;
RECONFIGURE;

Plan Cache Size Limits

Whether plan shells or plan stubs are used, there are still downsides to all these Adhoc cache entries. I've already mentioned total memory use, but each plan cache also has a maximum number of entries. Even where the total memory usage is not a concern, the sheer quantity may be.

The limits can be raised with documented trace flag 174 (number of entries) and trace flag 8032 (total size). Depending on the workload and other memory demands, this may not be the best solution. After all, it just means caching more low-value Adhoc plans, taking memory away from other needs.

Caching Only Prepared Plans

If the workload rarely issues ad-hoc batches with exactly the same statement text, caching plan shells or plan stubs is a waste of resources. It consumes memory and may cause contention when the SQL Plans cache store (CACHESTORE_SQLCP) needs to be shrunk to fit within configured limits.

The ideal would be to parameterize incoming ad-hoc batches, but only cache the parameterized version. There is a cost to doing this, because future ad-hoc statements need to be parameterized before they can be matched to the parameterized cached plan. On the other hand, this would have happened anyway since we've already stated exact textual matches are rare for the target workload.

For workloads that benefit from simple parameterization, but not the caching of Adhoc entries, there are a couple of options.

Undocumented Trace Flag

The first option is to enable undocumented trace flag 253. This prevents the caching of Adhoc plans completely. It does not simply restrict the number of such plans, or prevent them from “staying” in the cache, as has sometimes been suggested.

Trace flag 253 can be enabled at the session level—restricting its effects to just that connection—or more widely as a global or start-up flag. It also functions as a query hint, but using those prevents simple parameterization, which would be counterproductive here. There is a partial list of the things that prevent simple parameterization in the Microsoft Technical Paper, Plan Caching and Recompilation in SQL Server 2012.

With trace flag 253 active before the batch is compiled, only the Prepared statements are cached:

ALTER DATABASE SCOPED CONFIGURATION 
    CLEAR PROCEDURE_CACHE;
GO
-- Do not cache ad-hoc plans
DBCC TRACEON (253);
GO
SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = 2521;
GO
SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = 2827;
GO
SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = 3144;
GO
SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = 3151;
GO
-- Cache ad-hoc plans again
DBCC TRACEOFF (253);
GO

The plan cache query confirms only the Prepared statement is cached and reused.

Only the prepared statement is cached

The Uncacheable Batch

The second option is to include a statement that marks the entire batch as uncacheable. Suitable statements are often security-related or otherwise sensitive in some way.

This might sound impractical, but there are a couple of mitigations. First, the sensitive statement need not be executed—it just needs to be present. When that condition is met, the user running the batch doesn’t even need permission to execute the sensitive statement. Note carefully, the effect is confined to the batch containing the sensitive statement.

Two suitably-sensitive statements and example usage are shown below (with the test statements now in a single batch):

ALTER DATABASE SCOPED CONFIGURATION 
    CLEAR PROCEDURE_CACHE;
GO
-- Prevent caching of all statements in this batch.
-- Neither KEY nor CERTIFICATE need to exist.
-- No special permissions are needed.
-- GOTO is used to ensure the statements are not executed.
GOTO Start
    OPEN SYMMETRIC KEY Banana 
        DECRYPTION BY CERTIFICATE Banana;
Start:

/* Another way to achieve the same effect without GOTO
IF 1 = 0
BEGIN
    CREATE APPLICATION ROLE Banana 
    WITH PASSWORD = '';
END;
*/

SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = 2521;

SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = 2827;

SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = 3144;

SELECT U.DisplayName
FROM dbo.Users AS U 
WHERE U.Reputation = 3151;
GO

The Prepared plans created by simple parameterization are still cached and reused despite the parent batch being marked as uncacheable.

Only the prepared statement is cached

Neither solution is ideal, but until Microsoft provides a documented and supported solution for this issue, they’re the best options I’m aware of.

End of Part 1

There's a lot more ground to cover on this topic. Part two will cover the data types assigned when simple parameterization is employed.

The post Simple Parameterization and Trivial Plans — Part 1 appeared first on SQLPerformance.com.

Why the Optimizer Doesn't Use Buffer Pool Knowledge

Paul Randal — Tue, 30 Nov 2021 09:00:48 +0000

SQL Server has a cost-based optimizer that uses knowledge about the various tables involved in a query to produce what it decides is the most optimal plan in the time available to it during compilation. This knowledge includes whatever indexes exist and their sizes and whatever column statistics exist. Part of what goes into finding the optimal query plan is trying to minimize the number of physical reads needed during plan execution.

One thing I’ve been asked a few times is why the optimizer doesn’t consider what’s in the SQL Server buffer pool when compiling a query plan, as surely that could make a query execute faster. In this post, I’ll explain why.

Figuring Out Buffer Pool Contents

The first reason why the optimizer ignores the buffer pool is that it’s a non-trivial problem to figure out what’s in the buffer pool because of the way the buffer pool is organized. Data file pages are controlled in the buffer pool by small data structures called buffers, which track things like (non-exhaustive list):

The page’s ID (file number: page-number-in-file)
The last time the page was referenced (used by the lazy writer to help implement the least-recently-used algorithm that creates free space when needed)
The memory location of the 8KB page in the buffer pool
Whether the page is dirty or not (a dirty page has changes on it that haven’t yet been written back to durable storage)
The allocation unit the page belongs to (explained here) and the allocation unit ID can be used to figure out what table and index the page is part of

For each database that has pages in the buffer pool, there is a hash list of pages, in page ID order, that’s quickly searchable to determine whether a page is already in memory or whether a physical read has to be performed. However, nothing easily allows SQL Server to determine what percentage of the leaf level for each index of a table is already in memory. Code would have to scan the entire list of buffers for the database, looking for buffers that map pages for the allocation unit in question. And the more pages in memory for a database, the longer the scan would take. It would be prohibitively expensive to do as part of query compilation.

If you’re interested, I wrote a post a while back with some T-SQL code that scans the buffer pool and gives some metrics, using the DMV sys.dm_os_buffer_descriptors.

Why Using Buffer Pool Contents Would Be Dangerous

Let’s pretend there *is* a highly efficient mechanism to determine buffer pool contents the optimizer can use to help it choose which index to use in a query plan. The hypothesis I’m going to explore is if the optimizer knows enough of a less efficient (larger) index is already in memory, compared to the most efficient (smaller) index to use, it should pick the in-memory index because it will reduce the number of physical reads required and the query will run faster.

The scenario I’m going to use is as follows: a table BigTable has two nonclustered indexes, Index_A and Index_B, both completely covering a particular query. The query requires a complete scan of the leaf level of the index to retrieve the query results. The table has 1 million rows. Index_A has 200,000 pages at its leaf level, and Index_B has 1 million pages at its leaf level, so a complete scan of Index_B requires processing five times more pages.

I created this contrived example on a laptop running SQL Server 2019 with 8 processor cores, 32GB of memory, and solid-state disks. The code is as follows:

CREATE TABLE BigTable (
  	c1 BIGINT IDENTITY,
  	c2 AS (c1 * 2),
  	c3 CHAR (1500) DEFAULT 'a',
  	c4 CHAR (5000) DEFAULT 'b'
);
GO

INSERT INTO BigTable DEFAULT VALUES;
GO 1000000

CREATE NONCLUSTERED INDEX Index_A ON BigTable (c2) INCLUDE (c3);
-- 5 records per page = 200,000 pages
GO

CREATE NONCLUSTERED INDEX Index_B ON BigTable (c2) INCLUDE (c4);
-- 1 record per page = 1 million pages
GO

CHECKPOINT;
GO

And then I timed the contrived queries:

DBCC DROPCLEANBUFFERS;
GO

-- Index_A not in memory
SELECT SUM (c2) FROM BigTable WITH (INDEX (Index_A));
GO
-- CPU time = 796 ms, elapsed time = 764 ms

-- Index_A in memory
SELECT SUM (c2) FROM BigTable WITH (INDEX (Index_A));
GO
-- CPU time = 312 ms, elapsed time = 52 ms

DBCC DROPCLEANBUFFERS;
GO

-- Index_B not in memory
SELECT SUM (c2) FROM BigTable WITH (INDEX (Index_B));
GO
-- CPU time = 2952 ms, elapsed time = 2761 ms

-- Index_B in memory
SELECT SUM (c2) FROM BigTable WITH (INDEX (Index_B));
GO
-- CPU time = 1219 ms, elapsed time = 149 ms

You can see when neither index is in memory, Index_A is easily the most efficient index to use, with an elapsed query time of 764ms against 2,761ms using Index_B, and the same is true when both indexes are in memory. However, if Index_B is in memory, and Index_A is not, if the query uses Index_B (149ms) it’s going to run faster than if it uses Index_A (764ms).

Now let’s allow the optimizer to base plan choice on what’s in the buffer pool…

If Index_A is mostly not in memory and Index_B is mostly in memory, it would be more efficient to compile the query plan to use Index_B, for a query running at that instant. Even though Index_B is larger and would need more CPU cycles to scan through, physical reads are much slower than the extra CPU cycles so a more efficient query plan minimizes the number of physical reads.

This argument only holds, and a “use Index_B” query plan is only more efficient than a “use Index_A” query plan, if Index_B remains mostly in memory, and Index_A remains mostly not in memory. As soon as most of Index_A is in memory, the “use Index_A” query plan would be more efficient, and the “use Index_B” query plan is the wrong choice.

The situations when the compiled “use Index_B” plan is less efficient than the cost-based “use Index_A” plan are (generalizing):

Index_A and Index_B are both in memory: the compiled plan will take almost three times longer
Neither index is memory resident: the compiled plan take over 3.5 times longer
Index_A is memory resident and Index_B isn’t: all physical reads performed by the plan are extraneous, AND it will take a whopping 53 times longer

Summary

Although in our thought exercise, the optimizer can use buffer pool knowledge to compile the most efficient query at a single instant, it would be a dangerous way to drive plan compilation because of the potential volatility of the buffer pool contents, making the future efficiency of the cached plan highly unreliable.

Remember, the optimizer’s job is to find a good plan fast, not necessarily the single best plan for 100% of all situations. In my opinion, the SQL Server optimizer does the right thing by ignoring the actual contents of the SQL Server buffer pool, and instead relies on the various costing rules instead to produce a query plan that is likely to be the most efficient most of the time.

The post Why the Optimizer Doesn't Use Buffer Pool Knowledge appeared first on SQLPerformance.com.

The Adaptive Join Threshold

Paul White — Thu, 04 Nov 2021 09:00:37 +0000

First introduced in SQL Server 2017 Enterprise Edition, an adaptive join enables a runtime transition from a batch mode hash join to a row mode correlated nested loops indexed join (apply) at runtime. For brevity, I’ll refer to a “correlated nested loops indexed join” as an apply throughout the rest of this article. If you need a refresher on the difference between nested loops and apply, please see my previous article.

Whether an adaptive join transitions from a hash join to apply at runtime depends on a value labelled Adaptive Threshold Rows on the Adaptive Join execution plan operator. This article shows how an adaptive join works, includes details of the threshold calculation, and covers the implications of some of the design choices made.

Introduction

One thing I want you to bear in mind throughout this piece is an adaptive join always starts executing as a batch mode hash join. This is true even if the execution plan indicates the adaptive join expects to run as a row mode apply.

Like any hash join, an adaptive join reads all rows available on its build input and copies the required data into a hash table. The batch mode flavour of hash join stores these rows in an optimized format, and partitions them using one or more hash functions. Once the build input has been consumed, the hash table is fully populated and partitioned, ready for the hash join to start checking probe-side rows for matches.

This is the point where an adaptive join makes the decision to proceed with the batch mode hash join or to transition to a row mode apply. If the number of rows in the hash table is less than the threshold value, the join switches to an apply; otherwise, the join continues as a hash join by starting to read rows from the probe input.

If a transition to an apply join occurs, the execution plan doesn’t reread the rows used to populate the hash table to drive the apply operation. Instead, an internal component known as an adaptive buffer reader expands the rows already stored in the hash table and makes them available on-demand to the outer input of the apply operator. There’s a cost associated with the adaptive buffer reader, but it’s much lower than the cost of completely rewinding the build input.

Choosing an Adaptive Join

Query optimization involves one or more stages of logical exploration and physical implementation of alternatives. At each stage, when the optimizer explores the physical options for a logical join, it might consider both batch mode hash join and row mode apply alternatives.

If one of those physical join options forms part of the cheapest solution found during the current stage—and the other type of join can deliver the same required logical properties—the optimizer marks the logical join group as potentially suitable for an adaptive join. If not, consideration of an adaptive join ends here (and no adaptive join extended event is fired).

The normal operation of the optimizer means the cheapest solution found will only include one of the physical join options—either hash or apply, whichever had the lowest estimated cost. The next thing the optimizer does is build and cost a fresh implementation of the type of join that wasn’t chosen as cheapest.

Since the current optimization phase has already ended with a cheapest solution found, a special single-group exploration and implementation round is performed for the adaptive join. Finally, the optimizer calculates the adaptive threshold.

If any of the preceding work is unsuccessful, the extended event adaptive_join_skipped is fired with a reason.

If the adaptive join processing is successful, a Concat operator is added to the internal plan above the hash and apply alternatives with the adaptive buffer reader and any required batch/row mode adapters. Remember, only one of the join alternatives will execute at runtime, depending on the number of rows actually encountered compared with the adaptive threshold.

The Concat operator and individual hash/apply alternatives aren’t normally shown in the final execution plan. We’re instead presented with a single Adaptive Join operator. This is a just a presentation decision—the Concat and joins are still present in the code run by the SQL Server execution engine. You can find more details about this in the Appendix and Related Reading sections of this article.

The Adaptive Threshold

An apply is generally cheaper than a hash join for a smaller number of driving rows. The hash join has an extra startup cost to build its hash table but a lower per-row cost when it starts probing for matches.

There's generally a point where the estimated cost of an apply and hash join will be equal. This idea was nicely illustrated by Joe Sack in his article, Introducing Batch Mode Adaptive Joins:

Calculating the Threshold

At this point, the optimizer has a single estimate for the number of rows entering the build input of the hash join and apply alternatives. It also has the estimated cost of the hash and apply operators as a whole.

This gives us a single point on the extreme right edge of the orange and blue lines in the diagram above. The optimizer needs another point of reference for each join type so it can “draw the lines” and find the intersection (it doesn’t literally draw lines, but you get the idea).

To find a second point for the lines, the optimizer asks the two joins to produce a new cost estimate based on a different (and hypothetical) input cardinality. If the first cardinality estimate was more than 100 rows, it asks the joins to estimate new costs for one row. If the original cardinality was less than or equal to 100 rows, the second point is based on an input cardinality of 10,000 rows (so there’s a decent enough range to extrapolate).

In any case, the result is two different costs and row counts for each join type, allowing the lines to be “drawn.”

The Intersection Formula

Finding the intersection of two lines based on two points for each line is a problem with several well-known solutions. SQL Server uses one based on determinants as described on Wikipedia:

where:

The first line is defined by the points (x₁, y₁) and (x₂, y₂). The second line is given by the points (x₃, y₃) and (x₄, y₄). The intersection is at (P_x, P_y).

Our scheme has the number of rows on the x-axis and the estimated cost on the y-axis. We’re interested in the number of rows where the lines intersect. This is given by the formula for P_x. If we wanted to know the estimated cost at the intersection, it would be P_y.

For P_x rows, the estimated costs of the apply and hash join solutions would be equal. This is the adaptive threshold we need.

A Worked Example

Here’s an example using the AdventureWorks2017 sample database and the following indexing trick by Itzik Ben-Gan to get unconditional consideration of batch mode execution:

-- Itzik's trick
CREATE NONCLUSTERED COLUMNSTORE INDEX BatchMode
ON Sales.SalesOrderHeader (SalesOrderID)
WHERE SalesOrderID = -1
AND SalesOrderID = -2;

-- Test query
SELECT SOH.SubTotal
FROM Sales.SalesOrderHeader AS SOH
JOIN Sales.SalesOrderDetail AS SOD
    ON SOD.SalesOrderID = SOH.SalesOrderID
WHERE SOH.SalesOrderID <= 75123;

The execution plan shows an adaptive join with a threshold of 1502.07 rows:

The estimated number of rows driving the adaptive join is 31,465.

Join Costs

In this simplified case, we can find estimated subtree costs for the hash and apply join alternatives using hints:

-- Hash
SELECT SOH.SubTotal
FROM Sales.SalesOrderHeader AS SOH
JOIN Sales.SalesOrderDetail AS SOD
    ON SOD.SalesOrderID = SOH.SalesOrderID
WHERE SOH.SalesOrderID <= 75123
OPTION (HASH JOIN, MAXDOP 1);

-- Apply
SELECT SOH.SubTotal
FROM Sales.SalesOrderHeader AS SOH
JOIN Sales.SalesOrderDetail AS SOD
    ON SOD.SalesOrderID = SOH.SalesOrderID
WHERE SOH.SalesOrderID <= 75123
OPTION (LOOP JOIN, MAXDOP 1);

This gives us one point on the line for each join type:

31,465 rows
- Hash cost 1.05083
- Apply cost 10.0552

The Second Point on the Line

Since the estimated number of rows is more than 100, the second reference points come from special internal estimates based on one join input row. Unfortunately, there’s no easy way to obtain the exact cost numbers for this internal calculation (I’ll talk more about this shortly).

For now, I’ll just show you the cost numbers (using the full internal precision rather than the six significant figures presented in execution plans):

One row (internal calculation)
- Hash cost 0.999027422729
- Apply cost 0.547927305023
31,465 rows
- Hash cost 1.05082787359
- Apply cost 10.0552890166

As expected, the apply join is cheaper than the hash for a small input cardinality but much more expensive for the expected cardinality of 31,465 rows.

The Intersection Calculation

Plugging these cardinality and cost numbers into the line intersection formula gives you the following:

-- Hash points (x = cardinality; y = cost)

DECLARE 
    @x1 float = 1, 
    @y1 float = 0.999027422729,
    @x2 float = 31465, 
    @y2 float = 1.05082787359;

-- Apply points (x = cardinality; y = cost)

DECLARE
    @x3 float = 1, 
    @y3 float = 0.547927305023,
    @x4 float = 31465, 
    @y4 float = 10.0552890166;

-- Formula:

SELECT Threshold =
    (
        (@x1 * @y2 - @y1 * @x2) * (@x3 - @x4) - 
        (@x1 - @x2) * (@x3 * @y4 - @y3 * @x4)
    )
    /
    (
        (@x1 - @x2) * (@y3 - @y4) - 
        (@y1 - @y2) * (@x3 - @x4)
    );

-- Returns 1502.06521571273

Rounded to six significant figures, this result matches the 1502.07 rows shown in the adaptive join execution plan:

Defect or Design?

Remember, SQL Server needs four points to “draw” the row count versus cost lines to find the adaptive join threshold. In the present case, this means finding cost estimations for the one-row and 31,465-row cardinalities for both apply and hash join implementations.

The optimizer calls a routine named sqllang!CuNewJoinEstimate to calculate these four costs for an adaptive join. Sadly, there are no trace flags or extended events to provide a handy overview of this activity. The normal trace flags used to investigate optimizer behaviour and display costs don’t function here (see the Appendix if you’re interested in more details).

The only way to obtain the one-row cost estimates is to attach a debugger and set a breakpoint after the fourth call to CuNewJoinEstimate in the code for sqllang!CardSolveForSwitch. I used WinDbg to obtain this call stack on SQL Server 2019 CU12:

At this point in the code, double-precision floating points costs are stored in four memory locations pointed to by addresses at rsp+b0, rsp+d0, rsp+30, and rsp+28 (where rsp is a CPU register and offsets are in hexadecimal):

The operator subtree cost numbers shown match those used in the adaptive join threshold calculation formula.

About Those One-Row Cost Estimates

You may have noticed the estimated subtree costs for the one-row joins seem quite high for the amount of work involved in joining one row:

One row
- Hash cost 0.999027422729
- Apply cost 0.547927305023

If you try to produce one-row input execution plans for the hash join and apply examples, you’ll see much lower estimated subtree costs at the join than those shown above. Likewise, running the original query with a row goal of one (or the number of join output rows expected for an input of one row) will also produce an estimated cost way lower than shown.

The reason is the CuNewJoinEstimate routine estimates the one-row case in a way I think most people wouldn’t find intuitive.

The final cost is made up of three main components:

The build input subtree cost
The local cost of the join
The probe input subtree cost

Items 2 and 3 depend on the type of join. For a hash join, they account for the cost of reading all the rows from the probe input, matching them (or not) with the one row in the hash table, and passing the results on to the next operator. For an apply, the costs cover one seek on the lower input to the join, the internal cost of the join itself, and returning the matched rows to the parent operator.

None of this is unusual or surprising.

The Cost Surprise

The surprise comes on the build side of the join (item 1 in the list). One might expect the optimizer to do some fancy calculation to scale the already-calculated subtree cost for 31,465 rows down to one average row, or something like that.

In fact, both hash and apply one-row join estimates simply use the whole subtree cost for the original cardinality estimate of 31,465 rows. In our running example, this “subtree” is the 0.54456 cost of the batch mode clustered index seek on the header table:

To be clear: the build-side estimated costs for the one-row join alternatives use an input cost calculated for 31,465 rows. That should strike you as a bit odd.

As a reminder, the one-row costs computed by CuNewJoinEstimate were as follows:

One row
- Hash cost 0.999027422729
- Apply cost 0.547927305023

You can see the total apply cost (~0.54793) is dominated by the 0.54456 build-side subtree cost, with a tiny extra amount for the single inner-side seek, processing the small number of resulting rows within the join, and passing them on to the parent operator.

The estimated one-row hash join cost is higher because the probe side of the plan consists of a full index scan, where all resulting rows must pass through the join. The total cost of the one-row hash join is a little lower than the original 1.05095 cost for the 31,465-row example because there’s now only one row in the hash table.

Implications

One would expect a one-row join estimate to be based, in part, on the cost of delivering one row to the driving join input. As we’ve seen, this isn’t the case for an adaptive join: both apply and hash alternatives are saddled with the full estimated cost for 31,465 rows. The rest of the join is costed pretty much as one would expect for a one-row build input.

This intuitively strange arrangement is why it’s difficult (perhaps impossible) to show an execution plan mirroring the calculated costs. We’d need to construct a plan delivering 31,465 rows to the upper join input but costing the join itself and its inner input as if only one row were present. A tough ask.

The effect of all this is to raise the leftmost point on our intersecting-lines diagram up the y-axis. This affects the slope of the line and therefore the intersection point.

Another practical effect is the calculated adaptive join threshold now depends on the original cardinality estimate at the hash build input, as noted by Joe Obbish in his 2017 blog post. For example, if we change the WHERE clause in the test query to SOH.SalesOrderID <= 55000, the adaptive threshold reduces from 1502.07 to 1259.8 without changing the query plan hash. Same plan, different threshold.

This arises because, as we’ve seen, the internal one-row cost estimate depends on the build input cost for the original cardinality estimate. This means different initial build-side estimates will give a different y-axis “boost” to the one-row estimate. In turn, the line will have a different slope and a different intersection point.

Intuition would suggest the one-row estimate for the same join should always give the same value regardless of the other cardinality estimate on the line (given the exact same join with the same properties and row sizes has a close-to-linear relationship between driving rows and cost). This isn’t the case for an adaptive join.

By Design?

I can tell you with some confidence what SQL Server does when calculating the adaptive join threshold. I don’t have any special insight as to why it does it this way.

Still, there are some reasons to think this arrangement is deliberate and came about after due consideration and feedback from testing. The remainder of this section covers some of my thoughts on this aspect.

An adaptive join isn’t a straight choice between a normal apply and batch mode hash join. An adaptive join always starts by fully populating the hash table. Only once this work is complete is the decision made to switch to an apply implementation or not.

By this time, we’ve already incurred potentially significant cost by populating and partitioning the hash join in memory. This may not matter much for the one-row case, but it becomes progressively more important as cardinality increases. The unexpected “boost” may be a way to incorporate these realities into the calculation while retaining a reasonable computation cost.

The SQL Server cost model has long been a bit biased against nested loops join, arguably with some justification. Even the ideal indexed apply case can be slow in practice if the data needed isn’t already in memory, and the I/O subsystem isn’t flash, especially with a somewhat random access pattern. Limited amounts of memory and sluggish I/O won’t be entirely unfamiliar to users of lower-end cloud-based database engines, for example.

It’s possible practical testing in such environments revealed an intuitively costed adaptive join was too quick to transition to an apply. Theory is sometimes only great in theory.

Still, the current situation isn’t ideal; caching a plan based on an unusually low cardinality estimate will produce an adaptive join much more reluctant to switch to an apply than it would’ve been with a larger initial estimate. This is a variety of the parameter-sensitivity problem, but it’ll be a new consideration of this type for many of us.

Now, it’s also possible using the full build input subtree cost for the leftmost point of the intersecting cost lines is simply an uncorrected error or oversight. My feeling is the current implementation is probably a deliberate practical compromise, but you’d need someone with access to the design documents and source code to know for sure.

Summary

An adaptive join allows SQL Server to transition from a batch mode hash join to an apply after the hash table has been fully populated. It makes this decision by comparing the number of rows in the hash table with a precalculated adaptive threshold.

The threshold is computed by predicting where apply and hash join costs are equal. To find this point, SQL Server produces a second internal join cost estimate for a different build input cardinality—normally, one row.

Surprisingly, the estimated cost for the one-row estimate includes the full build-side subtree cost for the original cardinality estimate (not scaled to one row). This means the threshold value depends on the original cardinality estimate at the build input.

Consequently, an adaptive join may have an unexpectedly low threshold value, meaning the adaptive join is much less likely to transition away from a hash join. It’s unclear if this behaviour is by design.

Appendix

This section covers a couple of adaptive join aspects that were difficult to include in the main text in a natural way.

The Expanded Adaptive Plan

You might try looking at a visual representation of the internal plan using undocumented trace flag 9415, as provided by Dima Pilugin in his excellent adaptive join internals article linked above. With this flag active, the adaptive join plan for our running example becomes the following:

This is a useful representation to aid understanding, but it isn’t entirely accurate, complete, or consistent. For example, the Table Spool doesn’t exist—it’s a default representation for the adaptive buffer reader reading rows directly from the batch mode hash table.

The operator properties and cardinality estimates are also a bit all over the place. The output from the adaptive buffer reader (“spool”) should be 31,465 rows, not 121,317. The subtree cost of the apply is incorrectly capped by the parent operator cost. This is normal for showplan, but it makes no sense in an adaptive join context.

There are other inconsistencies as well—too many to usefully list— but that can happen with undocumented trace flags. The expanded plan shown above isn’t intended for use by end users, so perhaps it isn’t entirely surprising. The message here is not to rely too heavily on the numbers and properties shown in this undocumented form.

I should also mention in passing the finished standard adaptive join plan operator isn’t entirely without its own consistency issues. These stem pretty much exclusively from the hidden details.

For example, the displayed adaptive join properties come from a mixture of the underlying Concat, Hash Join, and Apply operators. You can see an adaptive join reporting batch mode execution for nested loops join (which is impossible), and the elapsed time shown is actually copied from the hidden Concat, not the particular join that executed at runtime.

The Usual Suspects

We can get some useful information from the sorts of undocumented trace flags normally used to look at optimizer output. For example:

SELECT SOH.SubTotal
FROM Sales.SalesOrderHeader AS SOH
JOIN Sales.SalesOrderDetail AS SOD
    ON SOD.SalesOrderID = SOH.SalesOrderID
WHERE SOH.SalesOrderID <= 75123
OPTION (
    QUERYTRACEON 3604,
    QUERYTRACEON 8607,
    QUERYTRACEON 8612);

Output (heavily edited for readability):

*** Output Tree: ***
PhyOp_ExecutionModeAdapter(BatchToRow) Card=121317 Cost=1.05095

PhyOp_Concat (batch) Card=121317 Cost=1.05325
PhyOp_HashJoinx_jtInner (batch) Card=121317 Cost=1.05083
- PhyOp_Range Sales.SalesOrderHeader Card=31465 Cost=0.54456
- PhyOp_Filter(batch) Card=121317 Cost=0.397185
  - PhyOp_Range Sales.SalesOrderDetail Card=121317 Cost=0.338953
PhyOp_ExecutionModeAdapter(RowToBatch) Card=121317 Cost=10.0798
- PhyOp_Apply Card=121317 Cost=10.0553
  - PhyOp_ExecutionModeAdapter(BatchToRow) Card=31465 Cost=0.544623
    - PhyOp_Range Sales.SalesOrderHeader Card=31465 Cost=0.54456 [** 3 **]
  - PhyOp_Filter Card=3.85562 Cost=9.00356

This gives some insight into the estimated costs for the full-cardinality case with hash and apply alternatives without writing separate queries and using hints. As mentioned in the main text, these trace flags aren’t effective within CuNewJoinEstimate, so we can’t directly see the repeat calculations for the 31,465-row case or any of the details for the one-row estimates this way.

Merge Join and Row Mode Hash Join

Adaptive joins only offer a transition from batch mode hash join to row mode apply. For the reasons why row mode hash join isn’t supported, see the Intelligent Query Processing Q&A in the Related Reading section. In short, it’s thought row mode hash joins would be too prone to performance regressions.

Switching to a row mode merge join would be another option, but the optimizer doesn’t currently consider this. As I understand it, it’s unlikely to be expanded in this direction in future.

Some of the considerations are the same as they are for row mode hash join. In addition, merge join plans tend to be less easily interchangeable with hash join, even if we limit ourselves to indexed merge join (no explicit sort).

There’s also a much greater distinction between hash and apply than there is between hash and merge. Both hash and merge are suitable for larger inputs, and apply is better suited to a smaller driving input. Merge join isn’t as easily parallelized as hash join and doesn’t scale as well with increasing thread counts.

Given the motivation for adaptive joins is to cope better with significantly varying input sizes—and only hash join supports batch mode processing—the choice of batch hash versus row apply is the more natural one. Finally, having three adaptive join choices would significantly complicate the threshold calculation for potentially little gain.

The post The Adaptive Join Threshold appeared first on SQLPerformance.com.

Finding Distinct Values Quickly

Paul White — Tue, 17 Mar 2020 09:00:25 +0000

Back in 2014, I wrote an article called Performance Tuning the Whole Query Plan. It looked at ways to find a relatively small number of distinct values from a moderately large dataset, and concluded that a recursive solution could be optimal. This follow-up post revisits the question for SQL Server 2019, using a larger number of rows.

Test Environment

I will be using the 50GB Stack Overflow 2013 database, but any large table with a low number of distinct values would do.

I will be looking for distinct values in the BountyAmount column of the dbo.Votes table, presented in bounty amount order ascending. The Votes table has just under 53 million rows (52,928,720 to be exact). There are just 19 different bounty amounts, including NULL.

The Stack Overflow 2013 database comes without nonclustered indexes to minimize download time. There is a clustered primary key index on the Id column of the dbo.Votes table. It comes set to SQL Server 2008 compatibility (level 100), but we will start with a more modern setting of SQL Server 2017 (level 140):

ALTER DATABASE StackOverflow2013
SET COMPATIBILITY_LEVEL = 140;

The tests were performed on my laptop using SQL Server 2019 CU 2. This machine has four i7 CPUs (hyperthreaded to 8) with a base speed of 2.4GHz. It has 32GB RAM, with 24GB available to the SQL Server 2019 instance. The cost threshold for parallelism is set to 50.

Each test result represents the best of ten runs, with all required data and index pages in memory.

1. Row Store Clustered Index

To set a baseline, the first run is a serial query without any new indexing (and remember this is with the database set to compatibility level 140):

SELECT DISTINCT 
    V.BountyAmount 
FROM dbo.Votes AS V 
ORDER BY
    V.BountyAmount
OPTION (MAXDOP 1);

This scans the clustered index and uses a row-mode hash aggregate to find the distinct values of BountyAmount:

Serial Clustered Index Plan

This takes 10,500ms to complete, using the same amount of CPU time. Remember this is the best time over ten runs with all the data in memory. Automatically-created sampled statistics on the BountyAmount column were created on the first run.

About half of the elapsed time is spent on the Clustered Index Scan, and about half on the Hash Match Aggregate. The Sort only has 19 rows to process, so it consumes only 1ms or so. All the operators in this plan use row mode execution.

Removing the MAXDOP 1 hint gives a parallel plan:

Parallel Clustered Index Plan

This is the plan the optimizer chooses without any hints in my configuration. It returns results in 4,200ms using a total of 32,800ms of CPU (at DOP 8).

2. Nonclustered Index

Scanning the whole table to find just the BountyAmount seems inefficient, so it is natural to try adding a nonclustered index on just the one column this query needs:

CREATE NONCLUSTERED INDEX b 
ON dbo.Votes (BountyAmount);

This index takes a while to create (1m 40s). The MAXDOP 1 query now uses a Stream Aggregate because the optimizer can use the nonclustered index to present rows in BountyAmount order:

Serial Nonclustered Row Mode Plan

This runs for 9,300ms (consuming the same amount of CPU time). A useful improvement on the original 10,500ms but hardly earth-shattering.

Removing the MAXDOP 1 hint gives a parallel plan with local (per-thread) aggregation:

Parallel Nonclustered Row Mode Plan

This executes in 3,400ms using 25,800ms of CPU time. We might be able to better with row or page compression on the new index, but I want to move on to more interesting options.

3. Batch Mode on Row Store (BMOR)

Now set the database to SQL Server 2019 compatibility mode using:

ALTER DATABASE StackOverflow2013
SET COMPATIBILITY_LEVEL = 150;

This gives the optimizer freedom to choose batch mode on row store if it deems it worthwhile. This can provide some of the benefits of batch mode execution without requiring a column store index. For deep technical details and undocumented options, see Dmitry Pilugin’s excellent article on the subject.

Unfortunately, the optimizer still chooses fully row mode execution using stream aggregates for both the serial and parallel test queries. In order to get a batch mode on row store execution plan, we can add a hint to encourage aggregation using hash match (which can run in batch mode):

SELECT DISTINCT
    V.BountyAmount
FROM dbo.Votes AS V
ORDER BY
    V.BountyAmount
OPTION (HASH GROUP, MAXDOP 1);

This gives us a plan with all operators running in batch mode:

Serial Batch Mode on Row Store Plan

Results are returned in 2,600ms (all CPU as usual). This is already faster than the parallel row mode plan (3,400ms elapsed) while using far less CPU (2,600ms versus 25,800ms).

Removing the MAXDOP 1 hint (but keeping HASH GROUP) gives a parallel batch mode on row store plan:

Parallel Batch Mode on Row Store Plan

This runs in just 725ms using 5,700ms of CPU.

4. Batch Mode on Column Store

The parallel batch mode on row store result is an impressive improvement. We can do even better by providing a column store representation of the data. To keep everything else the same, I will add a nonclustered column store index on just the column we need:

CREATE NONCLUSTERED COLUMNSTORE INDEX nb 
ON dbo.Votes (BountyAmount);

This is populated from the existing b-tree nonclustered index and takes only 15s to create.

SELECT DISTINCT
    V.BountyAmount
FROM dbo.Votes AS V
ORDER BY V.BountyAmount
OPTION (MAXDOP 1);

The optimizer chooses a fully batch mode plan including a column store index scan:

Serial Column Store Plan

This runs in 115ms using the same amount of CPU time. The optimizer chooses this plan without any hints on my system configuration because the estimated cost of the plan is below the cost threshold for parallelism.

To get a parallel plan, we can either reduce the cost threshold or use an undocumented hint:

SELECT DISTINCT
    V.BountyAmount
FROM dbo.Votes AS V
ORDER BY
    V.BountyAmount
OPTION (USE HINT ('ENABLE_PARALLEL_PLAN_PREFERENCE'));

In any case, the parallel plan is:

Parallel Column Store Plan

Query elapsed time is now down to 30ms, while consuming 210ms of CPU.

5. Batch Mode on Column Store with Push Down

The current best execution time of 30ms is impressive, especially when compared to the original 10,500ms. Nevertheless, it is a bit of shame that we have to pass nearly 53 million rows (in 58,868 batches) from the Scan to the Hash Match Aggregate.

It would be nice if SQL Server could push the aggregation down into the scan and just return distinct values from the column store directly. You might think we need to express the DISTINCT as a GROUP BY to get Grouped Aggregate Pushdown, but that is logically redundant and not the whole story in any case.

With the current SQL Server implementation, we actually need to compute an aggregate to activate aggregate pushdown. More than that, we need to use the aggregate result somehow, or the optimizer will just remove it as unnecessary.

One way to write the query to achieve aggregate pushdown is to add a logically redundant secondary ordering requirement:

SELECT 
    V.BountyAmount
FROM dbo.Votes AS V 
GROUP BY
    V.BountyAmount
ORDER BY 
    V.BountyAmount,
    COUNT_BIG(*)    -- New!
OPTION (MAXDOP 1);

The serial plan is now:

Serial Column Store Plan with Aggregate Push Down

Notice that no rows are passed from the Scan to the Aggregate! Under the covers, partial aggregates of the BountyAmount values and their associated row counts are passed to the Hash Match Aggregate, which sums the partial aggregates to form the required final (global) aggregate. This is very efficient, as confirmed by the elapsed time of 13ms (all of which is CPU time). As a reminder, the previous serial plan took 115ms.

To complete the set, we can get a parallel version in the same way as before:

Parallel Column Store Plan with Aggregate Push Down

This runs for 7ms using 40ms of CPU in total.

It is shame we need to compute and use an aggregate we don’t need just to get push down. Perhaps this will be improved in future so that DISTINCT and GROUP BY without aggregates can be pushed down to the scan.

6. Row Mode Recursive Common Table Expression

At the beginning, I promised to revisit the recursive CTE solution used to find a small number of duplicates in a large data set. Implementing the current requirement using that technique is quite straightforward, though the code is necessarily longer than anything we have seen up to this point:

WITH R AS
(
    -- Anchor
    SELECT
        V.BountyAmount
    FROM dbo.Votes AS V
    ORDER BY 
        V.BountyAmount ASC
        OFFSET 0 ROWS
        FETCH FIRST 1 ROW ONLY

    UNION ALL

    -- Recursive
    SELECT
        Q1.BountyAmount
    FROM 
    (
        SELECT 
            V.BountyAmount,
            rn = ROW_NUMBER() OVER (
                ORDER BY V.BountyAmount ASC)
        FROM R
        JOIN dbo.Votes AS V
            ON V.BountyAmount > ISNULL(R.BountyAmount, -1)
    ) AS Q1
    WHERE
        Q1.rn = 1
)
SELECT
    R.BountyAmount
FROM R
ORDER BY
    R.BountyAmount ASC
OPTION (MAXRECURSION 0);

The idea has its roots in a so-called index skip scan: We find the lowest value of interest at the beginning of the ascending-ordered b-tree index, then seek to find the next value in index order, and so on. The structure of a b-tree index makes finding the next highest value very efficient — there is no need to scan through the duplicates.

The only real trick here is convincing the optimizer to allow us to use TOP in the ‘recursive’ part of the CTE to return one copy of each distinct value. Please see my previous article if you need a refresher on the details.

The execution plan (explained in general by Craig Freedman here) is:

Serial Recursive CTE

The query returns correct results in 1ms using 1ms of CPU, according to Sentry One Plan Explorer.

7. Iterative T-SQL

Similar logic can be expressed using a WHILE loop. The code may be easier to read and understand than the recursive CTE. It also avoids having to use tricks to work around the many restrictions on the recursive part of a CTE. Performance is competitive at around 15ms. This code is provided for interest and is not included in the performance summary table.

SET NOCOUNT ON;
SET STATISTICS XML OFF;

DECLARE @Result table
(
    BountyAmount integer NULL UNIQUE CLUSTERED
);

DECLARE @BountyAmount integer;

-- First value in index order
WITH U AS
(
    SELECT
        V.BountyAmount
    FROM dbo.Votes AS V
    ORDER BY 
        V.BountyAmount ASC
        OFFSET 0 ROWS
        FETCH NEXT 1 ROW ONLY
)
UPDATE U
SET @BountyAmount = U.BountyAmount
OUTPUT Inserted.BountyAmount
    INTO @Result (BountyAmount);

-- Next higher value
WHILE @@ROWCOUNT > 0
BEGIN
    WITH U AS
    (
        SELECT
            V.BountyAmount
        FROM dbo.Votes AS V
        WHERE
            V.BountyAmount > ISNULL(@BountyAmount, -1)
        ORDER BY 
            V.BountyAmount ASC
            OFFSET 0 ROWS
            FETCH NEXT 1 ROW ONLY
    )
    UPDATE U
    SET @BountyAmount = U.BountyAmount
    OUTPUT Inserted.BountyAmount 
        INTO @Result (BountyAmount);
END;

-- Accumulated results
SELECT
    R.BountyAmount
FROM @Result AS R 
ORDER BY
    R.BountyAmount;

Performance Summary Table

Performance Summary Table (duration / CPU in milliseconds)

The “Est. Cost” column shows the optimizer’s cost estimate for each query as reported on the test system.

Final Thoughts

Finding a small number of distinct values might seem like quite a specific requirement, but I have come across it fairly frequently over the years, usually as part of tuning a larger query.

The last several examples were quite close in performance. Many people would be happy with any of the sub-second results, depending on priorities. Even the serial batch mode on row store result of 2,600ms compares well with the best parallel row mode plans, which bodes well for significant speed-ups just by upgrading to SQL Server 2019 and enabling database compatibility level 150. Not everyone will be able to move to column store storage quickly, and it won’t always be the right solution anyway. Batch mode on row store provides a neat way to achieve some of the gains possible with column stores, assuming you can convince the optimizer to choose to use it.

The parallel column store aggregate push down result of 57 million rows processed in 7ms (using 40ms of CPU) is remarkable, especially considering the hardware. The serial aggregate push down result of 13ms is equally impressive. It would be great if I hadn’t had to add a meaningless aggregate result to get these plans.

For those who cannot yet make the move to SQL Server 2019 or column store storage, the recursive CTE is still viable and efficient solution when a suitable b-tree index exists, and the number of distinct values needed is guaranteed to be quite small. It would be neat if SQL Server could access a b-tree like this without needing to write a recursive CTE (or equivalent iterative looping T-SQL code using WHILE).

Another possible solution for the problem is to create an indexed view. This will provide distinct values with great efficiency. The down side, as usual, is that each change to the underlying table will need to update the row count stored in the materialized view.

Each of the solutions presented here has its place, depending on the requirements. Having a wide range of tools available is generally speaking a good thing when tuning queries. Most of the time, I would choose one of the batch mode solutions because their performance will be quite predictable no matter how many duplicates are present.

The post Finding Distinct Values Quickly appeared first on SQLPerformance.com.

The Eager Index Spool and The Optimizer

Paul White — Fri, 22 Nov 2019 09:00:25 +0000

Introduction

An Eager Index Spool reads all rows from its child operator into an indexed worktable, before it starts returning rows to its parent operator. In some respects, an eager index spool is the ultimate missing index suggestion, but it is not reported as such.

Cost assessment

Inserting rows into an indexed worktable is relatively low-cost, but not free. The optimizer must consider that the work involved saves more than it costs. For that to work out in the spool’s favour, the plan must be estimated to consume rows from the spool more than once. Otherwise, it might as well skip the spool, and just do the underlying operation that one time.

To be accessed more than once, the spool must appear on the inner side of a nested loops join operator.
Each iteration of the loop should seek to a particular index spool key value provided by the outer side of the loop.

That means that the join needs to be an apply, not a nested loops join. For the difference between the two, please see my article Apply versus Nested Loops Join.

Notable features

While an eager index spool may only appear on the inner side of an nested loops apply, it is not a “performance spool”. An eager index spool cannot be disabled with trace flag 8690 or the NO_PERFORMANCE_SPOOL query hint.

Rows inserted into the index spool are not normally pre-sorted into index key order, which can result in index page splits. Undocumented trace flag 9260 can be used to generate a Sort operator before the index spool to avoid this. The downside is the extra sorting cost may dissuade the optimizer from choosing the spool option at all.

SQL Server does not support parallel inserts to a b-tree index. This means that everything below a parallel eager index spool runs on a single thread. The operators below the spool are still (misleadingly) marked with the parallelism icon. One thread is chosen to write to the spool. The other threads wait on EXECSYNC while that completes. Once the spool is populated, it may be read from by parallel threads.

Index spools do not tell the optimizer they support output ordered by the spool’s index keys. If sorted output from the spool is required, you may see an unnecessary Sort operator. Eager index spools should often be replaced by a permanent index anyway, so this is a minor concern much of the time.

There are five optimizer rules that can generate an Eager Index Spool option (known internally as an index on-the-fly). We will look at three of these in detail to understand where eager index spools come from.

SelToIndexOnTheFly

This is the most common one. It matches one or more relational selections (a.k.a filters or predicates) just above a data-access operator. The SelToIndexOnTheFly rule replaces the predicates with a seek predicate on an eager index spool.

Demo

An AdventureWorks sample database example is shown below:

SELECT
    P.ProductID,
    P.[Name],
    P.SafetyStockLevel,
    TH.Quantity
FROM Production.Product AS P
CROSS APPLY
(
    SELECT MAX(TH.Quantity)
    FROM Production.TransactionHistory AS TH
    WHERE 
        TH.ProductID = P.ProductID
        AND TH.Quantity < P.SafetyStockLevel
    GROUP BY ()
) AS TH (Quantity)
WHERE
    P.[Name] LIKE N'A%';

This execution plan has an estimated cost of 3.0881 units. Some points of interest:

The Nested Loops Inner Join operator is an apply, with ProductID and SafetyStockLevel from the Product table as outer references.
On the first iteration of the apply, the Eager Index Spool is fully populated from the Clustered Index Scan of the TransactionHistory table.
The spool’s worktable has a clustered index keyed on (ProductID, Quantity).
Rows matching the predicates TH.ProductID = P.ProductID and TH.Quantity < P.SafetyStockLevel are answered by the spool using its index. This is true for every iteration of the apply, including the first one.
The TransactionHistory table is only scanned once.

Sorted input to the spool

It is possible to enforce sorted input to the eager index spool, but this does affect estimated cost, as noted in the introduction. For the example above, enabling the undocumented trace flag produces a plan without a spool:

SELECT
    P.ProductID,
    P.[Name],
    P.SafetyStockLevel,
    TH.Quantity
FROM Production.Product AS P
CROSS APPLY
(
    SELECT
        MAX(TH.Quantity)
    FROM Production.TransactionHistory AS TH
    WHERE 
        TH.ProductID = P.ProductID
        AND TH.Quantity < P.SafetyStockLevel
    GROUP BY ()
) AS TH (Quantity)
WHERE
    P.[Name] LIKE N'A%'
OPTION (QUERYTRACEON 9260);

The estimated cost of this Index Seek and Key Lookup plan is 3.11631 units. This is more than the cost of the plan with an index spool alone, but less than the plan with an index spool and sorted input.

To see a plan with sorted input to the spool, we need to increase the expected number of loop iterations. This gives the spool a chance to repay the extra cost of the Sort. One way to expand the number of rows expected from the Product table is to make the Name predicate less restrictive:

SELECT
    P.ProductID,
    P.[Name],
    P.SafetyStockLevel,
    TH.Quantity
FROM Production.Product AS P
CROSS APPLY
(
    SELECT
        MAX(TH.Quantity)
    FROM Production.TransactionHistory AS TH
    WHERE 
        TH.ProductID = P.ProductID
        AND TH.Quantity < P.SafetyStockLevel
    GROUP BY ()
) AS TH (Quantity)
WHERE
    P.[Name] LIKE N'[A-P]%'
OPTION (QUERYTRACEON 9260);

This gives us an execution plan with sorted input to the spool:

JoinToIndexOnTheFly

This rule transforms an inner join to an apply, with an eager index spool on the inner side. At least one of the join predicates must be an inequality for this rule to be matched.

This is a much more specialized rule than SelToIndexOnTheFly, but the idea is much the same. In this case, the selection (predicate) being transformed to an index spool seek is associated with the join. The transformation from join to apply allows the join predicate to be moved from the join itself to the inner side of the apply.

Demo

SELECT
    P.ProductID,
    P.[Name],
    P.SafetyStockLevel,
    Quantity = MAX(TH.Quantity)
FROM Production.Product AS P
JOIN Production.TransactionHistory AS TH
    ON TH.ProductID = P.ProductID
    AND TH.Quantity < P.SafetyStockLevel
WHERE
    P.[Name] LIKE N'[A-P]%'
GROUP BY
    P.ProductID,
    P.[Name],
    P.SafetyStockLevel
OPTION (LOOP JOIN);

As before, we can request sorted input to the spool:

SELECT
    P.ProductID,
    P.[Name],
    P.SafetyStockLevel,
    Quantity = MAX(TH.Quantity)
FROM Production.Product AS P
JOIN Production.TransactionHistory AS TH
    ON TH.ProductID = P.ProductID
    AND TH.Quantity < P.SafetyStockLevel
WHERE
    P.[Name] LIKE N'[A-P]%'
GROUP BY
    P.ProductID,
    P.[Name],
    P.SafetyStockLevel
OPTION (LOOP JOIN, QUERYTRACEON 9260);

This time, the extra cost of sorting has encouraged the optimizer to choose a parallel plan.

An unwelcome side-effect is the Sort operator spills to tempdb. The total memory grant available for sorting is sufficient, but it is evenly split between parallel threads (as usual). As noted in the introduction, SQL Server does not support parallel inserts to a b-tree index, so the operators below the eager index spool run on a single thread. This single thread only gets a fraction of the memory grant, so the Sort spills to tempdb.

This side-effect is perhaps one reason the trace flag is undocumented and unsupported.

SelSTVFToIdxOnFly

This rule does the same thing as SelToIndexOnTheFly, but for a streaming table-valued function (sTVF) row source. These sTVFs are used extensively internally to implement DMVs and DMFs among other things. They appear in modern execution plans as Table Valued Function operators (originally as remote table scans).

In the past, many of these sTVFs could not accept correlated parameters from an apply. They could accept literals, variables, and module parameters, just not apply outer references. There are still warnings about this in the documentation, but they are somewhat out of date now.

Anyway, the point is that sometimes it is not possible for SQL Server to pass an apply outer reference as a parameter to an sTVF. In that situation, it can make sense to materialize part of the sTVF result in an eager index spool. The present rule provides that ability.

Demo

The next code example shows a DMV query that is successfully converted from a join to an apply. Outer references are passed as parameters to the second DMV:

-- Transformed to an apply
-- Outer reference passed as a parameter
SELECT
    DES.session_id,
    DES.login_time,
    DESWS.waiting_tasks_count
FROM sys.dm_exec_sessions AS DES
JOIN sys.dm_exec_session_wait_stats AS DESWS
    ON DESWS.session_id = DES.session_id
OPTION (FORCE ORDER);

The plan properties of the wait stats TVF show the input parameters. The second parameter value is provided as an outer reference from the sessions DMV:

It is a shame that sys.dm_exec_session_wait_stats is a view, not a function, because that prevents us writing an apply directly.

The rewrite below is enough to defeat the internal conversion:

-- Rewrite to avoid TVF parameter trickery
SELECT
    DES.session_id,
    DES.login_time,
    DESWS.waiting_tasks_count
FROM sys.dm_exec_sessions AS DES
JOIN sys.dm_exec_session_wait_stats AS DESWS
    ON DESWS.session_id >= DES.session_id
    AND DESWS.session_id <= DES.session_id
OPTION (FORCE ORDER);

With the session_id predicates now not consumed as parameters, the SelSTVFToIdxOnFly rule is free to convert them to an eager index spool:

I don’t want to leave you with the impression that tricky rewrites are needed to get an eager index spool over a DMV source – it just makes for an easier demo. If you do happen to encounter a query with DMV joins that produces a plan with an eager spool, at least you know how it got there.

You can’t create indexes on DMVs, so you may need to use a hash or merge join if the execution plan does not perform well enough.

Recursive CTEs

The remaining two rules are SelIterToIdxOnFly and JoinIterToIdxOnFly. They are direct counterparts to SelToIndexOnTheFly and JoinToIndexOnTheFly for recursive CTE data sources. These are extremely rare in my experience, so I am not going to provide demos for them. (Just so the Iter part of the rule name makes sense: It comes from the fact that SQL Server implements tail recursion as nested iteration.)

When a recursive CTE is referenced multiple times on the inside of an apply, a different rule (SpoolOnIterator) can cache the result of the CTE:

WITH R AS
(
    SELECT 1 AS n 
    UNION ALL
    SELECT R.n + 1 
    FROM R 
    WHERE R.n < 10
)
SELECT
    R1.n
FROM R AS R1
CROSS JOIN R AS R2;

The execution plan features a rare Eager Row Count Spool:

Final Thoughts

Eager index spools are often a sign that a useful permanent index is missing from the database schema. This is not always the case, as the streaming table-valued function examples show.

The post The Eager Index Spool and The Optimizer appeared first on SQLPerformance.com.