T-SQL Queries Archives - SQLPerformance.com

DATE_BUCKET and DATETRUNC Improve Optimization of Time-Based Grouping

Itzik Ben-Gan — Tue, 18 Oct 2022 09:00:04 +0000

Time-based grouping and aggregation are common in analyzing data using T-SQL—for example, grouping sales orders by year or by week and computing order counts per group. When you apply time-based grouping, you often group the data by expressions that manipulate date and time columns with functions such as YEAR, MONTH, and DATEPART. Such manipulation typically inhibits the optimizer’s ability to rely on index order. Before SQL Server 2022, there was a workaround that enabled relying on index order, but besides being quite ugly, it had its cost, and the tradeoff wasn’t always acceptable.

SQL Server 2022 introduces new ways to apply time-based grouping using the DATE_BUCKET and DATETRUNC functions. With the DATE_BUCKET function, besides being a flexible tool for handling time-series data, the optimizer can rely on index order when using it for grouping purposes. That’s been the case since the function’s inception in the first public preview of SQL Server 2022 (CTP 2.0). Starting with SQL Server 2022 RC1, that’s also the case with the DATETRUNC function.

I’ll start by presenting the optimization issue found using the older functions, describe the older workaround, and then present the optimization with the newer tools. I’ll also describe cases where you might still need to use the older workaround.

In my examples, I’ll use the sample database TSQLV6. You can download the script file to create and populate this database here and find its ER diagram here.

Important: All testing of examples in this article was done on SQL Server 2022 RC1. Feature availability and optimization could change in future builds of the product.

Optimization of Traditional Time-Based Grouping

Usually, when you group data by unmanipulated columns, SQL Server can apply an order-based group algorithm (Stream Aggregate) that relies on index order. Consider the following query, which I’ll refer to as Query 1, as an example:

USE TSQLV6;

SELECT orderdate, COUNT(*) AS numorders
FROM Sales.Orders
GROUP BY orderdate;

Figure 1 shows the plan for Query 1.

Figure 1: Plan for Query 1

There’s an index called idx_nc_orderdate defined on the Sales.Orders table, with orderdate as the key. This plan scans the index in key order and applies a Stream Aggregate algorithm to the preordered data without requiring explicit sorting.

However, suppose you group the data by expressions that manipulate columns, such as with most date and time-related functions. In that case, this will typically inhibit the optimizer’s ability to rely on index order. The optimizer can still rely on an index for coverage purposes but not on its order.

Consider the following query, which I’ll refer to as Query 2:

SELECT orderyear, COUNT(*) AS numorders
FROM Sales.Orders
CROSS APPLY (VALUES(YEAR(orderdate))) AS D(orderyear)
GROUP BY orderyear;

Theoretically, Microsoft could have added logic to the optimizer to recognize that it could rely on index order here since date order implies year order, but it doesn’t.

Figure 2 shows the plan for Query 2.

Figure 2: Plan for Query 2

When not finding preordered data, the optimizer can either force explicit sorting in the plan and still use a Stream Aggregate algorithm or a Hash Match (Aggregate) algorithm, which does not have an ordering requirement. In this example, the optimizer opted for the latter.

To verify that the use of a Stream Aggregate algorithm would require explicit sorting, you can force this algorithm with a hint, like so (I’ll call this Query 3):

SELECT orderyear, COUNT(*) AS numorders
FROM Sales.Orders
CROSS APPLY (VALUES(YEAR(orderdate))) AS D(orderyear)
GROUP BY orderyear
OPTION(ORDER GROUP);

Or you can encourage the optimizer to use a Stream Aggregate algorithm by adding a presentation ORDER BY list that is aligned with the grouping set, like so (I’ll call this Query 4):

SELECT orderyear, COUNT(*) AS numorders
FROM Sales.Orders
CROSS APPLY (VALUES(YEAR(orderdate))) AS D(orderyear)
GROUP BY orderyear
ORDER BY orderyear;

Query 3 and Query 4 get similar plans with a Stream Aggregate operator that is preceded by a Sort operator, as shown in Figure 3.

Figure 3: Plan for Query 3 and Query 4

If you’re interested in the details of the costing and scaling of the different grouping algorithms, see the following series:

The point is that currently, the optimizer is not encoded with the logic to recognize cases where column order implies manipulated column order, and hence the potential to rely on index order, at least with the traditional grouping expressions.

The same applies to the following query, which groups the orders by year and month (I’ll call this Query 5):

SELECT orderyear, ordermonth, COUNT(*) AS numorders
FROM Sales.Orders
CROSS APPLY (VALUES(YEAR(orderdate), MONTH(orderdate)))
               AS D(orderyear, ordermonth)
GROUP BY orderyear, ordermonth;

And the same applies to the following query, which groups the orders by week, assuming Sunday as the first day of the week (I’ll call this Query 6):

SELECT startofweek, COUNT(*) AS numorders
FROM Sales.Orders
CROSS APPLY (VALUES(DATEADD(week,
                      DATEDIFF(week, CAST('19000107' AS DATE), orderdate),
                      CAST('19000107' AS DATE)))) AS D(startofweek)
GROUP BY startofweek;

The Old and Ugly Workaround

Before SQL Server 2022, there was no way I could produce a workaround for the above queries that would rely on the order of the original index on orderdate. However, there was a workaround involving the creation of new indexed computed columns, which the optimizer could then rely upon. The workaround involves the following steps:

Create computed columns based on the grouping set expressions
Create an index with the computed columns forming the key

Here’s the code implementing this workaround to support Query 2, as an example:

ALTER TABLE Sales.Orders
ADD corderyear AS YEAR(orderdate);

CREATE NONCLUSTERED INDEX idx_nc_corderyear ON Sales.Orders(corderyear);

Here’s Query 2 again:

SELECT orderyear, COUNT(*) AS numorders
FROM Sales.Orders
CROSS APPLY (VALUES(YEAR(orderdate))) AS D(orderyear)
GROUP BY orderyear;

The new plan for Query 2 is shown in Figure 4.

Figure 4: Plan for Query 2 with Old Workaround

The advantage of this workaround is that no code changes are needed for your original query. Notice that even though you could, you didn’t have to change the query to refer to the computed column name corderyear in order to benefit from the new index. The optimizer applied expression matching and realized that the index on the computed column corderyear is based on the same expression as the grouping expression YEAR(orderdate).

The downside of this workaround is that it requires a DDL change to the table, and the addition of a new index. You don’t always have these options available; if you do, there’s extra space needed for the new index and extra modification overhead.

Run the following code to clean up the new index and computed column:

DROP INDEX idx_nc_corderyear ON Sales.Orders;

ALTER TABLE Sales.Orders DROP COLUMN corderyear;

The New and Elegant Workaround Using the DATE_BUCKET Function

As part of a set of new T-SQL features to support time-series scenarios, SQL Server 2022 introduces support for the DATE_BUCKET function. You can find details about this function in my article Bucketizing date and time data, as well as in Aaron Bertrand’s article My Favorite T-SQL Enhancements in SQL Server 2022.

In brief, the DATE_BUCKET function has the following syntax:

DATE_BUCKET( part, width, datetimevalue[, origin] )

The function returns a date and time value of the type of the input date and time value, representing the beginning of the date and time bucket that the input value belongs. Using the first two parameters of the function you define the bucket size, e.g., year, 1. You can explicitly define the starting point for the buckets on the time line using the fourth parameter origin or rely on January 1^st, 1900, midnight, as the default.

As an example, consider the following expression:

SELECT DATE_BUCKET( year, 1, CAST('20220718' AS DATE) ) AS startofyear;

This expression assumes bucket arrangement starting with the default origin January 1^st, 1900, with a bucket size of one year. It effectively computes the beginning of year date with respect to the input date value, July 18^th, 2022. This expression produces the following output:

startofyear
-----------
2022-01-01

The original motivation for adding this function was to support time-series scenarios, where you often need to group data from edge devices such as sensors by date and time buckets, e.g., 15-minute buckets, and apply aggregates per bucket. The pleasant surprise about this function beyond its flexibility is how it gets optimized. Unlike the more traditional date and time functions, which, at least at the time of writing, typically inhibit the optimizer’s ability to rely on index order, Microsoft did encode logic in the optimizer to enable the DATE_BUCKET function to rely on index order.

So, instead of the original Query 2:

SELECT orderyear, COUNT(*) AS numorders
FROM Sales.Orders
CROSS APPLY (VALUES(YEAR(orderdate))) AS D(orderyear)
GROUP BY orderyear;

You can use the DATE_BUCKET function, like so (I’ll refer to this Query 2b):

SELECT YEAR(yearbucket) AS orderyear, COUNT(*) AS numorders
FROM Sales.Orders
CROSS APPLY (VALUES(DATE_BUCKET(year, 1, orderdate))) AS D(yearbucket)
GROUP BY yearbucket;

The plan for Query 2b is shown in Figure 5.

Figure 5: Plan for Query 2b with New Workaround

Similarly, instead of the original Query 5:

SELECT orderyear, ordermonth, COUNT(*) AS numorders
FROM Sales.Orders
CROSS APPLY (VALUES(YEAR(orderdate), MONTH(orderdate)))
               AS D(orderyear, ordermonth)
GROUP BY orderyear, ordermonth;

Use the following (I’ll call this Query 5b):

SELECT
YEAR(yearmonthbucket) AS orderyear, 
MONTH(yearmonthbucket) AS ordermonth, 
COUNT(*) AS numorders
FROM Sales.Orders
CROSS APPLY (VALUES(DATE_BUCKET(month, 1, orderdate))) AS D(yearmonthbucket)
GROUP BY yearmonthbucket;

Instead of the original Query 6:

SELECT startofweek, COUNT(*) AS numorders
FROM Sales.Orders
CROSS APPLY (VALUES(DATEADD(week,
                      DATEDIFF(week, CAST('19000107' AS DATE), orderdate),
                      CAST('19000107' AS DATE)))) AS D(startofweek)
GROUP BY startofweek;

Use the following (I’ll call this Query 6b):

SELECT startofweek, COUNT(*) AS numorders
FROM Sales.Orders
CROSS APPLY (VALUES(DATE_BUCKET(week, 1, orderdate, CAST('19000107' AS DATE))))
               AS D(startofweek)
GROUP BY startofweek;

Whereas Query 2, Query 5 and Query 6 got plans that could not rely on index order, Query 2b, Query 5b and Query 6b got plans that did.

You get the general idea!

What about the new DATETRUNC function?

SQL Server 2022 also introduces support for the DATETRUNC function. This function has the following syntax:

DATETRUNC( part, datetimevalue )

The function returns a date and time value representing the input value truncated, or floored, to the beginning of the specified part. For instance, if you specify year as the part, you get the input value floored to the beginning of the year. If you specify month as the part, you get the input value floored to the beginning of the month. If the input value’s type is a date and time type, you get an output of the same type and precision. If the input is of a character string type, it is converted to DATETIME2(7), which is also the output type.

For most parts, you could think of the DATETRUNC function as a simplified version of the DATE_BUCKET function, with the same part, width of 1, and no explicit origin.

As an example, consider the following expression:

SELECT DATETRUNC( year, CAST('20220718' AS DATE) ) AS startofyear;

It is equivalent to:

SELECT DATE_BUCKET( year, 1, CAST('20220718' AS DATE) ) AS startofyear;

Both expressions produce the following output:

startofyear
-----------
2022-01-01

Similar to the rewrites you made to queries that use traditional date and time functions to use DATE_BUCKET instead, you could rewrite them to use DATETRUNC instead.

So, instead of the original Query 2:

SELECT orderyear, COUNT(*) AS numorders
FROM Sales.Orders
CROSS APPLY (VALUES(YEAR(orderdate))) AS D(orderyear)
GROUP BY orderyear;

You could use the DATETRUNC function, like so (I’ll refer to this Query 2c):

SELECT YEAR(startofyear) AS orderyear, COUNT(*) AS numorders
FROM Sales.Orders
CROSS APPLY (VALUES(DATETRUNC(year, orderdate))) AS D(startofyear)
GROUP BY startofyear;

Similarly, instead of the original Query 5:

SELECT orderyear, ordermonth, COUNT(*) AS numorders
FROM Sales.Orders
CROSS APPLY (VALUES(YEAR(orderdate), MONTH(orderdate)))
               AS D(orderyear, ordermonth)
GROUP BY orderyear, ordermonth;

You could use the following (I’ll call this Query 5c):

SELECT
YEAR(startofmonth) AS orderyear, 
MONTH(startofmonth) AS ordermonth, 
COUNT(*) AS numorders
FROM Sales.Orders
CROSS APPLY (VALUES(DATETRUNC(month, orderdate))) AS D(startofmonth)
GROUP BY startofmonth;

Instead of the original Query 6:

SELECT startofweek, COUNT(*) AS numorders
FROM Sales.Orders
CROSS APPLY (VALUES(DATEADD(week,
                      DATEDIFF(week, CAST('19000107' AS DATE), orderdate),
                      CAST('19000107' AS DATE)))) AS D(startofweek)
GROUP BY startofweek;

You could use the following (I’ll call this Query 6c):

SET DATEFIRST 7;

SELECT startofweek, COUNT(*) AS numorders
FROM Sales.Orders
CROSS APPLY (VALUES(DATETRUNC(week, orderdate))) AS D(startofweek)
GROUP BY startofweek;

Although not the case in earlier builds, starting with SQL Server 2022 RC1, the DATETRUNC function can rely on index order for grouping purposes similar to the DATE_BUCKET function. The plans generated for Query 2c, Query 5c, and Query 6c are similar to the ones generated for Query 2b, Query 5b, and Query 6b, respectively. All end up with an ordered index scan and no explicit sorting prior to a Stream Aggregate operator.

What about other manipulations?

Basic grouping is only one example where there’s potential for the optimizer to rely on index order even when manipulating the underlying columns, but where currently, it doesn’t. There are many other examples like filtering, more sophisticated hierarchical grouping with grouping sets, distinctness, joining, and others. For now, you can resolve only some of the cases by revising the code to use the DATE_BUCKET function. For some of the rest of the cases, you can still use the hack with the indexed computed columns if that’s an option and the tradeoff is acceptable. I’ll demonstrate this with filtering and hierarchical time-based grouping with grouping sets.

As an example for filtering, consider classic queries where you need to filter a consecutive period of time like a whole year or a whole month. The recommended way to go is to use a SARGable predicate without manipulating the filtered column. For example, to filter orders placed in 2021, the recommended way to write the query is like so (I’ll call this Query 7):

SELECT orderid, orderdate
FROM Sales.Orders
WHERE orderdate >= '20210101' AND orderdate < '20220101';

The plan for Query 7 is shown in Figure 6.

Figure 6: Plan for Query 7

As you can see, the plan relies on index order by applying a seek.

You often see people writing queries that filter consecutive time periods by applying functions to the filtered columns, not realizing the optimization implications. For example, using the YEAR function to filter a whole year, like so (I’ll call this Query 8):

SELECT orderid, orderdate
FROM Sales.Orders
WHERE YEAR(orderdate) = 2021;

The plan for this query, which is shown in Figure 7, shows a full scan of the index on orderdate.

Figure 7: Plan for Query 8

If you’re curious about how filtering with the DATE_BUCKET function is optimized; unlike with grouping, when using this function in a filter predicate, this also inhibits relying on index order. Consider the following query:

SELECT orderid, orderdate
FROM Sales.Orders
WHERE DATE_BUCKET(year, 1, orderdate) = '20210101';

SQL Server’s optimizer doesn’t consider this to be a SARGable predicate, resulting in a plan that scans the index on orderdate, similar to the plan shown in Figure 7.

I’m afraid that the same applies when using the DATETRUNC function:

SELECT orderid, orderdate
FROM Sales.Orders
WHERE DATETRUNC(year, orderdate) = '20210101';

Also here, SQL Server’s optimizer doesn’t consider this to be a SARGable predicate, resulting in a plan that scans the index on orderdate, similar to the plan shown in Figure 7.

Again, potentially the optimization of the new functions could be enhanced in the future to recognize such cases as SARGable and enable a seek in a supporting index, but at the time of writing they are not.

Naturally, the best thing you can do as a developer is to be familiar with best practices and ensure that you write your queries with SARGable predicates, such as Query 6. The problem is that sometimes you have no control over the way the code is written, and you don’t have the option to revise it. In such cases, it’s good to have the option with the indexed computed columns assuming the tradeoff is acceptable. For instance, suppose the original code in the application is Query 7, and you don’t have the option to revise it to Query 6. You could create the following computed column and supporting covering index:

ALTER TABLE Sales.Orders
ADD corderyear AS YEAR(orderdate);

CREATE NONCLUSTERED INDEX idx_nc_corderyear_i_od_oid ON Sales.Orders(corderyear)
INCLUDE(orderdate, orderid);

Rerun Query 7:

SELECT orderid, orderdate
FROM Sales.Orders
WHERE YEAR(orderdate) = 2021;

The plan for Query 7, which is shown in Figure 8, applies a seek against the new index.

Figure 8: New plan for Query 7

Run the following code for cleanup:

DROP INDEX idx_nc_corderyear_i_od_oid ON Sales.Orders;

ALTER TABLE Sales.Orders DROP COLUMN corderyear;

The same trick can also be applied in hierarchical time-based grouping with grouping sets, such as when using the ROLLUP option. Consider the following query as an example (I’ll call it Query 9):

SELECT orderyear, ordermonth, orderday, COUNT(*) AS numorders
FROM Sales.Orders
CROSS APPLY (VALUES(YEAR(orderdate), MONTH(orderdate), DAY(orderdate))) 
               AS D(orderyear, ordermonth, orderday)
GROUP BY ROLLUP(orderyear, ordermonth, orderday);

The plan for Query 9 is shown in Figure 9.

Figure 9: Plan for Query 9

The plan scans the input data only once, but then it sorts it. The first Stream Aggregate operator handles the base grouping, and the second handles the rolling up of the aggregates based on the time hierarchy. Theoretically, SQL Server’s optimizer could rely on the order of the index on orderdate and avoid a sort here. However, similar to the earlier examples of grouped queries, it doesn’t, resulting in explicit sorting in the plan.

An attempt to replace the expressions using the YEAR, MONTH and DAY functions with ones using the DATE_BUCKET function, at least in a trivial way, doesn’t help here. You still get a sort in the plan. What would prevent a sort is to emulate the use of the ROLLUP option by writing multiple grouped queries for the different grouping sets, some using the DATE_BUCKET function, and unifying the results, like so (I’ll call this Query 10):

SELECT 
YEAR(orderdate) AS orderyear, 
MONTH(orderdate) AS ordermonth,
DAY(orderdate) AS orderday,
COUNT(*) AS numorders
FROM Sales.Orders
GROUP BY orderdate

UNION ALL

SELECT 
YEAR(yearmonthbucket) AS orderyear, 
MONTH(yearmonthbucket) AS ordermonth,
NULL AS orderday,
COUNT(*) AS numorders
FROM Sales.Orders
CROSS APPLY (VALUES(DATE_BUCKET(month, 1, orderdate))) AS D(yearmonthbucket)
GROUP BY yearmonthbucket

UNION ALL

SELECT 
YEAR(yearbucket) AS orderyear, 
NULL AS ordermonth,
NULL AS orderday,
COUNT(*) AS numorders
FROM Sales.Orders
CROSS APPLY (VALUES(DATE_BUCKET(year, 1, orderdate))) AS D(yearbucket)
GROUP BY yearbucket

UNION ALL

SELECT 
NULL AS orderyear, 
NULL AS ordermonth,
NULL AS orderday,
COUNT(*) AS numorders
FROM Sales.Orders
GROUP BY GROUPING SETS(());

The plan for this query is shown in Figure 10.

Figure 10: Plan for Query 10

Indeed there’s no sorting in the plan, and this plan has linear scaling compared to N Log N scaling when sorting is involved; however, there are four scans of the input data instead of one. So, which plan is better depends on the input size. You also need to consider the loss of programmatic benefits due to the much more verbose coding here.

If it’s important for you to keep Query 9 as your solution, plus avoid the sort in the plan, and you’re willing to accept the tradeoff involved with adding computed columns and indexing them, the old trick works here. Here’s the code to add the computed columns and the indexes:

ALTER TABLE Sales.Orders
ADD corderyear AS YEAR(orderdate),
    cordermonth AS MONTH(orderdate),
    corderday AS DAY(orderdate);

CREATE NONCLUSTERED INDEX idx_nc_corderymd ON Sales.Orders(corderyear, cordermonth, corderday);

Rerun Query 9:

SELECT orderyear, ordermonth, orderday, COUNT(*) AS numorders
FROM Sales.Orders
CROSS APPLY (VALUES(YEAR(orderdate), MONTH(orderdate), DAY(orderdate))) 
               AS D(orderyear, ordermonth, orderday)
GROUP BY ROLLUP(orderyear, ordermonth, orderday);

Now you get the plan shown in Figure 11, with only one scan of the input and no explicit sorting.

Figure 11: New Plan for Query 9

It’s good to know that the old trick works here, but naturally, it would have been better if the optimizer recognized that it could rely on the index on the orderdate column to begin with.

Run the following code for cleanup:

DROP INDEX idx_nc_corderymd ON Sales.Orders;

ALTER TABLE Sales.Orders DROP COLUMN corderyear, cordermonth, corderday;

Conclusion

In this article I showed that when grouping data by expressions that manipulate date and time columns with traditional functions like DATEPART, YEAR, MONTH and DAY, such manipulation typically inhibits the optimizer’s ability to rely on index order even when theoretically it could have been beneficial. It’s great to see that with the new DATE_BUCKET and DATETRUNC functions in SQL Server 2022, Microsoft did add logic to the optimizer to enable relying on index order. Consequently, for now, you’d be better off in some cases revising grouped queries that use traditional functions to use DATE_BUCKET or DATETRUNC instead. The tradeoff is that the code will be a bit less natural than it was originally.

Ideally, in the future, Microsoft will add logic to the optimizer so that it can rely on index order also when manipulating the underlying columns with traditional date and time functions. The same goes for handling filtering tasks, more sophisticated grouping, and any situation where potentially index ordering could be relevant. This will enable each role in the data platform ecosystem to focus on what they’re supposed to. It will enable developers to write solutions that are more natural by avoiding query rewrites just for the sake of performance. It will enable DBAs to apply more natural tuning by creating indexes only on the original date and time columns and avoiding the need to use costly hacks like indexed computed columns.

At any rate, the addition of the DATE_BUCKET and DATETRUNC functions to T-SQL and their optimization is fantastic, and we can only hope to see many more great additions to T-SQL like this one in the future.

The post DATE_BUCKET and DATETRUNC Improve Optimization of Time-Based Grouping appeared first on SQLPerformance.com.

Emulating the GROUPS Window Frame Option

Itzik Ben-Gan — Wed, 13 Jul 2022 09:00:36 +0000

Generally in life, it’s good to have the perfect tools to handle any given task. But sometimes a tool you need isn’t available, and you have to figure out a solution without it. The same goes specifically for handling T-SQL querying tasks in terms of supported language elements. T-SQL, the dialect, supports a subset of the features from standard SQL. Sometimes, you need to solve a T-SQL querying task, and you know the natural tool to solve it is a feature that’s part of the SQL standard but isn’t available in T-SQL. So you have to come up with a supported alternative. This could also happen if you need to migrate code written for another database platform with a different dialect of SQL to T-SQL.

This article is dedicated to such a case—specifically, a standard SQL feature related to window functions called the GROUPS window frame option. At the time of writing, T-SQL doesn’t support it, but PostgreSQL—for example—does. I’ll explain what this feature does and how to emulate it in T-SQL.

Sample Data

In my examples, I’ll use a table called Orders. Use the following code to create the Orders table and populate it with a small set of sample data to help you verify the validity of the solutions:

SET NOCOUNT ON;

USE tempdb;

DROP TABLE IF EXISTS dbo.Orders;

CREATE TABLE dbo.Orders
(
  orderid   INT         NOT NULL,
  orderdate DATE        NOT NULL,
  empid     INT         NOT NULL,
  custid    VARCHAR(10) NOT NULL,
  qty       INT         NOT NULL,
  CONSTRAINT PK_Orders PRIMARY KEY CLUSTERED(orderid),
  INDEX idx_nc_od_oid_i_qty UNIQUE NONCLUSTERED(orderdate, orderid) INCLUDE(qty)
);

INSERT INTO dbo.Orders(orderid, orderdate, empid, custid, qty)
  VALUES( 2, '20220107', 3, 'B', 10),
        ( 3, '20220107', 1, 'C', 10),
        ( 5, '20220107', 1, 'A', 30),
        ( 7, '20220110', 4, 'A', 40),
        (11, '20220110', 1, 'C', 10),
        (13, '20220111', 2, 'B', 20),
        (17, '20220111', 4, 'A', 10),
        (19, '20220111', 2, 'C', 20),
        (23, '20220111', 3, 'B', 15),
        (29, '20220112', 3, 'B', 20),
        (31, '20220112', 3, 'C', 30),
        (37, '20220112', 3, 'C', 30);

The nonclustered index idx_nc_od_oid_i_qty is designed to support the solutions I’ll cover in this article.

To test the performance of the solutions, you’ll need a larger set of sample data. Use the following code to create a helper function called GetNums, which generates a sequence of integers in a requested range:

CREATE FUNCTION dbo.GetNums(@low AS BIGINT = 1, @high AS BIGINT)
  RETURNS TABLE
AS
RETURN
  WITH
    L0 AS ( SELECT 1 AS c 
            FROM (VALUES(1),(1),(1),(1),(1),(1),(1),(1),
                        (1),(1),(1),(1),(1),(1),(1),(1)) AS D(c) ),
    L1 AS ( SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B ),
    L2 AS ( SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B ),
    L3 AS ( SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B ),
    Nums AS ( SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
              FROM L3 )
  SELECT TOP(@high - @low + 1)
     rownum AS rn,
     @high + 1 - rownum AS op,
     @low - 1 + rownum AS n
  FROM Nums
  ORDER BY rownum;
GO

Use the following code to populate the Orders table with a large set of sample data:

DECLARE
  @numrows   AS INT = 1000000,
  @numemps   AS INT = 500,
  @numcusts  AS INT = 20000,
  @maxqty    AS INT = 100,
  @startdate AS DATE = '20180101',
  @enddate   AS DATE = '20221231';

TRUNCATE TABLE dbo.Orders;

INSERT INTO dbo.Orders WITH(TABLOCK) (orderid, orderdate, empid, custid, qty)
  SELECT N.n AS orderid,
    DATEADD(day, ABS(CHECKSUM(NEWID())) % (DATEDIFF(day, @startdate, @enddate) + 1), @startdate) AS orderdate,
    ABS(CHECKSUM(NEWID())) % @numemps + 1 AS empid,
    'C' + CAST(ABS(CHECKSUM(NEWID())) % @numcusts + 1 AS VARCHAR(9)) AS custid,
    ABS(CHECKSUM(NEWID())) % @maxqty + 1 AS qty
  FROM dbo.GetNums(1, @numrows) AS N;

Feel free, of course, to test the performance of the solutions with different data loading parameters if you’d like.

Understanding Window Frame Units

As prerequisite knowledge for this article, I’m assuming you’re familiar with T-SQL’s support for window functions and the elements in their specification, including the window frame. Still, I’ll provide a quick review of the syntax.

As a reminder, window functions supporting a frame, such as aggregate window functions and the FIRST_VALUE and LAST_VALUE functions, have the following syntax:

function_name() OVER(
[ ]
[ [ ] ] )

The optional window partition clause defines a subset of rows from the whole window of rows the function can see. The window frame clause defines a subset of rows from the partition (if present) or from the whole window (if not present).

The more detailed syntax for the window frame clause is as follows:

Assuming ordering of rows based on the window order clause, the frame defines a subset of rows between two delimiters. The window frame units part defines the units for the delimiters, and the window frame extent part defines the actual delimiters.

The SQL standard supports three window frame units: ROWS, RANGE, and GROUPS. T-SQL supports the ROWS option in full, the RANGE option partially, and has no support for the GROUPS option at all.

The ROWS Option

As a reminder, the ROWS option allows you to define frame delimiters in a certain offset from the current row in terms of a number of rows. Here’s a simple example:

SELECT orderid, orderdate, qty,
  SUM(qty) OVER(ORDER BY orderdate
                ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS sumrows\n
\nFigure 
FROM dbo.Orders;

As you can see, there’s no window partition clause, so the frame is extracted from the entire window of rows. Window ordering is based on the orderdate column. Then, using the ROWS units, you define a frame starting with two rows before the current and ending with the current row. In simple English terms, the function computes the sum of quantities from the last three rows based on orderdate ordering.

Here's the output of this query when applied to the small set of sample data:

orderid orderdate qty sumrows ----------- ---------- ----------- ----------- 2 2022-01-07 10 10 3 2022-01-07 10 20 5 2022-01-07 30 50 7 2022-01-10 40 80 11 2022-01-10 10 80 13 2022-01-11 20 70 17 2022-01-11 10 40 19 2022-01-11 20 50 23 2022-01-11 15 45 29 2022-01-12 20 55 31 2022-01-12 30 65 37 2022-01-12 30 80

Figure 1 illustrates the applicable frame of rows and the function’s result for a sample row (with order ID 17).

Figure 1: Understanding the ROWS Option

One tricky aspect of the ROWS option is it can result in a nondeterministic calculation when the window ordering is not total ordering (i.e., when the ordering elements don’t uniquely identify a row). This is the case in our example. The orderdate column isn’t unique, so there’s no preference between rows with the same orderdate values. Actual access order to the rows, which depends on the chosen plan and physical data layout, ends up determining the preference between rows with the same orderdate values. If you run the query twice—without any data changes happening between executions—you can theoretically end up with different results.

But what if you require a deterministic definition of the frame with guaranteed repeatable results?

One way to achieve this while still using the ROWS option is to define total ordering. In our example, this can be achieved by adding the orderid column as the ordering tiebreaker:

SELECT orderid, orderdate, qty, SUM(qty) OVER(ORDER BY orderdate, orderid ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS sumrows FROM dbo.Orders;

You use this option when you need both determinism and a guaranteed maximum frame size in terms of number of rows. In our case, the frame will never have more than three rows.

If you don’t need to limit the frame size, the SQL standard gives you two additional ways to define frame boundaries using the RANGE and GROUPS units.

The RANGE Option

The idea behind the RANGE option is it allows you to define a frame delimiter as an offset from the current row’s ordering value, as opposed to an offset in terms of a number of rows. For example, suppose you wanted to compute the total quantities of the orders from the last three days. The SQL standard’s syntax for it, albeit not supported currently in T-SQL, goes like this:

SELECT orderid, orderdate, qty, SUM(qty) OVER(ORDER BY orderdate RANGE BETWEEN INTERVAL '2' DAY PRECEDING AND CURRENT ROW) AS sumrange FROM dbo.Orders;

Here’s the expected output of this query:

orderid orderdate qty sumrange ----------- ---------- ----------- ----------- 2 2022-01-07 10 50 3 2022-01-07 10 50 5 2022-01-07 30 50 7 2022-01-10 40 50 11 2022-01-10 10 50 13 2022-01-11 20 115 17 2022-01-11 10 115 19 2022-01-11 20 115 23 2022-01-11 15 115 29 2022-01-12 20 195 31 2022-01-12 30 195 37 2022-01-12 30 195

Figure 2 illustrates the applicable frame of rows and the function’s result for our sample row with order ID 17.

Figure 2: Understanding the RANGE Option

The first delimiter is the ordering value two days before the current row’s ordering value. The sample row’s order date is 2022-01-11, so the first delimiter is 2022-01-09. Since this date doesn’t exist in the data, the actual first delimiter is the next existing date, 2022-01-10. The second delimiter is the current row’s ordering value, meaning the order date 2022-01-11. As you can see in the figure, there are a few additional rows with the same order date as the sample row’s order date, so they’re all included in the frame. The frame ends up including six rows in this row’s case.

Currently, T-SQL supports only UNBOUNDED and CURRENT ROW as delimiters with the RANGE option. So if you try running the above query against SQL Server or Azure SQL Database, you get an error. To achieve the task with T-SQL, you need to use a supported alternative like the following, which uses joining and grouping:

SELECT O1.orderid, O1.orderdate, O1.qty, SUM(O2.qty) AS sumrange FROM dbo.Orders AS O1 INNER JOIN dbo.Orders AS O2 ON O2.orderdate BETWEEN DATEADD(day, -2, O1.orderdate) AND O1.orderdate GROUP BY O1.orderid, O1.orderdate, O1.qty;

The GROUPS Option

And now the focus of this article: the SQL standard’s GROUPS option. When the window ordering is total ordering, each distinct ordering value appears in only one row. But when the window ordering is not total ordering—like when you order by the orderdate column—each distinct ordering value could appear in a group of rows. Using the GROUPS option, you can define a delimiter as an offset in terms of a number groups of distinct ordering values with respect to the current group’s ordering value. You could specify a number of groups PRECEDING, a number of groups FOLLOWING, and CURRENT ROW (meaning current group) as delimiters. You see how the GROUPS option allows you to combine certain aspects of the ROWS and RANGE options.

Recall the example shown earlier with RANGE where you wanted to compute the total quantities of the orders from the last three days. But what if you wanted the last three days with order activity? If your company doesn’t process orders on weekends and holidays, the meaning of “last three days” and “last three days with order activity” could be different. The SQL standard gives you the RANGE option to handle the former and the GROUPS option to handle the latter.

Here’s how you’re supposed to define a frame representing the last three days with order activity using GROUPS:

SELECT orderid, orderdate, qty, SUM(qty) OVER(ORDER BY orderdate GROUPS BETWEEN 2 PRECEDING AND CURRENT ROW) AS sumgroups FROM dbo.Orders;

Here’s the expected output of this query:

orderid orderdate qty sumgroups ----------- ---------- ----------- ----------- 2 2022-01-07 10 50 3 2022-01-07 10 50 5 2022-01-07 30 50 7 2022-01-10 40 100 11 2022-01-10 10 100 13 2022-01-11 20 165 17 2022-01-11 10 165 19 2022-01-11 20 165 23 2022-01-11 15 165 29 2022-01-12 20 195 31 2022-01-12 30 195 37 2022-01-12 30 195

Figure 3 illustrates the applicable frame of rows and the function’s result for our sample row with order ID 17.

Figure 3: Understanding the GROUPS Option

Each distinct orderdate value defines an ordering group. Given the sample row with order ID 17, the first frame delimiter starts two groups prior to the current (i.e., the order date 2022-01-07). The second frame delimiter is the current row’s order date value, 2022-01-11. The frame for this sample row ends up including nine rows.

Emulating GROUPS

I’ll cover two different solutions for emulating the GROUPS option. I’ll use the last task as the specific example for what needs to be computed—the total quantities of the last three days with order activity.

Solution 1: Using Dense Ranks, Joining, and Grouping

You’ve just learned the GROUPS option deals with offsets in terms of a number of groups of distinct ordering values. One approach to emulate the option is to first number (or rank) distinct groups of ordering values. This can be done easily with the DENSE_RANK function, like so:

SELECT orderid, orderdate, qty, DENSE_RANK() OVER(ORDER BY orderdate) AS drk FROM dbo.Orders;

This query generates the following output:

orderid orderdate qty drk ----------- ---------- ----------- -------------------- 2 2022-01-07 10 1 3 2022-01-07 10 1 5 2022-01-07 30 1 7 2022-01-10 40 2 11 2022-01-10 10 2 13 2022-01-11 20 3 17 2022-01-11 10 3 19 2022-01-11 20 3 23 2022-01-11 15 3 29 2022-01-12 20 4 31 2022-01-12 30 4 37 2022-01-12 30 4

Now, if T-SQL had complete support for the RANGE option, you’d have been able to achieve our task like so:

WITH C AS ( SELECT orderid, orderdate, qty, DENSE_RANK() OVER(ORDER BY orderdate) AS drk FROM dbo.Orders ) SELECT orderid, orderdate, qty, SUM(qty) OVER(ORDER BY drk RANGE BETWEEN 2 PRECEDING AND CURRENT ROW) AS sumgroups FROM C;

This code defines a CTE called C based on the query computing the dense rank values in the result column drk. The outer query against C then uses a window function to compute the desired total quantities of the last three days of activity. The window function does so by defining a window order clause based on drk and then a RANGE-based frame starting with two dense rank values preceding and ending with the current dense rank value. Unfortunately, as you know, T-SQL doesn’t support 2 PRECEDING as a RANGE delimiter.

You could emulate the RANGE option in T-SQL using joining and grouping, like so:

WITH C AS ( SELECT orderid, orderdate, qty, DENSE_RANK() OVER(ORDER BY orderdate) AS drk FROM dbo.Orders ) SELECT C1.orderid, C1.orderdate, C1.qty, SUM(C2.qty) AS sumgroups FROM C AS C1 INNER JOIN C AS C2 ON C2.drk BETWEEN C1.drk - 2 AND C1.drk GROUP BY C1.orderid, C1.orderdate, C1.qty;

This solution is indeed supported in T-SQL but, unfortunately, it results in a poor-performing plan, as shown in Figure 4.

Figure 4: Plan for Solution 1

This plan was produced against the large set of sample data. The Orders table was populated with one million rows, with 1,826 distinct order dates (across five years).

To understand the plan, first examine the outer input of the Nested Loops operator. It scans the data from the supporting nonclustered index and computes the dense rank values. To reduce the work, the plan locally groups the data by the dense rank values and computes total quantities per group prior to the join. This is why you see the Hash Match (Aggregate) operator delivering 1,826 rows as the outer input to the Nested Loops operator.

The inner input to the Nested Loops operator is then executed once per group—1,826 times—each of which scans the index and computes dense rank values. This results in 1,826,000,000 rows produced for all executions.

The Nested Loops operator matches the right rows between the outer and inner inputs, resulting in close to three million rows. Finally, the plan applies a global grouping and aggregation against the join’s result.

It took this plan over five minutes to complete on my system (311 seconds), with results discarded in SQL Server Management Studio (SSMS). That’s pretty slow for a one million-row input!

Solution 2: Using Grouping, Joining, and Windowing

The second solution is a bit more creative and requires a bit more sophistication. In the first step, you combine grouping and windowing. You group the rows from the Orders table by orderdate and compute a grouped sum of the quantities per group (in other words, the total daily quantities). You then apply a windowed sum with the grouped sum as input. It looks a bit strange because the expression starts with SUM(SUM(qty)) OVER…, but it’s perfectly valid. The inner SUM is a grouped sum, and the outer SUM is a windowed SUM applied to the grouped SUM. The window function uses orderdate as the window ordering element, and with the ROWS option, it defines a frame based on the last three rows.

Here’s the code implementing this first step:

SELECT orderdate, SUM(qty) AS daytotal, SUM(SUM(qty)) OVER(ORDER BY orderdate ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS sumgroups FROM dbo.Orders GROUP BY orderdate;

This code generates the following output:

orderdate daytotal sumgroups ---------- ----------- ----------- 2022-01-07 50 50 2022-01-10 50 100 2022-01-11 65 165 2022-01-12 80 195

The neat part of this idea is because you applied preliminary grouping by orderdate and added the windowing, the ROWS option effectively gives you what you needed from the unsupported GROUPS option. As you can see, you get the distinct daily groups and their aggregate results (total of last three days) but without the detail rows (the orders) you also need to return. Fortunately, this part is easy to achieve with a simple join between the Orders table and the result of the first step, like so:

WITH C AS ( SELECT orderdate, SUM(SUM(qty)) OVER(ORDER BY orderdate ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS sumgroups FROM dbo.Orders GROUP BY orderdate ) SELECT O.orderid, O.orderdate, O.qty, C.sumgroups FROM dbo.Orders AS O INNER JOIN C ON O.orderdate = C.orderdate;

It’s pretty neat, if I may say so myself!

The plan for this solution is shown in Figure 5.

Figure 5: Plan for Solution 2

It’s an efficient plan. It starts by scanning the data from the supporting index and computing the grouped aggregate using a Stream Aggregate relying on index order. It then computes the window aggregate, relying on the ordered data delivered by the grouped aggregate. Finally, it uses a Merge Join between the aggregated result and the result of one more ordered scan of the supporting index, which obtains the detail rows.

It took this plan less than one second (820 ms) to complete on my system, with results discarded in SSMS. This is compared to the more than five minutes it took the first solution to complete!

The suggested solution works fine when you need to emulate a sum window aggregate with the GROUPS option. But if you need to emulate an average aggregate, applying a windowed AVG on top of a grouped AVG won’t do the trick. To emulate an average calculation correctly, you’ll need to compute a windowed SUM of a grouped SUM and divide the result by a windowed SUM of a grouped COUNT, like so:

WITH C AS ( SELECT orderdate, SUM(SUM(qty)) OVER(ORDER BY orderdate ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS sumgroups, SUM(COUNT(qty)) OVER(ORDER BY orderdate ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS countgroups FROM dbo.Orders GROUP BY orderdate ) SELECT O.orderid, O.orderdate, O.qty, 1.0 * C.sumgroups / C.countgroups AS avggroups FROM dbo.Orders AS O INNER JOIN C ON O.orderdate = C.orderdate;

This query generates the following output against the small set of sample data:

orderid orderdate qty avggroups ----------- ---------- ----------- ---------------- 2 2022-01-07 10 16.666666666666 3 2022-01-07 10 16.666666666666 5 2022-01-07 30 16.666666666666 7 2022-01-10 40 20.000000000000 11 2022-01-10 10 20.000000000000 13 2022-01-11 20 18.333333333333 17 2022-01-11 10 18.333333333333 19 2022-01-11 20 18.333333333333 23 2022-01-11 15 18.333333333333 29 2022-01-12 20 21.666666666666 31 2022-01-12 30 21.666666666666 37 2022-01-12 30 21.666666666666

Conclusion

The standard window frame units called GROUPS allows you to define window frame delimiters as an offset in terms of a number of groups of distinct ordering values from the current row. It’s a mouthful, but hopefully having read the article, this now makes perfect sense to you. T-SQL currently doesn’t support this feature, so if you need this functionality, you have to use a supported workaround. I showed two techniques to emulate the GROUPS option. One is probably more intuitive—yet very slow—and uses dense ranks, joining, and grouping. The other is a bit more sophisticated but is very fast, using grouping, joining, and windowing.

The post Emulating the GROUPS Window Frame Option appeared first on SQLPerformance.com.

T-SQL Can Be Expressive Without Sacrificing Performance

Aaron Bertrand — Thu, 02 Jun 2022 09:00:38 +0000

Even with its warts and limitations, Transact-SQL is a beautiful language, allowing for flexible declarative expression about what you’re asking the database engine to do for you.

Itzik Ben-Gan has shown time after time that you can get the same results in many different ways—look what he demonstrated recently with window ordering. I have also discussed this when dealing with anti-semi-joins, which can be solved using APPLY or EXCEPT or NOT EXISTS or LEFT JOIN. In some of those cases, different queries led to different performance because query semantics changed (either obviously or, at least, according to the engine). But writing a query differently—even one more verbose—can have any of the following impacts, some subjective, some not:

More or less intuitive code

Zero or negligible performance differences

Meaningful performance differences

Trade one resource for another (e.g., use more memory but less CPU, or more CPU but less I/O)

And in each of those cases, you can decide which version of the query is most beneficial for you based on your priorities. Maybe you have a memory- or CPU-bound system, or maybe you prefer one syntax over another for subjective reasons, or maybe you think one form will be easier for future maintainers and newcomers.

As an analogy, there are many routes you can take from New York City to Dallas. Some may be faster than others, some may be fewer miles but take longer, some are more fuel-efficient due to average speed limits, some more scenic, and some more toll-friendly. The beauty is that if you and I are independently planning the same trip, we can choose our routes based on our individual priorities. I may not like interstates, or I may prefer to drive more westerly until the sun starts setting, and you may want to see a particular tourist attraction, visit an uncle, or stop in a certain city.

A query is similar. Usually, performance is of utmost importance, but even that isn’t always true. When two or more queries give the same answer and have identical (or “close enough”) performance, the choice can come down to other factors, as mentioned above. I recently answered a question on Stack Overflow where the user was asking how to filter a grouping where an aggregate condition was true.

For some context, Stack Overflow is a place where I tend to cater to people with a wide variety of experience with T-SQL or queries in general. Sometimes solutions are necessarily complex, or they need less-used syntax that is not universally understood, so it can take more explanation and a better breakdown of the code for the user to benefit. One of the ways I like to help with this breakdown is to isolate different aspects of the query in a derived table or, more commonly, a common table expression (CTE) because it can be a lightbulb moment to think about that part independently. With that in mind, let’s look at a boiled-down version of the question:

Given this table dbo.tablename, I want to return a single row for each name and division combination, but only where there is both a row with source = 'comp' and a row where source = 'manual':

name division source

host1 abc comp

host2 xy manual

host3 zyx comp

host3 zyx manual

host2 xy manual

I’ve highlighted the only rows they want to consider for aggregation, with the desired output being a single row:

name division

host3 zyx

In T-SQL, setting up this sample would look like this:

CREATE TABLE dbo.tablename ( name varchar(128), division varchar(128), source varchar(128) ); INSERT dbo.tablename(name, division, source) VALUES ('host1', 'abc', 'comp'), ('host2', 'xy', 'manual'), ('host3', 'zyx', 'comp'), ('host3', 'zyx', 'manual'), ('host2', 'xy', 'manual');

To get the desired result, the first (and later accepted) answer used this syntax, which is perfectly adequate:

SELECT name, division FROM dbo.tablename WHERE source in ('comp', 'manual') GROUP BY name, division HAVING COUNT(DISTINCT source) > 1; -- or = 2

Since I know users commonly have difficulty with the HAVING clause, I offered a different approach, one that breaks the logic down, as I mentioned earlier.

Another way to think about it is to calculate the counts inside a CTE and then filter:

;WITH cte AS ( SELECT name, division, SourceCount = COUNT(DISTINCT source) FROM dbo.tablename WHERE source IN ('comp', 'manual') GROUP BY name, division ) SELECT name, division FROM cte WHERE SourceCount = 2;

And yes, my CTEs always start with ;WITH – see why

Or, if you don’t like CTEs:

SELECT name, division FROM ( SELECT name, division, SourceCount = COUNT(DISTINCT source) FROM dbo.tablename WHERE source IN ('comp', 'manual') GROUP BY name, division ) AS q WHERE SourceCount = 2;

Yes, it’s more typing, but the intention is to think about the counting and grouping separate from the filtering, like how in an INNER JOIN you can logically think about the joining conditions (in the ON clause) separate from the filter conditions (in the WHERE clause).

As for performance, they all perform the same because SQL Server is smart and can generate the same plan. I inserted 50,000 rows into the table and ran all three queries; each had a duration of 30 – 33ms, a memory grant of 1,584 KB, and an estimated subtree cost of 0.5972860. Here is the plan shape in all three cases:

The plan would look different if the table had a clustered index; or let’s try an index designed to support this query specifically:

CREATE INDEX testing ON dbo.tablename (source) INCLUDE (name, division);

Now the time is down to 26 – 28ms, the memory grant is still 1.5MB, and the estimated subtree cost has dropped by a whopping amount, to 0.5769890. Here is the new plan (again, identical for all three queries):

This is not a complex example but illustrates that we can often find various ways to get to a final destination using the most expressive format we like. Variations in syntax that are identical in results and underlying meaning can help give someone that “lightbulb” moment and provide a more natural tendency to test “identical” variations for cases where the performance might be different.

The post T-SQL Can Be Expressive Without Sacrificing Performance appeared first on SQLPerformance.com.

T-SQL Windowing Improvements in SQL Server 2022

Itzik Ben-Gan — Wed, 25 May 2022 04:01:06 +0000

Microsoft recently released the first public preview of SQL Server 2022. This release has a number of T-SQL improvements. In this article I focus on windowing and NULL-related improvements. These include the new WINDOW clause and the windowing NULL treatment clause.

I’ll be using the sample database TSQLV6 in the examples in this article. You can download this sample database here.

The WINDOW Clause

The WINDOW clause is part of the ISO/IEC SQL standard. It allows you to name parts of a window specification—or an entire one—and then use the window name in the OVER clause of your query’s window functions. This clause allows you to shorten your code by avoiding the repetition of identical parts of your window specifications. This clause is now available in Azure SQL Database and SQL Server 2022, provided you use database compatibility level 160 or higher.

The WINDOW clause is located between the query’s HAVING and ORDER BY clauses:

SELECT
FROM
WHERE
GROUP BY
HAVING
WINDOW
ORDER BY

The WINDOW clause has the following syntax:

WINDOW window_name AS ( [ reference_window_name ] [ ] [ ] [ ] )

As an example where the WINDOW clause can be handy in shortening your code, consider the following query:

USE TSQLV6; SELECT orderid, custid, orderdate, qty, val, SUM(qty) OVER( PARTITION BY custid ORDER BY orderdate, orderid ROWS UNBOUNDED PRECEDING ) AS runsumqty, SUM(val) OVER( PARTITION BY custid ORDER BY orderdate, orderid ROWS UNBOUNDED PRECEDING ) AS runsumval FROM Sales.OrderValues WHERE custid IN (1, 2) ORDER BY custid, orderdate, orderid;

This query generates the following output:

orderid custid orderdate qty val runsumqty runsumval ----------- ----------- ---------- ----------- ------- ----------- ---------- 10643 1 2021-08-25 38 814.50 38 814.50 10692 1 2021-10-03 20 878.00 58 1692.50 10702 1 2021-10-13 21 330.00 79 2022.50 10835 1 2022-01-15 17 845.80 96 2868.30 10952 1 2022-03-16 18 471.20 114 3339.50 11011 1 2022-04-09 60 933.50 174 4273.00 10308 2 2020-09-18 6 88.80 6 88.80 10625 2 2021-08-08 18 479.75 24 568.55 10759 2 2021-11-28 10 320.00 34 888.55 10926 2 2022-03-04 29 514.40 63 1402.95

In this query you can see two window functions using identical window specifications, including window partitioning, ordering and framing clauses. To shorten the query, you can use the WINDOW clause to name a window specification with all three elements, say as W, and then specify OVER W in both window functions, like so:

SELECT orderid, custid, orderdate, qty, val, SUM(qty) OVER W AS runsumqty, SUM(val) OVER W AS runsumval FROM Sales.OrderValues WHERE custid IN (1, 2) WINDOW W AS ( PARTITION BY custid ORDER BY orderdate, orderid ROWS UNBOUNDED PRECEDING ) ORDER BY custid, orderdate, orderid;

As you can see, when the window name represents the whole window specification that you need and not just part of it, you specify the window name right after the OVER clause without parentheses.

You may have noticed in the WINDOW clause’s syntax one window name specification can have a reference to another window name. This is especially useful when your query has different window functions with different window specifications and one window specification is the same as part of another. Consider the following query as an example:

SELECT orderid, custid, orderdate, qty, val, ROW_NUMBER() OVER( PARTITION BY custid ORDER BY orderdate, orderid ) AS ordernum, MAX(orderdate) OVER( PARTITION BY custid ) AS maxorderdate, SUM(qty) OVER( PARTITION BY custid ORDER BY orderdate, orderid ROWS UNBOUNDED PRECEDING ) AS runsumqty, SUM(val) OVER( PARTITION BY custid ORDER BY orderdate, orderid ROWS UNBOUNDED PRECEDING ) AS runsumval FROM Sales.OrderValues WHERE custid IN (1, 2) ORDER BY custid, orderdate, orderid;

This query generates the following output:

orderid custid orderdate qty val ordernum maxorderdate runsumqty runsumval -------- ------- ---------- ---- ------- --------- ------------ ----------- ----------- 10643 1 2021-08-25 38 814.50 1 2022-04-09 38 814.50 10692 1 2021-10-03 20 878.00 2 2022-04-09 58 1692.50 10702 1 2021-10-13 21 330.00 3 2022-04-09 79 2022.50 10835 1 2022-01-15 17 845.80 4 2022-04-09 96 2868.30 10952 1 2022-03-16 18 471.20 5 2022-04-09 114 3339.50 11011 1 2022-04-09 60 933.50 6 2022-04-09 174 4273.00 10308 2 2020-09-18 6 88.80 1 2022-03-04 6 88.80 10625 2 2021-08-08 18 479.75 2 2022-03-04 24 568.55 10759 2 2021-11-28 10 320.00 3 2022-03-04 34 888.55 10926 2 2022-03-04 29 514.40 4 2022-03-04 63 1402.95

The MAX function’s window specification has only a window partition clause. The ROW_NUMBER function’s window specification has a window partition clause that is the same as the MAX function’s, plus a window order clause. Both SUM functions have the same window partition and order clauses as the ROW_NUMBER function’s, plus a window frame clause.

The recursive capability of the WINDOW clause’s syntax allows you to shorten the query’s code, like so:

SELECT orderid, custid, orderdate, qty, val, ROW_NUMBER() OVER PO AS ordernum, MAX(orderdate) OVER P AS maxorderdate, SUM(qty) OVER POF AS runsumqty, SUM(val) OVER POF AS runsumval FROM Sales.OrderValues WHERE custid IN (1, 2) WINDOW P AS ( PARTITION BY custid ), PO AS ( P ORDER BY orderdate, orderid ), POF AS ( PO ROWS UNBOUNDED PRECEDING ) ORDER BY custid, orderdate, orderid;

The order of the window name definitions in the WINDOW clause is insignificant. For example, the following code is valid and has the same meaning as the above query:

SELECT orderid, custid, orderdate, qty, val, ROW_NUMBER() OVER PO AS ordernum, MAX(orderdate) OVER P AS maxorderdate, SUM(qty) OVER POF AS runsumqty, SUM(val) OVER POF AS runsumval FROM Sales.OrderValues WHERE custid IN (1, 2) WINDOW POF AS ( PO ROWS UNBOUNDED PRECEDING ), PO AS ( P ORDER BY orderdate, orderid ), P AS ( PARTITION BY custid ) ORDER BY custid, orderdate, orderid;

Note, though, you can't use multiple window name references in one window name specification. You're limited to only one window name reference, plus any relevant additional window specification elements. For example, the following code isn’t valid for this reason:

SELECT orderid, custid, orderdate, qty, val, SUM(qty) OVER ( P O F ) AS runsumqty, SUM(val) OVER ( P O F ) AS runsumval FROM Sales.OrderValues WHERE custid IN (1, 2) WINDOW P AS ( PARTITION BY custid ), O AS ( ORDER BY orderdate, orderid ), F AS ( ROWS UNBOUNDED PRECEDING ) ORDER BY custid, orderdate, orderid;

This code generates the following error:

Msg 102, Level 15, State 1, Line 106
Incorrect syntax near 'O'.

You're allowed to mix one window name and additional windowing elements in a window specification, like so:

SELECT orderid, custid, orderdate, qty, val, ROW_NUMBER() OVER ( P ORDER BY orderdate, orderid ) AS ordernum, MAX(orderdate) OVER P AS maxorderdate FROM Sales.OrderValues WHERE custid IN (1, 2) WINDOW P AS ( PARTITION BY custid ) ORDER BY custid, orderdate, orderid;

This query generates the following output:

orderid custid orderdate qty val ordernum maxorderdate ----------- ----------- ---------- ----------- ------- -------------------- ------------ 10643 1 2021-08-25 38 814.50 1 2022-04-09 10692 1 2021-10-03 20 878.00 2 2022-04-09 10702 1 2021-10-13 21 330.00 3 2022-04-09 10835 1 2022-01-15 17 845.80 4 2022-04-09 10952 1 2022-03-16 18 471.20 5 2022-04-09 11011 1 2022-04-09 60 933.50 6 2022-04-09 10308 2 2020-09-18 6 88.80 1 2022-03-04 10625 2 2021-08-08 18 479.75 2 2022-03-04 10759 2 2021-11-28 10 320.00 3 2022-03-04 10926 2 2022-03-04 29 514.40 4 2022-03-04

As I mentioned before, when a window name represents the whole window specification, like with the MAX function in this query, you specify the window name right after the OVER clause without parentheses. When the window name is only part of the window specification, like with the ROW_NUMBER function in this query, you specify the window name followed by the rest of the windowing elements within parentheses.

By now, you know you're allowed to recursively define one window name based on another. However, in case it wasn’t obvious, cyclic references aren’t allowed. For example, the following query is valid since the window name definitions aren’t cyclic:

SELECT 'This is valid' WINDOW W1 AS (), W2 AS (W1), W3 AS (W2);

This query generates the following output:

------------- This is valid

However, the following query is invalid since the window name definitions are cyclic:

SELECT 'This is invalid' WINDOW W1 AS (W2), W2 AS (W3), W3 AS (W1);

This code generates the following error:

Msg 5365, Level 15, State 1, Line 108
Cyclic window references are not permitted.

Lastly, the scope of the defined window names is the immediate query/table expression, and can't cross table expression boundaries. For instance, if you define a window name in the inner query of a CTE, derived table, view or inline table valued function, the outer query won’t recognize the inner window name. As an example, the following query is invalid for this reason:

WITH C AS ( SELECT orderid, custid, orderdate, qty, val, SUM(qty) OVER W AS runsumqtyall FROM Sales.OrderValues WHERE custid IN (1, 2) WINDOW W AS ( PARTITION BY custid ORDER BY orderdate, orderid ROWS UNBOUNDED PRECEDING ) ) SELECT *, SUM(qty) OVER W AS runsumqty22 FROM C WHERE orderdate >= '20220101';

This code generates the following error:

Msg 5362, Level 15, State 3, Line 172
Window 'W' is undefined.

You have to define a window name you want to use in each of the scopes where you want to use it, like so:

WITH C AS ( SELECT orderid, custid, orderdate, qty, val, SUM(qty) OVER W AS runsumqtyall FROM Sales.OrderValues WHERE custid IN (1, 2) WINDOW W AS ( PARTITION BY custid ORDER BY orderdate, orderid ROWS UNBOUNDED PRECEDING ) ) SELECT *, SUM(qty) OVER W AS runsumqty22 FROM C WHERE orderdate >= '20220101' WINDOW W AS ( PARTITION BY custid ORDER BY orderdate, orderid ROWS UNBOUNDED PRECEDING );

This query generates the following output:

orderid custid orderdate qty val runsumqtyall runsumqty22 ----------- ----------- ---------- ----------- ------- ------------ ----------- 10835 1 2022-01-15 17 845.80 96 17 10952 1 2022-03-16 18 471.20 114 35 11011 1 2022-04-09 60 933.50 174 95 10926 2 2022-03-04 29 514.40 63 29

Each of the scopes defines its own window name W, and they don’t have to be based on the same specification (though they are in this example).

The Windowing NULL Treatment Clause

The NULL treatment clause is part of the ISO/IEC SQL standard and is available to the offset window functions FIRST_VALUE, LAST_VALUE, LAG and LEAD. This clause has the following syntax:

([, ]) [IGNORE NULLS | RESPECT NULLS] OVER( )

The RESPECT NULLS option is the default, in case you don’t indicate this clause. It means you want the function to return the value of in the requested position (first, last, previous, next), whether it's NULL or non-NULL. The IGNORE NULLS option introduces a new capability that people have been eagerly waiting to have in T-SQL for a long time. It means you want the function to return the value of in the requested position if it's non-NULL. However, if it is NULL, you want the function to keep going in the relevant direction (backward for LAST_VALUE and LAG, forward for FIRST_VALUE and LEAD) until a non-NULL value is found. If no non-NULL value is found, then it will return a NULL.

To illustrate the utility of this clause, I’ll use a table called T1 in my examples. Use the following code to create and populate T1:

DROP TABLE IF EXISTS dbo.T1; CREATE TABLE dbo.T1 ( id INT NOT NULL CONSTRAINT PK_T1 PRIMARY KEY, col1 INT NULL, col2 INT NULL ); GO INSERT INTO dbo.T1(id, col1, col2) VALUES ( 2, NULL, 200), ( 3, 10, NULL), ( 5, -1, NULL), ( 7, NULL, 202), (11, NULL, 150), (13, -12, 50), (17, NULL, 180), (19, NULL, 170), (23, 1759, NULL);

Suppose the column id represents the chronological order of the events recorded in T1. Each row represents an event where one or more attribute values have changed. A NULL means the attribute retains whatever last non-NULL value it had up to that point.

Suppose you need to return the last known (non-NULL) col1 value per event. Without access to the NULL treatment clause, you'd need to use a fairly complex technique such as the following:

WITH C AS ( SELECT id, col1, MAX(CASE WHEN col1 IS NOT NULL THEN id END) OVER(ORDER BY id ROWS UNBOUNDED PRECEDING) AS grp FROM dbo.T1 ) SELECT id, col1, MAX(col1) OVER(PARTITION BY grp ORDER BY id ROWS UNBOUNDED PRECEDING) AS lastknowncol1 FROM C;

If you aren't already familiar with this technique, it can take a bit to figure out the logic here.

This code generates the following output:

id col1 lastknowncol1 ----------- ----------- ------------- 2 NULL NULL 3 10 10 5 -1 -1 7 NULL -1 11 NULL -1 13 -12 -12 17 NULL -12 19 NULL -12 23 1759 1759

Having access to the NULL treatment clause, you can easily achieve the same using the LAST_VALUE function with the IGNORE NULLS option, like so:

SELECT id, col1, LAST_VALUE(col1) IGNORE NULLS OVER( ORDER BY id ROWS UNBOUNDED PRECEDING ) AS lastknowncol FROM dbo.T1;

The difference is of course more dramatic if you need to apply this logic to multiple attributes.

Without access to the NULL treatment clause, you'd use the following code to return the last known col1 and col2 values:

WITH C AS ( SELECT id, col1, col2, MAX(CASE WHEN col1 IS NOT NULL THEN id END) OVER(ORDER BY id ROWS UNBOUNDED PRECEDING) AS grp1, MAX(CASE WHEN col2 IS NOT NULL THEN id END) OVER(ORDER BY id ROWS UNBOUNDED PRECEDING) AS grp2 FROM dbo.T1 ) SELECT id, col1, MAX(col1) OVER(PARTITION BY grp1 ORDER BY id ROWS UNBOUNDED PRECEDING) AS lastknowncol1, col2, MAX(col2) OVER(PARTITION BY grp2 ORDER BY id ROWS UNBOUNDED PRECEDING) AS lastknowncol2 FROM C;

This code generates the following output:

id col1 lastknowncol1 col2 lastknowncol2 ----------- ----------- ------------- ----------- ------------- 2 NULL NULL 200 200 3 10 10 NULL 200 5 -1 -1 NULL 200 7 NULL -1 202 202 11 NULL -1 150 150 13 -12 -12 50 50 17 NULL -12 180 180 19 NULL -12 170 170 23 1759 1759 NULL 170

I should also note even though the table T1 has a supporting covering index with id as the key, each of the last known attribute calculations in the query above results in an explicit sort operator in the plan, as shown in Figure 1.

Figure 1: Plan for query without the NULL treatment clause

This fact makes this solution quite expensive.

Here’s the alternative using the NULL treatment clause:

SELECT id, col1, LAST_VALUE(col1) IGNORE NULLS OVER W AS lastknowncol1, col2, LAST_VALUE(col2) IGNORE NULLS OVER W AS lastknowncol2 FROM dbo.T1 WINDOW W AS ( ORDER BY id ROWS UNBOUNDED PRECEDING );

This solution is so much shorter and more elegant, and the optimization of the functions with this option can rely on an ordered scan of a supporting index, and thus avoid explicit sorting, as shown in the plan for this query in Figure 2.

Figure 2: Plan for query with the NULL treatment clause

As mentioned, the NULL treatment clause is available to all offset window functions (FIRST_VALUE, LAST_VALUE, LAG, and LEAD). Here’s an example using LAG to return the previous known value:

SELECT id, col1, LAG(col1) IGNORE NULLS OVER ( ORDER BY id ) AS prevknowncol1 FROM dbo.T1;

This code generates the following output:

id col1 prevknowncol1 ----------- ----------- ------------- 2 NULL NULL 3 10 NULL 5 -1 10 7 NULL -1 11 NULL -1 13 -12 -1 17 NULL -12 19 NULL -12 23 1759 -12

Want to try to achieve the same without the NULL treatment clause? I bet you don’t!

Conclusion and Other T-SQL Improvements in SQL Server 2022

In this article I covered T-SQL improvements in SQL Server 2022 concerning window functions and NULL handling. I showed how to:

Reuse parts of—or entire—window definitions with the WINDOW clause

Control NULL treatment in offset window functions with the NULL treatment clause

SQL Server 2022 has additional T-SQL improvements, covered by Aaron Bertrand in this article:

GREATEST / LEAST

STRING_SPLIT

DATE_BUCKET

GENERATE_SERIES

The post T-SQL Windowing Improvements in SQL Server 2022 appeared first on SQLPerformance.com.

Are You Sorted? Tips Concerning T-SQL Window Ordering

Itzik Ben-Gan — Wed, 11 May 2022 09:00:07 +0000

A supporting index can potentially help avoid the need for explicit sorting in the query plan when optimizing T-SQL queries involving window functions. By a supporting index, I mean one with the window partitioning and ordering elements as the index key, and the rest of the columns that appear in the query as the index included columns. I often refer to such an indexing pattern as a POC index as an acronym for partitioning, ordering, and covering. Naturally, if a partitioning or ordering element doesn’t appear in the window function, you omit that part from the index definition.

But what about queries involving multiple window functions with different ordering needs? Similarly, what if other elements in the query besides window functions also require arranging input data as ordered in the plan, such as a presentation ORDER BY clause? These can result in different parts of the plan needing to process the input data in different orders.

In such circumstances, you’ll typically accept explicit sorting is unavoidable in the plan. You may find the syntactical arrangement of expressions in the query can affect how many explicit sort operators you get in the plan. By following some basic tips you can sometimes reduce the number of explicit sort operators, which can, of course, have a major impact on the performance of the query.

Environment for Demos

In my examples, I’ll use the sample database PerformanceV5. You can download the source code to create and populate this database here.

I ran all the examples on SQL Server^® 2019 Developer, where batch-mode on rowstore is available.

In this article, I want to focus on tips having to do with the potential of the window function’s calculation in the plan to rely on ordered input data without requiring an extra explicit sort activity in the plan. This is relevant when the optimizer uses a serial or parallel row-mode treatment of window functions, and when using a serial batch-mode Window Aggregate operator.

SQL Server doesn't currently support an efficient combination of a parallel order-preserving input prior to a parallel batch-mode Window Aggregate operator. So, to use a parallel batch-mode Window Aggregate operator, the optimizer has to inject an intermediary parallel batch-mode Sort operator, even when the input is already preordered.

For simplicity’s sake, you can prevent parallelism in all examples shown in this article. To achieve this without needing to add a hint to all queries, and without setting a server-wide configuration option, you can set the database scoped configuration option MAXDOP to 1, like so:

USE PerformanceV5; ALTER DATABASE SCOPED CONFIGURATION SET MAXDOP = 1;

Remember to set it back to 0 after you’re done testing the examples in this article. I’ll remind you at the end.

Alternatively, you can prevent parallelism at the session level with the undocumented DBCC OPTIMIZER_WHATIF command, like so:

DBCC OPTIMIZER_WHATIF(CPUs, 1);

To reset the option when you’re done, invoke it again with the value 0 as the number of CPUs.

When you’re done trying all of the examples in this article with parallelism disabled, I recommend enabling parallelism and trying all examples again to see what changes.

Tips 1 and 2

Before I start with the tips, let’s first look at a simple example with a window function designed to benefit from a supporting index.

Consider the following query, which I’ll refer to as Query 1:

SELECT orderid, orderdate, custid, SUM(orderid) OVER(PARTITION BY custid ORDER BY orderdate, orderid ROWS UNBOUNDED PRECEDING) AS sum1 FROM dbo.Orders;

Don’t worry about the fact that the example is contrived. There’s no good business reason to compute a running total of order IDs—this table is decently sized with 1MM rows, and I wanted to show a simple example with a common window function such as one that applies a running total computation.

Following the POC indexing scheme, you create the following index to support the query:

CREATE UNIQUE NONCLUSTERED INDEX idx_nc_cid_od_oid ON dbo.Orders(custid, orderdate, orderid);

The plan for this query is shown in Figure 1.

Figure 1: Plan for Query 1

No surprises here. The plan applies an index order scan of the index you just created, providing the data ordered to the Window Aggregate operator, without the need for explicit sorting.

Next, consider the following query, which involves multiple window functions with different ordering needs, as well as a presentation ORDER BY clause:

SELECT orderid, orderdate, custid, SUM(orderid) OVER(PARTITION BY custid ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS sum2, SUM(orderid) OVER(PARTITION BY custid ORDER BY orderdate, orderid ROWS UNBOUNDED PRECEDING) AS sum1, SUM(1.0 * orderid) OVER(ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS sum3 FROM dbo.Orders ORDER BY custid, orderid;

I’ll refer to this query as Query 2. The plan for this query is shown in Figure 2.

Figure 2: Plan for Query 2

Notice there are four Sort operators in the plan.

If you analyze the various window functions and presentation ordering needs, you’ll find there are three distinct ordering needs:

custid, orderdate, orderid

orderid

custid, orderid

Given one of them (the first in the list above) can be supported by the index you created earlier, you would expect to see only two sorts in the plan. So, why does the plan have four sorts? It looks like SQL Server doesn’t try to be too sophisticated with rearranging the processing order of the functions in the plan to minimize sorts. It processes the functions in the plan in the order they appear in the query. That’s at least the case for the first occurrence of each distinct ordering need, but I’ll elaborate on this shortly.

You can remove the need for some of the sorts in the plan by applying the following two simple practices:

Tip 1: If you have an index to support some of the window functions in the query, specify those first.

Tip 2: If the query involves window functions with the same ordering need as the presentation ordering in the query, specify those functions last.

Following these tips, you rearrange the appearance order of the window functions in the query like so:

SELECT orderid, orderdate, custid, SUM(orderid) OVER(PARTITION BY custid ORDER BY orderdate, orderid ROWS UNBOUNDED PRECEDING) AS sum1, SUM(1.0 * orderid) OVER(ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS sum3, SUM(orderid) OVER(PARTITION BY custid ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS sum2 FROM dbo.Orders ORDER BY custid, orderid;

I’ll refer to this query as Query 3. The plan for this query is shown in Figure 3.

Figure 3: Plan for Query 3

As you can see, the plan now has only two sorts.

Tip 3

SQL Server doesn’t try to be too sophisticated in rearranging the processing order of window functions in an attempt to minimize sorts in the plan. However, it’s capable of a certain simple rearrangement. It scans the window functions based on appearance order in the query and each time it detects a new distinct ordering need, it looks ahead for additional window functions with the same ordering need and if it finds those, it groups them together with the first occurrence. In some cases, it can even use the same operator to compute multiple window functions.

Consider the following query as an example:

SELECT orderid, orderdate, custid, SUM(orderid) OVER(PARTITION BY custid ORDER BY orderdate, orderid ROWS UNBOUNDED PRECEDING) AS sum1, SUM(1.0 * orderid) OVER(ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS sum3, SUM(orderid) OVER(PARTITION BY custid ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS sum2, MAX(orderid) OVER(PARTITION BY custid ORDER BY orderdate, orderid ROWS UNBOUNDED PRECEDING) AS max1, MAX(orderid) OVER(ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS max3, MAX(orderid) OVER(PARTITION BY custid ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS max2, AVG(orderid) OVER(PARTITION BY custid ORDER BY orderdate, orderid ROWS UNBOUNDED PRECEDING) AS avg1, AVG(1.0 * orderid) OVER(ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS avg3, AVG(orderid) OVER(PARTITION BY custid ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS avg2 FROM dbo.Orders ORDER BY custid, orderid;

I’ll refer to this query as Query 4. The plan for this query is shown in Figure 4.

Figure 4: Plan for Query 4

Window functions with the same ordering needs aren’t grouped together in the query. However, there are still only two sorts in the plan. This is because what counts in terms of processing order in the plan is the first occurrence of each distinct ordering need. This leads me to the third tip.

Tip 3: Make sure to follow tips 1 and 2 for the first occurrence of each distinct ordering need. Subsequent occurrences of the same ordering need, even if nonadjacent, are identified and grouped together with the first.

Tips 4 and 5

Suppose you want to return columns resulting from windowed calculations in a certain left-to-right order in the output. But what if the order isn’t the same as the order that will minimize sorts in the plan?

For example, suppose you want the same result as the one produced by Query 2 in terms of left-to-right column order in the output (column order: other cols, sum2, sum1, sum3), but you’d rather have the same plan like the one you got for Query 3 (column order: other cols, sum1, sum3, sum2), which got two sorts instead of four.

That’s perfectly doable if you’re familiar with the fourth tip.

Tip 4: The aforementioned recommendations apply to appearance order of window functions in the code, even if within a named table expression such as a CTE or view, and even if the outer query returns the columns in a different order than in the named table expression. Therefore, if you need to return columns in a certain order in the output, and it’s different from the optimal order in terms of minimizing sorts in the plan, follow the tips in terms of appearance order within a named table expression, and return the columns in the outer query in the desired output order.

The following query, which I’ll refer to as Query 5, illustrates this technique:

WITH C AS ( SELECT orderid, orderdate, custid, SUM(orderid) OVER(PARTITION BY custid ORDER BY orderdate, orderid ROWS UNBOUNDED PRECEDING) AS sum1, SUM(1.0 * orderid) OVER(ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS sum3, SUM(orderid) OVER(PARTITION BY custid ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS sum2 FROM dbo.Orders ) SELECT orderid, orderdate, custid, sum2, sum1, sum3 FROM C ORDER BY custid, orderid;

The plan for this query is shown in Figure 5.

Figure 5: Plan for Query 5

You still get only two sorts in the plan despite the fact that the column order in the output is: other cols, sum2, sum1, sum3, like in Query 2.

One caveat to this trick with the named table expression is if your columns in the table expression aren’t referenced by the outer query, they are excluded from the plan and therefore don’t count.

Consider the following query, which I’ll refer to as Query 6:

WITH C AS ( SELECT orderid, orderdate, custid, MAX(orderid) OVER(PARTITION BY custid ORDER BY orderdate, orderid ROWS UNBOUNDED PRECEDING) AS max1, MAX(orderid) OVER(ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS max3, MAX(orderid) OVER(PARTITION BY custid ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS max2, AVG(orderid) OVER(PARTITION BY custid ORDER BY orderdate, orderid ROWS UNBOUNDED PRECEDING) AS avg1, AVG(1.0 * orderid) OVER(ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS avg3, AVG(orderid) OVER(PARTITION BY custid ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS avg2, SUM(orderid) OVER(PARTITION BY custid ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS sum2, SUM(orderid) OVER(PARTITION BY custid ORDER BY orderdate, orderid ROWS UNBOUNDED PRECEDING) AS sum1, SUM(1.0 * orderid) OVER(ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS sum3 FROM dbo.Orders ) SELECT orderid, orderdate, custid, sum2, sum1, sum3, max2, max1, max3, avg2, avg1, avg3 FROM C ORDER BY custid, orderid;

Here all table expression columns are referenced by the outer query, so optimization happens based on the first distinct occurrence of each ordering need within the table expression:

max1: custid, orderdate, orderid

max3: orderid

max2: custid, orderid

This results in a plan with only two sorts as shown in Figure 6.

Figure 6: Plan for Query 6

Now change only the outer query by removing the references to max2, max1, max3, avg2, avg1 and avg3, like so:

WITH C AS ( SELECT orderid, orderdate, custid, MAX(orderid) OVER(PARTITION BY custid ORDER BY orderdate, orderid ROWS UNBOUNDED PRECEDING) AS max1, MAX(orderid) OVER(ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS max3, MAX(orderid) OVER(PARTITION BY custid ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS max2, AVG(orderid) OVER(PARTITION BY custid ORDER BY orderdate, orderid ROWS UNBOUNDED PRECEDING) AS avg1, AVG(1.0 * orderid) OVER(ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS avg3, AVG(orderid) OVER(PARTITION BY custid ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS avg2, SUM(orderid) OVER(PARTITION BY custid ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS sum2, SUM(orderid) OVER(PARTITION BY custid ORDER BY orderdate, orderid ROWS UNBOUNDED PRECEDING) AS sum1, SUM(1.0 * orderid) OVER(ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS sum3 FROM dbo.Orders ) SELECT orderid, orderdate, custid, sum2, sum1, sum3 FROM C ORDER BY custid, orderid;

I’ll refer to this query as Query 7. The computations of max1, max3, max2, avg1, avg3, and avg2 in the table expression are irrelevant to the outer query so they are excluded. The remaining computations involving window functions in the table expression, which are relevant to the outer query, are those of sum2, sum1, and sum3. Unfortunately, they do not appear in the table expression in optimal order in terms of minimizing sorts. As you can see in the plan for this query as shown in Figure 7, there are four sorts.

Figure 7: Plan for Query 7

If you’re thinking it’s unlikely you will have columns in the inner query you won’t refer to in the outer query, think views. Each time you query a view, you might be interested in a different subset of the columns. With this in mind, the fifth tip could help in reducing sorts in the plan.

Tip 5: In the inner query of a named table expression like a CTE or view, group all window functions with the same ordering needs together, and follow tips 1 and 2 in the order of the groups of functions.

The following code implements a view based on this recommendation:

CREATE OR ALTER VIEW dbo.MyView AS SELECT orderid, orderdate, custid, MAX(orderid) OVER(PARTITION BY custid ORDER BY orderdate, orderid ROWS UNBOUNDED PRECEDING) AS max1, SUM(orderid) OVER(PARTITION BY custid ORDER BY orderdate, orderid ROWS UNBOUNDED PRECEDING) AS sum1, AVG(orderid) OVER(PARTITION BY custid ORDER BY orderdate, orderid ROWS UNBOUNDED PRECEDING) AS avg1, MAX(orderid) OVER(ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS max3, SUM(1.0 * orderid) OVER(ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS sum3, AVG(1.0 * orderid) OVER(ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS avg3, MAX(orderid) OVER(PARTITION BY custid ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS max2, AVG(orderid) OVER(PARTITION BY custid ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS avg2, SUM(orderid) OVER(PARTITION BY custid ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS sum2 FROM dbo.Orders; GO

Now query the view requesting only the windowed result columns sum2, sum1, and sum3, in this order:

SELECT orderid, orderdate, custid, sum2, sum1, sum3 FROM dbo.MyView ORDER BY custid, orderid;

I’ll refer to this query as Query 8. You get the plan shown in Figure 8 with only two sorts.

Figure 8: Plan for Query 8

Tip 6

When you have a query with multiple window functions with multiple distinct ordering needs, the common wisdom is you can support only one of them with preordered data via an index. This is the case even when all window functions have respective supporting indexes.

Let me demonstrate this. Recall earlier when you created the index idx_nc_cid_od_oid, which can support window functions needing the data ordered by custid, orderdate, orderid, such as the following expression:

SUM(orderid) OVER(PARTITION BY custid ORDER BY orderdate, orderid ROWS UNBOUNDED PRECEDING)

Suppose, in addition to this window function, you also need the following window function in the same query:

SUM(orderid) OVER(PARTITION BY custid ORDER BY orderid ROWS UNBOUNDED PRECEDING)

This window function would benefit from the following index:

CREATE UNIQUE NONCLUSTERED INDEX idx_nc_cid_oid ON dbo.Orders(custid, orderid);

The following query, which I’ll refer to as Query 9, invokes both window functions:

SELECT orderid, orderdate, custid, SUM(orderid) OVER(PARTITION BY custid ORDER BY orderdate, orderid ROWS UNBOUNDED PRECEDING) AS sum1, SUM(orderid) OVER(PARTITION BY custid ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS sum2 FROM dbo.Orders;

The plan for this query is shown in Figure 9.

Figure 9: Plan for Query 9

I get the following time stats for this query on my machine, with results discarded in SSMS:

CPU time = 3234 ms, elapsed time = 3354 ms.

As explained earlier, SQL Server scans the windowed expressions in order of appearance in the query and figures it can support the first with an ordered scan of the index idx_nc_cid_od_oid. But then it adds a Sort operator to the plan to order the data like the second window function needs. This means the plan has N log N scaling. It doesn’t consider using the index idx_nc_cid_oid to support the second window function. You’re probably thinking it can’t, but try to think a bit outside of the box. Could you not compute each of the window functions based on its respective index order and then join the results? Theoretically, you can, and depending on the size of the data, availability of indexing, and other resources available, the join version could sometimes do better. SQL Server doesn’t consider this approach, but you certainly can implement it by writing the join yourself, like so:

WITH C1 AS ( SELECT orderid, orderdate, custid, SUM(orderid) OVER(PARTITION BY custid ORDER BY orderdate, orderid ROWS UNBOUNDED PRECEDING) AS sum1 FROM dbo.Orders ), C2 AS ( SELECT orderid, custid, SUM(orderid) OVER(PARTITION BY custid ORDER BY orderid ROWS UNBOUNDED PRECEDING) AS sum2 FROM dbo.Orders ) SELECT C1.orderid, C1.orderdate, C1.custid, C1.sum1, C2.sum2 FROM C1 INNER JOIN C2 ON C1.orderid = C2.orderid;

I’ll refer to this query as Query 10. The plan for this query is shown in Figure 10.

Figure 10: Plan for Query 10

The plan uses ordered scans of the two indexes with no explicit sorting whatsoever, computes the window functions, and uses a hash join to join the results. This plan scales linearly compared to the previous one which has N log N scaling.

I get the following time stats for this query on my machine (again with results discarded in SSMS):

CPU time = 1000 ms, elapsed time = 1100 ms.

To recap, here’s our sixth tip.

Tip 6: When you have multiple window functions with multiple distinct ordering needs, and you’re able to support all of them with indexes, try a join version and compare its performance to the query without the join.

Cleanup

If you disabled parallelism by setting the database scoped configuration option MAXDOP to 1, reenable parallelism by setting it to 0:

ALTER DATABASE SCOPED CONFIGURATION SET MAXDOP = 0;

If you used the undocumented session option DBCC OPTIMIZER_WHATIF with the CPUs option set to 1, reenable parallelism by setting it to 0:

DBCC OPTIMIZER_WHATIF(CPUs, 0);

You can retry all examples with parallelism enabled if you like.

Use the following code to clean up the new indexes you created:

DROP INDEX IF EXISTS idx_nc_cid_od_oid ON dbo.Orders; DROP INDEX IF EXISTS idx_nc_cid_oid ON dbo.Orders;

And the following code to remove the view:

DROP VIEW IF EXISTS dbo.MyView;

Follow the Tips to Minimize Number of Sorts

Window functions need to process the input data ordered. Indexing can help in eliminating sorting in the plan, but normally only for one distinct ordering need. Queries with multiple ordering needs typically involve some sorts in their plans. However, by following certain tips, you can minimize the number of sorts needed. Here’s a summary of the tips I mentioned in this article:

Tip 1: If you have an index to support some of the window functions in the query, specify those first.

Tip 2: If the query involves window functions with the same ordering need as the presentation ordering in the query, specify those functions last.

Tip 3: Make sure to follow tips 1 and 2 for the first occurrence of each distinct ordering need. Subsequent occurrences of the same ordering need, even if nonadjacent, are identified and grouped together with the first.

Tip 4: The aforementioned recommendations apply to appearance order of window functions in the code, even if within a named table expression such as a CTE or view, and even if the outer query returns the columns in a different order than in the named table expression. Therefore, if you need to return columns in a certain order in the output, and it’s different from the optimal order in terms of minimizing sorts in the plan, follow the tips in terms of appearance order within a named table expression, and return the columns in the outer query in the desired output order.

Tip 5: In the inner query of a named table expression like a CTE or view, group all window functions with the same ordering needs together, and follow tips 1 and 2 in the order of the groups of functions.

Tip 6: When you have multiple window functions with multiple distinct ordering needs, and you’re able to support all of them with indexes, try a join version and compare its performance to the query without the join.

The post Are You Sorted? Tips Concerning T-SQL Window Ordering appeared first on SQLPerformance.com.

Islands T-SQL Challenge

Itzik Ben-Gan — Wed, 13 Apr 2022 09:00:57 +0000

Recently, I was introduced to a new islands challenge by my friend Erland Sommarskog. It’s based on a question from a public database forum. The challenge itself isn’t complicated to handle when using well-known techniques, which primarily employ window functions. However, these techniques require explicit sorting despite the presence of a supporting index. This affects the scalability and response time of the solutions. Fond of challenges, I set out to find a solution to minimize the number of explicit Sort operators in the plan, or better yet, eliminate the need for those altogether. And I found such a solution.

I’ll start by presenting a generalized form of the challenge. I’ll then show two solutions based on known techniques, followed by the new solution. Finally, I’ll compare the performance of the different solutions.

I recommend you try to find a solution before implementing mine.

The challenge

I’ll present a generalized form of Erland’s original islands challenge.

Use the following code to create a table called T1 and populate it with a small set of sample data:

SET NOCOUNT ON; USE tempdb; DROP TABLE IF EXISTS dbo.T1 CREATE TABLE dbo.T1 ( grp VARCHAR(10) NOT NULL, ord INT NOT NULL, val VARCHAR(10) NOT NULL, CONSTRAINT PK_T1 PRIMARY KEY(grp, ord) ); GO INSERT INTO dbo.T1(grp, ord, val) VALUES ('Group A', 1002, 'Y'), ('Group A', 1003, 'Y'), ('Group A', 1005, 'Y'), ('Group A', 1007, 'N'), ('Group A', 1011, 'N'), ('Group A', 1013, 'N'), ('Group A', 1017, 'Y'), ('Group A', 1019, 'Y'), ('Group A', 1023, 'N'), ('Group A', 1029, 'N'), ('Group B', 1001, 'X'), ('Group B', 1002, 'X'), ('Group B', 1003, 'Z'), ('Group B', 1005, 'Z'), ('Group B', 1008, 'Z'), ('Group B', 1013, 'Z'), ('Group B', 1021, 'Y'), ('Group B', 1034, 'Y');

The challenge is as follows:

Assuming partitioning based on the column grp and ordering based on the column ord, compute sequential row numbers starting with 1 within each consecutive group of rows with the same value in the val column. Following is the desired result for the given small set of sample data:

grp ord val seqno -------- ----- ---- ------ Group A 1002 Y 1 Group A 1003 Y 2 Group A 1005 Y 3 Group A 1007 N 1 Group A 1011 N 2 Group A 1013 N 3 Group A 1017 Y 1 Group A 1019 Y 2 Group A 1023 N 1 Group A 1029 N 2 Group B 1001 X 1 Group B 1002 X 2 Group B 1003 Z 1 Group B 1005 Z 2 Group B 1008 Z 3 Group B 1013 Z 4 Group B 1021 Y 1 Group B 1034 Y 2

Note the definition of the primary key constraint based on the composite key (grp, ord), which results in a clustered index based on the same key. This index can potentially support window functions partitioned by grp and ordered by ord.

To test the performance of your solution, you’ll need to populate the table with larger volumes of sample data. Use the following code to create the helper function GetNums:

CREATE FUNCTION dbo.GetNums(@low AS BIGINT = 1, @high AS BIGINT) RETURNS TABLE AS RETURN WITH L0 AS ( SELECT 1 AS c FROM (VALUES(1),(1),(1),(1),(1),(1),(1),(1), (1),(1),(1),(1),(1),(1),(1),(1)) AS D(c) ), L1 AS ( SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B ), L2 AS ( SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B ), L3 AS ( SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B ), Nums AS ( SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum FROM L3 ) SELECT TOP(@high - @low + 1) rownum AS rn, @high + 1 - rownum AS op, @low - 1 + rownum AS n FROM Nums ORDER BY rownum; GO

Use the following code to populate T1, after changing the parameters representing the number of groups, number of rows per group and number of distinct values as you wish:

DECLARE @numgroups AS INT = 1000, @rowspergroup AS INT = 10000, -- test with 1000 to 10000 here @uniquevalues AS INT = 5; ALTER TABLE dbo.T1 DROP CONSTRAINT PK_T1; TRUNCATE TABLE dbo.T1; INSERT INTO dbo.T1 WITH(TABLOCK) (grp, ord, val) SELECT CAST(G.n AS VARCHAR(10)) AS grp, CAST(R.n AS INT) AS ord, CAST(ABS(CHECKSUM(NEWID())) % @uniquevalues + 1 AS VARCHAR(10)) AS val FROM dbo.GetNums(1, @numgroups) AS G CROSS JOIN dbo.GetNums(1, @rowspergroup) AS R; ALTER TABLE dbo.T1 ADD CONSTRAINT PK_T1 PRIMARY KEY CLUSTERED(grp, ord);

In my performance tests, I populated the table with 1,000 groups, between 1,000 and 10,000 rows per group (so 1M to 10M rows), and 5 distinct values. I used a SELECT INTO statement to write the result into a temporary table.

My test machine has four logical CPUs, running SQL Server^® 2019 Enterprise.

I’ll assume you’re using an environment designed to support batch mode on row store either directly, e.g., using SQL Server 2019 Enterprise edition like mine, or indirectly, by creating a dummy columnstore index on the table.

Remember, extra points if you manage to come up with an efficient solution without explicit sorting in the plan. Good luck!

Is a Sort operator needed in the optimization of window functions?

Before I cover solutions, a bit of optimization background so what you’ll see in the query plans later will make more sense.

The most common techniques for solving islands tasks such as ours involve using some combination of aggregate and/or ranking window functions. SQL Server can process such window functions using either a series of older row mode operators (Segment, Sequence Project, Segment, Window Spool, Stream Aggregate) or the newer and usually more efficient batch mode Window Aggregate operator.

In both cases, the operators handling the window function’s calculation need to ingest the data ordered by the window partitioning and ordering elements. If you don’t have a supporting index, naturally SQL Server will need to introduce a Sort operator in the plan. For instance, if you have multiple window functions in your solution with more than one unique combination of partitioning and ordering, you’re bound to have explicit sorting in the plan. But what if you have only one unique combination of partitioning and ordering and a supporting index?

The older row mode method can rely on preordered data ingested from an index without the need for an explicit Sort operator in both serial and parallel modes. The newer batch mode operator eliminates much of the inefficiencies of the older row mode optimization and has the inherent benefits of batch mode processing. However, its current parallel implementation requires an intermediary batch mode parallel Sort operator even when a supporting index is present. Only its serial implementation can rely on index order without a Sort operator. This is all to say when the optimizer needs to choose a strategy to handle your window function, assuming you have a supporting index, it will generally be one of the following four options:

Row mode, serial, no sorting
Row mode, parallel, no sorting
Batch mode, serial, no sorting
Batch mode, parallel, sorting

Whichever one results in the lowest plan cost will be chosen, assuming of course prerequisites for parallelism and batch mode are met when considering those. Normally, for the optimizer to justify a parallel plan, the parallelism benefits need to outweigh the extra work like thread synchronization. With option 4 above, the parallelism benefits need to outweigh the usual extra work involved with parallelism, plus the extra sort.

While experimenting with different solutions to our challenge, I had cases where the optimizer chose option 2 above. It chose it despite the fact that the row mode method involves a few inefficiencies because the benefits in parallelism and no sorting resulted in a plan with a lower cost than the alternatives. In some of those cases, forcing a serial plan with a hint resulted in option 3 above, and in better performance.

With this background in mind, let’s look at solutions. I’ll start with two solutions relying on known techniques for islands tasks that cannot escape explicit sorting in the plan.

Solution based on known technique 1

Following is the first solution to our challenge, which is based on a technique that has been known for a while:

WITH C AS ( SELECT *, ROW_NUMBER() OVER(PARTITION BY grp ORDER BY ord) - ROW_NUMBER() OVER(PARTITION BY grp, val ORDER BY ord) AS island FROM dbo.T1 ) SELECT grp, ord, val, ROW_NUMBER() OVER(PARTITION BY grp, val, island ORDER BY ord) AS seqno FROM C;

I’ll refer to it as Known Solution 1.

The CTE called C is based on a query that computes two row numbers. The first (I’ll refer to it as rn1) is partitioned by grp and ordered by ord. The second (I’ll refer to it as rn2) is partitioned by grp and val and ordered by ord. Since rn1 has gaps between different groups of the same val and rn2 doesn’t, the difference between rn1 and rn2 (column called island) is a unique constant value for all rows with the same grp and val values. Following is the result of the inner query, including the results of the two-row number computations, which aren’t returned as separate columns:

grp ord val rn1 rn2 island -------- ----- ---- ---- ---- ------- Group A 1002 Y 1 1 0 Group A 1003 Y 2 2 0 Group A 1005 Y 3 3 0 Group A 1007 N 4 1 3 Group A 1011 N 5 2 3 Group A 1013 N 6 3 3 Group A 1017 Y 7 4 3 Group A 1019 Y 8 5 3 Group A 1023 N 9 4 5 Group A 1029 N 10 5 5 Group B 1001 X 1 1 0 Group B 1002 X 2 2 0 Group B 1003 Z 3 1 2 Group B 1005 Z 4 2 2 Group B 1008 Z 5 3 2 Group B 1013 Z 6 4 2 Group B 1021 Y 7 1 6 Group B 1034 Y 8 2 6

What’s left for the outer query to do is to compute the result column seqno using the ROW_NUMBER function, partitioned by grp, val, and island, and ordered by ord, generating the desired result.

Note you can get the same island value for different val values within the same partition, such as in the output above. This is why it’s important to use grp, val, and island as the window partitioning elements and not grp and island alone.

Similarly, if you’re dealing with an islands task requiring you to group the data by the island and compute group aggregates, you would group the rows by grp, val, and island. But this isn’t the case with our challenge. Here you were tasked with just computing row numbers independently for each island.

Figure 1 has the default plan that I got for this solution on my machine after populating the table with 10M rows.

Figure 1: Parallel plan for Known Solution 1

The computation of rn1 can rely on index order. So, the optimizer chose the no sort + parallel row mode strategy for this one (first Segment and Sequence Project operators in the plan). Since the computations of both rn2 and seqno use their own unique combinations of partitioning and ordering elements, explicit sorting is unavoidable for those irrespective of the strategy used. So, the optimizer chose the sort + parallel batch mode strategy for both. This plan involves two explicit Sort operators.

In my performance test, it took this solution 3.68 seconds to complete against 1M rows and 43.1 seconds against 10M rows.

As mentioned, I tested all solutions also by forcing a serial plan (with a MAXDOP 1 hint). The serial plan for this solution is shown in Figure 1.

Figure 2: Serial plan for Known Solution 1

As expected, this time also the computation of rn1 uses the batch mode strategy without a preceding Sort operator, but the plan still has two Sort operators for the subsequent row number computations. The serial plan performed worse than the parallel one on my machine with all input sizes I tested, taking 4.54 seconds to complete with 1M rows and 61.5 seconds with 10M rows.

Solution based on known technique 2

The second solution I’ll present is based on a brilliant technique for island detection I learned from Paul White a few years ago. Following is the complete solution code based on this technique:

WITH C1 AS ( SELECT *, CASE WHEN val = LAG(val) OVER(PARTITION BY grp ORDER BY ord) THEN 0 ELSE 1 END AS isstart FROM dbo.T1 ), C2 AS ( SELECT *, SUM(isstart) OVER(PARTITION BY grp ORDER BY ord ROWS UNBOUNDED PRECEDING) AS island FROM C1 ) SELECT grp, ord, val, ROW_NUMBER() OVER(PARTITION BY grp, island ORDER BY ord) AS seqno FROM C2;

I’ll refer to this solution as Known Solution 2.

The query defining the CTE C1 computes uses a CASE expression and the LAG window function (partitioned by grp and ordered by ord) to compute a result column called isstart. This column has the value 0 when the current val value is the same as the previous and 1 otherwise. In other words, it’s 1 when the row is the first in an island and 0 otherwise.

Following is the output of the query defining C1:

grp ord val isstart -------- ----- ---- -------- Group A 1002 Y 1 Group A 1003 Y 0 Group A 1005 Y 0 Group A 1007 N 1 Group A 1011 N 0 Group A 1013 N 0 Group A 1017 Y 1 Group A 1019 Y 0 Group A 1023 N 1 Group A 1029 N 0 Group B 1001 X 1 Group B 1002 X 0 Group B 1003 Z 1 Group B 1005 Z 0 Group B 1008 Z 0 Group B 1013 Z 0 Group B 1021 Y 1 Group B 1034 Y 0

The magic as far as island detection is concerned happens in the CTE C2. The query defining it computes an island identifier using the SUM window function (also partitioned by grp and ordered by ord) applied to the isstart column. The result column with the island identifier is called island. Within each partition, you get 1 for the first island, 2 for the second island, and so on. So, the combination of columns grp and island is an island identifier, which you can use in islands tasks that involve grouping by island when relevant.

Following is the output of the query defining C2:

grp ord val isstart island -------- ----- ---- -------- ------- Group A 1002 Y 1 1 Group A 1003 Y 0 1 Group A 1005 Y 0 1 Group A 1007 N 1 2 Group A 1011 N 0 2 Group A 1013 N 0 2 Group A 1017 Y 1 3 Group A 1019 Y 0 3 Group A 1023 N 1 4 Group A 1029 N 0 4 Group B 1001 X 1 1 Group B 1002 X 0 1 Group B 1003 Z 1 2 Group B 1005 Z 0 2 Group B 1008 Z 0 2 Group B 1013 Z 0 2 Group B 1021 Y 1 3 Group B 1034 Y 0 3

Lastly, the outer query computes the desired result column seqno with a ROW_NUMBER function, partitioned by grp and island, and ordered by ord. Notice this combination of partitioning and ordering elements is different from the one used by the previous window functions. Whereas the computation of the first two window functions can potentially rely on index order, the last can’t.

Figure 3 has the default plan that I got for this solution.

Figure 3: Parallel plan for Known Solution 2

As you can see in the plan, the computation of the first two window functions uses the no sort + parallel row mode strategy, and the computation of the last uses the sort + parallel batch mode strategy.

The run times I got for this solution ranged from 2.57 seconds against 1M rows to 46.2 seconds against 10M rows.

When forcing serial processing, I got the plan shown in Figure 4.

Figure 4: Serial plan for Known Solution 2

As expected, this time all window function computations rely on the batch mode strategy. The first two without preceding sorting, and the last with. Both the parallel plan and the serial one involved one explicit Sort operator. The serial plan performed better than the parallel plan on my machine with the input sizes I tested. The run times I got for the forced serial plan ranged from 1.75 seconds against 1M rows to 21.7 seconds against 10M rows.

Solution based on new technique

When Erland introduced this challenge in a private forum, people were skeptical of the possibility of solving it with a query that had been optimized with a plan without explicit sorting. That’s all I needed to hear to motivate me. So, here’s what I came up with:

WITH C1 AS ( SELECT *, ROW_NUMBER() OVER(PARTITION BY grp ORDER BY ord) AS rn, LAG(val) OVER(PARTITION BY grp ORDER BY ord) AS prev FROM dbo.T1 ), C2 AS ( SELECT *, MAX(CASE WHEN val = prev THEN NULL ELSE rn END) OVER(PARTITION BY grp ORDER BY ord ROWS UNBOUNDED PRECEDING) AS firstrn FROM C1 ) SELECT grp, ord, val, rn - firstrn + 1 AS seqno FROM C2;

The solution uses three window functions: LAG, ROW_NUMBER and MAX. The important thing here is all three are based on grp partitioning and ord ordering, which is aligned with the supporting index key order.

The query defining the CTE C1 uses the ROW_NUMBER function to compute row numbers (rn column), and the LAG function to return the previous val value (prev column).

Here’s the output of the query defining C1:

grp ord val rn prev -------- ----- ---- --- ----- Group A 1002 Y 1 NULL Group A 1003 Y 2 Y Group A 1005 Y 3 Y Group A 1007 N 4 Y Group A 1011 N 5 N Group A 1013 N 6 N Group A 1017 Y 7 N Group A 1019 Y 8 Y Group A 1023 N 9 Y Group A 1029 N 10 N Group B 1001 X 1 NULL Group B 1002 X 2 X Group B 1003 Z 3 X Group B 1005 Z 4 Z Group B 1008 Z 5 Z Group B 1013 Z 6 Z Group B 1021 Y 7 Z Group B 1034 Y 8 Y

Notice when val and prev are the same, it’s not the first row in the island, otherwise it is.

Based on this logic, the query defining the CTE C2 uses a CASE expression that returns rn when the row is the first in an island and NULL otherwise. The code then applies the MAX window function to the result of the CASE expression, returning the first rn of the island (firstrn column).

Here’s the output of the query defining C2, including the output of the CASE expression:

grp ord val rn prev CASE firstrn -------- ----- ---- --- ----- ----- -------- Group A 1002 Y 1 NULL 1 1 Group A 1003 Y 2 Y NULL 1 Group A 1005 Y 3 Y NULL 1 Group A 1007 N 4 Y 4 4 Group A 1011 N 5 N NULL 4 Group A 1013 N 6 N NULL 4 Group A 1017 Y 7 N 7 7 Group A 1019 Y 8 Y NULL 7 Group A 1023 N 9 Y 9 9 Group A 1029 N 10 N NULL 9 Group B 1001 X 1 NULL 1 1 Group B 1002 X 2 X NULL 1 Group B 1003 Z 3 X 3 3 Group B 1005 Z 4 Z NULL 3 Group B 1008 Z 5 Z NULL 3 Group B 1013 Z 6 Z NULL 3 Group B 1021 Y 7 Z 7 7 Group B 1034 Y 8 Y NULL 7

What’s left for the outer query is to compute the desired result column seqno as rn minus firstrn plus 1.

Figure 5 has the default parallel plan I got for this solution when testing it using a SELECT INTO statement, writing the result into a temporary table.

Figure 5: Parallel plan for a new solution

There are no explicit sort operators in this plan. However, all three window functions are computed using the no sort + parallel row mode strategy, so we’re missing the benefits of batch processing. The run times I got for this solution with the parallel plan ranged from 2.47 seconds against 1M rows and 41.4 against 10M rows.

Here, there’s quite high likelihood for a serial plan with batch processing to do significantly better, especially when the machine doesn’t have many CPUs. Recall I’m testing my solutions against a machine with 4 logical CPUs. Figure 6 has the plan I got for this solution when forcing serial processing.

Figure 6: Serial plan for a new solution

All three window functions use the no sort + serial batch mode strategy. And the results are quite impressive. This solution run times ranged from 0.5 seconds against 1M rows and 5.49 seconds against 10M rows. What’s also curious about this solution is when testing it as a normal SELECT statement (with results discarded) as opposed to a SELECT INTO statement, SQL Server chose the serial plan by default. With the other two solutions, I got a parallel plan by default both with SELECT and with SELECT INTO.

See the next section for the complete performance test results.

Performance comparison

Here’s the code I used to test the three solutions, of course uncommenting the MAXDOP 1 hint to test the serial plans:

-- Test Known Solution 1 DROP TABLE IF EXISTS #Result; WITH C AS ( SELECT *, ROW_NUMBER() OVER(PARTITION BY grp ORDER BY ord) - ROW_NUMBER() OVER(PARTITION BY grp, val ORDER BY ord) AS island FROM dbo.T1 ) SELECT grp, ord, val, ROW_NUMBER() OVER(PARTITION BY grp, val, island ORDER BY ord) AS seqno INTO #Result FROM C /* OPTION(MAXDOP 1) */; -- uncomment for serial test GO -- Test Known Solution 2 DROP TABLE IF EXISTS #Result; WITH C1 AS ( SELECT *, CASE WHEN val = LAG(val) OVER(PARTITION BY grp ORDER BY ord) THEN 0 ELSE 1 END AS isstart FROM dbo.T1 ), C2 AS ( SELECT *, SUM(isstart) OVER(PARTITION BY grp ORDER BY ord ROWS UNBOUNDED PRECEDING) AS island FROM C1 ) SELECT grp, ord, val, ROW_NUMBER() OVER(PARTITION BY grp, island ORDER BY ord) AS seqno INTO #Result FROM C2 /* OPTION(MAXDOP 1) */; GO -- Test New Solution DROP TABLE IF EXISTS #Result; WITH C1 AS ( SELECT *, ROW_NUMBER() OVER(PARTITION BY grp ORDER BY ord) AS rn, LAG(val) OVER(PARTITION BY grp ORDER BY ord) AS prev FROM dbo.T1 ), C2 AS ( SELECT *, MAX(CASE WHEN val = prev THEN NULL ELSE rn END) OVER(PARTITION BY grp ORDER BY ord ROWS UNBOUNDED PRECEDING) AS firstrn FROM C1 ) SELECT grp, ord, val, rn - firstrn + 1 AS seqno INTO #Result FROM C2 /* OPTION(MAXDOP 1) */;

Figure 7 has the run times of both the parallel and serial plans for all solutions against different input sizes.

Figure 7: Performance comparison

The new solution, using serial mode, is the clear winner. It has great performance, linear scaling, and immediate response time.

Conclusion

Islands tasks are quite common in real life. Many of them involve both identifying islands and grouping the data by the island. Erland’s islands challenge, which was the focus of this article, is a bit more unique because it doesn’t involve grouping but instead sequencing each island’s rows with row numbers.

I presented two solutions based on known techniques for identifying islands. The problem with both is they result in plans involving explicit sorting, which negatively affects the performance, scalability, and response time of the solutions. I also presented a new technique that resulted in a plan with no sorting at all. The serial plan for this solution, which uses a no sort + serial batch mode strategy, has excellent performance, linear scaling, and immediate response time. It’s unfortunate, at least for now, we can’t have a no sort + parallel batch mode strategy for handling window functions.

The post Islands T-SQL Challenge appeared first on SQLPerformance.com.

Matching Supply With Demand — Solutions, Part 3

Itzik Ben-Gan — Wed, 09 Mar 2022 09:00:23 +0000

[ Jump to: Original challenge | Solutions: Part 1 | Part 2 | Part 3 ]

In this article, I continue exploring solutions to the matching supply with demand challenge. Thanks to Luca, Kamil Kosno, Daniel Brown, Brian Walker, Joe Obbish, Rainer Hoffmann, Paul White, Charlie, Ian, and Peter Larsson, for sending your solutions.

Last month I covered solutions based on a revised interval intersections approach compared to the classic one. The fastest of those solutions combined ideas from Kamil, Luca, and Daniel. It unified two queries with disjoint sargable predicates. It took the solution 1.34 seconds to complete against a 400K-row input. That’s not too shabby considering the solution based on the classic interval intersections approach took 931 seconds to complete against the same input. Also recall Joe came up with a brilliant solution that relies on the classic interval intersection approach but optimizes the matching logic by bucketizing intervals based on the largest interval length. With the same 400K-row input, it took Joe’s solution 0.9 seconds to complete. The tricky part about this solution is its performance degrades as the largest interval length increases.

This month I explore fascinating solutions that are faster than the Kamil/Luca/Daniel Revised Intersections solution and are neutral to interval length. The solutions in this article were created by Brian Walker, Ian, Peter Larsson, Paul White, and me.

I tested all solutions in this article against the Auctions input table with 100K, 200K, 300K, and 400K rows, and with the following indexes:

-- Index to support solution CREATE UNIQUE NONCLUSTERED INDEX idx_Code_ID_i_Quantity ON dbo.Auctions(Code, ID) INCLUDE(Quantity); -- Enable batch-mode Window Aggregate CREATE NONCLUSTERED COLUMNSTORE INDEX idx_cs ON dbo.Auctions(ID) WHERE ID = -1 AND ID = -2;

When describing the logic behind the solutions, I’ll assume the Auctions table is populated with the following small set of sample data:

ID Code Quantity ----------- ---- --------- 1 D 5.000000 2 D 3.000000 3 D 8.000000 5 D 2.000000 6 D 8.000000 7 D 4.000000 8 D 2.000000 1000 S 8.000000 2000 S 6.000000 3000 S 2.000000 4000 S 2.000000 5000 S 4.000000 6000 S 3.000000 7000 S 2.000000

Following is the desired result for this sample data:

DemandID SupplyID TradeQuantity ----------- ----------- -------------- 1 1000 5.000000 2 1000 3.000000 3 2000 6.000000 3 3000 2.000000 5 4000 2.000000 6 5000 4.000000 6 6000 3.000000 6 7000 1.000000 7 7000 1.000000

Brian Walker’s Solution

Outer joins are fairly commonly used in SQL querying solutions, but by far when you use those, you use single-sided ones. When teaching about outer joins, I often get questions asking for examples for practical use cases of full outer joins, and there aren’t that many. Brian’s solution is a beautiful example of a practical use case of full outer joins.

Here’s Brian’s complete solution code:

DROP TABLE IF EXISTS #MyPairings; CREATE TABLE #MyPairings ( DemandID INT NOT NULL, SupplyID INT NOT NULL, TradeQuantity DECIMAL(19,06) NOT NULL ); WITH D AS ( SELECT A.ID, SUM(A.Quantity) OVER (PARTITION BY A.Code ORDER BY A.ID ROWS UNBOUNDED PRECEDING) AS Running FROM dbo.Auctions AS A WHERE A.Code = 'D' ), S AS ( SELECT A.ID, SUM(A.Quantity) OVER (PARTITION BY A.Code ORDER BY A.ID ROWS UNBOUNDED PRECEDING) AS Running FROM dbo.Auctions AS A WHERE A.Code = 'S' ), W AS ( SELECT D.ID AS DemandID, S.ID AS SupplyID, ISNULL(D.Running, S.Running) AS Running FROM D FULL JOIN S ON D.Running = S.Running ), Z AS ( SELECT CASE WHEN W.DemandID IS NOT NULL THEN W.DemandID ELSE MIN(W.DemandID) OVER (ORDER BY W.Running ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) END AS DemandID, CASE WHEN W.SupplyID IS NOT NULL THEN W.SupplyID ELSE MIN(W.SupplyID) OVER (ORDER BY W.Running ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) END AS SupplyID, W.Running FROM W ) INSERT #MyPairings( DemandID, SupplyID, TradeQuantity ) SELECT Z.DemandID, Z.SupplyID, Z.Running - ISNULL(LAG(Z.Running) OVER (ORDER BY Z.Running), 0.0) AS TradeQuantity FROM Z WHERE Z.DemandID IS NOT NULL AND Z.SupplyID IS NOT NULL;

I revised Brian’s original use of derived tables with CTEs for clarity.

The CTE D computes running total demand quantities in a result column called D.Running, and the CTE S computes running total supply quantities in a result column called S.Running. The CTE W then performs a full outer join between D and S, matching D.Running with S.Running, and returns the first non-NULL among D.Running and S.Running as W.Running. Here’s the result you get if you query all rows from W ordered by Running:

DemandID SupplyID Running ----------- ----------- ---------- 1 NULL 5.000000 2 1000 8.000000 NULL 2000 14.000000 3 3000 16.000000 5 4000 18.000000 NULL 5000 22.000000 NULL 6000 25.000000 6 NULL 26.000000 NULL 7000 27.000000 7 NULL 30.000000 8 NULL 32.000000

The idea to use a full outer join based on a predicate that compares the demand and supply running totals is a stroke of genius! Most rows in this result—the first 9 in our case—represent result pairings with a bit of extra computations missing. Rows with trailing NULL IDs of one of the kinds represent entries that cannot be matched. In our case, the last two rows represent demand entries that cannot be matched with supply entries. So, what’s left here is to compute the DemandID, SupplyID and TradeQuantity of the result pairings, and to filter out the entries that cannot be matched.

The logic that computes the result DemandID and SupplyID is done in the CTE Z as follows (assuming ordering in W by Running):

DemandID: if DemandID is not NULL then DemandID, otherwise the minimum DemandID starting with the current row

SupplyID: if SupplyID is not NULL then SupplyID, otherwise the minimum SupplyID starting with the current row

Here’s the result you get if you query Z and order the rows by Running:

DemandID SupplyID Running ----------- ----------- ---------- 1 1000 5.000000 2 1000 8.000000 3 2000 14.000000 3 3000 16.000000 5 4000 18.000000 6 5000 22.000000 6 6000 25.000000 6 7000 26.000000 7 7000 27.000000 7 NULL 30.000000 8 NULL 32.000000

The outer query filters out rows from Z representing entries of one kind that cannot be matched by entries of the other kind by ensuring both DemandID and SupplyID are not NULL. The result pairings’ trade quantity is computed as the current Running value minus the previous Running value using a LAG function.

Here’s what gets written to the #MyPairings table, which is the desired result:

DemandID SupplyID TradeQuantity ----------- ----------- -------------- 1 1000 5.000000 2 1000 3.000000 3 2000 6.000000 3 3000 2.000000 5 4000 2.000000 6 5000 4.000000 6 6000 3.000000 6 7000 1.000000 7 7000 1.000000

The plan for this solution is shown in Figure 1.

Figure 1: Query plan for Brian’s solution

The top two branches of the plan compute the demand and supply running totals using a batch-mode Window Aggregate operator, each after retrieving the respective entries from the supporting index. Like I explained in earlier articles in the series, since the index already has the rows ordered like the Window Aggregate operators need them to be, you would think explicit Sort operators shouldn’t be required. But SQL Server doesn’t currently support an efficient combination of parallel order-preserving index operation prior to a parallel batch-mode Window Aggregate operator, so as a result, an explicit parallel Sort operator precedes each of the parallel Window Aggregate operators.

The plan uses a batch-mode hash join to handle the full outer join. The plan also uses two additional batch-mode Window Aggregate operators preceded by explicit Sort operators to compute the MIN and LAG window functions.

That’s a pretty efficient plan!

Here are the run times I got for this solution against input sizes ranging from 100K to 400K rows, in seconds:

100K: 0.368 200K: 0.845 300K: 1.255 400K: 1.315

Itzik’s Solution

The next solution for the challenge is one of my attempts at solving it. Here’s the complete solution code:

DROP TABLE IF EXISTS #MyPairings; WITH C1 AS ( SELECT *, SUM(Quantity) OVER(PARTITION BY Code ORDER BY ID ROWS UNBOUNDED PRECEDING) AS SumQty FROM dbo.Auctions ), C2 AS ( SELECT *, SUM(Quantity * CASE Code WHEN 'D' THEN -1 WHEN 'S' THEN 1 END) OVER(ORDER BY SumQty, Code ROWS UNBOUNDED PRECEDING) AS StockLevel, LAG(SumQty, 1, 0.0) OVER(ORDER BY SumQty, Code) AS PrevSumQty, MAX(CASE WHEN Code = 'D' THEN ID END) OVER(ORDER BY SumQty, Code ROWS UNBOUNDED PRECEDING) AS PrevDemandID, MAX(CASE WHEN Code = 'S' THEN ID END) OVER(ORDER BY SumQty, Code ROWS UNBOUNDED PRECEDING) AS PrevSupplyID, MIN(CASE WHEN Code = 'D' THEN ID END) OVER(ORDER BY SumQty, Code ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS NextDemandID, MIN(CASE WHEN Code = 'S' THEN ID END) OVER(ORDER BY SumQty, Code ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS NextSupplyID FROM C1 ), C3 AS ( SELECT *, CASE Code WHEN 'D' THEN ID WHEN 'S' THEN CASE WHEN StockLevel > 0 THEN NextDemandID ELSE PrevDemandID END END AS DemandID, CASE Code WHEN 'S' THEN ID WHEN 'D' THEN CASE WHEN StockLevel <= 0 THEN NextSupplyID ELSE PrevSupplyID END END AS SupplyID, SumQty - PrevSumQty AS TradeQuantity FROM C2 ) SELECT DemandID, SupplyID, TradeQuantity INTO #MyPairings FROM C3 WHERE TradeQuantity > 0.0 AND DemandID IS NOT NULL AND SupplyID IS NOT NULL;

The CTE C1 queries the Auctions table and uses a window function to compute running total demand and supply quantities, calling the result column SumQty.

The CTE C2 queries C1, and computes a number of attributes using window functions based on SumQty and Code ordering:

StockLevel: The current stock level after processing each entry. This is computed by assigning a negative sign to demand quantities and a positive sign to supply quantities and summing those quantities up to and including the current entry.

PrevSumQty: Previous row’s SumQty value.

PrevDemandID: Last non-NULL demand ID.

PrevSupplyID: Last non-NULL supply ID.

NextDemandID: Next non-NULL demand ID.

NextSupplyID: Next non-NULL supply ID.

Here’s the contents of C2 ordered by SumQty and Code:

ID Code Quantity SumQty StockLevel PrevSumQty PrevDemandID PrevSupplyID NextDemandID NextSupplyID ----- ---- --------- ---------- ----------- ----------- ------------ ------------ ------------ ------------ 1 D 5.000000 5.000000 -5.000000 0.000000 1 NULL 1 1000 2 D 3.000000 8.000000 -8.000000 5.000000 2 NULL 2 1000 1000 S 8.000000 8.000000 0.000000 8.000000 2 1000 3 1000 2000 S 6.000000 14.000000 6.000000 8.000000 2 2000 3 2000 3 D 8.000000 16.000000 -2.000000 14.000000 3 2000 3 3000 3000 S 2.000000 16.000000 0.000000 16.000000 3 3000 5 3000 5 D 2.000000 18.000000 -2.000000 16.000000 5 3000 5 4000 4000 S 2.000000 18.000000 0.000000 18.000000 5 4000 6 4000 5000 S 4.000000 22.000000 4.000000 18.000000 5 5000 6 5000 6000 S 3.000000 25.000000 7.000000 22.000000 5 6000 6 6000 6 D 8.000000 26.000000 -1.000000 25.000000 6 6000 6 7000 7000 S 2.000000 27.000000 1.000000 26.000000 6 7000 7 7000 7 D 4.000000 30.000000 -3.000000 27.000000 7 7000 7 NULL 8 D 2.000000 32.000000 -5.000000 30.000000 8 7000 8 NULL

The CTE C3 queries C2 and computes the result pairings’ DemandID, SupplyID and TradeQuantity, before removing some superfluous rows.

The result C3.DemandID column is computed like so:

If the current entry is a demand entry, return DemandID.

If the current entry is a supply entry and the current stock level is positive, return NextDemandID.

If the current entry is a supply entry and the current stock level is nonpositive, return PrevDemandID.

The result C3.SupplyID column is computed like so:

If the current entry is a supply entry, return SupplyID.

If the current entry is a demand entry and the current stock level is nonpositive, return NextSupplyID.

If the current entry is a demand entry and the current stock level is positive, return PrevSupplyID.

The result TradeQuantity is computed as the current row’s SumQty minus PrevSumQty.

Here are the contents of the columns relevant to the result from C3:

DemandID SupplyID TradeQuantity ----------- ----------- -------------- 1 1000 5.000000 2 1000 3.000000 2 1000 0.000000 3 2000 6.000000 3 3000 2.000000 3 3000 0.000000 5 4000 2.000000 5 4000 0.000000 6 5000 4.000000 6 6000 3.000000 6 7000 1.000000 7 7000 1.000000 7 NULL 3.000000 8 NULL 2.000000

What’s left for the outer query to do is to filter out superfluous rows from C3. Those include two cases:

When the running totals of both kinds are the same, the supply entry has a zero trading quantity. Remember the ordering is based on SumQty and Code, so when the running totals are the same, the demand entry appears before the supply entry.

Trailing entries of one kind that cannot be matched with entries of the other kind. Such entries are represented by rows in C3 where either the DemandID is NULL or the SupplyID is NULL.

The plan for this solution is shown in Figure 2.

Figure 2: Query plan for Itzik’s solution

The plan applies one pass over the input data and uses four batch-mode Window Aggregate operators to handle the various windowed computations. All Window Aggregate operators are preceded by explicit Sort operators, although only two of those are unavoidable here. The other two have to do with the current implementation of the parallel batch-mode Window Aggregate operator, which cannot rely on a parallel order-preserving input. A simple way to see which Sort operators are due to this reason is to force a serial plan and see which Sort operators disappear. When I force a serial plan with this solution, the first and third Sort operators disappear.

Here are the run times in seconds that I got for this solution:

100K: 0.246 200K: 0.427 300K: 0.616 400K: 0.841>

Ian’s Solution

Ian’s solution is short and efficient. Here’s the complete solution code:

DROP TABLE IF EXISTS #MyPairings; WITH A AS ( SELECT *, SUM(Quantity) OVER (PARTITION BY Code ORDER BY ID ROWS UNBOUNDED PRECEDING) AS CumulativeQuantity FROM dbo.Auctions ), B AS ( SELECT CumulativeQuantity, CumulativeQuantity - LAG(CumulativeQuantity, 1, 0) OVER (ORDER BY CumulativeQuantity) AS TradeQuantity, MAX(CASE WHEN Code = 'D' THEN ID END) AS DemandID, MAX(CASE WHEN Code = 'S' THEN ID END) AS SupplyID FROM A GROUP BY CumulativeQuantity, ID/ID -- bogus grouping to improve row estimate -- (rows count of Auctions instead of 2 rows) ), C AS ( -- fill in NULLs with next supply / demand -- FIRST_VALUE(ID) IGNORE NULLS OVER ... -- would be better, but this will work because the IDs are in CumulativeQuantity order SELECT MIN(DemandID) OVER (ORDER BY CumulativeQuantity ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS DemandID, MIN(SupplyID) OVER (ORDER BY CumulativeQuantity ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS SupplyID, TradeQuantity FROM B ) SELECT DemandID, SupplyID, TradeQuantity INTO #MyPairings FROM C WHERE SupplyID IS NOT NULL -- trim unfulfilled demands AND DemandID IS NOT NULL; -- trim unused supplies

The code in the CTE A queries the Auctions table and computes running total demand and supply quantities using a window function, naming the result column CumulativeQuantity.

The code in the CTE B queries CTE A, and groups the rows by CumulativeQuantity. This grouping achieves a similar effect to Brian’s outer join based on the demand and supply running totals. Ian also added the dummy expression ID/ID to the grouping set to improve the grouping’s original cardinality under estimation. On my machine, this also resulted in using a parallel plan instead of a serial one.

In the SELECT list, the code computes DemandID and SupplyID by retrieving the ID of the respective entry type in the group using the MAX aggregate and a CASE expression. If the ID isn’t present in the group, the result is NULL. The code also computes a result column called TradeQuantity as the current cumulative quantity minus the previous one, retrieved using the LAG window function.

Here are the contents of B:

CumulativeQuantity TradeQuantity DemandId SupplyId ------------------- -------------- --------- --------- 5.000000 5.000000 1 NULL 8.000000 3.000000 2 1000 14.000000 6.000000 NULL 2000 16.000000 2.000000 3 3000 18.000000 2.000000 5 4000 22.000000 4.000000 NULL 5000 25.000000 3.000000 NULL 6000 26.000000 1.000000 6 NULL 27.000000 1.000000 NULL 7000 30.000000 3.000000 7 NULL 32.000000 2.000000 8 NULL

The code in the CTE C then queries the CTE B and fills in NULL demand and supply IDs with the next non-NULL demand and supply IDs, using the MIN window function with the frame ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING.

Here are the contents of C:

DemandID SupplyID TradeQuantity --------- --------- -------------- 1 1000 5.000000 2 1000 3.000000 3 2000 6.000000 3 3000 2.000000 5 4000 2.000000 6 5000 4.000000 6 6000 3.000000 6 7000 1.000000 7 7000 1.000000 7 NULL 3.000000 8 NULL 2.000000

The last step handled by the outer query against C is to remove entries of one kind that cannot be matched with entries of the other kind. That’s done by filtering out rows where either SupplyID is NULL or DemandID is NULL.

The plan for this solution is shown in Figure 3.

Figure 3: Query plan for Ian’s solution

This plan performs one scan of the input data and uses three parallel batch-mode window aggregate operators to compute the various window functions, all preceded by parallel Sort operators. Two of those are unavoidable as you can verify by forcing a serial plan. The plan also uses a Hash Aggregate operator to handle the grouping and aggregation in the CTE B.

Here are the run times in seconds that I got for this solution:

100K: 0.214 200K: 0.363 300K: 0.546 400K: 0.701

Peter Larsson’s Solution

Peter Larsson’s solution is amazingly short, sweet, and highly efficient. Here’s Peter’s complete solution code:

DROP TABLE IF EXISTS #MyPairings; WITH cteSource(ID, Code, RunningQuantity) AS ( SELECT ID, Code, SUM(Quantity) OVER (PARTITION BY Code ORDER BY ID) AS RunningQuantity FROM dbo.Auctions ) SELECT DemandID, SupplyID, TradeQuantity INTO #MyPairings FROM ( SELECT MIN(CASE WHEN Code = 'D' THEN ID ELSE 2147483648 END) OVER (ORDER BY RunningQuantity, Code ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS DemandID, MIN(CASE WHEN Code = 'S' THEN ID ELSE 2147483648 END) OVER (ORDER BY RunningQuantity, Code ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS SupplyID, RunningQuantity - COALESCE(LAG(RunningQuantity) OVER (ORDER BY RunningQuantity, Code), 0.0) AS TradeQuantity FROM cteSource ) AS d WHERE DemandID < 2147483648 AND SupplyID < 2147483648 AND TradeQuantity > 0.0;

The CTE cteSource queries the Auctions table and uses a window function to compute running total demand and supply quantities, calling the result column RunningQuantity.

The code defining the derived table d queries cteSource and computes the result pairings’ DemandID, SupplyID, and TradeQuantity, before removing some superfluous rows. All window functions used in those calculations are based on RunningQuantity and Code ordering.

The result column d.DemandID is computed as the minimum demand ID starting with the current or 2147483648 if none is found.

The result column d.SupplyID is computed as the minimum supply ID starting with the current or 2147483648 if none is found.

The result TradeQuantity is computed as the current row’s RunningQuantity value minus the previous row’s RunningQuantity value.

Here are the contents of d:

DemandID SupplyID TradeQuantity --------- ----------- -------------- 1 1000 5.000000 2 1000 3.000000 3 1000 0.000000 3 2000 6.000000 3 3000 2.000000 5 3000 0.000000 5 4000 2.000000 6 4000 0.000000 6 5000 4.000000 6 6000 3.000000 6 7000 1.000000 7 7000 1.000000 7 2147483648 3.000000 8 2147483648 2.000000

What’s left for the outer query to do is to filter out superfluous rows from d. Those are cases where the trading quantity is zero, or entries of one kind that cannot be matched with entries from the other kind (with ID 2147483648).

The plan for this solution is shown in Figure 4.

Figure 4: Query plan for Peter’s solution

Like Ian’s plan, Peter’s plan relies on one scan of the input data and uses three parallel batch-mode window aggregate operators to compute the various window functions, all preceded by parallel Sort operators. Two of those are unavoidable as you can verify by forcing a serial plan. In Peter’s plan, there’s no need for a grouped aggregate operator like in Ian’s plan.

Peter’s critical insight that allowed for such a short solution was the realization that for a relevant entry of either of the kinds, the matching ID of the other kind will always appear later (based on RunningQuantity and Code ordering). After seeing Peter’s solution, it sure felt like I overcomplicated things in mine!

Here are the run times in seconds I got for this solution:

100K: 0.197 200K: 0.412 300K: 0.558 400K: 0.696

Paul White’s Solution

Our last solution was created by Paul White. Here’s the complete solution code:

DROP TABLE IF EXISTS #MyPairings; CREATE TABLE #MyPairings ( DemandID integer NOT NULL, SupplyID integer NOT NULL, TradeQuantity decimal(19, 6) NOT NULL ); GO INSERT #MyPairings WITH (TABLOCK) ( DemandID, SupplyID, TradeQuantity ) SELECT Q3.DemandID, Q3.SupplyID, Q3.TradeQuantity FROM ( SELECT Q2.DemandID, Q2.SupplyID, TradeQuantity = -- Interval overlap CASE WHEN Q2.Code = 'S' THEN CASE WHEN Q2.CumDemand >= Q2.IntEnd THEN Q2.IntLength WHEN Q2.CumDemand > Q2.IntStart THEN Q2.CumDemand - Q2.IntStart ELSE 0.0 END WHEN Q2.Code = 'D' THEN CASE WHEN Q2.CumSupply >= Q2.IntEnd THEN Q2.IntLength WHEN Q2.CumSupply > Q2.IntStart THEN Q2.CumSupply - Q2.IntStart ELSE 0.0 END END FROM ( SELECT Q1.Code, Q1.IntStart, Q1.IntEnd, Q1.IntLength, DemandID = MAX(IIF(Q1.Code = 'D', Q1.ID, 0)) OVER ( ORDER BY Q1.IntStart, Q1.ID ROWS UNBOUNDED PRECEDING), SupplyID = MAX(IIF(Q1.Code = 'S', Q1.ID, 0)) OVER ( ORDER BY Q1.IntStart, Q1.ID ROWS UNBOUNDED PRECEDING), CumSupply = SUM(IIF(Q1.Code = 'S', Q1.IntLength, 0)) OVER ( ORDER BY Q1.IntStart, Q1.ID ROWS UNBOUNDED PRECEDING), CumDemand = SUM(IIF(Q1.Code = 'D', Q1.IntLength, 0)) OVER ( ORDER BY Q1.IntStart, Q1.ID ROWS UNBOUNDED PRECEDING) FROM ( -- Demand intervals SELECT A.ID, A.Code, IntStart = SUM(A.Quantity) OVER ( ORDER BY A.ID ROWS UNBOUNDED PRECEDING) - A.Quantity, IntEnd = SUM(A.Quantity) OVER ( ORDER BY A.ID ROWS UNBOUNDED PRECEDING), IntLength = A.Quantity FROM dbo.Auctions AS A WHERE A.Code = 'D' UNION ALL -- Supply intervals SELECT A.ID, A.Code, IntStart = SUM(A.Quantity) OVER ( ORDER BY A.ID ROWS UNBOUNDED PRECEDING) - A.Quantity, IntEnd = SUM(A.Quantity) OVER ( ORDER BY A.ID ROWS UNBOUNDED PRECEDING), IntLength = A.Quantity FROM dbo.Auctions AS A WHERE A.Code = 'S' ) AS Q1 ) AS Q2 ) AS Q3 WHERE Q3.TradeQuantity > 0;

The code defining the derived table Q1 uses two separate queries to compute demand and supply intervals based on running totals and unifies the two. For each interval, the code computes its start (IntStart), end (IntEnd), and length (IntLength). Here are the contents of Q1 ordered by IntStart and ID:

ID Code IntStart IntEnd IntLength ----- ---- ---------- ---------- ---------- 1 D 0.000000 5.000000 5.000000 1000 S 0.000000 8.000000 8.000000 2 D 5.000000 8.000000 3.000000 3 D 8.000000 16.000000 8.000000 2000 S 8.000000 14.000000 6.000000 3000 S 14.000000 16.000000 2.000000 5 D 16.000000 18.000000 2.000000 4000 S 16.000000 18.000000 2.000000 6 D 18.000000 26.000000 8.000000 5000 S 18.000000 22.000000 4.000000 6000 S 22.000000 25.000000 3.000000 7000 S 25.000000 27.000000 2.000000 7 D 26.000000 30.000000 4.000000 8 D 30.000000 32.000000 2.000000

The code defining the derived table Q2 queries Q1 and computes result columns called DemandID, SupplyID, CumSupply, and CumDemand. All window functions used by this code are based on IntStart and ID ordering and the frame ROWS UNBOUNDED PRECEDING (all rows up to the current).

DemandID is the maximum demand ID up to the current row, or 0 if none exists.

SupplyID is the maximum supply ID up to the current row, or 0 if none exists.

CumSupply is the cumulative supply quantities (supply interval lengths) up to the current row.

CumDemand is the cumulative demand quantities (demand interval lengths) up to the current row.

Here are the contents of Q2:

Code IntStart IntEnd IntLength DemandID SupplyID CumSupply CumDemand ---- ---------- ---------- ---------- --------- --------- ---------- ---------- D 0.000000 5.000000 5.000000 1 0 0.000000 5.000000 S 0.000000 8.000000 8.000000 1 1000 8.000000 5.000000 D 5.000000 8.000000 3.000000 2 1000 8.000000 8.000000 D 8.000000 16.000000 8.000000 3 1000 8.000000 16.000000 S 8.000000 14.000000 6.000000 3 2000 14.000000 16.000000 S 14.000000 16.000000 2.000000 3 3000 16.000000 16.000000 D 16.000000 18.000000 2.000000 5 3000 16.000000 18.000000 S 16.000000 18.000000 2.000000 5 4000 18.000000 18.000000 D 18.000000 26.000000 8.000000 6 4000 18.000000 26.000000 S 18.000000 22.000000 4.000000 6 5000 22.000000 26.000000 S 22.000000 25.000000 3.000000 6 6000 25.000000 26.000000 S 25.000000 27.000000 2.000000 6 7000 27.000000 26.000000 D 26.000000 30.000000 4.000000 7 7000 27.000000 30.000000 D 30.000000 32.000000 2.000000 8 7000 27.000000 32.000000

Q2 already has the final result pairings’ correct DemandID and SupplyID values. The code defining the derived table Q3 queries Q2 and computes the result pairings’ TradeQuantity values as the overlapping segments of the demand and supply intervals. Finally, the outer query against Q3 filters only the relevant pairings where TradeQuantity is positive.

The plan for this solution is shown in Figure 5.

Figure 5: Query plan for Paul’s solution

The top two branches of the plan are responsible for computing the demand and supply intervals. Both rely on Index Seek operators to get the relevant rows based on index order, and then use parallel batch-mode Window Aggregate operators, preceded by Sort operators that theoretically could have been avoided. The plan concatenates the two inputs, sorts the rows by IntStart and ID to support the subsequent remaining Window Aggregate operator. Only this Sort operator is unavoidable in this plan. The rest of the plan handles the needed scalar computations and the final filter. That’s a very efficient plan!

Here are the run times in seconds I got for this solution:

100K: 0.187 200K: 0.331 300K: 0.369 400K: 0.425

These numbers are pretty impressive!

Performance Comparison

Figure 6 has a performance comparison between all solutions covered in this article.

Figure 6: Performance comparison

At this point, we can add the fastest solutions I covered in previous articles. Those are Joe’s and Kamil/Luca/Daniel’s solutions. The complete comparison is shown in Figure 7.

Figure 7: Performance comparison including earlier solutions

These are all impressively fast solutions, with the fastest being Paul’s and Peter’s.

Conclusion

When I originally introduced Peter’s puzzle, I showed a straightforward cursor-based solution that took 11.81 seconds to complete against a 400K-row input. The challenge was to come up with an efficient set-based solution. It’s so inspiring to see all the solutions people sent. I always learn so much from such puzzles both from my own attempts and by analyzing others’ solutions. It’s impressive to see several sub-second solutions, with Paul’s being less than half a second!

It's great to have multiple efficient techniques to handle such a classic need of matching supply with demand. Well done everyone!

[ Jump to: Original challenge | Solutions: Part 1 | Part 2 | Part 3 ]

The post Matching Supply With Demand — Solutions, Part 3 appeared first on SQLPerformance.com.

String Aggregation Over the Years in SQL Server

Aaron Bertrand — Thu, 17 Feb 2022 09:00:44 +0000

Since SQL Server 2005, the trick of using FOR XML PATH to denormalize strings and combine them into a single (usually comma-separated) list has been very popular. In SQL Server 2017, however, STRING_AGG() finally answered long-standing and widespread pleas from the community to simulate GROUP_CONCAT() and similar functionality found in other platforms. I recently started modifying many of my Stack Overflow answers using the old method, both to improve the existing code and to add an additional example better suited for modern versions.

I was a little appalled at what I found.

On more than one occasion, I had to double-check the code was even mine.

A Quick Example

Let’s look at a simple demonstration of the problem. Someone has a table like this:

CREATE TABLE dbo.FavoriteBands ( UserID int, BandName nvarchar(255) ); INSERT dbo.FavoriteBands ( UserID, BandName ) VALUES (1, N'Pink Floyd'), (1, N'New Order'), (1, N'The Hip'), (2, N'Zamfir'), (2, N'ABBA');

On the page showing each user’s favorite bands, they want the output to look like this:

UserID Bands ------ --------------------------------------- 1 Pink Floyd, New Order, The Hip 2 Zamfir, ABBA

In the SQL Server 2005 days, I would have offered this solution:

SELECT DISTINCT UserID, Bands = (SELECT BandName + ', ' FROM dbo.FavoriteBands WHERE UserID = fb.UserID FOR XML PATH('')) FROM dbo.FavoriteBands AS fb;

But when I look back on this code now, I see many problems I can’t resist fixing.

STUFF

The most fatal flaw in the code above is it leaves a trailing comma:

UserID Bands ------ --------------------------------------- 1 Pink Floyd, New Order, The Hip, 2 Zamfir, ABBA,

To solve this, I often see people wrap the query inside another and then surround the Bands output with LEFT(Bands, LEN(Bands)-1). But this is needless additional computation; instead, we can move the comma to the beginning of the string and remove the first one or two characters using STUFF. Then, we don’t have to calculate the length of the string because it’s irrelevant.

SELECT DISTINCT UserID, Bands = STUFF( --------------------------------^^^^^^ (SELECT ', ' + BandName --------------^^^^^^ FROM dbo.FavoriteBands WHERE UserID = fb.UserID FOR XML PATH('')), 1, 2, '') --------------------------^^^^^^^^^^^ FROM dbo.FavoriteBands AS fb;

You can adjust this further if you’re using a longer or conditional delimiter.

DISTINCT

The next problem is the use of DISTINCT. The way the code works is the derived table generates a comma-separated list for each UserID value, then the duplicates are removed. We can see this by looking at the plan and seeing the XML-related operator executes seven times, even though only three rows are ultimately returned:

Figure 1: Plan showing filter after aggregation

If we change the code to use GROUP BY instead of DISTINCT:

SELECT /* DISTINCT */ UserID, Bands = STUFF( (SELECT ', ' + BandName FROM dbo.FavoriteBands WHERE UserID = fb.UserID FOR XML PATH('')), 1, 2, '') FROM dbo.FavoriteBands AS fb GROUP BY UserID; --^^^^^^^^^^^^^^^

It’s a subtle difference, and it doesn’t change the results, but we can see the plan improves. Basically, the XML operations are deferred until after the duplicates are removed:

Figure 2: Plan showing filter before aggregation

At this scale, the difference is immaterial. But what if we add some more data? On my system, this adds a little over 11,000 rows:

INSERT dbo.FavoriteBands(UserID, BandName) SELECT [object_id], name FROM sys.all_columns;

If we run the two queries again, the differences in duration and CPU are immediately obvious:

Figure 3: Runtime results comparing DISTINCT and GROUP BY

But other side effects are also obvious in the plans. In the case of DISTINCT, the UDX once again executes for every row in the table, there’s an excessively eager index spool, there’s a distinct sort (always a red flag for me), and the query has a high memory grant, which can put a serious dent in concurrency:

Figure 4: DISTINCT plan at scale

Meanwhile, in the GROUP BY query, the UDX only executes once for each unique UserID, the eager spool reads a much lower number of rows, there’s no distinct sort operator (it’s been replaced by a hash match), and the memory grant is tiny in comparison:

Figure 5: GROUP BY plan at scale

It takes a while to go back and fix old code like this, but for some time now, I’ve been very regimented about always using GROUP BY instead of DISTINCT.

N Prefix

Too many old code samples I came across assumed no Unicode characters would ever be in use, or at least the sample data didn’t suggest the possibility. I’d offer my solution as above, and then the user would come back and say, “but on one row I have 'просто красный', and it comes back as '?????? ???????'!” I often remind people they always need to prefix potential Unicode string literals with the N prefix unless they absolutely know they’ll only ever be dealing with varchar strings or integers. I started being very explicit and probably even overcautious about it:

SELECT UserID, Bands = STUFF( (SELECT N', ' + BandName --------------^ FROM dbo.FavoriteBands WHERE UserID = fb.UserID FOR XML PATH(N'')), 1, 2, N'') ----------------------^ -----------^ FROM dbo.FavoriteBands AS fb GROUP BY UserID;

XML Entitization

Another “what if?” scenario not always present in a user’s sample data is XML characters. For example, what if my favorite band is named “Bob & Sheila <> Strawberries”? The output with the above query is made XML-safe, which isn’t what we always want (e.g., Bob & Sheila <> Strawberries). Google searches at the time would suggest “you need to add TYPE,” and I remember trying something like this:

SELECT UserID, Bands = STUFF( (SELECT N', ' + BandName FROM dbo.FavoriteBands WHERE UserID = fb.UserID FOR XML PATH(N''), TYPE), 1, 2, N'') --------------------------^^^^^^ FROM dbo.FavoriteBands AS fb GROUP BY UserID;

Unfortunately, the output data type from the subquery in this case is xml. This leads to the following error message:

Msg 8116, Level 16, State 1
Argument data type xml is invalid for argument 1 of stuff function.

You need to tell SQL Server you want to extract the resulting value as a string by indicating the data type and that you want the first element. Back then, I’d add this as the following:

SELECT UserID, Bands = STUFF( (SELECT N', ' + BandName FROM dbo.FavoriteBands WHERE UserID = fb.UserID FOR XML PATH(N''), TYPE).value(N'.', N'nvarchar(max)'), --------------------------^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1, 2, N'') FROM dbo.FavoriteBands AS fb GROUP BY UserID;

This would return the string without XML entitization. But is it the most efficient? Last year, Charlieface reminded me Mister Magoo performed some extensive testing and found ./text()[1] was faster than the other (shorter) approaches like . and .[1]. (I originally heard this from a comment Mikael Eriksson left for me here.) I once again adjusted my code to look like this:

SELECT UserID, Bands = STUFF( (SELECT N', ' + BandName FROM dbo.FavoriteBands WHERE UserID = fb.UserID FOR XML PATH(N''), TYPE).value(N'./text()[1]', N'nvarchar(max)'), ------------------------------------------^^^^^^^^^^^ 1, 2, N'') FROM dbo.FavoriteBands AS fb GROUP BY UserID;

You might observe extracting the value in this way leads to a slightly more complex plan (you wouldn’t know it just from looking at duration, which stays pretty constant throughout the above changes):

Figure 6: Plan with ./text()[1]

The warning on the root SELECT operator comes from the explicit conversion to nvarchar(max).

Order

Occasionally, users would express ordering is important. Often, this is simply ordering by the column you’re appending—but sometimes, it can be added somewhere else. People tend to believe if they saw a specific order come out of SQL Server once, it’s the order they’ll always see, but there’s no reliability here. Order is never guaranteed unless you say so. In this case, let’s say we want to order by BandName alphabetically. We can add this instruction inside the subquery:

SELECT UserID, Bands = STUFF( (SELECT N', ' + BandName FROM dbo.FavoriteBands WHERE UserID = fb.UserID ORDER BY BandName ---------^^^^^^^^^^^^^^^^^ FOR XML PATH(N''), TYPE).value(N'./text()[1]', N'nvarchar(max)'), 1, 2, N'') FROM dbo.FavoriteBands AS fb GROUP BY UserID;

Note this may add a little execution time because of the additional sort operator, depending on whether there’s a supporting index.

STRING_AGG()

As I update my old answers, which should still work on the version that was relevant at the time of the question, the final snippet above (with or without the ORDER BY) is the form you’ll likely see. But you might see an additional update for the more modern form, too.

STRING_AGG() is arguably one of the best features added in SQL Server 2017. It’s both simpler and far more efficient than any of the above approaches, leading to tidy, well-performing queries like this:

SELECT UserID, Bands = STRING_AGG(BandName, N', ') FROM dbo.FavoriteBands GROUP BY UserID;

This isn’t a joke; that’s it. Here’s the plan—most importantly, there’s only a single scan against the table:

Figure 7: STRING_AGG() plan

If you want ordering, STRING_AGG() supports this, too (as long as you are in compatibility level 110 or greater, as Martin Smith points out here):

SELECT UserID, Bands = STRING_AGG(BandName, N', ') WITHIN GROUP (ORDER BY BandName) ----^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ FROM dbo.FavoriteBands GROUP BY UserID;

The plan looks the same as the one without sorting, but the query is a smidge slower in my tests. It’s still way faster than any of the FOR XML PATH variations.

Indexes

A heap is hardly fair. If you have even a nonclustered index the query can use, the plan looks even better. For example:

CREATE INDEX ix_FavoriteBands ON dbo.FavoriteBands(UserID, BandName);

Here’s the plan for the same ordered query using STRING_AGG()—note the lack of a sort operator, since the scan can be ordered:

Figure 8: STRING_AGG() plan with a supporting index

This shaves some time off, too—but to be fair, this index helps the FOR XML PATH variations as well. Here’s the new plan for the ordered version of that query:

Figure 9: FOR XML PATH plan with a supporting index

The plan is a little friendlier than before, including a seek instead of a scan in one spot, but this approach is still significantly slower than STRING_AGG().

A Caveat

There’s a little trick to using STRING_AGG() where, if the resulting string is more than 8,000 bytes, you’ll receive this error message:

Msg 9829, Level 16, State 1
STRING_AGG aggregation result exceeded the limit of 8000 bytes. Use LOB types to avoid result truncation.

To avoid this issue, you can inject an explicit conversion:

SELECT UserID, Bands = STRING_AGG(CONVERT(nvarchar(max), BandName), N', ') --------------------------^^^^^^^^^^^^^^^^^^^^^^ FROM dbo.FavoriteBands GROUP BY UserID;

This adds a compute scalar operation to the plan—and an unsurprising CONVERT warning on the root SELECT operator—but otherwise, it has little impact on performance.

Conclusion

If you’re on SQL Server 2017+ and you have any FOR XML PATH string aggregation in your codebase, I highly recommend switching over to the new approach. I did perform some more thorough performance testing back during the SQL Server 2017 public preview here and here you may want to revisit.

A common objection I’ve heard is people are on SQL Server 2017 or greater but still on an older compatibility level. It seems the apprehension is because STRING_SPLIT() is invalid on compatibility levels lower than 130, so they think STRING_AGG() works this way too, but it is a bit more lenient. It is only a problem if you are using WITHIN GROUP and a compat level lower than 110. So improve away!

The post String Aggregation Over the Years in SQL Server appeared first on SQLPerformance.com.

Matching Supply With Demand — Solutions, Part 2

Itzik Ben-Gan — Wed, 09 Feb 2022 09:00:26 +0000

[ Jump to: Original challenge | Solutions: Part 1 | Part 2 | Part 3 ]

In this article, I continue the coverage of solutions to Peter Larsson’s enticing matching supply with demand challenge. Thanks again to Luca, Kamil Kosno, Daniel Brown, Brian Walker, Joe Obbish, Rainer Hoffmann, Paul White, Charlie, and Peter Larsson, for sending your solutions.

Last month I covered a solution based on interval intersections, using a classic predicate-based interval intersection test. I’ll refer to that solution as classic intersections. The classic interval intersections approach results in a plan with quadratic scaling (N^2). I demonstrated its poor performance against sample inputs ranging from 100K to 400K rows. It took the solution 931 seconds to complete against the 400K-row input! This month I’ll start by briefly reminding you of last month’s solution and why it scales and performs so badly. I’ll then introduce an approach based on a revision to the interval intersection test. This approach was used by Luca, Kamil, and possibly also Daniel, and it enables a solution with much better performance and scaling. I’ll refer to that solution as revised intersections.

The Problem With the Classic Intersection Test

Let’s start with a quick reminder of last month’s solution.

I used the following indexes on the input dbo.Auctions table to support the computation of the running totals that produce the demand and supply intervals:

-- Index to support solution CREATE UNIQUE NONCLUSTERED INDEX idx_Code_ID_i_Quantity ON dbo.Auctions(Code, ID) INCLUDE(Quantity); -- Enable batch-mode Window Aggregate CREATE NONCLUSTERED COLUMNSTORE INDEX idx_cs ON dbo.Auctions(ID) WHERE ID = -1 AND ID = -2;

The following code has the complete solution implementing the classic interval intersections approach:

-- Drop temp tables if exist SET NOCOUNT ON; DROP TABLE IF EXISTS #MyPairings, #Demand, #Supply; GO WITH D0 AS -- D0 computes running demand as EndDemand ( SELECT ID, Quantity, SUM(Quantity) OVER(ORDER BY ID ROWS UNBOUNDED PRECEDING) AS EndDemand FROM dbo.Auctions WHERE Code = 'D' ), -- D extracts prev EndDemand as StartDemand, expressing start-end demand as an interval D AS ( SELECT ID, Quantity, EndDemand - Quantity AS StartDemand, EndDemand FROM D0 ) SELECT ID, CAST(ISNULL(StartDemand, 0.0) AS DECIMAL(19, 6)) AS StartDemand, CAST(ISNULL(EndDemand, 0.0) AS DECIMAL(19, 6)) AS EndDemand INTO #Demand FROM D; WITH S0 AS -- S0 computes running supply as EndSupply ( SELECT ID, Quantity, SUM(Quantity) OVER(ORDER BY ID ROWS UNBOUNDED PRECEDING) AS EndSupply FROM dbo.Auctions WHERE Code = 'S' ), -- S extracts prev EndSupply as StartSupply, expressing start-end supply as an interval S AS ( SELECT ID, Quantity, EndSupply - Quantity AS StartSupply, EndSupply FROM S0 ) SELECT ID, CAST(ISNULL(StartSupply, 0.0) AS DECIMAL(19, 6)) AS StartSupply, CAST(ISNULL(EndSupply, 0.0) AS DECIMAL(19, 6)) AS EndSupply INTO #Supply FROM S; CREATE UNIQUE CLUSTERED INDEX idx_cl_ES_ID ON #Supply(EndSupply, ID); -- Trades are the overlapping segments of the intersecting intervals SELECT D.ID AS DemandID, S.ID AS SupplyID, CASE WHEN EndDemand < EndSupply THEN EndDemand ELSE EndSupply END - CASE WHEN StartDemand > StartSupply THEN StartDemand ELSE StartSupply END AS TradeQuantity INTO #MyPairings FROM #Demand AS D INNER JOIN #Supply AS S WITH (FORCESEEK) ON D.StartDemand < S.EndSupply AND D.EndDemand > S.StartSupply;

This code starts by computing the demand and supply intervals and writing those to temporary tables called #Demand and #Supply. The code then creates a clustered index on #Supply with EndSupply as the leading key to support the intersection test as best as possible. Finally, the code joins #Supply and #Demand, identifying trades as the overlapping segments of the intersecting intervals, using the following join predicate based on the classic interval intersection test:

D.StartDemand < S.EndSupply AND D.EndDemand > S.StartSupply

The plan for this solution is shown in Figure 1.

Figure 1: Query plan for solution based on classic intersections

As I explained last month, among the two range predicates involved, only one can be used as a seek predicate as part of an index seek operation, whereas the other has to be applied as a residual predicate. You can clearly see this in the plan for the last query in Figure 1. That’s why I only bothered creating one index on one of the tables. I also forced the use of a seek in the index I created using the FORCESEEK hint. Otherwise, the optimizer could end up ignoring that index and creating one of its own using an Index Spool operator.

This plan has quadratic complexity since per row in one side—#Demand in our case—the index seek will have to access in average half the rows in the other side—#Supply in our case—based on the seek predicate, and identify the final matches by applying the residual predicate.

If it’s still unclear to you why this plan has quadratic complexity, perhaps a visual depiction of the work could help, as shown in Figure 2.

Figure 2: Illustration of work with solution based on classic intersections

This illustration represents the work done by the Nested Loops join in the plan for the last query. The #Demand heap is the outer input of the join and the clustered index on #Supply (with EndSupply as the leading key) is the inner input. The red lines represent the index seek activities done per row in #Demand in the index on #Supply and the rows they access as part of the range scans in the index leaf. Towards extreme low StartDemand values in #Demand, the range scan needs to access close to all rows in #Supply. Towards extreme high StartDemand values in #Demand, the range scan needs to access close to zero rows in #Supply. On average, the range scan needs to access about half the rows in #Supply per row in demand. With D demand rows and S supply rows, the number of rows accessed is D + D*S/2. That’s in addition to the cost of the seeks that get you to the matching rows. For example, when filling dbo.Auctions with 200,000 demand rows and 200,000 supply rows, this translates to the Nested Loops join accessing 200,000 demand rows + 200,000*200,000/2 supply rows, or 200,000 + 200,000^2/2 = 20,000,200,000 rows accessed. There’s a lot of rescanning of supply rows happening here!

If you want to stick to the interval intersections idea but get good performance, you need to come up with a way to reduce the amount of work done.

In his solution, Joe Obbish bucketized the intervals based on the maximum interval length. This way he was able to reduce the number of rows the joins needed to handle and rely on batch processing. He used the classic interval intersection test as a final filter. Joe’s solution works well as long as the maximum interval length is not very high, but the solution’s runtime increases as the maximum interval length increases.

So, what else can you do…?

Revised Intersection Test

Luca, Kamil, and possibly Daniel (there was an unclear part about his posted solution due to the website’s formatting, so I had to guess) used a revised interval intersection test that enables much better utilization of indexing.

Their intersection test is a disjunction of two predicates (predicates separated by OR operator):

(D.StartDemand >= S.StartSupply AND D.StartDemand < S.EndSupply) OR (S.StartSupply >= D.StartDemand AND S.StartSupply < D.EndDemand)

In English, either the demand start delimiter intersects with the supply interval in an inclusive, exclusive manner (including the start and excluding the end); or the supply start delimiter intersects with the demand interval, in an inclusive, exclusive manner. To make the two predicates disjoint (not have overlapping cases) yet still complete in covering all cases, you can keep the = operator in just one or the other, for example:

(D.StartDemand >= S.StartSupply AND D.StartDemand < S.EndSupply) OR (S.StartSupply > D.StartDemand AND S.StartSupply < D.EndDemand)

This revised interval intersection test is quite insightful. Each of the predicates can potentially efficiently use an index. Consider predicate 1:

D.StartDemand >= S.StartSupply AND D.StartDemand < S.EndSupply ^^^^^^^^^^^^^ ^^^^^^^^^^^^^

Assuming you create a covering index on #Demand with StartDemand as the leading key, you can potentially get a Nested Loops join applying the work illustrated in Figure 3.

Figure 3: Illustration of theoretical work required to process predicate 1

Yes, you still pay for a seek in the #Demand index per row in #Supply, but the range scans in the index leaf access almost disjoint subsets of rows. No rescanning of rows!

The situation is similar with predicate 2:

S.StartSupply > D.StartDemand AND S.StartSupply < D.EndDemand ^^^^^^^^^^^^^ ^^^^^^^^^^^^^

Assuming you create a covering index on #Supply with StartSupply as the leading key, you can potentially get a Nested Loops join applying the work illustrated in Figure 4.

Figure 4: Illustration of theoretical work required to process predicate 2

Also, here you pay for a seek in the #Supply index per row in #Demand, and then the range scans in the index leaf access almost disjoint subsets of rows. Again, no rescanning of rows!

Assuming D demand rows and S supply rows, the work can be described as:

Number of index seek operations: D + S Number of rows accessed: 2D + 2S

So likely, the biggest portion of the cost here is the I/O cost involved with the seeks. But this part of the work has linear scaling compared to the quadratic scaling of the classic interval intersections query.

Of course, you need to consider the preliminary cost of the index creation on the temporary tables, which has n log n scaling due to the sorting involved, but you pay this part as a preliminary part of both solutions.

Let’s try and put this theory into practice. Let’s start by populating the #Demand and #supply temporary tables with the demand and supply intervals (same as with the classic interval intersections) and creating the supporting indexes:

-- Drop temp tables if exist SET NOCOUNT ON; DROP TABLE IF EXISTS #MyPairings, #Demand, #Supply; GO WITH D0 AS -- D0 computes running demand as EndDemand ( SELECT ID, Quantity, SUM(Quantity) OVER(ORDER BY ID ROWS UNBOUNDED PRECEDING) AS EndDemand FROM dbo.Auctions WHERE Code = 'D' ), -- D extracts prev EndDemand as StartDemand, expressing start-end demand as an interval D AS ( SELECT ID, Quantity, EndDemand - Quantity AS StartDemand, EndDemand FROM D0 ) SELECT ID, CAST(ISNULL(StartDemand, 0.0) AS DECIMAL(19, 6)) AS StartDemand, CAST(ISNULL(EndDemand, 0.0) AS DECIMAL(19, 6)) AS EndDemand INTO #Demand FROM D; WITH S0 AS -- S0 computes running supply as EndSupply ( SELECT ID, Quantity, SUM(Quantity) OVER(ORDER BY ID ROWS UNBOUNDED PRECEDING) AS EndSupply FROM dbo.Auctions WHERE Code = 'S' ), -- S extracts prev EndSupply as StartSupply, expressing start-end supply as an interval S AS ( SELECT ID, Quantity, EndSupply - Quantity AS StartSupply, EndSupply FROM S0 ) SELECT ID, CAST(ISNULL(StartSupply, 0.0) AS DECIMAL(19, 6)) AS StartSupply, CAST(ISNULL(EndSupply, 0.0) AS DECIMAL(19, 6)) AS EndSupply INTO #Supply FROM S; -- Indexing CREATE UNIQUE CLUSTERED INDEX idx_cl_SD_ID ON #Demand(StartDemand, ID); CREATE UNIQUE CLUSTERED INDEX idx_cl_SS_ID ON #Supply(StartSupply, ID);

The plans for populating the temporary tables and creating the indexes are shown in Figure 5.

Figure 5: Query plans for populating temp tables and creating indexes

No surprises here.

Now to the final query. You might be tempted to use a single query with the aforementioned disjunction of two predicates, like so:

SELECT D.ID AS DemandID, S.ID AS SupplyID, CASE WHEN EndDemand < EndSupply THEN EndDemand ELSE EndSupply END - CASE WHEN StartDemand > StartSupply THEN StartDemand ELSE StartSupply END AS TradeQuantity INTO #MyPairings FROM #Demand AS D INNER JOIN #Supply AS S ON (D.StartDemand >= S.StartSupply AND D.StartDemand < S.EndSupply) OR (S.StartSupply > D.StartDemand AND S.StartSupply < D.EndDemand);

The plan for this query is shown in Figure 6.

Figure 6: Query plan for final query using revised intersection test, try 1

The problem is that the optimizer doesn’t know how to break this logic into two separate activities, each handling a different predicate and utilizing the respective index efficiently. So, it ends up with a full cartesian product of the two sides, applying all predicates as residual predicates. With 200,000 demand rows and 200,000 supply rows, the join processes 40,000,000,000 rows.

The insightful trick used by Luca, Kamil, and possibly Daniel was to break the logic into two queries, unifying their results. That’s where using two disjoint predicates becomes important! You can use a UNION ALL operator instead of UNION, avoiding the unnecessary cost of looking for duplicates. Here’s the query implementing this approach:

SELECT D.ID AS DemandID, S.ID AS SupplyID, CASE WHEN EndDemand < EndSupply THEN EndDemand ELSE EndSupply END - CASE WHEN StartDemand > StartSupply THEN StartDemand ELSE StartSupply END AS TradeQuantity INTO #MyPairings FROM #Demand AS D INNER JOIN #Supply AS S ON D.StartDemand >= S.StartSupply AND D.StartDemand < S.EndSupply UNION ALL SELECT D.ID AS DemandID, S.ID AS SupplyID, CASE WHEN EndDemand < EndSupply THEN EndDemand ELSE EndSupply END - CASE WHEN StartDemand > StartSupply THEN StartDemand ELSE StartSupply END AS TradeQuantity FROM #Demand AS D INNER JOIN #Supply AS S ON S.StartSupply > D.StartDemand AND S.StartSupply < D.EndDemand;

The plan for this query is shown in Figure 7.

Figure 7: Query plan for final query using revised intersection test, try 2

This is just beautiful! Isn’t it? And because it’s so beautiful, here’s the entire solution from scratch, including the creation of temp tables, indexing, and the final query:

-- Drop temp tables if exist SET NOCOUNT ON; DROP TABLE IF EXISTS #MyPairings, #Demand, #Supply; GO WITH D0 AS -- D0 computes running demand as EndDemand ( SELECT ID, Quantity, SUM(Quantity) OVER(ORDER BY ID ROWS UNBOUNDED PRECEDING) AS EndDemand FROM dbo.Auctions WHERE Code = 'D' ), -- D extracts prev EndDemand as StartDemand, expressing start-end demand as an interval D AS ( SELECT ID, Quantity, EndDemand - Quantity AS StartDemand, EndDemand FROM D0 ) SELECT ID, CAST(ISNULL(StartDemand, 0.0) AS DECIMAL(19, 6)) AS StartDemand, CAST(ISNULL(EndDemand, 0.0) AS DECIMAL(19, 6)) AS EndDemand INTO #Demand FROM D; WITH S0 AS -- S0 computes running supply as EndSupply ( SELECT ID, Quantity, SUM(Quantity) OVER(ORDER BY ID ROWS UNBOUNDED PRECEDING) AS EndSupply FROM dbo.Auctions WHERE Code = 'S' ), -- S extracts prev EndSupply as StartSupply, expressing start-end supply as an interval S AS ( SELECT ID, Quantity, EndSupply - Quantity AS StartSupply, EndSupply FROM S0 ) SELECT ID, CAST(ISNULL(StartSupply, 0.0) AS DECIMAL(19, 6)) AS StartSupply, CAST(ISNULL(EndSupply, 0.0) AS DECIMAL(19, 6)) AS EndSupply INTO #Supply FROM S; -- Indexing CREATE UNIQUE CLUSTERED INDEX idx_cl_SD_ID ON #Demand(StartDemand, ID); CREATE UNIQUE CLUSTERED INDEX idx_cl_SS_ID ON #Supply(StartSupply, ID); -- Trades are the overlapping segments of the intersecting intervals SELECT D.ID AS DemandID, S.ID AS SupplyID, CASE WHEN EndDemand < EndSupply THEN EndDemand ELSE EndSupply END - CASE WHEN StartDemand > StartSupply THEN StartDemand ELSE StartSupply END AS TradeQuantity INTO #MyPairings FROM #Demand AS D INNER JOIN #Supply AS S ON D.StartDemand >= S.StartSupply AND D.StartDemand < S.EndSupply UNION ALL SELECT D.ID AS DemandID, S.ID AS SupplyID, CASE WHEN EndDemand < EndSupply THEN EndDemand ELSE EndSupply END - CASE WHEN StartDemand > StartSupply THEN StartDemand ELSE StartSupply END AS TradeQuantity FROM #Demand AS D INNER JOIN #Supply AS S ON S.StartSupply > D.StartDemand AND S.StartSupply < D.EndDemand;

As mentioned earlier, I’ll refer to this solution as revised intersections.

Kamil’s solution

Between Luca’s, Kamil’s, and Daniel’s solutions, Kamil’s was the fastest. Here’s Kamil’s complete solution:

SET NOCOUNT ON; DROP TABLE IF EXISTS #supply, #demand; GO CREATE TABLE #demand ( DemandID INT NOT NULL, DemandQuantity DECIMAL(19, 6) NOT NULL, QuantityFromDemand DECIMAL(19, 6) NOT NULL, QuantityToDemand DECIMAL(19, 6) NOT NULL ); CREATE TABLE #supply ( SupplyID INT NOT NULL, QuantityFromSupply DECIMAL(19, 6) NOT NULL, QuantityToSupply DECIMAL(19,6) NOT NULL ); WITH demand AS ( SELECT a.ID AS DemandID, a.Quantity AS DemandQuantity, SUM(a.Quantity) OVER(ORDER BY ID ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) - a.Quantity AS QuantityFromDemand, SUM(a.Quantity) OVER(ORDER BY ID ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS QuantityToDemand FROM dbo.Auctions AS a WHERE Code = 'D' ) INSERT INTO #demand ( DemandID, DemandQuantity, QuantityFromDemand, QuantityToDemand ) SELECT d.DemandID, d.DemandQuantity, d.QuantityFromDemand, d.QuantityToDemand FROM demand AS d; WITH supply AS ( SELECT a.ID AS SupplyID, SUM(a.Quantity) OVER(ORDER BY ID ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) - a.Quantity AS QuantityFromSupply, SUM(a.Quantity) OVER(ORDER BY ID ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS QuantityToSupply FROM dbo.Auctions AS a WHERE Code = 'S' ) INSERT INTO #supply ( SupplyID, QuantityFromSupply, QuantityToSupply ) SELECT s.SupplyID, s.QuantityFromSupply, s.QuantityToSupply FROM supply AS s; CREATE UNIQUE INDEX ix_1 ON #demand(QuantityFromDemand) INCLUDE(DemandID, DemandQuantity, QuantityToDemand); CREATE UNIQUE INDEX ix_1 ON #supply(QuantityFromSupply) INCLUDE(SupplyID, QuantityToSupply); CREATE NONCLUSTERED COLUMNSTORE INDEX nci ON #demand(DemandID) WHERE DemandID is null; DROP TABLE IF EXISTS #myPairings; CREATE TABLE #myPairings ( DemandID INT NOT NULL, SupplyID INT NOT NULL, TradeQuantity DECIMAL(19, 6) NOT NULL ); INSERT INTO #myPairings(DemandID, SupplyID, TradeQuantity) SELECT d.DemandID, s.SupplyID, d.DemandQuantity - CASE WHEN d.QuantityFromDemand < s.QuantityFromSupply THEN s.QuantityFromSupply - d.QuantityFromDemand ELSE 0 end - CASE WHEN s.QuantityToSupply < d.QuantityToDemand THEN d.QuantityToDemand - s.QuantityToSupply ELSE 0 END AS TradeQuantity FROM #demand AS d INNER JOIN #supply AS s ON (d.QuantityFromDemand < s.QuantityToSupply AND s.QuantityFromSupply <= d.QuantityFromDemand) UNION ALL SELECT d.DemandID, s.SupplyID, d.DemandQuantity - CASE WHEN d.QuantityFromDemand < s.QuantityFromSupply THEN s.QuantityFromSupply - d.QuantityFromDemand ELSE 0 END - CASE WHEN s.QuantityToSupply < d.QuantityToDemand THEN d.QuantityToDemand - s.QuantityToSupply ELSE 0 END AS TradeQuantity FROM #supply AS s INNER JOIN #demand AS d ON (s.QuantityFromSupply < d.QuantityToDemand AND d.QuantityFromDemand < s.QuantityFromSupply);

As you can see, it’s very close to the revised intersections solution I covered.

The plan for the final query in Kamil’s solution is shown in Figure 8.

Figure 8: Query plan for final query in Kamil’s solution

The plan looks pretty similar to the one shown earlier in Figure 7.

Performance Test

Recall that the classic intersections solution took 931 seconds to complete against an input with 400K rows. That’s 15 minutes! It’s much, much worse than the cursor solution, which took about 12 seconds to complete against the same input. Figure 9 has the performance numbers including the new solutions discussed in this article.

Figure 9: Performance test

As you can see, Kamil’s solution and the similar revised intersections solution took about 1.5 seconds to complete against the 400K-row input. That’s a pretty decent improvement compared to the original cursor-based solution. The main drawback of these solutions is the I/O cost. With a seek per row, for a 400K-row input, these solutions perform an excess of 1.35M reads. But it could also be considered a perfectly acceptable cost given the good run time and scaling you get.

What’s Next?

Our first attempt at solving the matching supply versus demand challenge relied on the classic interval intersection test and got a poor-performing plan with quadratic scaling. Much worse than the cursor-based solution. Based on insights from Luca, Kamil, and Daniel, using a revised interval intersection test, our second attempt was a significant improvement that utilizes indexing efficiently and performs better than the cursor-based solution. However, this solution involves a significant I/O cost. Next month I’ll continue exploring additional solutions.

[ Jump to: Original challenge | Solutions: Part 1 | Part 2 | Part 3 ]

The post Matching Supply With Demand — Solutions, Part 2 appeared first on SQLPerformance.com.

Matching Supply With Demand — Solutions, Part 1

Itzik Ben-Gan — Wed, 12 Jan 2022 09:00:08 +0000

[ Jump to: Original challenge | Solutions: Part 1 | Part 2 | Part 3 ]

Last month, I covered Peter Larsson's puzzle of matching supply with demand. I showed Peter's straightforward cursor-based solution and explained that it has linear scaling. The challenge I left you with is to try and come up with a set-based solution to the task, and boy, have people risen to the challenge! Thanks Luca, Kamil Kosno, Daniel Brown, Brian Walker, Joe Obbish, Rainer Hoffmann, Paul White, Charlie, and, of course, Peter Larsson, for sending your solutions. Some of the ideas were brilliant and outright mind-blowing.

This month, I'm going to start exploring the submitted solutions, roughly, going from the worse performing to the best performing ones. Why even bother with the bad performing ones? Because you can still learn a lot from them; for example, by identifying anti-patterns. Indeed, the first attempt at solving this challenge for many people, including myself and Peter, is based on an interval intersection concept. It so happens that the classic predicate-based technique for identifying interval intersection has poor performance since there's no good indexing scheme to support it. This article is dedicated to this poor performing approach. Despite the poor performance, working on the solution is an interesting exercise. It requires practicing the skill of modeling the problem in a way that lends itself to set-based treatment. It is also interesting to identify the reason for the bad performance, making it easier to avoid the anti-pattern in the future. Keep in mind, this solution is just the starting point.

DDL and a Small Set of Sample Data

As a reminder, the task involves querying a table called "Auctions."" Use the following code to create the table and populate it with a small set of sample data:

DROP TABLE IF EXISTS dbo.Auctions; CREATE TABLE dbo.Auctions ( ID INT NOT NULL IDENTITY(1, 1) CONSTRAINT pk_Auctions PRIMARY KEY CLUSTERED, Code CHAR(1) NOT NULL CONSTRAINT ck_Auctions_Code CHECK (Code = 'D' OR Code = 'S'), Quantity DECIMAL(19, 6) NOT NULL CONSTRAINT ck_Auctions_Quantity CHECK (Quantity > 0) ); SET NOCOUNT ON; DELETE FROM dbo.Auctions; SET IDENTITY_INSERT dbo.Auctions ON; INSERT INTO dbo.Auctions(ID, Code, Quantity) VALUES (1, 'D', 5.0), (2, 'D', 3.0), (3, 'D', 8.0), (5, 'D', 2.0), (6, 'D', 8.0), (7, 'D', 4.0), (8, 'D', 2.0), (1000, 'S', 8.0), (2000, 'S', 6.0), (3000, 'S', 2.0), (4000, 'S', 2.0), (5000, 'S', 4.0), (6000, 'S', 3.0), (7000, 'S', 2.0); SET IDENTITY_INSERT dbo.Auctions OFF;

Your task was to create pairings that match supply with demand entries based on ID ordering, writing those to a temporary table. Following is the desired result for the small set of sample data:

DemandID SupplyID TradeQuantity ----------- ----------- -------------- 1 1000 5.000000 2 1000 3.000000 3 2000 6.000000 3 3000 2.000000 5 4000 2.000000 6 5000 4.000000 6 6000 3.000000 6 7000 1.000000 7 7000 1.000000

Last month, I also provided code that you can use to populate the Auctions table with a large set of sample data, controlling the number of supply and demand entries as well as their range of quantities. Make sure that you use the code from last month's article to check the performance of the solutions.

Modeling the Data as Intervals

One intriguing idea that lends itself to supporting set-based solutions is to model the data as intervals. In other words, represent each demand and supply entry as an interval starting with the running total quantities of the same kind (demand or supply) up to but excluding the current, and ending with the running total including the current, of course based on ID ordering. For example, looking at the small set of sample data, the first demand entry (ID 1) is for a quantity of 5.0 and the second (ID 2) is for a quantity of 3.0. The first demand entry can be represented with the interval start: 0.0, end: 5.0, and the second with the interval start: 5.0, end: 8.0, and so on.
Similarly, the first supply entry (ID 1000) is for a quantity of 8.0 and the second (ID 2000) is for a quantity of 6.0. The first supply entry can be represented with the interval start: 0.0, end: 8.0, and the second with the interval start: 8.0, end: 14.0, and so on.

The demand-supply pairings you need to create are then the overlapping segments of the intersecting intervals between the two kinds.

This is probably best understood with a visual depiction of the interval-based modeling of the data and the desired result, as shown in Figure 1.

Figure 1: Modeling the Data as Intervals

The visual depiction in Figure 1 is pretty self-explanatory but, in short…

The blue rectangles represent the demand entries as intervals, showing the exclusive running total quantities as the start of the interval and the inclusive running total as the end of the interval. The yellow rectangles do the same for supply entries. Then notice how the overlapping segments of the intersecting intervals of the two kinds, which are depicted by the green rectangles, are the demand-supply pairings you need to produce. For example, the first result pairing is with demand ID 1, supply ID 1000, quantity 5. The second result pairing is with demand ID 2, supply ID 1000, quantity 3. And so on.

Interval Intersections Using CTEs

Before you start writing the T-SQL code with solutions based on the interval modeling idea, you should already have an intuitive sense for what indexes are likely to be useful here. Since you're likely to use window functions to compute running totals, you could benefit from a covering index with a key based on the columns Code, ID, and including the column Quantity. Here's the code to create such an index:

CREATE UNIQUE NONCLUSTERED INDEX idx_Code_ID_i_Quantity ON dbo.Auctions(Code, ID) INCLUDE(Quantity);

That's the same index I recommended for the cursor-based solution that I covered last month.

Also, there's potential here to benefit from batch processing. You can enable its consideration without the requirements of batch mode on rowstore, e.g., using SQL Server 2019 Enterprise or later, by creating the following dummy columnstore index:

CREATE NONCLUSTERED COLUMNSTORE INDEX idx_cs ON dbo.Auctions(ID) WHERE ID = -1 AND ID = -2;

You can now start working on the solution's T-SQL code.

The following code creates the intervals representing the demand entries:

WITH D0 AS -- D0 computes running demand as EndDemand ( SELECT ID, Quantity, SUM(Quantity) OVER(ORDER BY ID ROWS UNBOUNDED PRECEDING) AS EndDemand FROM dbo.Auctions WHERE Code = 'D' ), -- D extracts prev EndDemand as StartDemand, expressing start-end demand as an interval D AS ( SELECT ID, Quantity, EndDemand - Quantity AS StartDemand, EndDemand FROM D0 ) SELECT * FROM D;

The query defining the CTE D0 filters demand entries from the Auctions table and computes a running total quantity as the end delimiter of the demand intervals. Then the query defining the second CTE called D queries D0 and computes the start delimiter of the demand intervals by subtracting the current quantity from the end delimiter.

This code generates the following output:

ID Quantity StartDemand EndDemand ---- --------- ------------ ---------- 1 5.000000 0.000000 5.000000 2 3.000000 5.000000 8.000000 3 8.000000 8.000000 16.000000 5 2.000000 16.000000 18.000000 6 8.000000 18.000000 26.000000 7 4.000000 26.000000 30.000000 8 2.000000 30.000000 32.000000

The supply intervals are generated very similarly by applying the same logic to the supply entries, using the following code:

WITH S0 AS -- S0 computes running supply as EndSupply ( SELECT ID, Quantity, SUM(Quantity) OVER(ORDER BY ID ROWS UNBOUNDED PRECEDING) AS EndSupply FROM dbo.Auctions WHERE Code = 'S' ), -- S extracts prev EndSupply as StartSupply, expressing start-end supply as an interval S AS ( SELECT ID, Quantity, EndSupply - Quantity AS StartSupply, EndSupply FROM S0 ) SELECT * FROM S;

This code generates the following output:

ID Quantity StartSupply EndSupply ----- --------- ------------ ---------- 1000 8.000000 0.000000 8.000000 2000 6.000000 8.000000 14.000000 3000 2.000000 14.000000 16.000000 4000 2.000000 16.000000 18.000000 5000 4.000000 18.000000 22.000000 6000 3.000000 22.000000 25.000000 7000 2.000000 25.000000 27.000000

What's left is then to identify the intersecting demand and supply intervals from the CTEs D and S, and compute the overlapping segments of those intersecting intervals. Remember the result pairings should be written into a temporary table. This can be done using the following code:

-- Drop temp table if exists DROP TABLE IF EXISTS #MyPairings; WITH D0 AS -- D0 computes running demand as EndDemand ( SELECT ID, Quantity, SUM(Quantity) OVER(ORDER BY ID ROWS UNBOUNDED PRECEDING) AS EndDemand FROM dbo.Auctions WHERE Code = 'D' ), -- D extracts prev EndDemand as StartDemand, expressing start-end demand as an interval D AS ( SELECT ID, Quantity, EndDemand - Quantity AS StartDemand, EndDemand FROM D0 ), S0 AS -- S0 computes running supply as EndSupply ( SELECT ID, Quantity, SUM(Quantity) OVER(ORDER BY ID ROWS UNBOUNDED PRECEDING) AS EndSupply FROM dbo.Auctions WHERE Code = 'S' ), -- S extracts prev EndSupply as StartSupply, expressing start-end supply as an interval S AS ( SELECT ID, Quantity, EndSupply - Quantity AS StartSupply, EndSupply FROM S0 ) -- Outer query identifies trades as the overlapping segments of the intersecting intervals -- In the intersecting demand and supply intervals the trade quantity is then -- LEAST(EndDemand, EndSupply) - GREATEST(StartDemsnad, StartSupply) SELECT D.ID AS DemandID, S.ID AS SupplyID, CASE WHEN EndDemand < EndSupply THEN EndDemand ELSE EndSupply END - CASE WHEN StartDemand > StartSupply THEN StartDemand ELSE StartSupply END AS TradeQuantity INTO #MyPairings FROM D INNER JOIN S ON D.StartDemand < S.EndSupply AND D.EndDemand > S.StartSupply;

Besides the code that creates the demand and supply intervals, which you already saw earlier, the main addition here is the outer query, which identifies the intersecting intervals between D and S, and computes the overlapping segments. To identify the intersecting intervals, the outer query joins D and S using the following join predicate:

D.StartDemand < S.EndSupply AND D.EndDemand > S.StartSupply

That's the classic predicate to identify interval intersection. It's also the main source for the solution's poor performance, as I'll explain shortly.

The outer query also computes the trade quantity in the SELECT list as:

LEAST(EndDemand, EndSupply) - GREATEST(StartDemand, StartSupply)

If you're using Azure SQL, you can use this expression. If you're using SQL Server 2019 or earlier, you can use the following logically equivalent alternative:

CASE WHEN EndDemand < EndSupply THEN EndDemand ELSE EndSupply END - CASE WHEN StartDemand > StartSupply THEN StartDemand ELSE StartSupply END

Since the requirement was to write the result into a temporary table, the outer query uses a SELECT INTO statement to achieve this.

The idea to model the data as intervals is clearly intriguing and elegant. But what about performance? Unfortunately, this specific solution has a big problem related to how interval intersection is identified. Examine the plan for this solution shown in Figure 2.

Figure 2: Query Plan for Intersections Using CTEs Solution

Let's start with the inexpensive parts of the plan.

The outer input of the Nested Loops join computes the demand intervals. It uses an Index Seek operator to retrieve the demand entries, and a batch mode Window Aggregate operator to compute the demand interval end delimiter (referred to as Expr1005 in this plan). The demand interval start delimiter is then Expr1005 – Quantity (from D).

As a side note, you might find the use of an explicit Sort operator prior to the batch mode Window Aggregate operator surprising here, since the demand entries retrieved from the Index Seek are already ordered by ID like the window function needs them to be. This has to do with the fact that currently, SQL Server doesn't support an efficient combination of parallel order-preserving index operation prior to a parallel batch mode Window Aggregate operator. If you force a serial plan just for experimentation purposes, you'll see the Sort operator disappearing. SQL Server decided overall, the use of parallelism here was preferred, despite it resulting in the added explicit sorting. But again, this part of the plan represents a small portion of the work in the grand scheme of things.

Similarly, the inner input of the Nested Loops join starts by computing the supply intervals. Curiously, SQL Server chose to use row-mode operators to handle this part. On one hand, row mode operators used to compute running totals involve more overhead than the batch mode Window Aggregate alternative; on the other hand, SQL Server has an efficient parallel implementation of an order-preserving index operation following by the window function's computation, avoiding explicit Sorting for this part. It's curious the optimizer chose one strategy for the demand intervals and another for the supply intervals. At any rate, the Index Seek operator retrieves the supply entries, and the subsequent sequence of operators up to the Compute Scalar operator compute the supply interval end delimiter (referred to as Expr1009 in this plan). The supply interval start delimiter is then Expr1009 – Quantity (from S).

Despite the amount of text I used to describe these two parts, the truly expensive part of the work in the plan is what comes next.

The next part needs to join the demand intervals and the supply intervals using the following predicate:

D.StartDemand < S.EndSupply AND D.EndDemand > S.StartSupply

With no supporting index, assuming DI demand intervals and SI supply intervals, this would involve processing DI * SI rows. The plan in Figure 2 was created after filling the Auctions table with 400,000 rows (200,000 demand entries and 200,000 supply entries). So, with no supporting index, the plan would have needed to process 200,000 * 200,000 = 40,000,000,000 rows. To mitigate this cost, the optimizer chose to create a temporary index (see the Index Spool operator) with the supply interval end delimiter (Expr1009) as the key. That's pretty much the best it could do. However, this takes care of only part of the problem. With two range predicates, only one can be supported by an index seek predicate. The other has to be handled using a residual predicate. Indeed, you can see in the plan that the seek in the temporary index uses the seek predicate Expr1009 > Expr1005 – D.Quantity, followed by a Filter operator handling the residual predicate Expr1005 > Expr1009 – S.Quantity.

Assuming on average, the seek predicate isolates half the supply rows from the index per demand row, the total number of rows emitted from the Index Spool operator and processed by the Filter operator is then DI * SI / 2. In our case, with 200,000 demand rows and 200,000 supply rows, this translates to 20,000,000,000. Indeed, the arow going from the Index Spool operator to the Filter operator reports a number of rows close to this.

This plan has quadratic scaling, compared to the linear scaling of the cursor-based solution from last month. You can see the result of a performance test comparing the two solutions in Figure 3. You can clearly see the nicely shaped parabola for the set-based solution.

Figure 3: Performance of Intersections Using CTEs Solution Versus Cursor-Based Solution

Interval Intersections Using Temporary Tables

You can somewhat improve things by replacing the use of CTEs for the demand and supply intervals with temporary tables, and to avoid the index spool, creating your own index on the temp table holding the supply intervals with the end delimiter as the key. Here's the complete solution's code:

-- Drop temp tables if exist DROP TABLE IF EXISTS #MyPairings, #Demand, #Supply; WITH D0 AS -- D0 computes running demand as EndDemand ( SELECT ID, Quantity, SUM(Quantity) OVER(ORDER BY ID ROWS UNBOUNDED PRECEDING) AS EndDemand FROM dbo.Auctions WHERE Code = 'D' ), -- D extracts prev EndDemand as StartDemand, expressing start-end demand as an interval D AS ( SELECT ID, Quantity, EndDemand - Quantity AS StartDemand, EndDemand FROM D0 ) SELECT ID, Quantity, CAST(ISNULL(StartDemand, 0.0) AS DECIMAL(19, 6)) AS StartDemand, CAST(ISNULL(EndDemand, 0.0) AS DECIMAL(19, 6)) AS EndDemand INTO #Demand FROM D; WITH S0 AS -- S0 computes running supply as EndSupply ( SELECT ID, Quantity, SUM(Quantity) OVER(ORDER BY ID ROWS UNBOUNDED PRECEDING) AS EndSupply FROM dbo.Auctions WHERE Code = 'S' ), -- S extracts prev EndSupply as StartSupply, expressing start-end supply as an interval S AS ( SELECT ID, Quantity, EndSupply - Quantity AS StartSupply, EndSupply FROM S0 ) SELECT ID, Quantity, CAST(ISNULL(StartSupply, 0.0) AS DECIMAL(19, 6)) AS StartSupply, CAST(ISNULL(EndSupply, 0.0) AS DECIMAL(19, 6)) AS EndSupply INTO #Supply FROM S; CREATE UNIQUE CLUSTERED INDEX idx_cl_ES_ID ON #Supply(EndSupply, ID); -- Outer query identifies trades as the overlapping segments of the intersecting intervals -- In the intersecting demand and supply intervals the trade quantity is then -- LEAST(EndDemand, EndSupply) - GREATEST(StartDemsnad, StartSupply) SELECT D.ID AS DemandID, S.ID AS SupplyID, CASE WHEN EndDemand < EndSupply THEN EndDemand ELSE EndSupply END - CASE WHEN StartDemand > StartSupply THEN StartDemand ELSE StartSupply END AS TradeQuantity INTO #MyPairings FROM #Demand AS D INNER JOIN #Supply AS S WITH (FORCESEEK) ON D.StartDemand < S.EndSupply AND D.EndDemand > S.StartSupply;

The plans for this solution are shown in Figure 4:

Figure 4: Query Plan for Intersections Using Temp Tables Solution

The first two plans use a combination of batch-mode Index Seek + Sort + Window Aggregate operators to compute the supply and demand intervals and write those to temporary tables. The third plan handles the index creation on the #Supply table with the EndSupply delimiter as the leading key.

The fourth plan represents by far the bulk of the work, with a Nested Loops join operator that matches to each interval from #Demand, the intersecting intervals from #Supply. Observe that also here, the Index Seek operator relies on the predicate #Supply.EndSupply > #Demand.StartDemand as the seek predicate, and #Demand.EndDemand > #Supply.StartSupply as the residual predicate. So in terms of complexity/scaling, you get the same quadratic complexity like for the previous solution. You just pay less per row since you're using your own index instead of the index spool used by the previous plan. You can see the performance of this solution compared to the previous two in Figure 5.

Figure 5: Performance of Intersections using temp tables compared to other two solutions

As you can see, the solution with the temp tables performs better than the one with the CTEs, but it still has quadratic scaling and does very badly compared to the cursor.

What's Next?

This article covered the first attempt at handling the classic matching supply with demand task using a set-based solution. The idea was to model the data as intervals, match supply with demand entries by identifying intersecting supply and demand intervals, and then compute the trading quantity based on the size of the overlapping segments. Certainly an intriguing idea. The main problem with it is also the classic problem of identifying interval intersection by using two range predicates. Even with the best index in place, you can only support one range predicate with an index seek; the other range predicate has to be handled using a residual predicate. This results in a plan with quadratic complexity.

So what can you do to overcome this obstacle? There are several different ideas. One brilliant idea belongs to Joe Obbish, which you can read about in detail in his blog post. I'll cover other ideas in upcoming articles in the series.

[ Jump to: Original challenge | Solutions: Part 1 | Part 2 | Part 3 ]

The post Matching Supply With Demand — Solutions, Part 1 appeared first on SQLPerformance.com.

name	division	source
host1	abc	comp
host2	xy	manual
host3	zyx	comp
host3	zyx	manual
host2	xy	manual