Aaron Bertrand, Author at SQLPerformance.com

10 For 10 : My 10 Favorite SQLPerformance Posts Over 10 Years

Aaron Bertrand — Sun, 24 Jul 2022 09:00:52 +0000

This is a month of milestones. I was awarded my 25^th MVP award on July 5^th, which I believe ties me with Tibor for the longest-tenured MVP. SQL Server 2012 went out of all forms of official support about two weeks ago, on July 12^th. And this site turns 10 years old today – the first post was on July 24^th, 2012. I thought it would be fun to look back and comment on some of my contributions over that decade.

2012

Best approaches for running totals – updated for SQL Server 2012

In the very first post on this site, I compared the performance of long-standing running totals methods with the new windowing functions added in SQL Server 2012. In almost all cases, the new approach worked better – but there’s always a catch, right?

2013

Break large delete operations into chunks

I wrote this 2013 post as a response to seeing many people try to delete millions of rows from very large tables, and face blocking that crippled their applications. Nearly 10 years later I still reference it frequently in comments or answers on Stack Overflow.

2014

Readable Secondaries on a Budget

When Availability Groups were first introduced, they were only available in Enterprise Edition. Customers who didn’t have unlimited cap-ex funds were clamoring for a way to scale out reads. In this 2014 post, I detailed a way to do this with good old log shipping.

2015

Partitioning on a Budget

In a similar vein to my favorite post from 2014, I described a method to simulate partitioning – another feature that was Enterprise-only at the time – using filtered indexes.

2016

A Big Deal : SQL Server 2016 Service Pack 1

Toward the end of the year, I was able to spread the news that a SQL Server 2016 service pack would make most Enterprise-only features available in all editions. I detailed all the changes and explained why this was such a huge win for non-Enterprise customers.

2017

One way to get an index seek for a leading %wildcard

Leading wildcard searches are notoriously inefficient because they require inspecting every single value. One way to get a seek is to have an indexed computed column using REVERSE. In this post from 2017, I explored the use of a trigram as a similar strategy.

2018

What to do (or not do) about top wait stats

Usually, wait stats associated with an individual query don’t help you solve the problem, because the wait is actually caused by some other process. This post from 2018 took a look at the top 10 wait types observed by the sum of SQL Sentry customers at the time and, for each one, asked the question, “Can you solve this problem with just the wait type and the query that experienced it?”

2019

Filtered Indexes and Forced Parameterization (redux)

In this 2019 post, I explained a trivial workaround to a frustrating limitation with filtered indexes. This was not my first post complaining about filtered indexes (other posts are linked in the opening paragraph).

2020

Please stop using this UPSERT anti-pattern

In 2020, I talked about an all-too-common pattern: checking if a row exists, if so, update, otherwise, insert. This pattern is a shortcut to deadlocks and, in this post, I explain a slightly better approach (using graphics I’m quite proud of).

2021

Refreshing Tables With Less Interruption Using Partition Switching

Last year, I wrote about a way to use partition switching to refresh lookup data with minimal or zero user disruption. I explained the new ABORT_AFTER_WAIT option and demonstrated the tradeoffs involved.

2022 (BONUS!)

Serializing Deletes From Clustered Columnstore Indexes

Well, 2022 isn’t over yet but, so far, my favorite post has been this one about reducing the impact of deleting data from a table with a clustered columnstore index.

The post 10 For 10 : My 10 Favorite SQLPerformance Posts Over 10 Years appeared first on SQLPerformance.com.

T-SQL Can Be Expressive Without Sacrificing Performance

Aaron Bertrand — Thu, 02 Jun 2022 09:00:38 +0000

Even with its warts and limitations, Transact-SQL is a beautiful language, allowing for flexible declarative expression about what you’re asking the database engine to do for you.

Itzik Ben-Gan has shown time after time that you can get the same results in many different ways—look what he demonstrated recently with window ordering. I have also discussed this when dealing with anti-semi-joins, which can be solved using APPLY or EXCEPT or NOT EXISTS or LEFT JOIN. In some of those cases, different queries led to different performance because query semantics changed (either obviously or, at least, according to the engine). But writing a query differently—even one more verbose—can have any of the following impacts, some subjective, some not:

More or less intuitive code
Zero or negligible performance differences
Meaningful performance differences
Trade one resource for another (e.g., use more memory but less CPU, or more CPU but less I/O)

And in each of those cases, you can decide which version of the query is most beneficial for you based on your priorities. Maybe you have a memory- or CPU-bound system, or maybe you prefer one syntax over another for subjective reasons, or maybe you think one form will be easier for future maintainers and newcomers.

As an analogy, there are many routes you can take from New York City to Dallas. Some may be faster than others, some may be fewer miles but take longer, some are more fuel-efficient due to average speed limits, some more scenic, and some more toll-friendly. The beauty is that if you and I are independently planning the same trip, we can choose our routes based on our individual priorities. I may not like interstates, or I may prefer to drive more westerly until the sun starts setting, and you may want to see a particular tourist attraction, visit an uncle, or stop in a certain city.

A query is similar. Usually, performance is of utmost importance, but even that isn’t always true. When two or more queries give the same answer and have identical (or “close enough”) performance, the choice can come down to other factors, as mentioned above. I recently answered a question on Stack Overflow where the user was asking how to filter a grouping where an aggregate condition was true.

For some context, Stack Overflow is a place where I tend to cater to people with a wide variety of experience with T-SQL or queries in general. Sometimes solutions are necessarily complex, or they need less-used syntax that is not universally understood, so it can take more explanation and a better breakdown of the code for the user to benefit. One of the ways I like to help with this breakdown is to isolate different aspects of the query in a derived table or, more commonly, a common table expression (CTE) because it can be a lightbulb moment to think about that part independently. With that in mind, let’s look at a boiled-down version of the question:

Given this table dbo.tablename, I want to return a single row for each name and division combination, but only where there is both a row with source = 'comp' and a row where source = 'manual':

name	division	source
host1	abc	comp
host2	xy	manual
host3	zyx	comp
host3	zyx	manual
host2	xy	manual

I’ve highlighted the only rows they want to consider for aggregation, with the desired output being a single row:

name	division
host3	zyx

In T-SQL, setting up this sample would look like this:

  CREATE TABLE dbo.tablename
  (
    name     varchar(128),
    division varchar(128),
    source   varchar(128)
  );

  INSERT dbo.tablename(name, division, source) VALUES
  ('host1',  'abc',  'comp'),
  ('host2',  'xy',   'manual'),
  ('host3',  'zyx',  'comp'),
  ('host3',  'zyx',  'manual'),
  ('host2',  'xy',   'manual');

To get the desired result, the first (and later accepted) answer used this syntax, which is perfectly adequate:

  SELECT name, division  
  FROM dbo.tablename
  WHERE source in ('comp', 'manual')
  GROUP BY name, division
  HAVING COUNT(DISTINCT source) > 1; -- or = 2

Since I know users commonly have difficulty with the HAVING clause, I offered a different approach, one that breaks the logic down, as I mentioned earlier.

Another way to think about it is to calculate the counts inside a CTE and then filter:

  ;WITH cte AS
  (
    SELECT name, division, SourceCount = COUNT(DISTINCT source)
    FROM dbo.tablename
    WHERE source IN ('comp', 'manual')
    GROUP BY name, division
  )
  SELECT name, division FROM cte 
    WHERE SourceCount = 2;

And yes, my CTEs always start with ;WITH – see why

Or, if you don’t like CTEs:

  SELECT name, division FROM
  (
    SELECT name, division, SourceCount = COUNT(DISTINCT source)
    FROM dbo.tablename
    WHERE source IN ('comp', 'manual')
    GROUP BY name, division
  ) AS q WHERE SourceCount = 2;

Yes, it’s more typing, but the intention is to think about the counting and grouping separate from the filtering, like how in an INNER JOIN you can logically think about the joining conditions (in the ON clause) separate from the filter conditions (in the WHERE clause).

As for performance, they all perform the same because SQL Server is smart and can generate the same plan. I inserted 50,000 rows into the table and ran all three queries; each had a duration of 30 – 33ms, a memory grant of 1,584 KB, and an estimated subtree cost of 0.5972860. Here is the plan shape in all three cases:

The plan would look different if the table had a clustered index; or let’s try an index designed to support this query specifically:

  CREATE INDEX testing ON dbo.tablename (source) INCLUDE (name, division);

Now the time is down to 26 – 28ms, the memory grant is still 1.5MB, and the estimated subtree cost has dropped by a whopping amount, to 0.5769890. Here is the new plan (again, identical for all three queries):

This is not a complex example but illustrates that we can often find various ways to get to a final destination using the most expressive format we like. Variations in syntax that are identical in results and underlying meaning can help give someone that “lightbulb” moment and provide a more natural tendency to test “identical” variations for cases where the performance might be different.

The post T-SQL Can Be Expressive Without Sacrificing Performance appeared first on SQLPerformance.com.

String Aggregation Over the Years in SQL Server

Aaron Bertrand — Thu, 17 Feb 2022 09:00:44 +0000

Since SQL Server 2005, the trick of using FOR XML PATH to denormalize strings and combine them into a single (usually comma-separated) list has been very popular. In SQL Server 2017, however, STRING_AGG() finally answered long-standing and widespread pleas from the community to simulate GROUP_CONCAT() and similar functionality found in other platforms. I recently started modifying many of my Stack Overflow answers using the old method, both to improve the existing code and to add an additional example better suited for modern versions.

I was a little appalled at what I found.

On more than one occasion, I had to double-check the code was even mine.

A Quick Example

Let’s look at a simple demonstration of the problem. Someone has a table like this:

CREATE TABLE dbo.FavoriteBands
(
  UserID   int,
  BandName nvarchar(255)
);

INSERT dbo.FavoriteBands
(
  UserID, 
  BandName
) 
VALUES
  (1, N'Pink Floyd'), (1, N'New Order'), (1, N'The Hip'),
  (2, N'Zamfir'),     (2, N'ABBA');

On the page showing each user’s favorite bands, they want the output to look like this:

UserID   Bands
------   ---------------------------------------
1        Pink Floyd, New Order, The Hip
2        Zamfir, ABBA

In the SQL Server 2005 days, I would have offered this solution:

SELECT DISTINCT UserID, Bands = 
      (SELECT BandName + ', '
         FROM dbo.FavoriteBands
         WHERE UserID = fb.UserID
         FOR XML PATH('')) 
FROM dbo.FavoriteBands AS fb;

But when I look back on this code now, I see many problems I can’t resist fixing.

STUFF

The most fatal flaw in the code above is it leaves a trailing comma:

UserID   Bands
------   ---------------------------------------
1        Pink Floyd, New Order, The Hip, 
2        Zamfir, ABBA,

To solve this, I often see people wrap the query inside another and then surround the Bands output with LEFT(Bands, LEN(Bands)-1). But this is needless additional computation; instead, we can move the comma to the beginning of the string and remove the first one or two characters using STUFF. Then, we don’t have to calculate the length of the string because it’s irrelevant.

SELECT DISTINCT UserID, Bands = STUFF(
--------------------------------^^^^^^
      (SELECT ', ' + BandName
--------------^^^^^^
         FROM dbo.FavoriteBands
         WHERE UserID = fb.UserID
         FOR XML PATH('')), 1, 2, '')
--------------------------^^^^^^^^^^^
FROM dbo.FavoriteBands AS fb;

You can adjust this further if you’re using a longer or conditional delimiter.

DISTINCT

The next problem is the use of DISTINCT. The way the code works is the derived table generates a comma-separated list for each UserID value, then the duplicates are removed. We can see this by looking at the plan and seeing the XML-related operator executes seven times, even though only three rows are ultimately returned:

Figure 1: Plan showing filter after aggregation

If we change the code to use GROUP BY instead of DISTINCT:

SELECT /* DISTINCT */ UserID, Bands = STUFF(
      (SELECT ', ' + BandName
         FROM dbo.FavoriteBands
         WHERE UserID = fb.UserID
         FOR XML PATH('')), 1, 2, '')
  FROM dbo.FavoriteBands AS fb
  GROUP BY UserID;
--^^^^^^^^^^^^^^^

It’s a subtle difference, and it doesn’t change the results, but we can see the plan improves. Basically, the XML operations are deferred until after the duplicates are removed:

Figure 2: Plan showing filter before aggregation

At this scale, the difference is immaterial. But what if we add some more data? On my system, this adds a little over 11,000 rows:

INSERT dbo.FavoriteBands(UserID, BandName)
  SELECT [object_id], name FROM sys.all_columns;

If we run the two queries again, the differences in duration and CPU are immediately obvious:

Figure 3: Runtime results comparing DISTINCT and GROUP BY

But other side effects are also obvious in the plans. In the case of DISTINCT, the UDX once again executes for every row in the table, there’s an excessively eager index spool, there’s a distinct sort (always a red flag for me), and the query has a high memory grant, which can put a serious dent in concurrency:

Figure 4: DISTINCT plan at scale

Meanwhile, in the GROUP BY query, the UDX only executes once for each unique UserID, the eager spool reads a much lower number of rows, there’s no distinct sort operator (it’s been replaced by a hash match), and the memory grant is tiny in comparison:

Figure 5: GROUP BY plan at scale

It takes a while to go back and fix old code like this, but for some time now, I’ve been very regimented about always using GROUP BY instead of DISTINCT.

N Prefix

Too many old code samples I came across assumed no Unicode characters would ever be in use, or at least the sample data didn’t suggest the possibility. I’d offer my solution as above, and then the user would come back and say, “but on one row I have 'просто красный', and it comes back as '?????? ???????'!” I often remind people they always need to prefix potential Unicode string literals with the N prefix unless they absolutely know they’ll only ever be dealing with varchar strings or integers. I started being very explicit and probably even overcautious about it:

SELECT UserID, Bands = STUFF(
      (SELECT N', ' + BandName
--------------^
         FROM dbo.FavoriteBands
         WHERE UserID = fb.UserID
         FOR XML PATH(N'')), 1, 2, N'')
----------------------^ -----------^
  FROM dbo.FavoriteBands AS fb
  GROUP BY UserID;

XML Entitization

Another “what if?” scenario not always present in a user’s sample data is XML characters. For example, what if my favorite band is named “Bob & Sheila <> Strawberries”? The output with the above query is made XML-safe, which isn’t what we always want (e.g., Bob & Sheila <> Strawberries). Google searches at the time would suggest “you need to add TYPE,” and I remember trying something like this:

SELECT UserID, Bands = STUFF(
      (SELECT N', ' + BandName
         FROM dbo.FavoriteBands
         WHERE UserID = fb.UserID
         FOR XML PATH(N''), TYPE), 1, 2, N'')
--------------------------^^^^^^
  FROM dbo.FavoriteBands AS fb
  GROUP BY UserID;

Unfortunately, the output data type from the subquery in this case is xml. This leads to the following error message:

Msg 8116, Level 16, State 1
Argument data type xml is invalid for argument 1 of stuff function.

You need to tell SQL Server you want to extract the resulting value as a string by indicating the data type and that you want the first element. Back then, I’d add this as the following:

SELECT UserID, Bands = STUFF(
      (SELECT N', ' + BandName
         FROM dbo.FavoriteBands
         WHERE UserID = fb.UserID
         FOR XML PATH(N''), TYPE).value(N'.', N'nvarchar(max)'), 
--------------------------^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           1, 2, N'')
  FROM dbo.FavoriteBands AS fb
  GROUP BY UserID;

This would return the string without XML entitization. But is it the most efficient? Last year, Charlieface reminded me Mister Magoo performed some extensive testing and found ./text()[1] was faster than the other (shorter) approaches like . and .[1]. (I originally heard this from a comment Mikael Eriksson left for me here.) I once again adjusted my code to look like this:

SELECT UserID, Bands = STUFF(
      (SELECT N', ' + BandName
         FROM dbo.FavoriteBands
         WHERE UserID = fb.UserID
         FOR XML PATH(N''), TYPE).value(N'./text()[1]', N'nvarchar(max)'), 
------------------------------------------^^^^^^^^^^^
           1, 2, N'')
  FROM dbo.FavoriteBands AS fb
  GROUP BY UserID;

You might observe extracting the value in this way leads to a slightly more complex plan (you wouldn’t know it just from looking at duration, which stays pretty constant throughout the above changes):

Figure 6: Plan with ./text()[1]

The warning on the root SELECT operator comes from the explicit conversion to nvarchar(max).

Order

Occasionally, users would express ordering is important. Often, this is simply ordering by the column you’re appending—but sometimes, it can be added somewhere else. People tend to believe if they saw a specific order come out of SQL Server once, it’s the order they’ll always see, but there’s no reliability here. Order is never guaranteed unless you say so. In this case, let’s say we want to order by BandName alphabetically. We can add this instruction inside the subquery:

SELECT UserID, Bands = STUFF(
      (SELECT N', ' + BandName
         FROM dbo.FavoriteBands
         WHERE UserID = fb.UserID
         ORDER BY BandName
---------^^^^^^^^^^^^^^^^^
         FOR XML PATH(N''),
          TYPE).value(N'./text()[1]', N'nvarchar(max)'), 1, 2, N'')
  FROM dbo.FavoriteBands AS fb
  GROUP BY UserID;

Note this may add a little execution time because of the additional sort operator, depending on whether there’s a supporting index.

STRING_AGG()

As I update my old answers, which should still work on the version that was relevant at the time of the question, the final snippet above (with or without the ORDER BY) is the form you’ll likely see. But you might see an additional update for the more modern form, too.

STRING_AGG() is arguably one of the best features added in SQL Server 2017. It’s both simpler and far more efficient than any of the above approaches, leading to tidy, well-performing queries like this:

SELECT UserID, Bands = STRING_AGG(BandName, N', ')
  FROM dbo.FavoriteBands
  GROUP BY UserID;

This isn’t a joke; that’s it. Here’s the plan—most importantly, there’s only a single scan against the table:

Figure 7: STRING_AGG() plan

If you want ordering, STRING_AGG() supports this, too (as long as you are in compatibility level 110 or greater, as Martin Smith points out here):

SELECT UserID, Bands = STRING_AGG(BandName, N', ')
    WITHIN GROUP (ORDER BY BandName)
----^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  FROM dbo.FavoriteBands
  GROUP BY UserID;

The plan looks the same as the one without sorting, but the query is a smidge slower in my tests. It’s still way faster than any of the FOR XML PATH variations.

Indexes

A heap is hardly fair. If you have even a nonclustered index the query can use, the plan looks even better. For example:

CREATE INDEX ix_FavoriteBands ON dbo.FavoriteBands(UserID, BandName);

Here’s the plan for the same ordered query using STRING_AGG()—note the lack of a sort operator, since the scan can be ordered:

Figure 8: STRING_AGG() plan with a supporting index

This shaves some time off, too—but to be fair, this index helps the FOR XML PATH variations as well. Here’s the new plan for the ordered version of that query:

Figure 9: FOR XML PATH plan with a supporting index

The plan is a little friendlier than before, including a seek instead of a scan in one spot, but this approach is still significantly slower than STRING_AGG().

A Caveat

There’s a little trick to using STRING_AGG() where, if the resulting string is more than 8,000 bytes, you’ll receive this error message:

Msg 9829, Level 16, State 1
STRING_AGG aggregation result exceeded the limit of 8000 bytes. Use LOB types to avoid result truncation.

To avoid this issue, you can inject an explicit conversion:

SELECT UserID, 
       Bands = STRING_AGG(CONVERT(nvarchar(max), BandName), N', ')
--------------------------^^^^^^^^^^^^^^^^^^^^^^
  FROM dbo.FavoriteBands
  GROUP BY UserID;

This adds a compute scalar operation to the plan—and an unsurprising CONVERT warning on the root SELECT operator—but otherwise, it has little impact on performance.

Conclusion

If you’re on SQL Server 2017+ and you have any FOR XML PATH string aggregation in your codebase, I highly recommend switching over to the new approach. I did perform some more thorough performance testing back during the SQL Server 2017 public preview here and here you may want to revisit.

A common objection I’ve heard is people are on SQL Server 2017 or greater but still on an older compatibility level. It seems the apprehension is because STRING_SPLIT() is invalid on compatibility levels lower than 130, so they think STRING_AGG() works this way too, but it is a bit more lenient. It is only a problem if you are using WITHIN GROUP and a compat level lower than 110. So improve away!

The post String Aggregation Over the Years in SQL Server appeared first on SQLPerformance.com.

Serializing Deletes From Clustered Columnstore Indexes

Aaron Bertrand — Wed, 05 Jan 2022 09:00:01 +0000

At Stack Overflow, we have some tables using clustered columnstore indexes, and these work great for the majority of our workload. But we recently came across a situation where “perfect storms” — multiple processes all trying to delete from the same CCI — would overwhelm the CPU as they all went widely parallel and fought to complete their operation. Here's what it looked like in SolarWinds^® SQL Sentry^®:

And here are the interesting waits associated with these queries:

The queries competing were all of this form:

DELETE dbo.LargeColumnstoreTable WHERE col1 = @p1 AND col2 = @p2;

The plan looked like this:

And the warning on the scan advised us of some pretty extreme residual I/O:

The table has 1.9 billion rows but is only 32GB (thank you, columnar storage!). Still, these single-row deletes would take 10 – 15 seconds each, with most of this time being spent on SOS_SCHEDULER_YIELD.

Thankfully, since in this scenario the delete operation could be asynchronous, we were able to solve the problem with two changes (though I’m grossly oversimplifying here):

We limited MAXDOP at the database level so these deletes can't go quite so parallel
We improved serialization of the processes coming from the application (basically, we queued deletes through a single dispatcher)

As a DBA, we can easily control MAXDOP, unless it’s overridden at the query level (another rabbit hole for another day). We can't necessarily control the application to this extent, especially if it’s distributed or not ours. How can we serialize the writes in this case without drastically changing the application logic?

A Mock Setup

I'm not going to try to create a two billion-row table locally — never mind the exact table — but we can approximate something on a smaller scale and try to reproduce the same issue.

Let's pretend this is the SuggestedEdits table (in reality, it's not). But it's an easy example to use because we can pull the schema from the Stack Exchange Data Explorer. Using this as a base, we can create an equivalent table (with a few minor changes to make it easier to populate) and throw a clustered columnstore index on it:

CREATE TABLE dbo.FakeSuggestedEdits
(
  Id            int IDENTITY(1,1),
  PostId        int NOT NULL DEFAULT CONVERT(int, ABS(CHECKSUM(NEWID()))) % 200,
  CreationDate  datetime2 NOT NULL DEFAULT sysdatetime(),
  ApprovalDate  datetime2 NOT NULL DEFAULT sysdatetime(),
  RejectionDate datetime2 NULL,
  OwnerUserId   int NOT NULL DEFAULT 7,
  Comment       nvarchar (800)   NOT NULL DEFAULT NEWID(),
  Text          nvarchar (max)   NOT NULL DEFAULT NEWID(),
  Title         nvarchar (250)   NOT NULL DEFAULT NEWID(),
  Tags          nvarchar (250)   NOT NULL DEFAULT NEWID(),
  RevisionGUID  uniqueidentifier NOT NULL DEFAULT NEWSEQUENTIALID(),
  INDEX CCI_FSE CLUSTERED COLUMNSTORE
);

To populate it with 100 million rows, we can cross join sys.all_objects and sys.all_columns five times (on my system, this will produce 2.68 million rows each time, but YMMV):

-- 2680350 * 5 ~ 3 minutes

INSERT dbo.FakeSuggestedEdits(CreationDate)
  SELECT TOP (10) /*(2000000) */ modify_date
  FROM sys.all_objects AS o
  CROSS JOIN sys.columns AS c;
GO 5

Then, we can check the space:

EXEC sys.sp_spaceused @objname = N'dbo.FakeSuggestedEdits';

It's only 1.3GB, but this should be sufficient:

Mimicking Our Clustered Columnstore Delete

Here’s a simple query roughly matching what our application was doing to the table:

DECLARE @p1 int = ABS(CHECKSUM(NEWID())) % 10000000, @p2 int = 7;
DELETE dbo.FakeSuggestedEdits WHERE Id = @p1 AND OwnerUserId = @p2;

The plan isn't quite a perfect match, though:

To get it to go parallel and produce similar contention on my meager laptop, I had to coerce the optimizer a little with this hint:

OPTION (QUERYTRACEON 8649);

Now, it looks right:

Reproducing the Problem

Then, we can create a surge of concurrent delete activity using SqlStressCmd to delete 1,000 random rows using 16 and 32 threads:

sqlstresscmd -s docs/ColumnStore.json -t 16
sqlstresscmd -s docs/ColumnStore.json -t 32

We can observe the strain this puts on CPU:

The strain on CPU lasts throughout the batches of about 64 and 130 seconds, respectively:

Note: The output from SQLQueryStress is sometimes a little off on iterations, but I’ve confirmed the work you ask it to do gets done precisely.

A Potential Workaround: A Delete Queue

Initially, I thought about introducing a queue table in the database, which we could use to offload delete activity:

CREATE TABLE dbo.SuggestedEditDeleteQueue
(
  QueueID       int IDENTITY(1,1) PRIMARY KEY,
  EnqueuedDate  datetime2 NOT NULL DEFAULT sysdatetime(),
  ProcessedDate datetime2 NULL,
  Id            int NOT NULL,
  OwnerUserId   int NOT NULL
);

All we need is an INSTEAD OF trigger to intercept these rogue deletes coming from the application and place them on the queue for background processing. Unfortunately, you can't create a trigger on a table with a clustered columnstore index:

Msg 35358, Level 16, State 1
CREATE TRIGGER on table 'dbo.FakeSuggestedEdits' failed because you cannot create a trigger on a table with a clustered columnstore index. Consider enforcing the logic of the trigger in some other way, or if you must use a trigger, use a heap or B-tree index instead.

We'll need a minimal change to the application code, so that it calls a stored procedure to handle the delete:

CREATE PROCEDURE dbo.DeleteSuggestedEdit
  @Id          int,
  @OwnerUserId int
AS
BEGIN
  SET NOCOUNT ON;

  DELETE dbo.FakeSuggestedEdits 
    WHERE Id = @Id AND OwnerUserId = @OwnerUserId;
END

This isn't a permanent state; this is just to keep the behavior the same while changing only one thing in the app. Once the app is changed and is successfully calling this stored procedure instead of submitting ad hoc delete queries, the stored procedure can change:

CREATE PROCEDURE dbo.DeleteSuggestedEdit
  @Id          int,
  @OwnerUserId int
AS
BEGIN
  SET NOCOUNT ON;

  INSERT dbo.SuggestedEditDeleteQueue(Id, OwnerUserId)
    SELECT @Id, @OwnerUserId;
END

Testing the Impact of the Queue

Now, if we change SqlQueryStress to call the stored procedure instead:

DECLARE @p1 int = ABS(CHECKSUM(NEWID())) % 10000000, @p2 int = 7;
EXEC dbo.DeleteSuggestedEdit @Id = @p1, @OwnerUserId = @p2;

And submit similar batches (placing 16K or 32K rows on the queue):

DECLARE @p1 int = ABS(CHECKSUM(NEWID())) % 10000000, @p2 int = 7;
EXEC dbo.@Id = @p1 AND OwnerUserId = @p2;

The CPU impact is slightly higher:

But the workloads finish much more quickly — 16 and 23 seconds, respectively:

This is a significant reduction in the pain the applications will feel as they get into periods of high concurrency.

We Still Have to Perform the Delete, Though

We still have to process those deletes in the background, but we can now introduce batching and have full control over the rate and any delays we want to inject between operations. Here’s the very basic structure of a stored procedure to process the queue (admittedly without fully vested transactional control, error handling, or queue table cleanup):

CREATE PROCEDURE dbo.ProcessSuggestedEditQueue
  @JobSize        int = 10000,
  @BatchSize      int = 100,
  @DelayInSeconds int = 2      -- must be between 1 and 59
AS
BEGIN
  SET NOCOUNT ON;
  
  DECLARE @d TABLE(Id int, OwnerUserId int);
  DECLARE @rc int = 1,
          @jc int = 0, 
          @wf nvarchar(100) = N'WAITFOR DELAY ' + CHAR(39) 
              + '00:00:' + RIGHT('0' + CONVERT(varchar(2), 
                @DelayInSeconds), 2) + CHAR(39);
                
  WHILE @rc > 0 AND @jc < @JobSize
  BEGIN 
    DELETE @d; 

    UPDATE TOP (@BatchSize) q SET ProcessedDate = sysdatetime() 
      OUTPUT inserted.Id, inserted.OwnerUserId INTO @d 
      FROM dbo.SuggestedEditDeleteQueue AS q WITH (UPDLOCK, READPAST) 
       WHERE ProcessedDate IS NULL; 

    SET @rc = @@ROWCOUNT; 
    IF @rc = 0 BREAK; 

    DELETE fse 
      FROM dbo.FakeSuggestedEdits AS fse 
      INNER JOIN @d AS d 
        ON fse.Id = d.Id 
       AND fse.OwnerUserId = d.OwnerUserId; 

    SET @jc += @rc; 
    IF @jc > @JobSize BREAK;
    
    EXEC sys.sp_executesql @wf;
  END
  RAISERROR('Deleted %d rows.', 0, 1, @jc) WITH NOWAIT;
END

Now, deleting rows will take longer — the average for 10,000 rows is 223 seconds, ~100 of which is intentional delay. But no user is waiting, so who cares? The CPU profile is almost zero, and the app can continue adding items on the queue as highly concurrent as it wants, with almost zero conflict with the background job. While processing 10,000 rows, I added another 16K rows to the queue, and it used the same CPU as before — taking only a second longer than when the job wasn't running:

And the plan now looks like this, with much better estimated / actual rows:

I can see this queue table approach being an effective way to deal with high DML concurrency, but it does require at least a little bit of flexibility with the applications submitting DML — this is one reason I really like having applications call stored procedures, as they give us much more control closer to the data.

Other Options

If you don’t have the ability to change the delete queries coming from the application — or, if you can’t defer the deletes to a background process — you can consider other options to reduce the impact of the deletes:

A nonclustered index on the predicate columns to support point lookups (we can do this in isolation without changing the application)
Using soft deletes only (still requires changes to the application)

It’ll be interesting to see if these options offer similar benefits, but I’ll save them for a future post.

The post Serializing Deletes From Clustered Columnstore Indexes appeared first on SQLPerformance.com.

Bad Habits: Avoiding NULL in SQL Server

Aaron Bertrand — Thu, 21 Oct 2021 09:00:06 +0000

A long time ago, I answered a question about NULL on Stack Exchange entitled, “Why shouldn’t we allow NULLs?” I have my share of pet peeves and passions, and the fear of NULLs is pretty high up on my list. A colleague recently said to me, after expressing a preference to force an empty string instead of allowing NULL:

"I don't like dealing with nulls in code."

I’m sorry, but that’s not a good reason. How the presentation layer deals with empty strings or NULLs shouldn’t be the driver for your table design and data model. And if you’re allowing a “lack of value” in some column, does it matter to you from a logical standpoint whether the “lack of value” is represented by a zero-length string or a NULL? Or worse, a token value like 0 or -1 for integers, or 1900-01-01 for dates?

Itzik Ben-Gan recently wrote a whole series on NULLs, and I highly recommend going through it all:

But my aim here is a little less complicated than that, after the topic came up in a different Stack Exchange question: “Add an auto now field to an existing table.” There, the user was adding a new column to an existing table, with the intention of auto-populating it with the current date/time. They wondered if they should leave NULLs in that column for all the existing rows or set a default value (like 1900-01-01, presumably, though they weren’t explicit).

It may be easy for someone in the know to filter out old rows based on a token value—after all, how could anyone believe some kind of Bluetooth doodad was manufactured or purchased on 1900-01-01? Well, I’ve seen this in current systems where they use some arbitrary-sounding date in views to act as a magic filter, only presenting rows where the value can be trusted. In fact, in every case I’ve seen so far, the date in the WHERE clause is the date/time when the column (or its default constraint) was added. Which is all fine; it’s maybe not the best way to solve the problem, but it’s a way.

If you’re not accessing the table through the view, though, this implication of a known value can still cause both logical and result-related problems. The logical problem is simply that someone interacting with the table has to know 1900-01-01 is a bogus, token value representing “unknown” or “not relevant.” For a real-world example, what was the average release speed, in seconds, for a quarterback who played in the 1970s, before we measured or tracked such a thing? Is 0 a good token value for “unknown”? How about -1? Or 100? Getting back to dates, if a patient without ID gets admitted to the hospital and is unconscious, what should they enter as date of birth? I don’t think 1900-01-01 is a good idea, and it certainly wasn’t a good idea back when that was more likely to be a real birthdate.

Performance Implications of Token Values

From a performance perspective, fake or “token” values like 1900-01-01 or 9999-21-31 can introduce problems. Let’s look at a couple of these with an example based loosely on the recent question mentioned above. We have a Widgets table and, after some warranty returns, we’ve decided to add an EnteredService column where we’ll enter the current date/time for new rows. In one case we’ll leave all the existing rows as NULL, and in the other we’ll update the value to our magical 1900-01-01 date. (We’ll leave any sort of compression out of the conversation for now.)

  CREATE TABLE dbo.Widgets_NULL
  (
    WidgetID     int IDENTITY(1,1) NOT NULL,
    SerialNumber uniqueidentifier NOT NULL DEFAULT NEWID(),
    Description  nvarchar(500),
    CONSTRAINT   PK_WNULL PRIMARY KEY (WidgetID)
  );

  CREATE TABLE dbo.Widgets_Token
  (
    WidgetID     int IDENTITY(1,1) NOT NULL,
    SerialNumber uniqueidentifier NOT NULL DEFAULT NEWID(),
    Description  nvarchar(500),
    CONSTRAINT   PK_WToken PRIMARY KEY (WidgetID)
  );

Now we’ll insert the same 100,000 rows into each table:

  INSERT dbo.Widgets_NULL(Description) 
  OUTPUT inserted.Description INTO dbo.Widgets_Token(Description)
  SELECT TOP (100000) LEFT(OBJECT_DEFINITION(o.object_id), 250)
    FROM master.sys.all_objects AS o 
    CROSS JOIN (SELECT TOP (50) * FROM master.sys.all_objects) AS o2
    WHERE o.[type] IN (N'P',N'FN',N'V')
      AND OBJECT_DEFINITION(o.object_id) IS NOT NULL;

Then we can add the new column and update 10% of the existing values with a distribution of current-ish dates, and the other 90% to our token date only in one of the tables:

  ALTER TABLE dbo.Widgets_NULL  ADD EnteredService datetime;
  ALTER TABLE dbo.Widgets_Token ADD EnteredService datetime;
  GO

  UPDATE dbo.Widgets_NULL  
    SET EnteredService = DATEADD(DAY, WidgetID/250, '20200101') 
    WHERE WidgetID > 90000;

  UPDATE dbo.Widgets_Token 
    SET EnteredService = DATEADD(DAY, WidgetID/250, '20200101') 
    WHERE WidgetID > 90000;

  UPDATE dbo.Widgets_Token 
    SET EnteredService = '19000101'
    WHERE WidgetID <= 90000;

Finally, we can add indexes:

  CREATE INDEX IX_EnteredService ON dbo.Widgets_NULL (EnteredService);
  CREATE INDEX IX_EnteredService ON dbo.Widgets_Token(EnteredService);

Space Used

I always hear “disk space is cheap” when we talk about data type choices, fragmentation, and token values vs. NULL. My concern isn’t so much with the disk space these extra meaningless values take up. It’s more that, when the table is queried, it’s wasting memory. Here we can get a quick idea of how much space our token values consume before and after the column and index are added:

Reserved space of table after adding a column and adding an index. Space almost doubles with token values.

Query Execution

Inevitably, someone is going to make assumptions about the data in the table and query against the EnteredService column as if all the values there are legitimate. For example:

  SELECT COUNT(*) FROM dbo.Widgets_Token  
    WHERE EnteredService <= '20210101';

  SELECT COUNT(*) FROM dbo.Widgets_NULL 
    WHERE EnteredService <= '20210101';

The token values can mess with estimates in some cases but, more importantly, they’re going to produce incorrect (or at least unexpected) results. Here is the execution plan for the query against the table with token values:

Execution plan for the token table; note the high cost.

And here’s the execution plan for the query against the table with NULLs:

Execution plan for the NULL table; wrong estimate, but much lower cost.

The same would happen the other way if the query asked for >= {some date} and 9999-12-31 was used as the magic value representing unknown.

Again, for the people who happen to know the results are wrong specifically because you’ve used token values, this isn’t an issue. But everyone else who doesn’t know that—including future colleagues, other inheritors and maintainers of the code, and even future you with memory challenges—is probably going to stumble.

Conclusion

The choice to allow NULLs in a column (or to avoid NULLs entirely) shouldn’t be reduced to an ideological or fear-based decision. There are real, tangible downsides to architecting your data model to make sure no value can be NULL, or using meaningless values to represent something that could easily have been not stored at all. I’m not suggesting every column in your model should allow NULLs; just that you not be opposed to the idea of NULLs.

The post Bad Habits: Avoiding NULL in SQL Server appeared first on SQLPerformance.com.

Refreshing SQL Server Tables With Less Interruption Using Partition Switching

Aaron Bertrand — Wed, 29 Sep 2021 09:00:26 +0000

A common requirement in ETL and various reporting scenarios is to quietly load a SQL Server staging table in the background, so users querying the data aren’t impacted by the writes and vice-versa. The trick is how and when you point users at the new, refreshed version of the data.

Simplified Example of a Staging Table: A Farmer’s Market Analogy

So, what is a staging table in SQL? A staging table can be more easily understood using a real-world example: Let’s say you have a table full of vegetables you're selling at the local farmer's market. As your vegetables sell and you bring in new inventory:

When you bring a load of new vegetables, it's going to take you 20 minutes to clear off the table and replace the remaining stock with the newer product.
You don't want customers to sit there and wait 20 minutes for the switch to happen, since most will get their vegetables elsewhere.

Now, what if you had a second empty table where you load the new vegetables, and while you're doing that, customers can still buy the older vegetables from the first table? (Let's pretend it's not because the older vegetables went bad or are otherwise less desirable.)

Refreshing Tables in SQL Server

There are several methods to reload entire tables while they’re actively being queried; two decades ago, I took unbridled advantage of sp_rename — I’d play a shell game with an empty shadow copy of the table, happily reloading the shadow copy and then only performing the rename inside a transaction.

In SQL Server 2005, I started using schemas to hold shadow copies of tables I simply transferred around using the same shell game technique, which I wrote about in these two posts:

The only advantage of transferring objects between schemas over renaming them is there are no warning messages about renaming an object – which isn’t even a problem, per se, except the warning messages fill up agent history logs that much faster.

Both approaches still require a schema modification (Sch-M) lock, so they must wait for any existing transactions to release their own locks. Once they acquire their Sch-M lock, they block any subsequent queries requiring schema stability locks (Sch-S)… which is nearly every query. It can rapidly become a blocking chain nightmare, as any new queries needing Sch-S have to get in a queue behind the Sch-M. (And no, you can’t get around this by using RCSI or NOLOCK everywhere, since even those queries still require Sch-S. You can’t acquire Sch-S with a Sch-M in place, as they’re incompatible—Michael J. Swart talks about that here.)

Kendra Little really opened my eyes about the dangers with schema transfer in her post, “Staging Data: Locking Danger with ALTER SCHEMA TRANSFER.” There she shows why schema transfer can be worse than rename. She later detailed a third and much less impactful way of swapping tables out, which I now use exclusively: partition switching. This method allows the switch to wait at a lower priority, which isn’t even an option with the rename or schema transfer techniques. Joe Sack went into detail about this enhancement added back in SQL Server 2014: “Exploring Low Priority Lock Wait Options in SQL Server 2014 CTP1.”

SQL Server Partition Switching Example

Let’s look at a basic example, following Kendra’s thorough gist here. First, we’ll create two new databases:

CREATE DATABASE NewWay;
CREATE DATABASE OldWay;
GO

In the new database, we’ll create a table to hold our vegetable inventory, and two copies of the table for our shell game:

USE NewWay;
GO

CREATE TABLE dbo.Vegetables_NewWay
(
  VegetableID int,
  Name        sysname,
  WhenPicked  datetime,
  BackStory   nvarchar(max)
);
GO

-- we need to create two extra copies of the table.

CREATE TABLE dbo.Vegetables_NewWay_prev
(
  VegetableID int,
  Name        sysname,
  WhenPicked  datetime,
  BackStory   nvarchar(max)
);
GO

CREATE TABLE dbo.Vegetables_NewWay_hold
(
  VegetableID int,
  Name        sysname,
  WhenPicked  datetime,
  BackStory   nvarchar(max)
);
GO

We create a procedure that loads the staging copy of the table, then uses a transaction to switch the current copy out.

CREATE PROCEDURE dbo.DoTheVeggieSwap_NewWay
AS
BEGIN
  SET NOCOUNT ON;
  
  TRUNCATE TABLE dbo.Vegetables_NewWay_prev;

  INSERT dbo.Vegetables_NewWay_prev
    SELECT TOP (1000000) s.session_id, o.name, s.last_successful_logon, 
      LEFT(m.definition, 500)
    FROM sys.dm_exec_sessions AS s
    CROSS JOIN model.sys.all_objects AS o
    INNER JOIN model.sys.all_sql_modules AS m
    ON o.[object_id] = m.[object_id];

  -- need to take Sch-M locks here:

  BEGIN TRANSACTION;
    ALTER TABLE dbo.Vegetables_NewWay 
      SWITCH TO dbo.Vegetables_NewWay_hold
      WITH (WAIT_AT_LOW_PRIORITY 
            (MAX_DURATION = 1 MINUTES,
             ABORT_AFTER_WAIT = BLOCKERS));

    ALTER TABLE dbo.Vegetables_NewWay_prev
      SWITCH TO dbo.Vegetables_NewWay;
  COMMIT TRANSACTION;

  -- and now users will query the new data in dbo
  -- can switch the old copy back and truncate it 
  -- without interfering with other queries

  ALTER TABLE dbo.Vegetables_NewWay_hold
	SWITCH TO dbo.Vegetables_NewWay_prev;

  TRUNCATE TABLE dbo.Vegetables_NewWay_prev;
END
GO

The beauty of WAIT_AT_LOW_PRIORITY is you can completely control the behavior with the ABORT_AFTER_WAIT option:

ABORT_AFTER_WAIT
setting Description / symptoms

SELF

ABORT_AFTER_WAIT setting	Description / symptoms
SELF	This means the switch will give up after `n` minutes. For the session attempting to perform the switch, this will surface as the error message: Lock request time out period exceeded.
BLOCKERS	This dictates the switch will wait up to `n` minutes, then force itself to the front of the line by killing all the blockers ahead of it. Sessions trying to interact with the table that get bumped by the switch operation will see some combination of these error messages: Your session has been disconnected because of a high priority DDL operation. Cannot continue the execution because the session is in the kill state. A severe error occurred on the current command. The results, if any, should be discarded.
NONE	This says the switch will happily wait until it gets its turn, regardless of `MAX_DURATION`. This is the same behavior you’d get with rename, schema transfer, or partition switch without `WAIT_AT_LOW_PRIORITY`.

This means the switch will give up after n minutes.

For the session attempting to perform the switch, this will surface as the error message:

Lock request time out period exceeded.

BLOCKERS

This dictates the switch will wait up to n minutes, then force itself to the front of the line by killing all the blockers ahead of it.

Sessions trying to interact with the table that get bumped by the switch operation will see some combination of these error messages:

Your session has been disconnected because of a high priority DDL operation.

Cannot continue the execution because the session is in the kill state.

A severe error occurred on the current command. The results, if any, should be discarded.

NONE

This says the switch will happily wait until it gets its turn, regardless of MAX_DURATION.

This is the same behavior you’d get with rename, schema transfer, or partition switch without WAIT_AT_LOW_PRIORITY.

The BLOCKERS option isn’t the friendliest way to handle things, since you’re already saying it’s okay through this staging/switch operation for users to see data that’s a little out of date. I’d likely prefer to use SELF and have the operation try again in cases where it couldn’t get the required locks in the allotted time. I’d keep track of how often it fails, though, especially consecutive failures, because you want to make sure the data doesn’t ever get too stale.

Compared to the Old Way of Switching Between Schemas

Here’s how I would have handled the switching before:

USE OldWay;
GO

-- create two schemas and two copies of the table

CREATE SCHEMA prev AUTHORIZATION dbo;
GO

CREATE SCHEMA hold AUTHORIZATION dbo;
GO

CREATE TABLE dbo.Vegetables_OldWay
(
  VegetableID int,
  Name sysname,
  WhenPicked datetime,
  BackStory nvarchar(max)
);
GO

CREATE TABLE prev.Vegetables_OldWay
(
  VegetableID int,
  Name sysname,
  WhenPicked datetime,
  BackStory nvarchar(max)
);
GO

CREATE PROCEDURE dbo.DoTheVeggieSwap_OldWay
AS
BEGIN
  SET NOCOUNT ON;

  TRUNCATE TABLE prev.Vegetables_OldWay;

  INSERT prev.Vegetables_OldWay
    SELECT TOP (1000000) s.session_id, o.name, s.last_successful_logon, 
      LEFT(m.definition, 500)
    FROM sys.dm_exec_sessions AS s
    CROSS JOIN model.sys.all_objects AS o
    INNER JOIN model.sys.all_sql_modules AS m
    ON o.[object_id] = m.[object_id];

  -- need to take Sch-M locks here:
  BEGIN TRANSACTION;
    ALTER SCHEMA hold TRANSFER dbo.Vegetables_OldWay;
    ALTER SCHEMA dbo  TRANSFER prev.Vegetables_OldWay;
  COMMIT TRANSACTION;

  -- and now users will query the new data in dbo
  -- can transfer the old copy back and truncate it without 
  -- interfering with other queries:
  
  ALTER SCHEMA prev TRANSFER hold.Vegetables_OldWay;
  TRUNCATE TABLE prev.Vegetables_OldWay;
END
GO

I ran concurrency tests by using two windows of Erik Ejlskov Jensen's SQLQueryStress: one to repeat a call to the procedure every minute, and the other to run 16 threads like this, thousands of times:

BEGIN TRANSACTION; UPDATE TOP (1) dbo. SET name += 'x'; SELECT TOP (10) name FROM dbo.

ORDER BY NEWID(); WAITFOR DELAY '00:00:02'; COMMIT TRANSACTION;

You can look at the output from SQLQueryStress, or sys.dm_exec_query_stats, or Query Store, and you’ll see something along the lines of the following results (but I highly recommend using a quality SQL Server performance monitoring tool if you’re serious about proactively optimizing database environments):

Duration and Error Rates	Schema Transfer	ABORT_AFTER_WAIT: SELF	ABORT_AFTER_WAIT: BLOCKERS
Avg Duration – Transfer/Switch	96.4 seconds	68.4 seconds	20.8 seconds
Avg Duration – DML	18.7 seconds	2.7 seconds	2.9 seconds
Exceptions – Transfer/Switch	0	0.5/minute	0
Exceptions – DML	0	0	25.5/minute

Note the durations and exception counts will be highly dependent on your server specs and what else is going on within your environment. Also note that, while there were no exceptions for the schema transfer tests when using SQLQueryStress, you might hit some more strict timeouts depending on the consuming application. And it was so much slower on average, because the blocking piled up much more aggressively. Nobody ever wants exceptions, but when there’s a trade-off like this, you might prefer a few exceptions here and there (depending on the frequency of the refresh operation) over everyone waiting longer all the time.

Partition Switching vs. Rename/Schema Transfer to Refresh SQL Server Tables

Partition switching allows you to pick which part of your process bears the cost of concurrency. You can give preference to the switching process, so the data is more reliably fresh, but this means some of your queries will fail. Conversely, you can prioritize the queries, at the cost of a slower refresh process (and the occasional failure there). The main thrust is SQL Server partition switching is a superior method to refresh SQL Server tables compared to the previous rename/schema transfer techniques on almost all points, and you can use more robust retry logic or experiment with duration tolerances to land at the sweet spot for your workload.

The post Refreshing SQL Server Tables With Less Interruption Using Partition Switching appeared first on SQLPerformance.com.

Split strings the right way – or the next best way

Aaron Bertrand — Thu, 02 Sep 2021 13:06:59 +0000

UPDATE: September 2nd, 2021 (Originally published July 26th, 2012.)

A lot of things change over the course of a few major versions of our favorite database platform. SQL Server 2016 brought us STRING_SPLIT, a native function that eliminates the need for many of the custom solutions we’ve needed before. It’s fast, too, but it’s not perfect. For example, it only supports a single-character delimiter, and it doesn’t return anything to indicate the order of the input elements. I’ve written several articles about this function (and STRING_AGG, which arrived in SQL Server 2017) since this post was written:

I’m going to leave the below content here for posterity and historical relevance, and also because some of the testing methodology is relevant to other problems aside from splitting strings, but please see some of the above references for information about how you should be splitting strings in modern, supported versions of SQL Server – as well as this post, which explains why splitting strings maybe isn’t a problem you want the database to solve in the first place, new function or not.

Splitting Strings : Now with less T-SQL

I know many people are bored of the "split strings" problem, but it still seems to come up almost daily on forum and Q & A sites like Stack Overflow. This is the problem where people want to pass in a string like this:

EXEC dbo.UpdateProfile @UserID = 1, @FavoriteTeams = N'Patriots,Red Sox,Bruins';

Inside the procedure, they want to do something like this:

INSERT dbo.UserTeams(UserID, TeamID) SELECT @UserID, TeamID
    FROM dbo.Teams WHERE TeamName IN (@FavoriteTeams);

This doesn't work because @FavoriteTeams is a single string, and the above translates to:

INSERT dbo.UserTeams(UserID, TeamID) SELECT @UserID, TeamID 
    FROM dbo.Teams WHERE TeamName IN (N'Patriots,Red Sox,Bruins');

SQL Server is therefore going to try to find a team named Patriots,Red Sox,Bruins, and I'm guessing there is no such team. What they really want here is the equivalent of:

INSERT dbo.UserTeams(UserID, TeamID) SELECT @UserID, TeamID
    FROM dbo.Teams WHERE TeamName IN (N'Patriots', N'Red Sox', N'Bruins');

But since there is no array type in SQL Server, this is not how the variable is interpreted at all – it's still a simple, single string that happens to contain some commas. Questionable schema design aside, in this case the comma-separated list needs to be "split" into individual values – and this is the question that frequently spurs a lot of "new" debate and commentary about the best solution to achieve just that.

The answer seems to be, almost invariably, that you should use CLR. If you can't use CLR – and I know there are many of you out there who can't, due to corporate policy, the pointy-haired boss, or stubbornness – then you use one of the many workarounds that exist. And many workarounds exist.

But which one should you use?

I'm going to compare the performance of a few solutions – and focus on the question everyone always asks: "Which is fastest?" I'm not going to belabor the discussion around *all* of the potential methods, because several have already been eliminated due to the fact that they simply don't scale. And I may re-visit this in the future to examine the impact on other metrics, but for now I'm just going to focus on duration. Here are the contenders I am going to compare (using SQL Server 2012, 11.00.2316, on a Windows 7 VM with 4 CPUs and 8 GB of RAM):

CLR

If you wish to use CLR, you should definitely borrow code from fellow MVP Adam Machanic before thinking about writing your own (I've blogged before about re-inventing the wheel, and it also applies to free code snippets like this). He spent a lot of time fine-tuning this CLR function to efficiently parse a string. If you are currently using a CLR function and this is not it, I strongly recommend you deploy it and compare – I tested it against a much simpler, VB-based CLR routine that was functionally equivalent, but the VB approach performed about three times worse than Adam's.

So I took Adam's function, compiled the code to a DLL (using csc), and deployed just that file to the server. Then I added the following assembly and function to my database:

CREATE ASSEMBLY CLRUtilities FROM 'c:\DLLs\CLRUtilities.dll' 
  WITH PERMISSION_SET = SAFE;
GO

CREATE FUNCTION dbo.SplitStrings_CLR
(
   @List      NVARCHAR(MAX),
   @Delimiter NVARCHAR(255)
)
RETURNS TABLE ( Item NVARCHAR(4000) )
EXTERNAL NAME CLRUtilities.UserDefinedFunctions.SplitString_Multi;
GO

XML

This is the typical function I use for one-off scenarios where I know the input is "safe," but is not one I recommend for production environments (more on that below).

CREATE FUNCTION dbo.SplitStrings_XML
(
   @List       NVARCHAR(MAX),
   @Delimiter  NVARCHAR(255)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
   RETURN 
   (  
      SELECT Item = y.i.value('(./text())[1]', 'nvarchar(4000)')
      FROM 
      ( 
        SELECT x = CONVERT(XML, '' 
          + REPLACE(@List, @Delimiter, '') 
          + '').query('.')
      ) AS a CROSS APPLY x.nodes('i') AS y(i)
   );
GO

A very strong caveat has to ride along with the XML approach: it can only be used if you can guarantee that your input string does not contain any illegal XML characters. One name with <, > or & and the function will blow up. So regardless of the performance, if you're going to use this approach, be aware of the limitations – it should not be considered a viable option for a generic string splitter. I'm including it in this round-up because you may have a case where you can trust the input – for example it is possible to use for comma-separated lists of integers or GUIDs.

Numbers table

This solution uses a Numbers table, which you must build and populate yourself. (We've been requesting a built-in version for ages.) The Numbers table should contain enough rows to exceed the length of the longest string you'll be splitting. In this case we'll use 1,000,000 rows:

SET NOCOUNT ON;

DECLARE @UpperLimit INT = 1000000;

WITH n AS
(
    SELECT
        x = ROW_NUMBER() OVER (ORDER BY s1.[object_id])
    FROM       sys.all_objects AS s1
    CROSS JOIN sys.all_objects AS s2
    CROSS JOIN sys.all_objects AS s3
)
SELECT Number = x
  INTO dbo.Numbers
  FROM n
  WHERE x BETWEEN 1 AND @UpperLimit;

GO
CREATE UNIQUE CLUSTERED INDEX n ON dbo.Numbers(Number) 
    WITH (DATA_COMPRESSION = PAGE);
GO

(Using data compression will drastically reduce the number of pages required, but obviously you should only use this option if you are running Enterprise Edition. In this case the compressed data requires 1,360 pages, versus 2,102 pages without compression – about a 35% savings.)

CREATE FUNCTION dbo.SplitStrings_Numbers
(
   @List       NVARCHAR(MAX),
   @Delimiter  NVARCHAR(255)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
   RETURN
   (
       SELECT Item = SUBSTRING(@List, Number, 
         CHARINDEX(@Delimiter, @List + @Delimiter, Number) - Number)
       FROM dbo.Numbers
       WHERE Number <= CONVERT(INT, LEN(@List))
         AND SUBSTRING(@Delimiter + @List, Number, LEN(@Delimiter)) = @Delimiter
   );
GO

Common Table Expression

This solution uses a recursive CTE to extract each part of the string from the "remainder" of the previous part. As a recursive CTE with local variables, you'll note that this had to be a multi-statement table-valued function, unlike the others which are all inline.

CREATE FUNCTION dbo.SplitStrings_CTE
(
   @List       NVARCHAR(MAX),
   @Delimiter  NVARCHAR(255)
)
RETURNS @Items TABLE (Item NVARCHAR(4000))
WITH SCHEMABINDING
AS
BEGIN
   DECLARE @ll INT = LEN(@List) + 1, @ld INT = LEN(@Delimiter);

   WITH a AS
   (
       SELECT
           [start] = 1,
           [end]   = COALESCE(NULLIF(CHARINDEX(@Delimiter, 
                       @List, 1), 0), @ll),
           [value] = SUBSTRING(@List, 1, 
                     COALESCE(NULLIF(CHARINDEX(@Delimiter, 
                       @List, 1), 0), @ll) - 1)
       UNION ALL
       SELECT
           [start] = CONVERT(INT, [end]) + @ld,
           [end]   = COALESCE(NULLIF(CHARINDEX(@Delimiter, 
                       @List, [end] + @ld), 0), @ll),
           [value] = SUBSTRING(@List, [end] + @ld, 
                     COALESCE(NULLIF(CHARINDEX(@Delimiter, 
                       @List, [end] + @ld), 0), @ll)-[end]-@ld)
       FROM a
       WHERE [end] < @ll ) INSERT @Items SELECT [value] FROM a WHERE LEN([value]) > 0
   OPTION (MAXRECURSION 0);

   RETURN;
END
GO

Jeff Moden's splitter A function based on Jeff Moden's splitter with minor changes to support longer strings

Over on SQLServerCentral, Jeff Moden presented a splitter function that rivaled the performance of CLR, so I thought it only fair to include a variation using a similar approach in this round-up. I had to make a few minor changes to his function in order to handle our longest string (500,000 characters), and also made the naming conventions similar:

CREATE FUNCTION dbo.SplitStrings_Moden
(
   @List NVARCHAR(MAX),
   @Delimiter NVARCHAR(255)
)
RETURNS TABLE
WITH SCHEMABINDING AS
RETURN
  WITH E1(N)        AS ( SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 
                         UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 
                         UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1),
       E2(N)        AS (SELECT 1 FROM E1 a, E1 b),
       E4(N)        AS (SELECT 1 FROM E2 a, E2 b),
       E42(N)       AS (SELECT 1 FROM E4 a, E2 b),
       cteTally(N)  AS (SELECT 0 UNION ALL SELECT TOP (DATALENGTH(ISNULL(@List,1))) 
                         ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E42),
       cteStart(N1) AS (SELECT t.N+1 FROM cteTally t
                         WHERE (SUBSTRING(@List,t.N,1) = @Delimiter OR t.N = 0))
  SELECT Item = SUBSTRING(@List, s.N1, ISNULL(NULLIF(CHARINDEX(@Delimiter,@List,s.N1),0)-s.N1,8000))
    FROM cteStart s;

As an aside, for those using Jeff Moden's solution, you may consider using a Numbers table as above, and experimenting with a slight variation on Jeff's function:

CREATE FUNCTION dbo.SplitStrings_Moden2
(
   @List      NVARCHAR(MAX),
   @Delimiter NVARCHAR(255)
)
RETURNS TABLE
WITH SCHEMABINDING AS
RETURN
   WITH cteTally(N)  AS 
   (
	   SELECT TOP (DATALENGTH(ISNULL(@List,1))+1) Number-1 
	      FROM dbo.Numbers ORDER BY Number
   ),
   cteStart(N1) AS 
   (
       SELECT t.N+1 
          FROM cteTally t
    	  WHERE (SUBSTRING(@List,t.N,1) = @Delimiter OR t.N = 0)
   )
   SELECT Item = SUBSTRING(@List, s.N1, 
      ISNULL(NULLIF(CHARINDEX(@Delimiter, @List, s.N1), 0) - s.N1, 8000))
   FROM cteStart AS s;

(This will trade slightly higher reads for slightly lower CPU, so may be better depending on whether your system is already CPU- or I/O-bound.)

Sanity checking

Just to be sure we're on the right track, we can verify that all five functions return the expected results:

DECLARE @s NVARCHAR(MAX) = N'Patriots,Red Sox,Bruins';

SELECT Item FROM dbo.SplitStrings_CLR     (@s, N',');
SELECT Item FROM dbo.SplitStrings_XML     (@s, N',');
SELECT Item FROM dbo.SplitStrings_Numbers (@s, N',');
SELECT Item FROM dbo.SplitStrings_CTE     (@s, N',');
SELECT Item FROM dbo.SplitStrings_Moden   (@s, N',');

And in fact, these are the results we see in all five cases...

The Test Data

Now that we know the functions behave as expected, we can get to the fun part: testing performance against various numbers of strings that vary in length. But first we need a table. I created the following simple object:

CREATE TABLE dbo.strings
(
  string_type  TINYINT,
  string_value NVARCHAR(MAX)
);

CREATE CLUSTERED INDEX st ON dbo.strings(string_type);

I populated this table with a set of strings of varying lengths, making sure that roughly the same set of data would be used for each test - first 10,000 rows where the string is 50 characters long, then 1,000 rows where the string is 500 characters long, 100 rows where the string is 5,000 characters long, 10 rows where the string is 50,000 characters long, and so on up to 1 row of 500,000 characters. I did this both to compare the same amount of overall data being processed by the functions, as well as to try to keep my testing times somewhat predictable.

I use a #temp table so that I can simply use GO to execute each batch a specific number of times:

SET NOCOUNT ON;
GO
CREATE TABLE #x(s NVARCHAR(MAX));
INSERT #x SELECT N'a,id,xyz,abcd,abcde,sa,foo,bar,mort,splunge,bacon,';
GO
INSERT dbo.strings SELECT 1, s FROM #x;
GO 10000
INSERT dbo.strings SELECT 2, REPLICATE(s,10) FROM #x;
GO 1000
INSERT dbo.strings SELECT 3, REPLICATE(s,100) FROM #x;
GO 100
INSERT dbo.strings SELECT 4, REPLICATE(s,1000) FROM #x;
GO 10
INSERT dbo.strings SELECT 5, REPLICATE(s,10000) FROM #x;
GO
DROP TABLE #x;
GO

-- then to clean up the trailing comma, since some approaches treat a trailing empty string as a valid element:
UPDATE dbo.strings SET string_value = SUBSTRING(string_value, 1, LEN(string_value)-1) + 'x';

Creating and populating this table took about 20 seconds on my machine, and the table represents about 6 MB worth of data (about 500,000 characters times 2 bytes, or 1 MB per string_type, plus row and index overhead). Not a huge table, but it should be large enough to highlight any differences in performance between the functions.

The Tests

With the functions in place, and the table properly stuffed with big strings to chew on, we can finally run some actual tests to see how the different functions perform against real data. In order to measure performance without factoring in network overhead, I used SQL Sentry Plan Explorer, running each set of tests 10 times, collecting the duration metrics, and averaging.

The first test simply pulled the items from each string as a set:

DBCC DROPCLEANBUFFERS;
DBCC FREEPROCCACHE;

DECLARE @string_type TINYINT = ; -- 1-5 from above

SELECT t.Item FROM dbo.strings AS s
  CROSS APPLY dbo.SplitStrings_(s.string_value, ',') AS t
  WHERE s.string_type = @string_type;

The results show that as the strings get larger, the advantage of CLR really shines. At the lower end, the results were mixed, but again the XML method should have an asterisk next to it, since its use depends on relying on XML-safe input. For this specific use case, the Numbers table consistently performed the worst:

Duration, in milliseconds

After the hyperbolic 40-second performance for the numbers table against 10 rows of 50,000 characters, I dropped it from the running for the last test. To better show the relative performance of the four best methods in this test, I've dropped the Numbers results from the graph altogether:

Next, let's compare when we perform a search against the comma-separated value (e.g. return the rows where one of the strings is 'foo'). Again we'll use the five functions above, but we'll also compare the result against a search performed at runtime using LIKE instead of bothering with splitting.

DBCC DROPCLEANBUFFERS;
DBCC FREEPROCCACHE;

DECLARE @i INT = , @search NVARCHAR(32) = N'foo';

;WITH s(st, sv) AS 
(
  SELECT string_type, string_value
    FROM dbo.strings AS s
    WHERE string_type = @i
)
SELECT s.string_type, s.string_value FROM s 
  CROSS APPLY dbo.SplitStrings_(s.sv, ',') AS t
  WHERE t.Item = @search;

SELECT s.string_type
  FROM dbo.strings
  WHERE string_type = @i
  AND ',' + string_value + ',' LIKE '%,' + @search + ',%';

These results show that, for small strings, CLR was actually the slowest, and that the best solution is going to be performing a scan using LIKE, without bothering to split the data up at all. Again I dropped the Numbers table solution from the 5th approach, when it was clear that its duration would increase exponentially as the size of the string went up:

Duration, in milliseconds

And to better demonstrate the patterns for the top 4 results, I've eliminated the Numbers and XML solutions from the graph:

Next, let's look at replicating the use case from the beginning of this post, where we're trying to find all the rows in one table that exist in the list being passed in. As with the data in the table we created above, we're going to create strings varying in length from 50 to 500,000 characters, store them in a variable, and then check a common catalog view for existing in the list.

DECLARE 
  @i INT = , -- value 1-5, yielding strings 50 - 500,000 characters
  @x NVARCHAR(MAX) = N'a,id,xyz,abcd,abcde,sa,foo,bar,mort,splunge,bacon,';

SET @x = REPLICATE(@x, POWER(10, @i-1));

SET @x = SUBSTRING(@x, 1, LEN(@x)-1) + 'x';

SELECT c.[object_id] 
  FROM sys.all_columns AS c
  WHERE EXISTS 
  (
    SELECT 1 FROM dbo.SplitStrings_(@x, N',') AS x 
    WHERE Item = c.name
  )
  ORDER BY c.[object_id];

SELECT [object_id]
  FROM sys.all_columns 
  WHERE N',' + @x + ',' LIKE N'%,' + name + ',%'
  ORDER BY [object_id];

These results show that, for this pattern, several methods see their duration increase exponentially as the size of the string goes up. At the lower end, XML keeps good pace with CLR, but this quickly deteriorates as well. CLR is consistently the clear winner here:

Duration, in milliseconds

And again without the methods that explode upward in terms of duration:

Finally, let's compare the cost of retrieving the data from a single variable of varying length, ignoring the cost of reading data from a table. Again we'll generate strings of varying length, from 50 - 500,000 characters, and then just return the values as a set:

DECLARE 
  @i INT = , -- value 1-5, yielding strings 50 - 500,000 characters
  @x NVARCHAR(MAX) = N'a,id,xyz,abcd,abcde,sa,foo,bar,mort,splunge,bacon,';

SET @x = REPLICATE(@x, POWER(10, @i-1));

SET @x = SUBSTRING(@x, 1, LEN(@x)-1) + 'x';

SELECT Item FROM dbo.SplitStrings_(@x, N',');

These results also show that CLR is fairly flat-lined in terms of duration, all the way up to 110,000 items in the set, while the other methods keep decent pace until some time after 11,000 items:

Duration, in milliseconds

Conclusion

In almost all cases, the CLR solution clearly out-performs the other approaches - in some cases it's a landslide victory, especially as string sizes increase; in a few others, it's a photo finish that could fall either way. In the first test we saw that XML and CTE out-performed CLR at the low end, so if this is a typical use case *and* you are sure that your strings are in the 1 - 10,000 character range, one of those approaches might be a better option. If your string sizes are less predictable than that, CLR is probably still your best bet overall - you lose a few milliseconds at the low end, but you gain a whole lot at the high end. Here are the choices I would make, depending on the task, with second place highlighted for cases where CLR is not an option. Note that XML is my preferred method only if I know the input is XML-safe; these may not necessarily be your best alternatives if you have less faith in your input.

The only real exception where CLR is not my choice across the board is the case where you're actually storing comma-separated lists in a table, and then finding rows where a defined entity is in that list. In that specific case, I would probably first recommend redesigning and properly normalizing the schema, so that those values are stored separately, rather than using it as an excuse to not use CLR for splitting.

If you can't use CLR for other reasons, there isn't a clear-cut "second place" revealed by these tests; my answers above were based on overall scale and not at any specific string size. Every solution here was runner up in at least one scenario - so while CLR is clearly the choice when you can use it, what you should use when you cannot is more of an "it depends" answer - you'll need to judge based on your use case(s) and the tests above (or by constructing your own tests) which alternative is better for you.

Addendum : An alternative to splitting in the first place

The above approaches require no changes to your existing application(s), assuming they are already assembling a comma-separated string and throwing it at the database to deal with. One option you should consider, if either CLR is not an option and/or you can modify the application(s), is using Table-Valued Parameters (TVPs). Here is a quick example of how to utilize a TVP in the above context. First, create a table type with a single string column:

CREATE TYPE dbo.Items AS TABLE
(
  Item NVARCHAR(4000)
);

Then the stored procedure can take this TVP as input, and join on the content (or use it in other ways - this is just one example):

CREATE PROCEDURE dbo.UpdateProfile
    @UserID INT,
    @TeamNames dbo.Items READONLY
AS
BEGIN
   SET NOCOUNT ON;

   INSERT dbo.UserTeams(UserID, TeamID) SELECT @UserID, t.TeamID
      FROM dbo.Teams AS t
      INNER JOIN @TeamNames AS tn
      ON t.Name = tn.Item;
END
GO

Now in your C# code, for example, instead of building a comma-separated string, populate a DataTable (or use whatever compatible collection might already hold your set of values):

DataTable tvp = new DataTable();
tvp.Columns.Add(new DataColumn("Item"));

// in a loop from a collection, presumably:
tvp.Rows.Add(someThing.someValue);

using (connectionObject)
{
    SqlCommand cmd       = new SqlCommand("dbo.UpdateProfile", connectionObject);
    cmd.CommandType      = CommandType.StoredProcedure;
    SqlParameter tvparam = cmd.Parameters.AddWithValue("@TeamNames", tvp);
    tvparam.SqlDbType    = SqlDbType.Structured;
    // other parameters, e.g. userId
    cmd.ExecuteNonQuery();
}

You might consider this to be a prequel to a follow-up post.

Of course this doesn't play well with JSON and other APIs - quite often the reason a comma-separated string is being passed to SQL Server in the first place.

The post Split strings the right way – or the next best way appeared first on SQLPerformance.com.

Deprecated features to take out of your toolbox – Part 3

Aaron Bertrand — Thu, 01 Jul 2021 09:00:11 +0000

I've recently discussed a few features that Microsoft advises against using, and that I think you should forget exist, too. There was the case where a colleague constantly promoted the deprecated backward compatibility view sys.sysprocesses instead of newer dynamic management views (DMVs), and another case where a different colleague took down a production server using SQL Server Profiler.

My latest run-in with things best forgotten is a new stored procedure with an ntext parameter. I checked and, sure enough, the data type matches the schema for the underlying table. My mind started racing about these older data types to explain why we really shouldn't be using them anymore:

image
ntext
text

These types are on the deprecated list for many reasons, and have held a permanent spot on that list since being replaced by the max types way back in SQL Server 2005. Some of these pain points include:

you can't use many string functions, like LEFT(), RTRIM(), UPPER(), and most comparison operators;
you need to use functions like TEXTPTR, WRITETEXT, and UPDATETEXT for modifications;
you can't use the types as local variables;
you can't reference the columns in DISTINCT, GROUP BY, ORDER BY, or as an included column (not that you should want to do any of these);
smaller values that could fit in-row can only do so with the text in row option.

That is not an exhaustive list; there are other differences that you might consider more or less important. The most pressing reason for me is that you can't rebuild the clustered index online if the table contains one of these data types.

Let's create a simple database with a few tables:

CREATE DATABASE BadIdeas;
GO

USE BadIdeas;
GO

CREATE TABLE dbo.t1(id bigint IDENTITY PRIMARY KEY, msg nvarchar(max));
CREATE TABLE dbo.t2(id bigint IDENTITY PRIMARY KEY, msg ntext);

Now, let's try to perform online operations on the tables:

ALTER TABLE dbo.t1 REBUILD WITH (ONLINE = ON);
GO
ALTER TABLE dbo.t2 REBUILD WITH (ONLINE = ON);

While the first statement succeeds, the second yields an error message like this:

Msg 2725, Level 16, State 2
An online operation cannot be performed for index 'PK__t2__3213E83FEEA1E0AD' because the index contains column 'msg' of data type text, ntext, image or FILESTREAM. For a non-clustered index, the column could be an include column of the index. For a clustered index, the column could be any column of the table. If DROP_EXISTING is used, the column could be part of a new or old index. The operation must be performed offline.

In a one-off scenario, the error message is easy enough to deal with: you either skip the table or, if the table absolutely must be rebuilt, you work with your teams to schedule an outage. When you're automating an index maintenance solution or deploying new compression settings across your environment, it is a bit more of a pain in the behind to make your solution handle present or future state.

What to do?

You can start replacing these columns with their more modern counterparts. Here is a query to help you track them down using the sys.columns catalog view, but you're on your own for any explicit references that might exist in your application code:

SELECT [Schema]    = s.name, 
       [Object]    = o.name,
       [Column]    = c.name,
       [Data Type] = TYPE_NAME(c.user_type_id) + CASE 
         WHEN c.system_type_id <> c.user_type_id 
         THEN N' (' + TYPE_NAME(c.system_type_id) + N')' 
         ELSE N'' END
  FROM sys.columns AS c
  INNER JOIN sys.objects AS o
    ON c.[object_id] = o.[object_id]
  INNER JOIN sys.schemas AS s
    ON o.[schema_id] = s.[schema_id]
  WHERE c.system_type_id IN (34, 35, 99)
  ORDER BY [Schema], [Object], [Column];

Output:

It may be tempting to go into SSMS and manually change the data types of these columns, but there may be other implications as well. For example, the columns might have default constraints associated with them. And you may have stored procedures with parameters that should be updated in tandem:

CREATE PROCEDURE dbo.sp1 @p1 ntext AS PRINT 1;
GO

To find all of these cases, you can adapt the above query to search against the sys.parameters catalog view instead:

SELECT [Schema]  = s.name, 
       [Object]   = o.name, 
       [Parameter] = p.name, 
       [Data Type] = TYPE_NAME(p.user_type_id) + CASE 
         WHEN p.system_type_id <> p.user_type_id 
         THEN N' (' + TYPE_NAME(p.system_type_id) + N')' 
         ELSE N'' END
  FROM sys.objects AS o
  INNER JOIN sys.schemas AS s
    ON o.[schema_id] = s.[schema_id]
  INNER JOIN sys.parameters AS p
    ON p.[object_id] = o.[object_id]
  WHERE p.system_type_id IN (34, 35, 99)
  ORDER BY [Schema], [Object], [Parameter];

Output:

If you need to return this data this across all databases, you can grab sp_ineachdb, a procedure I wrote (and documented here and here) to overcome several of the limitations in the buggy, undocumented, and unsupported sp_MSforeachdb. Then you can do this:

EXEC master.dbo.sp_ineachdb @command = N'SELECT [Database]  = DB_NAME(), 
       [Schema]    = s.name, 
       [Object]    = o.name,
       [Column]    = c.name,
       [Data Type] = TYPE_NAME(c.user_type_id) + CASE 
         WHEN c.system_type_id <> c.user_type_id 
         THEN N'' ('' + TYPE_NAME(c.system_type_id) + N'')'' 
         ELSE N'''' END
  FROM sys.columns AS c
  INNER JOIN sys.objects AS o
    ON c.[object_id] = o.[object_id]
  INNER JOIN sys.schemas AS s
    ON o.[schema_id] = s.[schema_id]
  WHERE c.system_type_id IN (34, 35, 99)
  ORDER BY [Schema], [Object], [Column];

SELECT [Database]  = DB_NAME(),
       [Schema]    = s.name, 
       [Object]    = o.name, 
       [Parameter] = p.name, 
       [Data Type] = TYPE_NAME(p.user_type_id) + CASE 
         WHEN p.system_type_id <> p.user_type_id 
         THEN N'' ('' + TYPE_NAME(p.system_type_id) + N'')''
         ELSE N'''' END
  FROM sys.objects AS o
  INNER JOIN sys.schemas AS s
    ON o.[schema_id] = s.[schema_id]
  INNER JOIN sys.parameters AS p
    ON p.[object_id] = o.[object_id]
  WHERE p.system_type_id IN (34, 35, 99)
  ORDER BY [Schema], [Object], [Parameter];';

An interesting side note here: if you do run that against all databases, you will discover that, even in SQL Server 2019, Microsoft is still using some of these old types.

You could further automate that by running it from PowerShell or whatever automation tool you use to manage multiple instances of SQL Server.

Of course, that is just the beginning – it only produces a list. You could further extend it to generate a draft version of the ALTER TABLE commands you would need to update all the tables, but those commands would need to be reviewed before you execute them, and you'd still need to modify the procedures yourself (generating ALTER PROCEDURE commands that only have those parameter type names replaced correctly is not an easy exercise by any means). Here is an example that generates ALTER TABLE commands, taking into account nullability but no other complications like default constraints:

SELECT N'ALTER TABLE ' + QUOTENAME(s.name)
  + N'.' + QUOTENAME(o.name)
  + N' ALTER COLUMN ' + QUOTENAME(c.name) + N' '
  + CASE c.system_type_id
      WHEN 34 THEN N'varbinary'
      WHEN 35 THEN N'varchar'
      WHEN 99 THEN N'nvarchar'
    END + N'(max)' 
  + CASE c.is_nullable 
      WHEN 0 THEN N' NOT' 
      ELSE N'' END + N' NULL;'
FROM sys.columns AS c
INNER JOIN sys.objects AS o
  ON c.[object_id] = o.[object_id]
INNER JOIN sys.schemas AS s
  ON o.[schema_id] = s.[schema_id]
  WHERE c.system_type_id IN (34, 35, 99);

Output:

ALTER TABLE [dbo].[t2] ALTER COLUMN [msg] nvarchar(max) NULL;

And in case you're wondering, no, you can't do this one-time operation with an explicit WITH (ONLINE = ON) option:

Msg 11427, Level 16, State 1
The online ALTER COLUMN operation cannot be performed for table 't2' because column 'msg' currently has or is getting altered into an unsupported datatype: text, ntext, image, CLR type or FILESTREAM. The operation must be performed offline.

Conclusion

Hopefully this provides some good background on why you want to eliminate these deprecated types, and a starting point to actually make the changes. Microsoft learned the hard way that there isn't much functionality they can just rip out of the product, so I'm not concerned that these will ever actually cease to exist in my lifetime. But fear of removal shouldn't be your only motivator.

The post Deprecated features to take out of your toolbox – Part 3 appeared first on SQLPerformance.com.

Deprecated features to take out of your toolbox – Part 2

Aaron Bertrand — Mon, 15 Mar 2021 09:00:15 +0000

In my last post, I illustrated one reason you should stop using deprecated system tables like sysprocesses. This wasn't for performance reasons, directly, or to simply follow Microsoft's documented best practices, but revolved more around the decisions you might make when you only have access to some of the data.

This time around, I want to talk about a feature included with SQL Server client tools that we should not be using these days – not just because it is deprecated but, more importantly, because it has the potential to completely take a server down:

SQL Server Profiler

Since SQL Server 2012 (and, to a much lesser extent, since SQL Server 2008), the most complete and flexible solution for monitoring events on a SQL Server instance has been Extended Events. On the other hand, since it was first invented (roughly right around the time I started my career), SQL Server Profiler has been one of the most prolific ways to accidentally bring a server to its knees.

Not too long ago, something like this happened to us. A developer made a change to their application to reduce the number of round-trips they were making. They ran the application locally and in our development environment, using Profiler with a filtered trace, to confirm their changes were working as expected. Let me remind you at this point that the official documentation for SQL Server Profiler includes the following warning:

SQL Trace and SQL Server Profiler are deprecated. The Microsoft.SqlServer.Management.Trace namespace that contains the Microsoft SQL Server Trace and Replay objects are also deprecated. This feature will be removed in a future version of Microsoft SQL Server. Avoid using this feature in new development work, and plan to modify applications that currently use this feature.

Anyway, when they deployed the new version of their application to production, and targeted the production server with the same filtered trace, it didn't go so well. Their wildcard filter on application name didn't take into account the other (similarly-named) applications also connecting to this server, and they immediately started capturing way more info than their open Profiler window could handle. This resulted in a catastrophic uptick in connection time for all users and applications connecting to that instance. It would be an understatement to say that complaints were lodged.

When the culprit was determined, and we got a response from the developer, you can see that the connection time trended back down to normal almost immediately after they stopped their Profiler trace (click to enlarge):

A SQL Server Profiler mini-disaster

This is definitely a scenario where the old "worked on my machine" statement does not in any way mean it will work well on a busy production server. And this incident has led to an active conversation around modifying the logon trigger that already exists across all of the servers in our environment to simply block the application name Profiler passes in its connection string.

Maybe this isn't a Profiler problem. (But it kind of is.)

I am not here to suggest that other monitoring tools, including Extended Events, can't possibly be used inappropriately to bring a server down in a similar (or worse!) way. There are plenty of opportunities to inadvertently affect an instance of SQL Server, in really adverse ways, without touching SQL Server Profiler.

But Profiler is notorious for this type of symptom because of the way it consumes data. It is a user interface with a grid that presents new rows as it receives them; unfortunately, it makes SQL Server wait while it renders them before allowing SQL Server to transmit more rows. This behavior is similar to how SQL Server Management Studio can slow queries down and cause high ASYNC_NETWORK_IO waits as it tries to render a large amount of output to its own grid. That is a simplification (and is not to be confused with the way the underlying SQL Trace can be made to behave, which Greg Gonzalez (@SQLsensei) explains in "Don't Fear the Trace"), but it is exactly what leads to the type of scenario shown above: that bottleneck has a cascading effect on any other processes trying to do anything in the same code path as what you're tracing (including trying to establish a connection).

Afraid of Extended Events?

Don't be. It is high time we all ditch Profiler and embrace the present. There is no shortage of tutorials and guides out there, including Microsoft's own QuickStart and several articles right here on this site.

If you have existing traces you rely on, Jonathan Kehayias (@SQLPoolBoy) has a really handy script to convert an existing trace to Extended Events. You can still feel free to configure the original trace with the Profiler UI, if that's where you're most comfortable; just please do it without connecting to a production server. You can read about that script here and see some of Jonathan's other Extended Events articles here.

If you're having a hard time with the user experience, you are not alone, but there are some answers:

Erin Stellato (@erinstellato) has long been a spectacular advocate for Extended Events, and often wonders aloud why people are reluctant to let go of Profiler and SQL Trace in general. She has some insight (and inspired a lot of comments) in her 2016 post, "Why do YOU avoid Extended Events?"; perhaps there is some insight there into whether your reasons for holding out are still (as) valid in 2021.
There is an XEvent Profiler built into modern versions of SSMS, with an equivalent extension for Azure Data Studio. Though, confusingly, they called this extension – of all the things one could possibly imagine – SQL Server Profiler. Erin also has a few thoughts about that choice.
Erik Darling (@erikdarlingdata) has created sp_HumanEvents to take some of the pain out of switching to Extended Events. One of my favorite "stick to the point" folks, Erik describes sp_HumanEvents as follows: "If you want a pain-free way to profile query metrics, wait stats, blocking, compiles, or recompiles with Extended Events, this is your unicorn."

The post Deprecated features to take out of your toolbox – Part 2 appeared first on SQLPerformance.com.

Deprecated features to take out of your toolbox – Part 1

Aaron Bertrand — Mon, 22 Feb 2021 09:00:58 +0000

Microsoft is not in the habit of deprecating things these days, but when they do, it's for a reason – and it's certainly not because they want to make your life harder. On the contrary, it is almost always because they have developed better and more modern ways to solve those same problems.

But habits are hard to break; ask me how I know. All too often, I see people clinging to an older way of accomplishing some task, even though a better way exists.

I would like to share a couple of recent examples that help illustrate how using deprecated SQL Server features continues to bite us. In this first part, I want to talk about…

sysprocesses

The system table sys.sysprocesses was replaced way back in SQL Server 2005 by a set of dynamic management views (DMVs), most notably sys.dm_exec_requests, sys.dm_exec_sessions, and sys.dm_exec_connections. The official documentation for sys.sysprocesses warns:

This SQL Server 2000 system table is included as a view for backward compatibility. We recommend that you use the current SQL Server system views instead. To find the equivalent system view or views, see Mapping System Tables to System Views (Transact-SQL). This feature will be removed in a future version of Microsoft SQL Server. Avoid using this feature in new development work, and plan to modify applications that currently use this feature.

A recent example

Recently one of our teams was investigating a log reader latency issue. We pay a lot of attention to latency here, along with any long-running transactions, because of downstream impact to technologies that use the log reader – like Availability Groups and transactional replication. Our first warnings are usually spotted on a dashboard that plots log reader latency against transaction duration (I'll explain the points in time I labeled t₀ and t₁ shortly):

They determined, let's say at time t₀, that a certain session had an open transaction blocking the log reader process. They first checked the output of DBCC INPUTBUFFER, to try to determine what this session did last, but the results simply indicated that they also issued other batches during their transaction:

event_type       parameters   event_info
--------------   ----------   ---------------
Language Event   0            SET ROWCOUNT 0;

Note that DBCC INPUTBUFFER also has a more capable replacement in modern versions: sys.dm_exec_input_buffer. And while it doesn't have an explicit deprecation warning, the official documentation for the DBCC command has this gentle nudge:

Starting with SQL Server 2014 (12.x) SP2, use sys.dm_exec_input_buffer to return information about statements submitted to an instance of SQL Server.

After getting nothing from the input buffer, they queried sys.sysprocesses:

SELECT 
  spid, 
  [status], 
  open_tran, 
  waittime, 
  [cpu], 
  physical_io, 
  memusage, 
  last_batch
FROM sys.sysprocesses 
WHERE spid = 107;

The results were similarly useless, at least in terms of determining what the session had been doing to keep their transaction open and disrupt the log reader:

I'm highlighting the physical_io column because this value sparked a discussion about whether or not they wanted to risk killing the sleeping session. The thinking was that, in the event all those physical I/Os are writes, killing the transaction might result in a lengthy and disruptive rollback – potentially making the problem even worse. I'm not going to put actual times on this, but let's just say this turned into a prolonged conversation, and it left the system in this state from time t₀ to time t₁ on the graph above.

Why this is a problem

The issue in this specific case is that they spent that time contemplating a decision based on incomplete information. Are those I/Os reads or writes? If the user has an open transaction and has merely read a lot of data, there is far less impact in rolling that transaction back than if they have changed a lot of data. So, instead of sys.sysprocesses, let's see what the more modern DMV, sys.dm_exec_sessions, can show us about this session:

SELECT 
  session_id, 
  [status], 
  open_transaction_count, 
  cpu_time, 
  [reads], 
  writes, 
  logical_reads, 
  last_request_start_time,
  last_request_end_time
FROM sys.dm_exec_sessions 
WHERE session_id = 107;

Results:

Here we see that sys.dm_exec_sessions breaks out the physical I/O separately into reads and writes. This allows us to make a much more informed decision, much more quickly than t₁ - t₀, about a potential rollback's impact. If the I/O is all writes, and depending on how high the number is, we might hesitate a bit more and perhaps spend the time trying to locate the user (so we could smack their knuckles or ask them why they have an open transaction). If we know it is mostly reads, we can instead lean toward killing the session and forcing the transaction to roll back.

Sure, sys.sysprocesses has dbid and waittime. But dbid is unreliable and marginally useful anyway, particularly for cross-database queries; there's much better info in sys.dm_tran_locks. Wait info (time and last wait type) can be found in sys.dm_exec_requests, but much more detailed info is offered in sys.dm_exec_session_wait_stats (added in SQL Server 2016). An excuse I used to hear a lot was that sys.dm_exec_sessions was missing open_tran, but open_transaction_count was added back in SQL Server 2012. So there is very little reason to even think about using sys.sysprocesses today.

If you want to discover how often sys.sysprocesses has been referenced since SQL Server restarted last, you can run this query against the performance counters DMV:

SELECT instance_name, cntr_value
  FROM sys.dm_os_performance_counters
  WHERE [object_name] LIKE N'%:Deprecated Features%'
    AND instance_name = N'sysprocesses' 
  ORDER BY cntr_value DESC;

If you really want to avoid sleep tonight, or you just like to constantly add to the laundry list of things you worry about, remove the predicate against instance_name. This will give you a scary, high level idea of how many things your instances are running that you will eventually need to change.

In the meantime, go download sp_WhoIsActive, Adam Machanic's ultra useful stored procedure for monitoring and troubleshooting SQL Server processes in real time. We have deployed this stored procedure to every instance in our environment, and you should too, regardless of what other high-end monitoring tools you might also be using.

Next time

In Part 2, I'm going to talk a little about SQL Server Profiler, an application that people use more because of familiarity than anything else – without realizing how dangerous it can be.

The post Deprecated features to take out of your toolbox – Part 1 appeared first on SQLPerformance.com.