Rob Farley, Author at SQLPerformance.com

Filtered Indexes and INCLUDEd Columns

Rob Farley — Fri, 29 Mar 2019 13:38:48 +0000

Filtered indexes are amazingly powerful, but I still see some confusion out there about them – particularly about the columns that are used in the filters, and what happens when you want to tighten the filters.

A recent question on dba.stackexchange asked for help about why columns used in the filter of a filtered index should be included in the 'included' columns of the index. Excellent question – except that I felt like it started on a poor premise, because those columns shouldn't have to be included in the index. Yes they help, but not in the way that the question seemed to suggest.

To save you looking at the question itself, here's a quick summary:

To satisfy this query…

SELECT Id, DisplayName  FROM Users  WHERE Reputation > 400000;

…the following filtered index is pretty good:

CREATE UNIQUE NONCLUSTERED INDEX Users_400k_Club ON dbo.Users ( DisplayName, Id ) INCLUDE ( Reputation ) WHERE Reputation > 400000;

But despite having this index in place, the Query Optimizer recommends the following index if the filtered value is tightened to, say, 450000.

CREATE NONCLUSTERED INDEX IndexThatWasMissing ON dbo.Users ( Reputation ) INCLUDE ( DisplayName, Id );

I'm paraphrasing the question a little here, which starts by referring to this situation and then builds a different example, but the idea is the same. I just didn't want to make things more complicated by involving a separate table.

Point is – the index suggested by the QO is the original index but turned on its head. The original index had Reputation in the INCLUDE list, and DisplayName and Id as key columns, while the new recommended index is the opposite way around with Reputation as the key column and DisplayName & ID in the INCLUDE. Let's look into why.

The question refers to a post by Erik Darling, where he explains that he tuned the '450,000' query above by putting Reputation into the INCLUDE column. Erik shows that without Reputation in the INCLUDE list, a query which filters to a higher value of Reputation needs to do Lookups (bad!), or maybe even give up entirely on the filtered index (potentially even worse). He concludes that having the Reputation column in the INCLUDE list lets SQL have statistics, so that it can make better choices, and shows that with Reputation in the INCLUDE a variety of queries that all filter on higher Reputation values all scan his filtered index.

In an answer to the dba.stackexchange question, Brent Ozar points out that Erik's improvements aren't particularly great because they cause Scans. I'll come back to that one, because it's an interesting point in itself, and somewhat incorrect.

First let's think a little about indexes in general.

An index provides an ordered structure to a set of data. (I could be pedantic and point out that reading through the data in an index from start to end might jump you from page to page in a seeming haphazard way, but still as you're reading through pages, following the pointers from one page to the next you can be confident the data is ordered. Within each page you might even jump around to read the data in order, but there is a list showing you which parts (slots) of the page should be read in which order. There really is no point in my pedantry except to answer those equally pedantic who will comment if I don't.)

And this order is according to the key columns – that's the easy bit that everyone gets. It's useful not only for being able to avoid re-ordering the data later, but also for being able to quickly locate any particular row or range of rows by those columns.

The leaf levels of the index contain the values in any columns in the INCLUDE list, or in the case of a Clustered Index, the values across all the columns in the table (except non-persisted computed columns). The other levels in the index contain just the key columns and (if the index is not unique) the unique address of the row – which is either the keys of the clustered index (with the row's uniquifier if the clustered index isn't unique either) or the RowID value for a heap, enough to allow easy access to all the other column values for the row. The leaf levels also include all the ‘address’ information.

But that's not the interesting bit for this post. The interesting bit for this post is what I mean by "to a set of data". Remember I said "An index provides an ordered structure to a set of data".

In a clustered index, that set of data is the whole table, but it could be something else. You can probably already picture how most non-clustered indexes don't involve all the columns of the table. This is one of the things that make non-clustered indexes so useful, because they're typically a lot smaller than the underlying table.

In the case of an indexed view, our set of data it could be the results of a whole query, including joins across many tables! That's for another post.

But in a filtered index, it's not just a copy of a subset of columns, but a subset of rows too. So in the example here, the index is only across the users with more than 400k reputation.

CREATE UNIQUE NONCLUSTERED INDEX Users_400k_Club_NoInclude ON dbo.Users ( DisplayName, Id ) WHERE Reputation > 400000;

This index takes the users who have more than 400k reputation, and orders them by DisplayName and Id. It can be unique because (assumedly) the Id column is already unique. If you try something similar on your own table, you might need to be careful of that.

But at this point, the index doesn't care what the Reputation is for each user – it just cares whether the Reputation is high enough to be in the index or not. If a user's reputation gets updated and it tips over the threshold that user's DisplayName and Id will be inserted into the index. If it drops below, it'll get deleted from the index. It's just like having a separate table for the high rollers, except that we get people into that table by increasing their Reputation value over the 400k threshold in the underlying table. It can do this without having to actually store the Reputation value itself.

So now if we want to find people who have a threshold over 450k, that index is missing some information.

Sure, we could confidently say that everyone we'll find is in that index – but the index doesn't contain enough information in itself to filter any further on Reputation. If I told you I had an alphabetical list of Best Picture Oscar winning films from the 1990s (American Beauty, Braveheart, Dances With Wolves, English Patient, Forrest Gump, Schindler's List, Shakespeare in Love, Silence of the Lambs, Titanic, Unforgiven), then I can assure you that the winners for 1994-1996 would be a subset of those, but I can't answer the question without first getting some more information.

Obviously my filtered index would be more useful if I had included the year, and potentially even more so if the year were a key column, since my new query wants to find the ones for 1994-1996. But I probably designed this index around a query to list all the movies from the 1990s in alphabetical order. That query doesn't care about what the actual year is, only whether it's in the 1990s or not, and I don't even need to return the year – just the title – so I can scan my filtered index to get the results. For that query I don't even need to reorder the results or find the starting point – my index really is perfect.

A more practical example of not caring about the value of the column in the filter is on status, such as:

WHERE IsActive = 1

I frequently see code that moves data from one table to another when rows stop being 'active'. People don't want old rows cluttering up their table, and they recognise that their 'hot' data is only a small subset of all their data. So they move their cooling data off into an Archive table, keeping their Active table small.

A filtered index can do this for you. Behind the scenes. As soon as you update the row and change that IsActive column to something other than 1. If you only care about having active data in most of your indexes, then filtered indexes are ideal. It'll even bring rows back into the indexes if the IsActive value changes back to 1.

But you don't need to put IsActive in the INCLUDE list to achieve this. Why would you want to store the value – you already know what the value is – it's 1! Unless you're asking to return the value you shouldn't need it. And why would you return the value when you already know that the answer is 1, right?! Except that frustratingly, the statistics that Erik refers to in his post will take advantage of being in the INCLUDE list. You don't need it for the query, but you should include it for the stats.

Let's think about what the Query Optimizer needs to do to figure out the usefulness of an index.

Before it can do much at all, it needs to consider whether the index is a candidate. No point in using an index if it doesn't have all the rows that might be needed – not unless we have an effective way of getting the rest. If I want movies from 1985-1995, then my index of 1990s movies is pretty pointless. But for 1994-1996, maybe it's not bad.

At this point, just like any index consideration, I need to think about whether it will help enough for finding the data and getting it into an order that will help executing the rest of the query (possibly for a Merge Join, Stream Aggregate, satisfying an ORDER BY, or various other reasons). If my query filter matches the index filter exactly, then I don't need to filter any further – just using the index is enough. This sounds great, but if it doesn't match exactly, if my query filter is tighter that the index filter (like my 1994-1996 example, or Erik's 450,000), I'm going to need to have those Year values or Reputation values to check – hopefully getting them either from the INCLUDEd at the leaf level or somewhere in my key columns. If they're not in the index, I'm going to have to do a Lookup for every row in my filtered index (and ideally, have an idea about how many times my Lookup will be called, which are the stats that Erik wants the column included for).

Ideally, any index I plan to use is ordered correctly (via the keys), INCLUDEs all the columns I need to return, and is pre-filtered to just the rows I need. That would be the perfect index, and my execution plan will be a Scan.

That's right, a SCAN. Not a Seek, but a Scan. It'll start on the first page of my index and keep giving me rows until I've got as many as I need, or until there are no more rows to return. Not skipping any, not sorting them – just giving me the rows in order.

A Seek would suggest that I don't need the whole index, which means I'm wasting resources in maintaining that part of the index, and to query it I have to find the starting point and keep checking rows to see if I've hit the end or not. If my Scan has a Predicate, then sure, I'm having to look through (and test) more data than I need to, but if my index filters are perfect, then the Query Optimizer should recognise that and not have to perform those checks.

Final Thoughts

INCLUDEs aren't critical for filtered indexes. They're useful for providing easy access to columns which might be useful for your query, and if you happen to be tightening what's in your filtered index by any column, whether it's mentioned in the filter or not, you should consider having that column in the mix. But at that point you ought to be asking whether your index's filter is the right one, what else you should have in your INCLUDE list, and even what the key column(s) should be. Erik's queries weren't playing nicely because he needed information that wasn't in the index, even though he had mentioned the column in the filter. He found a good use for the statistics too, and I would still encourage you to include the filter columns for that reason. But putting them in an INCLUDE doesn't allow them to suddenly start doing a Seek, because that's not how any index works, whether filtered or not.

I want you, reader, to understand filtered indexes really well. They're incredibly useful and, when you start to picture them like tables in their own rights, can become part of your overall database design. They're also a reason for always using the ANSI_NULLs and QUOTED_IDENTIFIER settings, because you'll get errors from filtered index unless those settings are ON, but hopefully you already make sure they're always on anyway.

Oh, and those movies were Forrest Gump, Braveheart, and The English Patient.

@rob_farley

The post Filtered Indexes and INCLUDEd Columns appeared first on SQLPerformance.com.

Dealing with date and time instead of datetime

Rob Farley — Thu, 29 Mar 2018 13:14:16 +0000

Data professionals don’t always get to use databases that have an optimal design. Sometimes the things that make you cry are things that we’ve done to ourselves, because they seemed like good ideas at the time. Sometimes they’re because of third party applications. Sometimes they simply predate you.

The one I’m thinking about in this post is when your datetime (or datetime2, or better still, datetimeoffset) column is actually two columns – one for the date, and one for the time. (If you have a separate column again for the offset, then I will give you a hug next time I see you, because you have probably had to deal with all kinds of hurt.)

I did a survey on Twitter, and found that this is a very real problem that about half of you have to deal with date and time from time to time.

AdventureWorks almost does this – if you look in the Sales.SalesOrderHeader table, you’ll see a datetime column called OrderDate, which always has exact dates in it. I bet that if you’re a report developer at AdventureWorks, you have probably written queries that look for the number of orders on a particular day, using GROUP BY OrderDate, or something like that. Even if you knew that this was a datetime column and there was potential for it to also store a non-midnight time, you would still say GROUP BY OrderDate just for the sake of using an index properly. GROUP BY CAST(OrderDate AS DATE) just doesn’t cut it.

I have an index on the OrderDate, as you would if you were regularly querying that column, and I can see that grouping by CAST(OrderDate AS DATE) is around four times worse from a CPU perspective.

So I understand why you would be happy to query your column as if it’s a date, simply knowing that you’ll have a world of pain if the use of that column changes. Maybe you solve this by having a constraint on the table. Maybe you just put your head in the sand.

And when someone comes along and says “You know, we should store the time that orders happen too”, well, you think of all the code that assumes OrderDate is simply a date, and figure that having a separate column called OrderTime (data type of time, please) will be the most sensible option. I understand. It’s not ideal, but it works without breaking too much stuff.

At this point, I recommend you also make OrderDateTime, which would be a computed column joining the two (which you should do by adding the number of days since day 0 to CAST(OrderDate as datetime2), rather than trying to add the time to date, which is generally a whole lot messier). And then index OrderDateTime, because that would be sensible.

But quite often, you’ll find yourself with date and time as separate columns, with basically nothing you can do about it. You can’t add a computed column, because it’s a third party application and you don’t know what might break. Are you sure they never do SELECT *? One day I hope they’ll let us add columns and hide them, but for the time being, you certainly risk breaking stuff.

And, you know, even msdb does this. They’re both integers. And it’s because of backward compatibility, I assume. But I doubt you’re considering adding a computed column to a table in msdb.

So how do we query this? Let’s suppose we want to find the entries that were within a particular datetime range?

Let’s do some experimenting.

First, let’s create a table with 3 million rows, and index the columns we care about.

select identity(int,1,1) as ID, OrderDate, 
  dateadd(minute, abs(checksum(newid())) % (60 * 24), cast('00:00' as time)) as OrderTime
into dbo.Sales3M
from Sales.SalesOrderHeader
cross apply (select top 100 * from master..spt_values) v;

create index ixDateTime on dbo.Sales3M (OrderDate, OrderTime) include (ID);

(I could have made that a clustered index, but I figure that a non-clustered index is more typical for your environment.)

Our data looks like this, and I want to find rows between, say, Aug 2 2011 at 8:30, and Aug 5 2011 at 21:30.

By looking through the data, I can see that I want all the rows between 48221 and 50171. That’s 50171-48221+1=1951 rows (the +1 is because it’s an inclusive range). This helps me be confident that my results are correct. You would probably have similar on your machine, but not exact, because I used random values when generating my table.

I know that I can’t just do something like this:

select * 
from dbo.Sales3M 
where OrderDate between '20110802' and '20110805'
and OrderTime between '8:30' and '21:30';

…because this wouldn’t include something that happened overnight on the 4th. This gives me 1268 rows – clearly not right.

One option is to combine the columns:

select * 
from dbo.Sales3M 
where dateadd(day,datediff(day,0,OrderDate),cast(OrderTime as datetime2)) 
  between '20110802 8:30' and '20110805 21:30';

This gives the correct results. It does. It’s just that this is completely non-sargable, and gives us a Scan across all the rows in our table. On our 3 million rows it might take seconds to run this.

Our problem is that we have an ordinary case, and two special cases. We know that every row that satisfies OrderDate > ‘20110802’ AND OrderDate < ‘20110805’ is one that we want. But we also need every row that is on-or-after 8:30 on 20110802, and on-or-before 21:30 on 20110805. And that leads us to:

select * 
from dbo.Sales3M 
where (OrderDate > '20110802' and OrderDate < '20110805')
or (OrderDate = '20110802' and OrderTime >= '8:30')
or (OrderDate = '20110805' and OrderTime <= '21:30');

OR is awful, I know. It can also lead to Scans, although not necessarily. Here I see three Index Seeks, being concatenated and then checked for uniqueness. The Query Optimizer obviously realises that it shouldn’t return the same row twice, but doesn’t realise that the three conditions are mutually exclusive. And actually, if you were doing this on a range within a single day, you’d get the wrong results.

We could use UNION ALL on this, which would mean the QO wouldn’t care whether the conditions were mutually exclusive. This gives us three Seeks that are concatenated – that’s pretty good.

select * 
from dbo.Sales3M 
where (OrderDate > '20110802' and OrderDate < '20110805')
union all 
select * 
from dbo.Sales3M 
where (OrderDate = '20110802' and OrderTime >= '8:30')
union all 
select * 
from dbo.Sales3M 
where (OrderDate = '20110805' and OrderTime <= '21:30');

But it’s still three seeks. Statistics IO tells me it’s 20 reads on my machine.

Now, when I think about sargability, I don’t just think about avoiding putting indexes columns inside expressions, I also think about what might help something seem sargable.

Take WHERE LastName LIKE 'Far%' for example. When I look at the plan for this, I see a Seek, with a Seek Predicate is looking for any name from Far up to (but not including) FaS. And then there’s a Residual Predicate checking the LIKE condition. This is not because the QO considers that LIKE is sargable. If it were, it would be able to use LIKE in the Seek Predicate. It’s because it knows that everything that is satisfied by that LIKE condition must be within that range.

Take WHERE CAST(OrderDate AS DATE) = '20110805'

Here we see a Seek Predicate that looks for OrderDate values between two values that have been worked out elsewhere in the plan, but creating a range in which the right values must exist. This isn’t >= 20110805 00:00 and < 20110806 00:00 (which is what I would’ve made it), it’s something else. The value for start of this range must be smaller than 20110805 00:00, because it’s >, not >=. All we can really say is that when someone within Microsoft implemented how the QO should respond to this kind of predicate, they gave it enough information to come up with what I call a “helper predicate.”

Now, I would love Microsoft to make more functions sargable, but that particular request was Closed long before they retired Connect.

But maybe what I mean is for them to make more helper predicates.

The problem with helper predicates is that they almost certainly read more rows than you want. But it’s still way better than looking through the whole index.

I know that all the rows I want to return will have OrderDate between 20110802 and 20110805. It’s just that there are some that I don’t want.

I could just remove them, and this would be valid:

select *
from dbo.Sales3M
where OrderDate between '20110802' and '20110805'
and not (OrderDate = '20110802' and OrderTime < '8:30')
and not (OrderDate = '20110805' and OrderTime > '21:30');

But I feel like this is a solution that requires some effort of thought to come up with. Less effort on the developer’s side is to simply provide a helper predicate to our correct-but-slow version.

select * 
from dbo.Sales3M 
where dateadd(day,datediff(day,0,OrderDate),cast(OrderTime as datetime2)) 
between '20110802 8:30' and '20110805 21:30'
and OrderDate between '20110802' and '20110805';

Both of these queries find the 2300 rows that are on the right days, and then need to check all of those rows against the other predicates. One must check the two NOT conditions, the other must do some type conversion and maths. But both are much quicker than what we had before, and do a single Seek (13 reads). Sure, I get warnings about an inefficient RangeScan, but this is my preference over doing three efficient ones.

In some ways, the biggest problem with this last example is that some well-meaning person would see that the helper predicate was redundant and might delete it. This is the case with all helper predicates. So put a comment in.

select * 
from dbo.Sales3M 
where dateadd(day,datediff(day,0,OrderDate),cast(OrderTime as datetime2)) 
  between '20110802 8:30' and '20110805 21:30'
/* This next predicate is just a helper to improve performance */
and OrderDate between '20110802' and '20110805';

If you have something which doesn’t fit in a nice sargable predicate, work out one that is, and then figure out what you need to exclude from it. You might just come up with a nicer solution.

@rob_farley

The post Dealing with date and time instead of datetime appeared first on SQLPerformance.com.

When DISTINCT <> GROUP BY

Rob Farley — Tue, 13 Mar 2018 13:17:56 +0000

I wrote a post recently about DISTINCT and GROUP BY. It was a comparison that showed that GROUP BY is generally a better option than DISTINCT. It's on a different site, but be sure to come back to sqlperformance.com right after..

One of the query comparisons that I showed in that post was between a GROUP BY and DISTINCT for a sub-query, showing that the DISTINCT is a lot slower, because it has to fetch the Product Name for every row in the Sales table, rather than just for each different ProductID. This is quite plain from the query plans, where you can see that in the first query, the Aggregate operates on data from just one table, rather than on the results of the join. Oh, and both queries give the same 266 rows.

select od.ProductID,
    (select Name
     from Production.Product p
     where p.ProductID = od.ProductID) as ProductName
from Sales.SalesOrderDetail od
group by od.ProductID;

select distinct od.ProductID,
     (select Name
      from Production.Product p
      where p.ProductID = od.ProductID) as ProductName
from Sales.SalesOrderDetail od;

Now, it's been pointed out, including by Adam Machanic (@adammachanic) in a tweet referencing Aaron's post about GROUP BY v DISTINCT that the two queries are essentially different, that one is actually asking for the set of distinct combinations on the results of the sub-query, rather than running the sub-query across the distinct values that are passed in. It's what we see in the plan, and is the reason why the performance is so different.

The thing is that we would all assume that the results are going to be identical.

But that's an assumption, and isn't a good one.

I'm going to imagine for a moment that the Query Optimizer has come up with a different plan. I used hints for this, but as you would know, the Query Optimizer can choose to create plans in all kinds of shapes for all kinds of reasons.

select od.ProductID,
    (select Name
     from Production.Product p
     where p.ProductID = od.ProductID) as ProductName
from Sales.SalesOrderDetail od
group by od.ProductID
option (loop join);

select distinct od.ProductID,
    (select Name
     from Production.Product p
     where p.ProductID = od.ProductID) as ProductName
from Sales.SalesOrderDetail od
option (loop join);

In this situation, we either do 266 Seeks into the Product table, one for each different ProductID that we're interested in, or 121,317 Seeks. So if we are thinking about a particular ProductID, we know that we're going to get a single Name back from the first one. And we assume that we're going to get a single Name back for that ProductID, even if we have to ask for it hundred times. We just assume we're going to get the same results back.

But what if we don't?

This sounds like a isolation level thing, so let's use NOLOCK when we hit the Product table. And let's run (in a different window) a script the changes the text in the Name columns. I'm going to do it over and over, to try to get some of the changes in between my query.

update Production.Product
set Name = cast(newid() as varchar(36));
go 1000

Now, my results are different. The plans are the same (except for the number of rows coming out of the Hash Aggregate in the second query), but my results are different.

Sure enough, I have more rows with DISTINCT, because it finds different Name values for the same ProductID. And I don't necessarily have 295 rows. Another I run it, I might get 273, or 300, or possibly, 121,317.

It's not hard to find an example of a ProductID that shows multiple Name values, confirming what's going on.

Clearly, to ensure that we don't see these rows in the results, we would either need to NOT use DISTINCT, or else use a stricter isolation level.

The thing is that although I mentioned using NOLOCK for this example, I didn't need to. This situation occurs even with READ COMMITTED, which is the default isolation level on many SQL Server systems.

You see, we need the REPEATABLE READ isolation level to avoid this situation, to hold the locks on each row once it has been read. Otherwise, a separate thread might change the data, as we saw.

But… I can't show you that the results are fixed, because I couldn't manage to avoid a deadlock on the query.

So let's change the conditions, by making sure that our other query is less of a problem. Instead of updating the whole table at a time (which is far less likely in the real world anyway), let's just update a single row at a time.

declare @id int = 1;
declare @maxid int = (select count(*) from Production.Product);
while (@id < @maxid)
begin
   with p as (select *, row_number() over (order by ProductID) as rn from Production.Product)
   update p
   set Name = cast(newid() as varchar(36))
   where rn = @id;

   set @id += 1;
end
go 100

Now, we can still demonstrate the problem under a lesser isolation level, such as READ COMMITTED or READ UNCOMMITTED (although you may need to run the query multiple times if you get 266 the first time, because the chance of updating a row during the query is less), and now we can demonstrate that REPEATABLE READ fixes it (no matter how many times we run the query).

REPEATABLE READ does what it says on the tin. Once you read a row within a transaction, it's locked to make sure you can repeat the read and get the same results. The lesser isolation levels don't take out those locks until you try to change the data. If your query plan never needs to repeat a read (as is the case with the shape of our GROUP BY plans), then you're not going to need REPEATABLE READ.

Arguably, we should always use the higher isolation levels, such as REPEATABLE READ or SERIALIZABLE, but it all comes down to figuring out what our systems need. These levels can introduce unwanted locking, and SNAPSHOT isolation levels require versioning that comes with a price as well. For me, I think it's a trade-off. If I'm asking for a query that could be affected by changing data, then I might need to raise the isolation level for a while.

Ideally, you simply don't update data that has just been read and might need to be read again during the query, so that you don't need REPEATABLE READ. But it's definitely worth understanding what can happen, and recognising that this is the kind of scenario when DISTINCT and GROUP BY might not be the same.

@rob_farley

The post When DISTINCT <> GROUP BY appeared first on SQLPerformance.com.

Because you need to know PowerShell

Rob Farley — Tue, 12 Sep 2017 00:01:45 +0000

Hey – hope you’re doing okay.

I want to take a few minutes to get you started with PowerShell, in a way that I think is less overwhelming than how I found most PowerShell material back in the day. Because you’ve told me you keep hearing about PowerShell, and that you can’t tell how to get into it.

The first thing to know is that it feels like Command Prompt on steroids, but a lot more useful than Command Prompt. You’ll see what I mean as soon as you open it. Let’s do that, but let’s open it using ‘Run as Administrator’, because that just helps for some of the demos. In real life, you’ll only open it with Admin rights if you really need to.

So click on the start button and start typing PowerShell. I don’t want you to open the ISE (yet), just find the “Windows PowerShell” shortcut, and right-click on it to choose ‘Run as Administrator’.

Now that it’s running, pin it to your Taskbar, so that you can access it more quickly next time. If you’re going to try to embrace PowerShell, you should make it easy to open.

Anyway, that window… straight away it feels like you need to have a clue what to write. Yes, that’s right. Things might’ve been a little easier in the ISE, and Aaron Nelson (@sqlvariant) has told me off for not using the ISE every time I do anything in PowerShell, but we’ll be okay here.

Let’s start by doing some basic things that we know from DOS. The commands cd and dir. Try this:

cd ..
dir

You’ll see that you move up to C:\WINDOWS, and then list the contents. No surprises here, it’s just like Command Prompt. Or DOS if you’re old enough.

The difference with PowerShell is that this is not just text output. These things are objects. And we can do stuff with that. PowerShell is a ‘typed’ environment. You already deal with objects all the time, like in Object Explorer in SSMS. PowerShell objects can include all those things, as well as .Net objects. This is where PowerShell gets its power from.

A few quick things…

PowerShell commands come in verb-noun pairs, such as Set-Location or Get-Content or Get-ChildItem. These are known as cmdlets, so when you see the word ‘cmdlet’, it just means a PowerShell command.
PowerShell command parameters use a hyphen. Like: Get-ChildItem -Recurse but you can also abbreviate them so long as there’s no ambiguity. So I could use -rec instead of -recurse.
Variables start with a dollar sign ($). We’ll get to this later.
We don’t use > in filters and tests, we use -gt. -eq for =, -ne for !=/<>, -le for <=, and so on. This is frustrating, but you’ll get used to it.
There are a bunch of abbreviations and aliases. One alias for Set-Location is cd. Get-ChildItem is also known as dir, and ls too. Where-Object is ? and ForEach-Object is %. We’ll see these as we go along too.
You get tab-completion on almost everything, so if you’ve typed Get-Chi, you can hit Tab and cycle through the options until you find what you want.
You can use help (which is an alias for Get-Help) by simply using: help dir, and at the bottom of that help, you’ll get a nice reminder that you could’ve used help dir -examples, and that adding -online would have opened a web page with the relevant help page on it (that’s really handy).
You can pass the results of one thing into another using the pipe symbol (|). Perhaps the most useful one is gm (an alias for Get-Member), which will list the Properties and Methods (the members) for whatever object you’ve passed in.
# will comment the line, or go between <# and #>

Let’s see some of this in action.

Here we can see some interesting things. Good to notice that we could use -Exclude and -Include. We’ll do that soon. We could get more information using -detailed, or -examples, or -full.

Let’s explore our SQL folder. Start by typing cd \p and then hit Tab until it shows you Program Files.

Now hit M and start hitting Tab again to find ‘Microsoft SQL Server’. If you go too far, press Shift-Tab to go backwards. (Oh, and if you should’ve been looking on D:, then start again with cd D:\p – but the principles are the same.)

Great. Let’s hit Enter and see that we’ve changed directory.

But now let’s type: dir -rec -inc ConfigurationFile.ini

You could’ve hit Tab after each parameter and seen them expand into the more correct versions. You could’ve used Get-ChildItem to be slightly stricter. I don’t really care though. When you run this, you’ll see some files listed. I don’t know how many you’ll get. Each one indicates that you ran an installation of SQL Server. But as a SQL expert, you know this. You just might not have known how easily PowerShell could list them for you.

Ok. Hopefully we’re seeing some use for this.

Now let’s do something with this output. I want to say ‘with these files’, but let’s do the data thing and think of this as a dataset of files.

Suppose we want to count them. I could loop through the results using ForEach-Object (but I’ll use the shortcut %), and increment a variable that I set up. This is a common pattern for looping – people put whole programs inside those curly brackets, across lots of lines.

This does the trick, but I’m sure there’s something built-in to let me count things.

Well, I can never remember what that command, so let’s do ask one of those fancy internet-search providers.

I don’t even want to open these pages. I just want to notice that there’s a command called Measure-Object. Let’s pipe our output into that. Hit the ‘up arrow’ key to go back to a previous command.

This takes a moment, and then happily tells us there are three.

It takes a moment because it’s having to search through the files again. How about we store those results somewhere?

Now, I call these things brackets, but Americans call them parentheses. And what you call brackets, I call square brackets. There are curly brackets too, which you call braces I think. I’ll try to call parentheses ‘round brackets’ to avoid confusion, but if I refer to a ‘bracket’, I mean a parenthesis.

I’m going to wrap my dir command in round brackets, and assign that to a variable.

Now I can pipe $configfilelist into Measure-Object and it will be more responsive, because I’ve stored my results.

But because $configfilelist is an array of files, I can also look at its length to see that there are three items. That’s way easier. And I can access each individual one by using square brackets from 0 to 2.

This isn’t exactly ground-breaking stuff yet. But those files do tell us some useful things. Let’s do a Get-Member command on our array.

Don’t get put off by this. At the top we can see that we’re dealing with objects that are of type System.IO.FileInfo. This is a .Net thing and is where the power comes from. We can see that we could call the Delete methods of these files, or CopyTo, or Decrypt (if they were encrypted, but they’re not), and more. Down the bottom we see Properties.

This is more interesting. Let’s look at LastWriteTime. Because we might be interested in when that is.

We can use Select-Object to show only certain properties.

As expected, nothing’s written to those files since the installs were actually done. But what if I wanted to filter this, only to ones since a particular point in time?

I’m going to use Get-Date to get the current date and time. I’m going to wrap that in round brackets and use its AddYears method to find a year ago. In the real world I’d be more inclined to use AddDays, but I do want to get some results.

And I’m going to filter $configfilelist by passing into Where-Object (except that I’m going to use a convenient shortcut which is the question-mark). My filter lives inside curly brackets (‘braces’, right?), and inside those curly brackets I refer to the things in $configfilelist using $_.

So my filter goes like this, saying that I want to list the ones that have a LastWriteTime greater than (-gt) a year before now:

And I only get one file. Terrific.

That’s all well and good, but now I want to know what that installation was about. For that we need to look inside the file. Get-Content is our friend here. Let’s start by using Get-Content against a file directly. It’ll be easy to hook it into our proper command later.

Get-Content actually returns an array of lines, and has really useful parameters like -TotalCount which stops after a certain number of lines, or -Tail which gets lines at the end of the file. I love this when I’m dealing with large files – it’s not like I can open them in Notepad.

Let me tell you about Select-String too. It’s like grep from Unix – you use it to get the lines you want in a file.

That output seems like a pain, but remember this is just the string representation of an object. If I pipe my command into gm (Get-Member), you’ll see what I mean.

So now I can tell that I can access the Path of the file, the Line, the LineNumber, etc., separately.

To look for multiple things, I want to pass an array of parameters to -Pattern. I do that by simply using a comma, to discover that it was an install of Analysis Services.

Hopefully you’re starting to see the power of this.

I can drop it all into the one line by using:

dir -rec -inc ConfigurationFile.ini | ? {$_.LastWriteTime -gt (Get-Date).AddYears(-1)} | select-string -Pattern 'FEATURES=','ACTION=' | select-object Path, Line

(Sorry, you'll likely have to scroll sideways to see all that – getting everything on a single line isn't exactly a virtue in HTML.)

When you’re working things out that you like, you’ll want to store the commands somewhere for later (such as in a .ps1 file that you run). For that, you want to be aware of Get-History. It lists everything you’ve run in that session.

And a neat little trick is that you can use negative numbers to hook into arrays from the other end. So you can do this:

But instead of just displaying it, pipe it into clip:

…which gets it onto the clipboard.

I’m not going to go much further with this, but you could keep push this further and further yourself, exploring the types of things that might be interesting in your job. You can dump results into files using > and append to files using >>.

Now consider commands like:

Get-EventLog -LogName Application -After (Get-Date).AddHours(-1) -EntryType Error

…and imagine what you could do with that.

Try doing: Import-Module SQLServer

Which then lets you go:

cd SQLSERVER:

And browse the SQL environment as if it’s a file system.

I don’t tend to use PowerShell to explore SQL like this very often, I must admit. But the fact that I can certainly gives me a bunch of options. The fact that I could use PowerShell to create a Word document like they do over at https://learn-powershell.net/2014/12/31/beginning-with-powershell-and-word/, or send emails as described at https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/send-mailmessage, or call web services (include Azure Machine Learning ones) as described over at https://docs.microsoft.com/en-gb/powershell/module/Microsoft.PowerShell.Utility/Invoke-WebRequest means that with a bit of imagination, you could use PowerShell to do just about whatever you want. I love using Invoke-SqlCommand to get data out of SQL, push it into Invoke-WebRequest to get scores from Azure ML, and then respond accordingly. PowerShell makes it really straightforward.

Oh, and to run PowerShell commands on other machines (which can be quite useful, of course), a quick ‘internet search’ showed me this useful-looking page: https://www.howtogeek.com/117192/how-to-run-powershell-commands-on-remote-computers/

You could have a list of computers inside a file, and loop through it like this:

Try some stuff. Knock yourself out. And explore what everyone is writing for T-SQL Tuesday this month, hosted by Rob Sewell (@sqldbawithbeard)

@rob_farley

The post Because you need to know PowerShell appeared first on SQLPerformance.com.

Using AT TIME ZONE to fix an old report

Rob Farley — Tue, 14 Feb 2017 15:37:41 +0000

As soon as I saw the SQL 2016 feature AT TIME ZONE, which I wrote about here at sqlperformance.com a few months ago, I remembered a report that needed this feature. This post forms a case study on how I saw it work out, which fits into this month’s T-SQL Tuesday hosted by Matt Gordon (@sqlatspeed). (It’s the 87th T-SQL Tuesday, and I really need to write more blog posts, particularly about things that are not prompted by T-SQL Tuesdays.)

The situation was this, and this may sound familiar if you read that earlier post of mine.

Long before LobsterPot Solutions existed, I needed to produce a report on incidents that occurred, and in particular, show the number of times that responses were made within SLA and the number of times that the SLA was missed. For example, a Sev2 incident that occurred at 4:30pm on a weekday would need to have a response within 1 hour, while a Sev2 incident that occurred at 5:30pm on a weekday would need to have a response within 3 hours. Or something like that – I forget the numbers involved, but I do remember that the helpdesk employees would breathe a sigh of relief when 5pm would roll around, because they wouldn’t need to respond to things so quickly. The 15-minute Sev1 alerts would suddenly stretch out to an hour, and the urgency would disappear.

But a problem would come whenever daylight saving time started or ended.

I’m sure if you’ve dealt with databases, you’ll know the pain that daylight saving time is. Supposedly Ben Franklin came up with the idea – and for that he should be struck by lightning or something. Western Australia tried it for a few years recently, and sensibly abandoned it. And the general consensus is to store date/time data is to do so in UTC.

If you don’t store data in UTC, you run the risk of having an event start at 2:45am and end at 2:15am after the clocks have gone back. Or having an SLA incident that starts at 1:59am just before the clocks go forward. Now, these times are fine if you store the time zone that they’re in, but in UTC time just works as expected.

…except for reporting.

Because how am I supposed to know whether a particular date was before daylight saving started or after? I might know that an incident occurred at 6:30am in UTC, but is that 4:30pm in Melbourne or 5:30pm? Obviously I can consider which month it’s in, because I know that Melbourne observes daylight saving time from the first Sunday in October to the first Sunday in April, but then if there are customers in Brisbane, and Auckland, and Los Angeles, and Phoenix, and various places within Indiana, things get a lot more complicated.

To get around this, there were very few time zones in which SLAs could be defined for that company. It was just considered too hard to cater for more than that. A report could then be customised to say “Consider that on a particular date the time zone changed from X to Y”. It felt messy, but it worked. There was no need for anything to look up the Windows registry, and it basically just worked.

But these days, I would’ve done it differently.

Now, I would’ve used AT TIME ZONE.

You see, now I could store the customer’s time zone information as a property of the customer. I could then store each incident time in UTC, allowing me to do the necessary calculations around the number of minutes to respond, resolve, and so on, while being able to report using the customer’s local time. Assuming my IncidentTime had actually been stored using datetime, rather than datetimeoffset, it would simply be a matter of using code like:

i.IncidentTime AT TIME ZONE 'UTC' AT TIME ZONE c.tz

…which first puts the timezoneless i.IncidentTime into UTC, before converting it to the time zone of the customer. And this time zone can be 'AUS Eastern Standard Time', or 'Mauritius Standard Time', or whatever. And the SQL Engine is left to figure out what offset to use for that.

At this point, I can very easily create a report that lists each incident across a time period, and show it in the customer’s local time zone. I can convert the value to the time data type, and then report against how many incidents were within business hours or not.

And all of this is very useful, but what about the indexing to handle this nicely? After all, AT TIME ZONE is a function. But changing the time zone doesn’t change the order in which the incidents actually occurred, so it should be okay.

To test this, I created a table called dbo.Incidents, and indexed the IncidentTime column. Then I ran this query, and confirmed that an index seek was used.

select i.IncidentTime, itz.LocalTime
from dbo.Incidents i
cross apply (select i.IncidentTime AT TIME ZONE 'UTC' 
  AT TIME ZONE 'Cen. Australia Standard Time') itz (LocalTime)
where i.IncidentTime >= '20170201'
and i.IncidentTime < '20170301';

But I want to filter on itz.LocalTime…

select i.IncidentTime, itz.LocalTime
from dbo.Incidents i
cross apply (select i.IncidentTime AT TIME ZONE 'UTC' 
  AT TIME ZONE 'Cen. Australia Standard Time') itz (LocalTime)
where itz.LocalTime >= '20170201'
and itz.LocalTime < '20170301';

No luck. It didn’t like the index.

The warnings are because it’s having to look through a lot more than the data that I’m interested in.

I even tried using a table with a datetimeoffset field. After all, AT TIME ZONE can change the order when moving from datetime to datetimeoffset, even though the order isn’t changed when moving from datetimeoffset to another datetimeoffset. I even tried making sure that the thing I was comparing it to was in the time zone.

select i.IncidentTime, itz.LocalTime
from dbo.IncidentsOffset i
cross apply (select i.IncidentTime AT TIME ZONE 'Cen. Australia Standard Time') itz (LocalTime)
where itz.LocalTime >= cast('20170201' as datetimeoffset) 
  AT TIME ZONE 'Cen. Australia Standard Time'
and itz.LocalTime < cast('20170301' as datetimeoffset) 
  AT TIME ZONE 'Cen. Australia Standard Time';

Still no luck!

So now I had two options. One was to store the converted version alongside the UTC version, and index that. I think that’s a pain. It’s certainly a lot more of a database change than I’d like.

The other option was to use what I call helper predicates. These are the kind of thing that you see when you use LIKE. They’re predicates that can be used as Seek Predicates, but not exactly what you’re asking for.

I figure that no matter what time zone I’m interested in, the IncidentTimes that I care about are within a very specific range. That range is no more than a day bigger than my preferred range, on either side.

So I’ll include two extra predicates.

select i.IncidentTime, itz.LocalTime
from dbo.IncidentsOffset i
cross apply (select i.IncidentTime 
    AT TIME ZONE 'Cen. Australia Standard Time') itz (LocalTime)
where itz.LocalTime >= cast('20170201' as datetimeoffset) 
  AT TIME ZONE 'Cen. Australia Standard Time'
and itz.LocalTime < cast('20170301' as datetimeoffset) 
  AT TIME ZONE 'Cen. Australia Standard Time
and i.IncidentTime >= dateadd(day,-1,'20170201')
and i.IncidentTime < dateadd(day, 1,'20170301');

Now, my index can be used. It’s having to look through 30 rows before filtering it to the 28 that it cares about – but that’s a lot better than scanning the whole thing.

And you know – this is the kind of behaviour that I see all the time from regular queries, like when I do CAST(myDateTimeColumns AS DATE) = @SomeDate, or use LIKE.

I’m okay with this. AT TIME ZONE is great for letting me handle my time zone conversions, and by considering what’s going on with my queries, I don’t need to sacrifice performance either.

@rob_farley

The post Using AT TIME ZONE to fix an old report appeared first on SQLPerformance.com.

Estimated Number of Rows to be Read

Rob Farley — Tue, 10 Jan 2017 02:40:31 +0000

Don’t get me wrong – I love the Actual Rows Read property that we saw arrive in SQL Server’s execution plans in late 2015. But in SQL Server 2016 SP1, less than two months ago (and considering we’ve had Christmas in between, I don’t think much of the time since then counts), we got another exciting addition – Estimated Number of Rows to be Read (oh, and this is somewhat down to the Connect item I submitted, both demonstrating that Connect Items are worth submitting and making this post eligible for this month’s T-SQL Tuesday, hosted by Brent Ozar (@brento) on the topic of Connect items).

Let’s recap a moment… when the SQL Engine access data in a table, it uses either a Scan operation or a Seek operation. And unless that Seek has a Seek Predicate that can access at most one row (because it’s looking for an equality match on a set of columns – could be just a single column – which are known to be unique), then the Seek will perform a RangeScan, and behaves just like a Scan, just across the subset of rows that are satisfied by the Seek Predicate.

The rows satisfied by a Seek Predicate (in the case of a Seek operation’s RangeScan) or all the rows in the table (in the case of a Scan operation) are treated in essentially the same way. Both might get terminated early if no more rows are requested from the operator to its left, for example if a Top operator somewhere has already grabbed enough rows, or if a Merge Operator has no more rows to match against. And both might be filtered further by a Residual Predicate (shown as the ‘Predicate’ property) before the rows even get served up by the Scan/Seek operator. The “Number of Rows” and “Estimated Number of Rows” properties would tell us how many rows were expected to be produced by the operator, but we didn’t have any information about how may rows would be filtered by just the Seek Predicate. We could see the TableCardinality, but this was only really useful for Scan operators, where there was a chance that the Scan might look through the whole table for the rows it needed. It wasn’t useful at all for Seeks.

The query that I’m running here is against the WideWorldImporters database, and is:

SELECT COUNT(*)
FROM Sales.Orders
WHERE SalespersonPersonID = 7
AND YEAR(OrderDate) = 2013
AND MONTH(OrderDate) = 4;

Furthermore, I have an index in play:

CREATE NONCLUSTERED INDEX rf_Orders_SalesPeople_OrderDate 
  ON Sales.Orders (SalespersonPersonID, OrderDate);

This index is covering – the query doesn’t need any other columns to get its answer – and has been designed so that a Seek Predicate can be used on SalespersonPersonID, quickly filtering the data down to a smaller range. The functions on OrderDate mean that those last two predicates can’t be used within the Seek Predicate, so they are relegated to the Residual Predicate instead. A better query would filter those dates using OrderDate >= '20130401' AND OrderDate < '20130501', but I’m imagining a scenario here which is all too common…

Now, if I run the query, I can see the impact of the Residual Predicates. Plan Explorer even gives that useful warning that I’d written about before.

I can see very clearly that the RangeScan is 7,276 rows, and that the Residual Predicate filters this down to 149. Plan Explorer shows more information about this on the tooltip:

But without running the query, I can’t see that information. It’s simply not there. The properties in the estimated plan don’t have it:

And I’m sure I don’t need to remind you – this information is not present in the plan cache either. Having grabbed the plan from the cache using:

SELECT p.query_plan, t.text
FROM sys.dm_exec_cached_plans c
CROSS APPLY sys.dm_exec_query_plan(c.plan_handle) p
CROSS APPLY sys.dm_exec_sql_text(c.plan_handle) t
WHERE t.text LIKE '%YEAR%';

I opened it up, and sure enough, no sign of that 7,276 value. It looks just the same as the estimated plan I just showed.

Getting plans out of the cache is where the estimated values come into their own. It’s not just that I’d prefer to not actually run potentially-expensive queries on customer databases. Querying the plan cache is one thing, but running queries to get the actuals – that’s a lot harder.

With SQL 2016 SP1 installed, thanks to that Connect item, I can now see the Estimated Number of Rows to be Read property in estimated plans, and in the plan cache. The operator tooltip shown here is taken from the cache, and I can easily see that Estimated property showing 7,276, as well as the residual warning:

This is something which I could do on a customer box, looking in the cache for situations in problematic plans where the ratio of Estimated Number of Rows to be Read and Estimated Number of Rows isn’t great. Potentially, someone could make a process that checked every plan in the cache, but it’s not something that I’ve done.

Astute reading will have noticed that the Actual Rows that came out of this operator was 149, which was much smaller than the estimated 1382.56. But when I’m looking for Residual Predicates that are having to check too many rows, the ratio of 1,382.56 : 7,276 is still significant.

Now that we’ve found that this query is ineffective without even needing to run it, the way to fix it is to make sure that the Residual Predicate is sufficiently SARGable. This query…

SELECT COUNT(*) 
FROM Sales.Orders
WHERE SalespersonPersonID = 7 
AND OrderDate >= '20130401' 
AND OrderDate <  '20130501';

...gives the same results, and doesn’t have a Residual Predicate. In this situation, the Estimated Number of Rows to be Read value is identical to the Estimated Number of Rows, and the inefficiency is gone:

As mentioned earlier, this post is part of this month’s T-SQL Tuesday. Why not head over there to see what other feature requests have been granted recently?

The post Estimated Number of Rows to be Read appeared first on SQLPerformance.com.

Implementing a custom sort

Rob Farley — Mon, 17 Oct 2016 15:16:02 +0000

@rob_farley your recent stackoverflow solution to ordering by a value first then a field is genius! Wanted to thank you personally.

— Joel Sacco (@Jsac90) August 11, 2016

I saw this tweet come through…

And it made me look at what it was referring to, because I hadn't written anything 'recently' on StackOverflow about ordering data. Turns out it was this answer I'd written, which although wasn't the accepted answer, has collected over a hundred votes.

The person asking the question had a very simple problem – wanting to get certain rows to appear first. And my solution was simple:

ORDER BY CASE WHEN city = 'New York' THEN 1 ELSE 2 END, City;

It seems to have been a popular answer, including for Joel Sacco (according to that tweet above).

The idea is to form an expression, and order by that. ORDER BY doesn't care whether it's an actual column or not. You could've done the same using APPLY, if you really prefer to use a 'column' in your ORDER BY clause.

SELECT Users.*
FROM Users
CROSS APPLY 
(
  SELECT CASE WHEN City = 'New York' THEN 1 ELSE 2 END 
  AS OrderingCol
) o
ORDER BY o.OrderingCol, City;

If I use some queries against WideWorldImporters, I can show you why these two queries really are exactly the same. I'm going to query the Sales.Orders table, asking for the Orders for Salesperson 7 to appear first. I'm also going to create an appropriate covering index:

CREATE INDEX rf_Orders_SalesPeople_OrderDate 
ON Sales.Orders(SalespersonPersonID) INCLUDE (OrderDate);

The plans for these two queries look identical. They perform identically – same reads, same expressions, they really are the same query. If there's a slight difference in the actual CPU or Duration, then that's a fluke because of other factors.

SELECT OrderID, SalespersonPersonID, OrderDate
FROM Sales.Orders
ORDER BY CASE WHEN SalespersonPersonID = 7 THEN 1 ELSE 2 END, SalespersonPersonID;

SELECT OrderID, SalespersonPersonID, OrderDate
FROM Sales.Orders
CROSS APPLY 
(
  SELECT CASE WHEN SalespersonPersonID = 7 THEN 1 ELSE 2 END 
  AS OrderingCol
) o
ORDER BY o.OrderingCol, SalespersonPersonID;

And yet this is not the query that I would actually use in this situation. Not if performance were important to me. (It usually is, but it's not always worth writing a query the long way if the amount of data is small.)

What bothers me is that Sort operator. It's 96.4% of the cost!

Consider if we simply want to order by SalespersonPersonID:

We see that this simpler query's estimated CPU cost is 1.4% of the batch, while the custom-sorted version's is 98.6%. That's SEVENTY TIMES worse. Reads are the same though – that's good. Duration is way worse, and so is CPU.

I'm not fond of Sorts. They can be nasty.

One option I have here is to add a computed column to my table and index that, but that's going to have an impact on anything which looks for all the columns on the table, such as ORMs, Power BI, or anything that does SELECT *. So that's not so great (although if we ever get to add hidden computed columns, that would make for a really nice option here).

Another option, which is more longwinded (some might suggest that would suit me – and if you thought that: Oi! Don't be so rude!), and uses more reads, is to consider what we'd do in real life if we needed to do this.

If I had a pile of 73,595 orders, sorted by Salesperson order, and I needed to return them with a particular Salesperson first, I wouldn't disregard the order they were in and simply sort them all, I'd start by diving in and finding the ones for Salesperson 7 – keeping them in the order they were in. Then I'd find the ones that weren't the ones that weren't Salesperson 7 – putting them next, and again keeping them in the order they were already in.

In T-SQL, that's done like this:

SELECT OrderID, SalespersonPersonID, OrderDate
FROM
(
  SELECT OrderID, SalespersonPersonID, OrderDate, 
     1 AS OrderingCol
  FROM Sales.Orders  
  WHERE SalespersonPersonID = 7
  UNION ALL
  SELECT OrderID, SalespersonPersonID, OrderDate, 
     2 AS OrderingCol
  FROM Sales.Orders
  WHERE SalespersonPersonID != 7
) o
ORDER BY o.OrderingCol, o.SalespersonPersonID;

This gets two sets of data and concatenates them. But the Query Optimizer can see that it needs to maintain the SalespersonPersonID order, once the two sets are concatenated, so it does a special kind of concatenation that maintains that order. It's a Merge Join (Concatenation) join, and the plan looks like this:

You can see it's a lot more complicated. But hopefully you'll also notice that there's no Sort operator. The Merge Join (Concatenation) pulls the data from each branch, and produces a dataset which is in the right order. In this case, it will pull all 7,276 rows for Salesperson 7 first, and then pull the other 66,319, because that's the required order. Within each set, the data is in SalespersonPersonID order, which is maintained as the data flows through.

I mentioned earlier that it uses more reads, and it does. If I show the SET STATISTICS IO output, comparing the two queries, I see this:

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Orders'. Scan count 1, logical reads 157, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Orders'. Scan count 3, logical reads 163, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Using the "Custom Sort" version, it's just one scan of the index, using 157 reads. Using the "Union All" method, it's three scans – one for SalespersonPersonID = 7, one for SalespersonPersonID < 7, and one for SalespersonPersonID > 7. We can see those last two by looking at the properties of the second Index Seek:

For me, though, the benefit comes through in the lack of a Worktable.

Look at the estimated CPU cost:

It's not as small as our 1.4% when we avoid the sort completely, but it's still a vast improvement over our Custom Sort method.

But a word of warning…

Suppose I had created that index differently, and had OrderDate as a key column rather than as an included column.

CREATE INDEX rf_Orders_SalesPeople_OrderDate 
ON Sales.Orders(SalespersonPersonID, OrderDate);

Now, my "Union All" method doesn't work as intended at all.

Despite using exactly the same queries as before, my nice plan now has two Sort operators, and it performs nearly as badly as my original Scan + Sort version.

The reason for this is a quirk of the Merge Join (Concatenation) operator, and the clue is in the Sort operator.

It's ordering by SalespersonPersonID followed by OrderID – which is the clustered index key of the table. It chooses this because this is known to be unique, and it's a smaller set of columns to sort by than SalespersonPersonID followed by OrderDate followed by OrderID, which is the dataset order produced by three index range scans. One of those times when the Query Optimizer doesn't notice a better option that's right there.

With this index, we would need our dataset ordered by OrderDate as well to produce our preferred plan.

SELECT OrderID, SalespersonPersonID, OrderDate
FROM 
(
  SELECT OrderID, SalespersonPersonID, OrderDate, 
    1 AS OrderingCol
  FROM Sales.Orders
  WHERE SalespersonPersonID = 7
  UNION ALL
  SELECT OrderID, SalespersonPersonID, OrderDate, 
    2 AS OrderingCol
  FROM Sales.Orders
  WHERE SalespersonPersonID != 7
) o
ORDER BY o.OrderingCol, o.SalespersonPersonID, OrderDate;

So it's definitely more effort. The query is longer for me to write, it's more reads, and I have to have an index without extra key columns. But it's certainly quicker. With even more rows, the impact is bigger still, and I don't have to risk a Sort spilling to tempdb either.

For small sets, my StackOverflow answer is still good. But when that Sort operator is costing me in performance, then I'm going with the Union All / Merge Join (Concatenation) method.

The post Implementing a custom sort appeared first on SQLPerformance.com.

AT TIME ZONE – a new favourite feature in SQL Server 2016

Rob Farley — Thu, 28 Jul 2016 13:32:38 +0000

Image © Mark Boyle | Australia Day Council of NSW.
Images property of respective artist(s). All rights reserved.

AT TIME ZONE is such a cool feature and I hadn’t noticed it until recently, even though Microsoft have had a page about it since December.

I live in Adelaide in Australia. And like over a billion other people in the world, Adelaide people have to cope with being on a half-hour time zone. In winter time, we’re UTC+9:30, and in summer time we’re UTC+10:30. Except that if you’re reading this in the northern hemisphere, you’ll need to remember that by ‘winter’, I mean April to October. Summer time is October to April, and Santa Claus sits on the beach with a cold drink, sweating through his thick red suit and beard. Unless he’s out saving lives, of course.

Within Australia, we have three main time zones (Western, Central, and Eastern), but this extends to five in the summer, as the three states which extend to the northern end of Australia (WA, Qld, and NT) don’t try to save any daylight. They’re close enough to the equator to not care, or something like that. It’s loads of fun for the Gold Coast airport, whose runway crosses the NSW-QLD border.

Database servers often run in UTC, because it’s simply easier to not have to deal with converting between UTC and local (and the reverse) in SQL Server. Many years ago I remember having to fix a report which listed incidents that occurred along with response times (I have blogged about this since). Measuring SLA was quite straight forward – I could see that one incident happened during the customer’s working hours, and that they responded within one hour. I could see that another incident occurred outside working hours, and the response was within two hours. The problem came when a report was produced at the end of a period when the time zone changed, causing an incident that actually happened at 5:30pm (outside hours) to be listed as if it had occurred at 4:30pm (inside hours). The response had taken about 90 minutes, which was okay, but the report was showing otherwise.

All this is fixed in SQL Server 2016.

How to use AT TIME ZONE in SQL Server 2016

Now, with AT TIME ZONE, instead of saying: '20160101 00:00 +10:30', I can start with a datetime value which does not have a time zone offset, and use AT TIME ZONE to explain that it’s in Adelaide.

SELECT CONVERT(datetime,'20160101 00:00') 
    AT TIME ZONE 'Cen. Australia Standard Time';

-- 2016-01-01 00:00:00.000 +10:30

And this can be converted to the American time by appending AT TIME ZONE again.

SELECT CONVERT(datetime,'20160101 00:00') 
    AT TIME ZONE 'Cen. Australia Standard Time' 
    AT TIME ZONE 'US Eastern Standard Time';

-- 2015-12-31 08:30:00.000 -05:00

Now, I know this is a lot more longwinded. And I need to explicitly convert the string to datetime, to avoid an error saying:

Argument data type varchar is invalid for argument 1 of AT TIME ZONE function.

But despite the longwindedness of it, I love it, because at no point did I need to figure out that Adelaide was in +10:30, or that Eastern was -5:00 – I simply needed to know the time zone by name. Figuring out whether daylight saving should apply or not was handled for me, and I did not have to do any conversion from local to UTC to establish some baseline.

It works by using the Windows registry, which has all that information in it, but sadly, it’s not perfect when looking back in time. Australia changed the dates in 2008, and the US changed its dates in 2005 – both countries saving daylight for more of the year. AT TIME ZONE understands this. But it doesn’t seem to appreciate that in Australia in the year 2000, thanks to the Sydney Olympics, Australia started daylight saving about two months earlier. This is a little frustrating, but it’s not SQL’s fault – we need to blame Windows for that. I guess the Windows registry doesn’t remember the hotfix that went around that year. (Note to self: I might need to ask someone in the Windows team to fix that…)

The usefulness just continues though!

That time zone name doesn’t even need to be a constant. I can pass variables in, and even use columns:

WITH PeopleAndTZs AS
(
  SELECT * FROM (VALUES 
    ('Rob',   'Cen. Australia Standard Time'),
    ('Paul',  'New Zealand Standard Time'),
    ('Aaron', 'US Eastern Standard Time')
  ) t (person, tz)
)
SELECT tz.person, SYSDATETIMEOFFSET() AT TIME ZONE tz.tz
  FROM PeopleAndTZs tz;

/*
  Rob      2016-07-18 18:29:11.9749952 +09:30
  Paul     2016-07-18 20:59:11.9749952 +12:00
  Aaron    2016-07-18 04:59:11.9749952 -04:00
*/

(Because I ran that just before 6:30pm here in Adelaide, which happens to be nearly 9pm in New Zealand where Paul is, and nearly 5am this morning in the eastern bit of America where Aaron is.)

This would let me easily see what time it is for people wherever they are in the world, and to see who would be best to respond to some issue, without having to perform any manual datetime conversions. And even more so, it would let me do it for people in the past. I could have a report which analyses which time zones would allow the greatest number of events to occur during business hours.

Those time zones are listed in sys.time_zone_info, along with what the current offset is, and whether daylight saving is currently applied.

name	current_utc_offset	is_currently_dst
Singapore Standard Time	+08:00	0
W. Australia Standard Time	+08:00	0
Taipei Standard Time	+08:00	0
Ulaanbaatar Standard Time	+09:00	1
North Korea Standard Time	+08:30	0
Aus Central W. Standard Time	+08:45	0
Transbaikal Standard Time	+09:00	0
Tokyo Standard Time	+09:00	0

Sampling of rows from sys.time_zone_info

I’m only really interested in what the name is, but anyway. And it’s interesting to see that there is a time zone called “Aus Central W. Standard Time” which is on the quarter-hour. Go figure. Also worth noting that places are referred to using their Standard Time name, even if they’re currently observing DST. Such as Ulaanbaatar in that list above, which isn’t listed as Ulaanbaatar Daylight Time. This may throw people for a loop when they start using AT TIME ZONE.

Can AT TIME ZONE cause performance issues?

Now, I’m sure you’re wondering what the impact of using AT TIME ZONE might be on indexing.

In terms of the shape of the plan, it’s no different to dealing with datetimeoffset in general. If I have datetime values, such as in the AdventureWorks column Sales.SalesOrderHeader.OrderDate (upon which I created an index called rf_IXOD), then running both this query:

select OrderDate, SalesOrderID
  from Sales.SalesOrderHeader
  where OrderDate >= convert(datetime,'20110601 00:00') at time zone 'US Eastern Standard Time' 
    and OrderDate <  convert(datetime,'20110701 00:00') at time zone 'US Eastern Standard Time' ;

And this query:

select OrderDate, SalesOrderID
  from Sales.SalesOrderHeader
  where OrderDate >= convert(datetimeoffset,'20110601 00:00 -04:00')
    and OrderDate <  convert(datetimeoffset,'20110701 00:00 -04:00');

In both cases, you get plans that look like this:

But if we explore a little more closely, there is a problem.

The one that uses AT TIME ZONE doesn’t use the statistics very well. It thinks it’s going to see 5,170 rows come out of that Index Seek, when there’s actually only 217. Why 5,170? Well, Aaron’s recent post, “Paying Attention To Estimates,” explains it, by referring to the post “Cardinality Estimation for Multiple Predicates” from Paul. 5,170 is 31,465 (rows in the table) * 0.3 * sqrt(0.3).

The second query gets it right, estimating 217. No functions involved, you see.

This is probably fair enough. I mean – at the point when it’s producing the plan, it won’t have asked the registry for the information it needs, so it really doesn’t know how many to estimate. But there is potential for it to be a problem.

If I add extra predicates which I know can’t be a problem, then my estimates actually drop even further – down to 89.9 rows.

select OrderDate, SalesOrderID
  from Sales.SalesOrderHeader
  where OrderDate >= convert(datetime,'20110601 00:00') at time zone 'US Eastern Standard Time' 
  and OrderDate   <  convert(datetime,'20110701 00:00') at time zone 'US Eastern Standard Time'
  and OrderDate   >= convert(datetimeoffset,'20110601 00:00 +14:00')
  and OrderDate   <  convert(datetimeoffset,'20110701 00:00 -12:00');

Estimating too many rows means too much memory is allocated, but estimating too few can cause too little memory, with potentially needing a spill to correct the problem (which can often be disastrous from a performance perspective). Go read Aaron’s post for more information about how poor estimates can be bad.

When I consider how to handle displaying values for those people from before, I can use queries like this:

WITH PeopleAndTZs AS
(
  SELECT * FROM (VALUES 
    ('Rob',   'Cen. Australia Standard Time'),
    ('Paul',  'New Zealand Standard Time'),
    ('Aaron', 'US Eastern Standard Time')
  ) t (person, tz)
)
SELECT tz.person, o.SalesOrderID, o.OrderDate AT TIME ZONE 'UTC' AT TIME ZONE tz.tz
FROM PeopleAndTZs tz
CROSS JOIN Sales.SalesOrderHeader o
WHERE o.SalesOrderID BETWEEN 44001 AND 44010;

And get this plan:

…which has no such concerns – the right-most Compute Scalar is converting the datetime OrderDate into datetimeoffset for UTC, and the left-most Compute Scalar is converting it into the appropriate time zone for the person. The warning is because I’m doing a CROSS JOIN, and that was fully intentional.

Advantages and Disadvantages of other time conversion methods

Before AT TIME ZONE, one of my favourite, but often unappreciated, features of SQL 2008 was the data type datetimeoffset. This allows date/time data to be stored with the time zone as well, such as '20160101 00:00 +10:30', which is when we celebrated New Year in Adelaide this year. To see when that was in US Eastern, I can use the function SWITCHOFFSET.

SELECT SWITCHOFFSET('20160101 00:00 +10:30', '-05:00');

-- 2015-12-31 08:30:00.0000000 -05:00

This is the same moment in time, but in a different part of the world. If I were on the phone to someone in North Carolina or New York, wishing them a Happy New Year because it was just past midnight in Adelaide, they would be saying “What do you mean? It’s still breakfast time here on New Year’s Eve!”

The problem is that to do this, I need to know that in January, Adelaide is +10:30 and US Eastern is –5:00. And that’s often a pain. Especially if I’m asking about late March, early April, October, early November - those times of year when people can’t be sure which time zone people in other countries were in because they change by an hour for daylight saving, and they all do so according to different rules. My computer tells me what time zone people are in now, but it’s much harder to tell what time zone they will be in at other times of the year.

For some other information about converting between time zones, and how folks like Aaron have dealt with daylight saving, see the following links:

And some official Microsoft documentation:

Should you use AT TIME ZONE?

AT TIME ZONE isn’t perfect. But it is really useful – incredibly so. It’s flexible enough to accept columns and variables as input, and I can see a huge amount of potential for it. But if it’s going to cause my estimates to be out, then I’m going to need to be careful. For display purposes, this shouldn’t matter at all though, and that’s where I can see it being most useful. Making it easier to convert a display value to or from UTC (or to or from a local time), without anyone having to bust their brains about offsets and DST, is a big win.

This really is one of my favourite features of SQL Server 2016. I’ve been crying out for something like this for a very long time.

Oh, and most of those billion people on the half-hour time zone are in India. But you probably already knew that…

The post AT TIME ZONE – a new favourite feature in SQL Server 2016 appeared first on SQLPerformance.com.

SQL Server 2016 Temporal Table Query Plan Behaviour

Rob Farley — Tue, 14 Jun 2016 00:20:09 +0000

Isn’t it great to have a new version of SQL Server available? This is something that only happens every couple of years, and this month we saw one reach General Availability. (Ok, I know we get a new version of SQL Database in Azure almost continuously, but I count this as different.) Acknowledging this new release, this month’s T-SQL Tuesday (hosted by Michael Swart – @mjswart) is on the topic of all things SQL Server 2016!

So today I want to look at SQL 2016’s Temporal Tables feature, and have a look at some query plan situations you could end up seeing. I love Temporal Tables, but have come across a bit of a gotcha that you might want to be aware of.

Now, despite the fact that SQL Server 2016 is now in RTM, I’m using AdventureWorks2016CTP3, which you can download here – but don’t just download AdventureWorks2016CTP3.bak, also grab SQLServer2016CTP3Samples.zip from the same site.

You see, in the Samples archive, there are some useful scripts for trying out new features, including some for Temporal Tables. It’s win-win – you get to try a bunch of new features, and I don’t have to repeat so much script in this post. Anyway, go and grab the two scripts about Temporal Tables, running AW 2016 CTP3 Temporal Setup.sql, followed by Temporal System-Versioning Sample.sql.

These scripts set up temporal versions of a few tables, including HumanResources.Employee. It creates HumanResources.Employee_Temporal (although, technically, it could’ve been called anything). At the end of the CREATE TABLE statement, this bit appears, adding two hidden columns to use to indicate when the row is valid, and indicating that a table should be created called HumanResources.Employee_Temporal_History to store the old versions.

  ...
  ValidFrom datetime2(7) GENERATED ALWAYS AS ROW START HIDDEN NOT NULL,
  ValidTo datetime2(7)   GENERATED ALWAYS AS ROW END   HIDDEN NOT NULL,
  PERIOD FOR SYSTEM_TIME (ValidFrom, ValidTo)
) WITH (SYSTEM_VERSIONING = ON 
  (HISTORY_TABLE = [HumanResources].[Employee_Temporal_History])
);

What I want to explore in this post is what happens with query plans when the history is used.

If I query the table to see the latest row for a particular BusinessEntityID, I get a Clustered Index Seek, as expected.

SELECT e.BusinessEntityID, e.ValidFrom, e.ValidTo
FROM HumanResources.Employee_Temporal AS e
WHERE e.BusinessEntityID = 4;

I’m sure I could query this table using other indexes, if it had any. But in this case, it doesn’t. Let’s create one.

CREATE UNIQUE INDEX rf_ix_Login 
on HumanResources.Employee_Temporal(LoginID);

Now I can query the table by LoginID, and will see a Key Lookup if I ask for columns other than Loginid or BusinessEntityID. None of this is surprising.

SELECT * FROM HumanResources.Employee_Temporal e
WHERE e.LoginID = N'adventure-works\rob0';

Let’s use SQL Server Management Studio for a minute, and have a look at how this table looks in Object Explorer.

We can see the History table mentioned under HumanResources.Employee_Temporal, and the columns and indexes from both the table itself and the history table. But while the indexes on the proper table are the Primary Key (on BusinessEntityID) and the index I had just created, the History table doesn’t have matching indexes.

The index on the history table is on ValidTo and ValidFrom. We can right-click the index and select Properties, and we see this dialog:

A new row is inserted into this History table when it is no longer valid in the main table, because it has just been deleted or changed. The values in the ValidTo column are naturally populated with the current time, so ValidTo acts as an ascending key, like an identity column, so that new inserts appear at the end of the b-tree structure.

But how does this perform when you want to query the table?

If we want to query our table for what was current at a particular point in time, then we should use a query structure such as:

SELECT * FROM HumanResources.Employee_Temporal
FOR SYSTEM_TIME AS OF '20160612 11:22';

This query needs to concatenate the appropriate rows from the main table with the appropriate rows from the history table.

In this scenario, the rows that were valid for the moment I picked were all from the history table, but nonetheless, we see a Clustered Index Scan against the main table, which was filtered by a Filter operator. The predicate of this filter is:

[HumanResources].[Employee_Temporal].[ValidFrom] <= '2016-06-12 11:22:00.0000000' 
AND [HumanResources].[Employee_Temporal].[ValidTo] > '2016-06-12 11:22:00.0000000'

Let’s revisit this in a moment.

The Clustered Index Seek on the History table must clearly be leveraging a Seek Predicate on ValidTo. The Start of the Seek’s Range Scan is HumanResources.Employee_Temporal_History.ValidTo > Scalar Operator('2016-06-12 11:22:00'), but there is no End, because every row that has a ValidTo after the time we care about is a candidate row, and must be tested for an appropriate ValidFrom value by the Residual Predicate, which is HumanResources.Employee_Temporal_History.ValidFrom <= '2016-06-12 11:22:00'.

Now, intervals are hard to index for; that’s a known thing that has been discussed on many blogs. Most effective solutions consider creative ways to write queries, but no such smarts have been built into Temporal Tables. You can, though, put indexes on other columns too, such as on ValidFrom, or even have indexes that match the types of queries you might have on the main table. With a clustered index being a composite key on both ValidTo and ValidFrom, these two columns get included on every other column, providing a good opportunity for some Residual Predicate testing.

If I know which loginid I’m interested in, my plan forms a different shape.

The top branch of the Concatenation operator looks similar to before, although that Filter operator has entered the fray to remove any rows that are not valid, but the Clustered Index Seek on the lower branch has a Warning. This is a Residual Predicate warning, like the examples in an earlier post of mine. It’s able to filter to entries that are valid until some point after the time we care about, but the Residual Predicate now filters to the LoginID as well as ValidFrom.

[HumanResources].[Employee_Temporal_History].[ValidFrom] <= '2016-06-12 11:22:00.0000000' 
AND [HumanResources].[Employee_Temporal_History].[LoginID] = N'adventure-works\rob0'

Changes to rob0’s rows are going to be a tiny proportion of the rows in the History. This column won’t be unique like in the main table, because the row may have been changed multiple times, but there is still a good candidate for indexing.

CREATE INDEX rf_ixHist_loginid
ON HumanResources.Employee_Temporal_History(LoginID);

This new index has a notable effect on our plan.

It’s now changed our Clustered Index Seek into a Clustered Index Scan!!

You see, the Query Optimizer now works out that the best thing to do would be to use the new index. But it also decides that the effort in having to do lookups to get all the other columns (because I was asking for all columns) would be simply too much work. The tipping point was reached (sadly an incorrect assumption in this case), and a Clustered Index SCAN chosen instead. Even though without the non-clustered index, the best option would have been to use a Clustered Index Seek, when the non-clustered index has been considered and rejected for tipping-point reasons, it chooses to scan.

Frustratingly, I’ve only just created this index and its statistics should be good. It should know that a Seek that requires exactly one lookup should be better than a Clustered Index Scan (only by statistics – if you were thinking it should know this because LoginID is unique in the main table, remember that it may not always have been). So I suspect that lookups should be avoided in history tables, although I haven’t done quite enough research into this yet.

Now were we to only query columns that appear in our non-clustered index, we would get much better behaviour. Now that no lookup is required, our new index on the history table is happily used. It still needs to apply a Residual Predicate based on only being able to filter to LoginID and ValidTo, but it behaves much better than dropping into a Clustered Index Scan.

SELECT LoginID, ValidFrom, ValidTo
FROM HumanResources.Employee_Temporal
FOR SYSTEM_TIME AS OF '20160612 11:22'
WHERE LoginID = N'adventure-works\rob0'

So do index your history tables in extra ways, considering how you will be querying them. Include the necessary columns to avoid lookups, because you’re really avoiding Scans.

These history tables can grow large if data is changing frequently. So be mindful of how they are being handled. This same situation occurs when using the other FOR SYSTEM_TIME constructs, so you should (as always) review the plans your queries are producing, and index to make sure that you are well positioned to leverage what is a very powerful feature of SQL Server 2016.

The post SQL Server 2016 Temporal Table Query Plan Behaviour appeared first on SQLPerformance.com.

What's "Actually" going on with that Seek?

Rob Farley — Fri, 10 Jun 2016 12:57:25 +0000

I wrote previously about the Actual Rows Read property. It tells you how many rows are actually read by an Index Seek, so that you can see how selective the Seek Predicate is, compared to the selectiveness of the Seek Predicate plus Residual Predicate combined.

But let’s have a look at what’s actually going on inside the Seek operator. Because I’m not convinced that “Actual Rows Read” is necessarily an accurate description of what’s going on.

I want to look at an example that queries addresses of particular address types for a customer, but the principle here would easily apply to many other situations if the shape of your query fits, such as looking up attributes in a Key-Value Pair table, for example.

SELECT AddressTypeID, FullAddress
FROM dbo.Addresses
WHERE CustomerID = 783
AND AddressTypeID IN (2,4,5);

I know I haven’t shown you anything about the metadata – I’ll come back to that in a minute. Let’s have a think about this query and what kind of index we’d like to have for it.

Firstly, we know the CustomerID exactly. An equality match like this generally makes it an excellent candidate for the first column in an index. If we had an index on this column we could dive straight into the addresses for that customer – so I’d say that’s a safe assumption.

The next thing to consider is that filter on AddressTypeID. Adding a second column to the keys of our index is perfectly reasonable, so let’s do that. Our index is now on (CustomerID, AddressTypeID). And let’s INCLUDE FullAddress too, so that we don’t need to do any lookups to complete the picture.

And I think we’re done. We should be able to safely assume that the ideal index for this query is:

CREATE INDEX ixIdealIndex 
ON dbo.Addresses (CustomerID, AddressTypeID)
INCLUDE (FullAddress);

We could potentially declare it as a unique index – we’ll look at the impact of that later.

So let’s create a table (I’m using tempdb, because I don’t need it to persist beyond this blog post) and test this out.

CREATE TABLE dbo.Addresses (
  AddressID INT IDENTITY(1,1) PRIMARY KEY,
  CustomerID INT NOT NULL,
  AddressTypeID INT NOT NULL,
  FullAddress NVARCHAR(MAX) NOT NULL,
  SomeOtherColumn DATE NULL
);

I’m not interested in foreign key constraints, or what other columns there might be. I’m only interested in my Ideal Index. So create that too, if you haven’t already.

My plan seems pretty perfect.

I have an index seek, and that’s it.

Granted, there’s no data, so there’s no reads, no CPU, and it runs pretty quickly too. If only all queries could be tuned as well as this.

Let’s see what’s going on a little closer, by looking at the properties of the Seek.

We can see the Seek Predicates. There are six. Three about the CustomerID, and three about the AddressTypeID. What we actually have here are three sets of seek predicates, indicating three seek operations within the single Seek operator. The first seek is looking for Customer 783 and AddressType 2. The second is looking for 783 and 4, and the last 783 and 5. Our Seek operator appeared once, but there were three seeks going on inside it.

We don’t even have data, but we can see how our index is going to be used.

Let’s put some dummy data in, so that we can look at some of the impact of this. I’m going to put addresses in for types 1 to 6. Every customer (over 2000, based on the size of master..spt_values) will have an address of type 1. Maybe that’s the Primary Address. I’m letting 80% have a type 2 address, 60% a type 3, and so on, up to 20% for type 5. Row 783 will get addresses of type 1, 2, 3, and 4, but not 5. I’d rather have gone with random values, but I want to make sure we’re on the same page for the examples.

WITH nums AS (
    SELECT row_number() OVER (ORDER BY (SELECT 1)) AS num
    FROM master..spt_values
)
INSERT dbo.Addresses (CustomerID, AddressTypeID, FullAddress)
SELECT num AS CustomerID, 1 AS AddressTypeID, N'Some sample text for the address' AS FullAddress
FROM nums
UNION ALL
SELECT num AS CustomerID, 2 AS AddressTypeID, N'Some sample text for the address' AS FullAddress
FROM nums
WHERE num % 10 < 8
UNION ALL
SELECT num AS CustomerID, 3 AS AddressTypeID, N'Some sample text for the address' AS FullAddress
FROM nums
WHERE num % 10 < 6
UNION ALL
SELECT num AS CustomerID, 4 AS AddressTypeID, N'Some sample text for the address' AS FullAddress
FROM nums
WHERE num % 10 < 4
UNION ALL
SELECT num AS CustomerID, 5 AS AddressTypeID, N'Some sample text for the address' AS FullAddress
FROM nums
WHERE num % 10 < 2
;

Now let’s look at our query with data. Two rows are coming out. It’s like before, but we now see the two rows coming out of the Seek operator, and we see six reads (in the top-right).

Six reads makes sense to me. We have a small table, and the index fits on just two levels. We’re doing three seeks (within our one operator), so the engine is reading the root page, finding out which page to go down to and reading that, and doing that three times.

If we were to just look for two AddressTypeIDs, we’d see just 4 reads (and in this case, a single row being outputted). Excellent.

And if we were looking for 8 address types, then we’d see 16.

Yet each of these show that the Actual Rows Read matches the Actual Rows exactly. No inefficiency at all!

Let’s go back to our original query, looking for address types 2, 4, and 5, (which returns 2 rows) and think about what’s going on inside the seek.

I’m going to assume the Query Engine has already done the work to figure out that the Index Seek is the right operation, and that it has the page number of the index root handy.

At this point, it loads that page into memory, if it’s not already there. That’s the first read that gets counted in the execution of the seek. Then it locates the page number for the row it’s looking for, and reads that page in. That’s the second read.

But we often gloss over that ‘locates the page number’ bit.

By using DBCC IND(2, N'dbo.Address', 2); (the first 2 is the database id because I’m using tempdb; the second 2 is the index id of ixIdealIndex), I can discover that the 712 in file 1 is the page with the highest IndexLevel. In the screenshot below, I can see that page 668 is IndexLevel 0, which the root page.

So now I can use DBCC TRACEON(3604); DBCC PAGE (2,1,712,3); to see the contents of page 712. On my machine, I get 84 rowscoming back, and I can tell that CustomerID 783 is going to be on page 1004 of file 5.

But I know this by scrolling through my list until I see the one I want. I started by scrolling down a bit, and then came back up, until I found the row I wanted. A computer calls this a binary search, and it’s a bit more precise than me. It’s looking for the row where the (CustomerID, AddressTypeID) combination is smaller that the one I’m looking for, with the next page being larger or the same as it. I say “the same” because there could be two that match, spread across two pages. It knows there are 84 rows (0 to 83) of data in that page (it reads that in the page header), so it’ll start by checking row 41. From there, it knows which half to search in, and (in this example), it will read row 20. A few more reads (making 6 or 7 in total)* and it knows that row 25 (please look at the column called ‘Row’ for this value, not the row number provided by SSMS) is too small, but row 26 is too big – so 25 is the answer!

*In a binary search, the searching can be marginally quicker if it gets lucky when it splits the block into two if there’s no middle slot, and depending on whether the middle slot can be eliminated or not.

Now it can go into page 1004 in file 5. Let’s use DBCC PAGE on that one.

This one gives me 94 rows. It does another binary search to find the start of the range that it’s looking for. It has to look through 6 or 7 rows to find that.

“Start of the range?” I can hear you ask. But we’re looking for address type 2 of customer 783.

Right, but we didn’t declare this index as unique. So there could be two. If it is unique, the seek can do a singleton search, and could stumble across it during the binary search, but in this case, it must complete the binary search, to find the first row in the range. In this case, it’s the row 71.

But we don’t stop here. Now we need to see if there really is a second one! So it reads row 72 as well, and finds that the CustomerID+AddressTypeiD pair is indeed too big, and its seek is done.

And this happens three times. The third time, it doesn’t find a row for customer 783 and address type 5, but it doesn’t know this ahead of time, and still needs to complete the seek.

So the rows actually being read across these three seeks (to find two rows to output) is a lot more than the number being returned. There’s about 7 at index level 1, and about 7 more at the leaf level just to find the start of the range. Then it reads the row we care about, and then the row after that. That sounds more like 16 to me, and it does this three times, making about 48 rows.

But Actual Rows Read is not about the number of rows actually read, but the number of rows returned by the Seek Predicate, that get tested against the Residual Predicate. And in that, it’s only the 2 rows that get found by the 3 seeks.

You might be thinking at this point that there’s a certain amount of ineffectiveness here. The second seek would’ve also read page 712, checked the same 6 or 7 rows there, and then read page 1004, and hunted through it... as would have the third seek.

So perhaps it would’ve been better to get this in a single seek, reading page 712 and page 1004 only once each. After all, if I were doing this with a paper-based system, I would’ve done a seek to find customer 783, and then scanned through all their address types. Because I know that a customer doesn’t tend to have many addresses. That’s an advantage I have over the database engine. The database engine knows through its statistics that a seek will be best, but it doesn’t know that the seek should only go down one level, when it can tell that it has what seems like the Ideal Index.

If I change my query to grab a range of address types, from 2 to 5, then I get almost the behaviour I want:

Look – the reads are down to 2, and I know which pages they are...

...but my results are wrong. Because I only want address types 2, 4, and 5, not 3. I need to tell it not to have 3, but I have to be careful how I do this. Look at the next two examples.

I can assure you that predicate order doesn’t matter, but here it clearly does. If we put the “not 3” first, it does two seeks (4 reads), but if we put the “not 3” second, it does a single seek (2 reads).

The problem is that AddressTypeID != 3 gets converted to (AddressTypeID > 3 OR AddressTypeID < 3), which is then seen as two very useful seek predicates.

And so my preference is to use a non-sargable predicate to tell it that I only want address types 2, 4, and 5. And I can do that by modifying AddressTypeID in some way, such as adding zero to it.

Now I have a nice and tight range scan within a single seek, and I’m still making sure that my query is returning only the rows that I want.

Oh, but that Actual Rows Read property? That’s now higher than the Actual Rows property, because the Seek Predicate finds address type 3, which the Residual Predicate rejects.

I’ve traded three perfect seeks for a single imperfect seek, which I’m fixing up with a residual predicate.

And for me, that’s sometimes a price worth paying, getting me a query plan that I’m much happier about. It’s not considerably cheaper, even though it has only a third of the reads (because there would only ever be two physical reads), but when I think about the work it’s doing, I’m much more comfortable with what I’m asking it to do this way.

The post What's "Actually" going on with that Seek? appeared first on SQLPerformance.com.