Scalar functions and improvements in SQL Server 2019

Now that SQL server 2019 is released I want to discuss one of the more interesting improvements, at least from a performance perspective. I believe that this alone can result in greatly improved performance and reduced resource usage. For some installations, I should add.

Just a couple of quick notes:

  • This post is about scalar UDFs that does SELECT. A forthcoming post will be about scalar functions that doesn’t do SELECT.
  • Jump towards the end to see a table with the performance differences. That is why you are here, after all! Then you can come back and read the introduction paragraphs.

For a couple of decades, “we all” have known that scalar functions are bad. Really bad. To the extent that they are verboten in many installations. Why are they so bad? For two reasons:

  1. The function is called once per row. Imagine if you have a SELECT in there, and the function is called 100,000 times. You have now executed that SELECT 100,000 times.
  2. But also the overhead of calling the function is significant. So, even if it does something very simple like adding two numbers the overhead of calling it many times can be huge.

Don’t believe me? Read on.

I don’t want to get too theoretical regarding how this work, digging into execution plans etc. But the big thing here in SQL server 2019 is that SQL server can in-line those functions for you! I won’t go into the pre-requisites for this, just check out the product documentation for those things. You will also see that this requires database compatibility level 150 (2019) and that the database option for this isn’t turned off.

I’m using the AdventureworksDW database, and two tables:

  • DimProduct, 606 rows, 6 MB.
  • FactResellerSalesXL, approx 11,000,000 rows. I have three versions of this.
    • Row-based clustered index. 2.48 GB.
    • Same as above but with page compression: 0.68 GB. I don’t use this table for my tests here.
    • Same as the first with a clustered columnstore index: 0,48 GB

UDF with I/O

Let’s start with a UDF that performs I/O. It is dead-simple, it will just summarize the sales amount from the fact table for a certain product. Here’s the first version:

CREATE OR ALTER FUNCTION ProdSalesAmt(@ProductKey int)
RETURNS int
AS
BEGIN
DECLARE @ret int
SET @ret =
(
SELECT SUM(d.UnitPrice * d.OrderQuantity)
FROM dbo.FactResellerSalesXL AS d
WHERE d.ProductKey = @ProductKey
)
IF @ret IS NULL
SET @ret = 0
RETURN @ret
END
GO

You probably say that the function can be simplified. So here is such a simplified version. The idea is to get rid of all procedural code constructs.:

CREATE OR ALTER FUNCTION ProdSalesAmt_Simple(@ProductKey int)
RETURNS int
AS
BEGIN
RETURN (
SELECT ISNULL(SUM(d.UnitPrice * d.OrderQuantity), 0)
FROM dbo.FactResellerSalesXL AS d
WHERE d.ProductKey = @ProductKey
)
END
GO

Some of you know the trick to not use a scalar UDF but an Inline Table function instead. The calling query would have to use a CROSS APPLY, so the optimization isn’t transparent to the client code. Anyhow, here is such a version:

CREATE OR ALTER FUNCTION ProdSalesAmt_TVF(@ProductKey int)
RETURNS table
AS
RETURN (
SELECT ISNULL(SUM(d.UnitPrice * d.OrderQuantity), 0) AS SumSales
FROM dbo.FactResellerSalesXL AS d
WHERE d.ProductKey = @ProductKey
)
GO

I’m going to test with different compatibility levels for the database:

ALTER DATABASE CURRENT SET COMPATIBILITY_LEVEL = 140
ALTER DATABASE CURRENT SET COMPATIBILITY_LEVEL = 150

And I’m also going to test with and without in-lining turned on.

ALTER DATABASE SCOPED CONFIGURATION SET TSQL_SCALAR_UDF_INLINING  = ON 
ALTER DATABASE SCOPED CONFIGURATION SET TSQL_SCALAR_UDF_INLINING  = OFF 

Automatic inlining requires compatibility level 150 and also that inining is turned on. I did not see any difference between turning off iniling one or the other way, so I will just report it as inlining turned on or off.

But what if we didn’t use a function in the first place? Sure, functions are a neat programming construct. But neat programming constructs and performance doesn’t necessarily go hand-in-hand – we learned the hard way. So, I will also use a simple join as well as a correlated subquery.

So here are the queries I will use:

-- Q1 Basic scalar function
 SELECT p.ProductKey, p.EnglishProductName, dbo.ProdSalesAmt(p.ProductKey) AS SumSales
 FROM DimProduct AS p
 ORDER BY SumSales

-- Q2 Simplified scalar function
 SELECT p.ProductKey, p.EnglishProductName, dbo.ProdSalesAmt_Simple(p.ProductKey) AS SumSales
 FROM DimProduct AS p
 ORDER BY SumSales

-- Q3 Inline TVF instead of scalar function
 SELECT p.ProductKey, p.EnglishProductName, s.SumSales AS SumSales
 FROM DimProduct AS p
  OUTER APPLY dbo.ProdSalesAmt_TVF(p.ProductKey) AS s
 ORDER BY SumSales

-- Q4 Join instead of calling a function
 SELECT p.ProductKey, p.EnglishProductName, SUM(f.UnitPrice* f.OrderQuantity) AS SumSales
 FROM DimProduct AS p
  LEFT OUTER JOIN FactResellerSalesXL AS f ON f.ProductKey = p.ProductKey
 GROUP BY p.ProductKey, p.EnglishProductName
 ORDER BY SumSales

-- Q5 Correlated subquery
 SELECT p.ProductKey, p.EnglishProductName, 
 ( SELECT SUM(d.UnitPrice * d.OrderQuantity) 
   FROM dbo.FactResellerSalesXL AS d
   WHERE d.ProductKey = p.ProductKey ) AS SumSales
 FROM DimProduct AS p
 ORDER BY SumSales

-- Q6 Join using a table with a columnstore index
 SELECT p.ProductKey, p.EnglishProductName, SUM(f.UnitPrice* f.OrderQuantity) AS SumSales
 FROM DimProduct AS p
  LEFT OUTER JOIN FactResellerSalesXL_CCI AS f ON f.ProductKey = p.ProductKey
 GROUP BY p.ProductKey, p.EnglishProductName
 ORDER BY SumSales

I measured three things. The execution time as shown in SSMS. I also used an Extended Event trace to capture duration and logical reads. I will limit significant figures for duration and logical reads to 3.

QueryDescriptionInliningTimeDurationLogical reads
Q1Basic UDFN3:23204,000200,000,000
Q1Basic UDFY0:3232,10035,600,000
Q2Simplified UDFY0:3232,10035,600,000
Q3Inline TVF0:3233,20035,600,00
Q4Join0:00692331,000
Q5Join corr sub0:00708331,000
Q6Join col-store0:001534,960

Conclusions

Let us focus on the time it takes, the “Time” column above. This is the end-user perspective, and of is what matters in the very end.

Without inlining, the end-user would have time to grab a cup of coffee and chat with a friend while waiting. This is the pre-2019 behavior, assuming you didn’t do any optimization.

With inlining, the user can swap to the email program and check for new mails while waiting. We had to do manual work prior to 2019 to get from three and a half minute to half a minute. This meant re-writing the function and the SQL that called the function. SQL server 2019 basically does that for us! Note that I didn’t see any difference with simplifying the scalar function to only contain one SELECT.

But we could be smarter than that. For this simple example we could either do a join or a correlated subquery. It might not be that simple in all situations, but there can still be room for query optimizations, evidently. Going from half a minute to half a second is a big thing. Now the end-user will barely notice the execution time.

And imagine if the dev/dba created a columnstore index to support the query. Now we basically have instant result.

Also consider the resource consumption. I’m thinking about the logical reads column here. I didn’t measure CPU time for this test, but I would expect that it to some extent reflect the execution time and logical reads.

I will discuss a scalar UDF that doesn’t do SELECT (I/O) in a later blog post.

Disclaimers

You can do many more combinations than what I did. For instance scalar inlining with columnstore index. Etc. My goal here was not to test all possible permutations. Please give the others a try if you want!

This was only one example. I’m sure that there are examples where inlining doesn’t work that well. And perhaps also where the benefits are even grater. My aim here was to give it a try, with an example with was simple enough and something that I have seen in the wild (similar cases, that is).

Fragmentation, the final installment

Measuring things can be difficult. I have been painfully reminded of that with my attempts to measure whether fragmentation matters with new hardware. I created a test script in which I read data seeking the index (following the linked list of the leaf level of the index).

My thoughts were that “jumping back and forth” doesn’t matter with modern I/O subsystems. But perhaps read-ahead reads more (unnecessary) data into cache and less useful data remains in memory because of this?

In my first post, came to the conclusion that it didn’t matter for execution time, but I noticed that lots of memory was brought into cache when the index was fragmented. There’s some background info in there: http://sqlblog.karaszi.com/does-index-fragmentation-matter/ .

In my second blog post, I focused on how much memory was brought into cache. And I found that when the index was fragmented, I noticed 1000 times more memory usage. Surely, read-ahead, cannot cause this!? http://sqlblog.karaszi.com/index-fragmentation-part-2/

Because of the absurde figures, I had to dig deeper. Paul White told me about a lesser known optimization where SQL server will do “large I/O” (my own term for this) if you have plenty of unused memory. Instead of reading single pages, it reads whole extents. This makes sense, warming up the cache. https://blogs.msdn.microsoft.com/ialonso/2011/12/09/the-read-ahead-that-doesnt-count-as-read-ahead/

But still, that can’t account for 1000 times more memory usage for a fragmented index. I can live with 5-10 times, but not 1000 times. And then Simon Sabin pointed out the obvious: auto-update of statistics! My first set of SELECTs was done after a updating a bunch of rows (so to fragment the index), and that triggered auto update statistics (bringing the whole index into memory). Mystery solved.

My findings

As I said, I found it extremely difficult to measure this. Things happens in the machine which can muddle the results. I did the test over and over again. Looking at both details and averages. My conclusion is that with same amount of pages, you will see a pretty marginal effect of fragmentation. “Marginal” is in the eye of the beholder, though. More details further down. Please read my earlier posts, especially the first one for background information.

What about page fullness?

To be honest, I find this to be a much more interesting aspect. I’d like to split this into two sub-sections. Increasing and “random” index.

Increasing index

This can be like an identity or ever increasing datetime column (think for instance order_date). This will not become fragmented from INSERTS and assuming that you don’t delete “every other row”, then you will end up with a nice unfragmented index with a high degree of page fullness. If you rebuild it with a fillfactor of 100%, you have just wasted resources. If you rebuild it with a lower fillfactor, you also waste space and cause SQL Server to read more data – causing worse performance.

Random index

By this I mean index where the data distribution is all over the place. For instance an index on the LastName column. As you insert data, pages are split and you end up with a fillfactor of about 75% (which is in between a half-full and a full page). If you rebuild it with a similar fillfactor, you didn’t gain anything. If you rebuild it with a high fillfactor, then the following activity in your database will generate a high number of page splits. It seems reasonable to let it be, accepting a fillfactor of about 75% and accept a page split every now and then.

But we also delete data!

Do you? Really? OK, let us say that you do. What is the delete pattern? And the degree of deletes? How does this spread over your particular index? Deleting old data over some ever increasing index will just deallocate those pages. Deleting data which doesn’t match with the index key (rows deleted are spread all over the index) will leave empty space. This will be re-used as you insert new rows. But if you delete a lot of data, then you will have a low degree of page fullness. IMO, this is a rather unusual situation and should warrant a manually triggered index maintenance, in contrast to rebuilding indexes every week or so.
(And don’t get me started on heaps, but surely you don’t have those…? I once had a SELECT of all rows for a table with only 3000 rows, which took minutes. It had allocated more extents than number of rows in the table. Rebuild it, and it took, of course, sub-second. Needless to say, this didn’t stay as a heap for long. And no, the table wasn’t wide, at all.)

Bottom line?

Being who I am, I don’t want to draw too hard conclusions from some tests and reasoning. But I must say that I doubt the value of regular index defrag – assuming you are on some modern I/O subsystem, of course. I think that folks focus on index defrag and forget about statistics. I like to provide the optimizer with as high-quality statistics as possible. Sure, index rebuild will give you new statistics. But if you condition it based on fragmentation, then you end up with not rebuilding a bunch of indexes and have stale statistics. Auto update statistics is OK, but you have to modify a decent amount of data before it kicks in. And when it kicks in, the user waits for it. Unless you do async update…. But I digress. Statistics management is a topic of its own. My point is that you probably don’t want to lose track of statistics management because you keep staring as index fragmentation.

My tests – details

Check out this section if you are curious of how I did it, want to see the numbers and perhaps even do tests of your own. I welcome constructive criticism which can be weaknesses etc in my test scripts. Or perhaps different interpretations of the outcome!

The basic flow is:

  • Create a proc which does bunch of reads to fill the buffer pool with data.
  • Create some other supporting stuff, like trace definition, trace control procedure etc.
  • Grow the data and log files for the Stackoverflow (10 GB) database.
  • Create table to hold my measures.
  • Create the narrow_index table which is a table with a clustered index which is narrow. A row is some 40 bytes (very rough calculation off the top of my head).
  • Create the wide_index is a table with a clustered index which is wider – some 340 bytes.
  • Set max server memory to 1 GB.
  • Turn off auto-update statistics.
  • Create the proc that will do the tests.
  • Generate fragmentation in both indexes/tables.
  • The narrow table has 20,000,000 rows and is about 970 MB in size.
  • The wide table has 10,000,000 rows and is about 6.8 GB in size.
  • Run the proc that does the test with 1 row selectivity.
  • Run the proc that does the test with 10000 rows selectivity.
  • Run the proc that does the test with 100000 rows selectivity.
  • Each test runs 99 SELECTs, spread over the data (so not to read the same row).
  • Average the result, into a table named resx.
  • I did 4 such iterations, so I have the tables res1, res2, res3 and res4.
  • Check the numbers from these tables and see if they correspond reasonably. They did.
  • Average the results from these four table to get the end results.

The procedure which executes the SELECTS and captures the measures has this basic flow

  • Stop the (XE) trace if it is running.
  • Delete all old trace files.
  • Do a first SELECT to get rid of auto-stats (yeah, I disabled it, but just in case).
  • Empty the cache.
  • Reads lots of other data into cache.
  • Start the trace
  • Do the SELECT, in a loop, reading as many rows as requested (param to the proc), as many times as requested (another param to the proc). Don’t read the same data over and over again. This is done both for the narrow and wide table/index.
  • Capture memory usage for each table (buffer pool usage).
  • Get the trace raw data into a temp table.
  • Parse the trace data/above table using XQuery into the measures table.
  • Ie., we now have one row for each SELECT with a bunch of measures.

The measures

  • Microseconds as captured with the Extended Event trace.
  • Microseconds as “wall clock time”. I.e., capture datetime before the loop in a variable and calculate microseconds after the loop into a variable, divided by number of executions. This value will then be the same for each execution in the loop. I want to have two different ways to calculate execution time, as a sanity check. In the end, they corresponded pretty well.
  • How much data of the table was brought into cache.
  • Physical reads. This is of lower interest, since read-ahead will muddle the result.
  • Logical reads. This is more interesting and can among other things be used to check t at we read about the same number of pages for the fragmented and for the non-fragmented index. Which we should – they have the same fillfactor!
  • CPU microseconds. I didn’t find this very interesting.

Outcome

I ignored the reading of 1 row, this shouldn’t differ, and the small difference I found I attribute to imperfections in the ability to capture exact numbers.

When reading 1000 rows for the narrow table, about 75 pages, the fragmented index was about 30% slower.

When reading some 700-900 pages (100,000 rows for the narrow table and 10,000 rows for the wide table) the fragmented table was about 16-17 % slower.

When reading about 9,000 pages (100,000 rows for the wide table), the fragmented index was about 15% slower.

Here are the averages, one row per execution (remember that I did 99 SELECTs in a loop, so this is the average of those):

Narrow index
Wide index

And here is the average of those averages

The code

------------------------------------------------------------------------------------------------
--Measure cost of fragmentation in terms of execution time and memory usage
--Tibor Karaszi, 2019
--Requires the Stackoverflow database. I was using the 10 GB version.
------------------------------------------------------------------------------------------------

SET NOCOUNT ON

------------------------------------------------------------------------------------------------
--Setup section
------------------------------------------------------------------------------------------------
USE StackOverflow
GO

--Proc to fill buffer pool, reads lots of data, takes a while to execute, especially if on spinning disk
CREATE OR ALTER PROC fill_bp
AS
DECLARE @a bigint
SET @a = (SELECT AVG(CAST(PostId AS bigint)) FROM Comments)
SET @a = (SELECT AVG(CAST(CommentCount AS bigint)) FROM Posts)
SET @a = (SELECT AVG(CAST(PostId AS bigint)) FROM Comments)
SET @a = (SELECT AVG(CAST(VoteTypeId AS bigint)) FROM Votes)
SET @a = (SELECT AVG(CAST(PostId AS bigint)) FROM Comments)
SET @a = (SELECT AVG(CAST(VoteTypeId AS bigint)) FROM Votes)
SET @a = (SELECT AVG(CAST(VoteTypeId AS bigint)) FROM Votes)
GO

--Disable IAM order scan, so we know that SQL Server will follow the linked list
--See https://sqlperformance.com/2015/01/t-sql-queries/allocation-order-scans
EXEC sp_configure 'cursor threshold', 1000000 RECONFIGURE
GO

--Proc to change trace status
CREATE OR ALTER PROC change_trace_status 
	 @state varchar(50) --start, stop, delete_all_files
	AS

	IF @state = 'start'
	 BEGIN
	   IF NOT EXISTS (SELECT * FROM sys.dm_xe_sessions WHERE name = 'frag_test')
		ALTER EVENT SESSION frag_test ON SERVER STATE = START
	 END

	IF @state = 'stop'
	BEGIN
	   IF EXISTS (SELECT * FROM sys.dm_xe_sessions WHERE name = 'frag_test')
		 ALTER EVENT SESSION frag_test ON SERVER STATE = STOP
	 END

	--Delete XE file, using xp_cmdshell (ugly, I know)
	IF @state = 'delete_all_files'
	BEGIN
	  EXEC sp_configure 'xp_cmdshell', 1 RECONFIGURE WITH OVERRIDE
	  EXEC xp_cmdshell 'DEL R:\frag_test*.xel', no_output 
	  EXEC sp_configure 'xp_cmdshell', 0 RECONFIGURE WITH OVERRIDE
	END

	/*
	EXEC change_trace_status @state = 'start'
	EXEC change_trace_status @state = 'stop'
	EXEC change_trace_status @state = 'delete_all_files'
	*/
GO


--Drop and create event session to keep track of execution time
EXEC change_trace_status @state = 'stop'
WAITFOR DELAY '00:00:01'
IF EXISTS (SELECT * FROM sys.server_event_sessions WHERE name = 'frag_test')
 DROP EVENT SESSION frag_test ON SERVER
EXEC change_trace_status @state = 'delete_all_files'
CREATE EVENT SESSION frag_test ON SERVER 
	ADD EVENT sqlserver.sp_statement_completed()
	ADD TARGET package0.event_file(SET filename=N'R:\frag_test')
	WITH (MAX_DISPATCH_LATENCY=10 SECONDS)
GO


--Create proc to report progress
CREATE OR ALTER PROC #status 
 @msg varchar(200)
AS
RAISERROR(@msg, 10, 1) WITH NOWAIT
GO

--Grow the data and log files for StackOverflow database.
EXEC #status 'Grow the data and log files for StackOverflow database...'
IF EXISTS(	SELECT size * 8/(1024*1024), * 
			FROM sys.database_files 
			WHERE name = N'StackOverflow2010' AND size * 8/(1024*1024) < 30)
	ALTER DATABASE [StackOverflow] MODIFY FILE ( NAME = N'StackOverflow2010', SIZE = 30GB )
IF EXISTS(	SELECT size * 8/(1024*1024), * 
			FROM sys.database_files 
			WHERE name = N'StackOverflow2010_log' AND size * 8/(1024*1024) < 15)
	ALTER DATABASE [StackOverflow] MODIFY FILE ( NAME = N'StackOverflow2010_log', SIZE = 15GB )
GO


--Table to hold output 
DROP TABLE IF EXISTS measures

CREATE TABLE measures (
 id int identity(1,1) PRIMARY KEY NOT NULL
,rows_to_read int NOT NULL
,index_width varchar(20) NOT NULL
,fragmented varchar(2) NOT NULL
,usec_xe bigint NOT NULL
,usec_clock bigint NOT NULL
,cpu_microsec bigint NOT NULL
,physical_reads bigint NOT NULL
,logical_reads bigint NOT NULL
,mb_cache decimal(9,2) NOT NULL
);



--Create the table for the narrow index
EXEC #status 'Setup section. Create table with narrow index...'
DROP TABLE IF EXISTS narrow_index


--Adjust numbers. 20,000,000 rows means about 970 MB in the end
SELECT TOP(1000*1000*20) ROW_NUMBER() OVER( ORDER BY (SELECT NULL)) AS c1, CAST('Hello' AS varchar(80)) AS c2
INTO narrow_index
FROM sys.columns AS a, sys.columns AS b, sys.columns AS c


CREATE CLUSTERED INDEX x ON narrow_index(c1)


--Create the table for the wide index
EXEC #status 'Setup section. Create table with wide index...'
DROP TABLE IF EXISTS wide_index

--Adjust numbers. 10,000,000 rows means about 6.8 GB in the end
SELECT 
 TOP(1000*1000*10) ROW_NUMBER() OVER( ORDER BY (SELECT NULL)) AS c1
,CAST('Hi' AS char(80)) AS c2
,CAST('there' AS char(80)) AS c3
,CAST('I''m' AS char(80)) AS c4
,CAST('on' AS char(80)) AS c5
,CAST('my' AS varchar(200)) AS c6
INTO wide_index
FROM sys.columns AS a, sys.columns AS b, sys.columns AS c

CREATE CLUSTERED INDEX x ON wide_index(c1)
GO


------------------------------------------------------------------------------------------------
--Investigate the data if you want

/*
--wide index
SELECT TOP(100) * FROM wide_index
EXEC sp_indexinfo wide_index		--Found on my web-site
EXEC sp_help 'wide_index'

--narrow index
SELECT TOP(100) * FROM narrow_index
EXEC sp_indexinfo narrow_index		--Found on my web-site
EXEC sp_help 'narrow_index'
*/

--Execute this if you want to have a rather full BP, restricts memory to 1 GB
EXEC sp_configure 'max server memory', 1000 RECONFIGURE

--Turn off auto-update statistics
ALTER DATABASE Stackoverflow SET AUTO_UPDATE_STATISTICS OFF

------------------------------------------------------------------------------------------------
--/Setup section
------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------
--Create the proc that executes our SQL
EXEC #status 'Create the proc that executes our SQL...'
GO
CREATE OR ALTER PROC run_the_sql
 @fragmented varchar(20)
,@rows_to_read int
,@range_iterations int
,@fill_bp char(1)			--'y' or 'n'
,@report_frag char(1) = 'n'
AS
DECLARE 
 @sql varchar(1000)
,@sql_condition varchar(1000)
,@bp_wide_mb decimal(9,2)
,@bp_narrow_mb decimal(9,2)
,@range_start int
,@range_iterations_counter int
,@a int
,@start_time datetime2
,@exec_time_wide_index_usec bigint
,@exec_time_narrow_index_usec bigint

EXEC change_trace_status @state = 'stop'
EXEC change_trace_status @state = 'delete_all_files'

--Do a first execution to get autostats etc out of the way
DBCC FREEPROCCACHE
SET @range_iterations_counter = 1
SET @range_start = 1000
SET @a = (SELECT COUNT_BIG(c1) AS c1_count FROM wide_index WHERE c1 BETWEEN @range_start AND (@range_start + @rows_to_read - 1))
SET @a = (SELECT COUNT_BIG(c1) AS c1_count FROM narrow_index WHERE c1 BETWEEN @range_start AND (@range_start + @rows_to_read - 1))

--Empty data cache (buffer pool, BP)
CHECKPOINT
DBCC DROPCLEANBUFFERS

--Run proc to read stuff into BP if requested
IF @fill_bp = 'y'
  EXEC fill_bp

EXEC change_trace_status @state = 'start'

--Do the SELECTs, narrow index
SET @range_iterations_counter = 1
SET @range_start = 1000
SET @start_time = SYSDATETIME()
WHILE @range_iterations_counter <= @range_iterations
BEGIN
	SET @a = (SELECT COUNT_BIG(c1) AS c1_count FROM narrow_index WHERE c1 BETWEEN @range_start AND (@range_start + @rows_to_read - 1))
	SET @range_start = @range_start + 100000
	SET @range_iterations_counter += 1
END
SET @exec_time_narrow_index_usec = DATEDIFF_BIG(microsecond, @start_time, SYSDATETIME()) / @range_iterations

--Do the SELECTs, wide index
SET @range_iterations_counter = 1
SET @range_start = 1000
SET @start_time = SYSDATETIME()
WHILE @range_iterations_counter <= @range_iterations
BEGIN
	SET @a = (SELECT COUNT_BIG(c1) AS c1_count FROM wide_index WHERE c1 BETWEEN @range_start AND (@range_start + @rows_to_read - 1))
	SET @range_start = @range_start + 100000
	SET @range_iterations_counter += 1
END
SET @exec_time_wide_index_usec = DATEDIFF_BIG(microsecond, @start_time, SYSDATETIME()) / @range_iterations

EXEC change_trace_status @state = 'stop'

--Keep track of BP usage
SET @bp_wide_mb = 
(
			SELECT 
			CAST((COUNT(*) * 8.00) / 1024 AS DECIMAL(9,2)) AS MB 
			FROM  sys.allocation_units AS a  
			  JOIN sys.dm_os_buffer_descriptors AS b 
				ON a.allocation_unit_id = b.allocation_unit_id
			  JOIN sys.partitions AS p
				ON a.container_id = p.hobt_id
			WHERE p.object_id = OBJECT_ID('wide_index')
			  AND b.database_id = DB_ID()
)

SET @bp_narrow_mb = 
(
			SELECT 
			CAST((COUNT(*) * 8.00) / 1024 AS DECIMAL(9,2)) AS MB 
			FROM  sys.allocation_units AS a  
			  JOIN sys.dm_os_buffer_descriptors AS b 
				ON a.allocation_unit_id = b.allocation_unit_id
			  JOIN sys.partitions AS p
				ON a.container_id = p.hobt_id
			WHERE p.object_id = OBJECT_ID('narrow_index')
			  AND b.database_id = DB_ID()
)


--Wait for trace data to arrive in target
WAITFOR DELAY '00:00:10'

--Get the trace data into our table
SELECT CAST(event_data AS XML) AS StatementData
INTO #myRawXeData
FROM sys.fn_xe_file_target_read_file('R:\frag_test*.xel', NULL, NULL, NULL);

--Done with trace file, delete it
EXEC change_trace_status @state = 'delete_all_files';

--Transform raw trace data into our measures table
WITH t AS(
SELECT 
 StatementData.value('(event/data[@name="duration"]/value)[1]','bigint') AS duration_microsec
,StatementData.value('(event/data[@name="cpu_time"]/value)[1]','bigint') AS cpu_microsec
,StatementData.value('(event/data[@name="physical_reads"]/value)[1]','bigint') AS physical_reads
,StatementData.value('(event/data[@name="logical_reads"]/value)[1]','bigint') AS logical_reads
,StatementData.value('(event/data[@name="statement"]/value)[1]','nvarchar(500)') AS statement_
FROM #myRawXeData AS evts
WHERE StatementData.value('(event/data[@name="statement"]/value)[1]','nvarchar(500)') LIKE '%index WHERE c1%'
),
t2 AS (
SELECT 
 CASE WHEN t.statement_ LIKE '%wide_index%' THEN 'wide_index' ELSE 'narrow_index' END AS index_width
,CASE @fragmented WHEN 'high_frag_level' THEN 'y' ELSE 'n' END AS fragmented
,duration_microsec
,CASE WHEN t.statement_ LIKE '%wide_index%' THEN @exec_time_wide_index_usec ELSE @exec_time_narrow_index_usec END AS usec_clock
,cpu_microsec
,physical_reads
,logical_reads
,CASE WHEN t.statement_ LIKE '%wide_index%' THEN @bp_wide_mb ELSE @bp_narrow_mb END AS mb_cache
FROM t)
INSERT INTO measures(rows_to_read, index_width, fragmented, usec_xe, usec_clock, cpu_microsec, physical_reads, logical_reads, mb_cache)
SELECT @rows_to_read, index_width, fragmented, duration_microsec, usec_clock, cpu_microsec, physical_reads, logical_reads, mb_cache
FROM t2;

--Report fragmentation level, if requested
IF @report_frag = 'y'
	--Note size of index and frag level, should be comparative between executions
	SELECT
	  OBJECT_NAME(object_id) AS table_name
	 ,index_type_desc
	 ,CAST(avg_fragmentation_in_percent AS decimal(5,1)) AS frag_level
	 ,page_count/1000 AS page_count_1000s
	FROM sys.dm_db_index_physical_stats(DB_ID(), NULL, NULL, NULL, 'LIMITED') 
	WHERE index_level = 0 AND alloc_unit_type_desc = 'IN_ROW_DATA' AND OBJECT_NAME(object_id) IN('narrow_index', 'wide_index')
	ORDER BY index_id
GO


------------------------------------------------------------------------------------------------
--1: Generate fragmentation in both indexes

--Fragment wide ix
EXEC #status 'Generate fragmentation in wide index...'
UPDATE wide_index SET c6 = REPLICATE('a', 200) WHERE c1 % 20 = 0
UPDATE STATISTICS wide_index WITH FULLSCAN

--Fragment narrow ix
EXEC #status 'Generate fragmentation in narrow index...'
UPDATE narrow_index SET c2 = REPLICATE('a', 20) WHERE c1 % 100 = 0
UPDATE STATISTICS narrow_index WITH FULLSCAN

--Run the queries
EXEC #status 'Run queries with high frag level...'
EXEC run_the_sql @fragmented = 'high_frag_level', @rows_to_read = 1,		@range_iterations = 99, @fill_bp = 'y'
EXEC run_the_sql @fragmented = 'high_frag_level', @rows_to_read = 10000,	@range_iterations = 99, @fill_bp = 'y'
EXEC run_the_sql @fragmented = 'high_frag_level', @rows_to_read = 100000,	@range_iterations = 99, @fill_bp = 'y'


-----------------------------------------------------------------------------------
--2: no frag in either index, fillfactor set to make same size as when fragmented

EXEC #status 'Eliminate fragmentation in wide index...'
ALTER INDEX x ON wide_index REBUILD WITH (FILLFACTOR = 48)

EXEC #status 'Eliminate fragmentation in narrow index...'
ALTER INDEX x ON narrow_index REBUILD WITH (FILLFACTOR = 50)

--Run the queries
EXEC #status 'Run queries with low frag level...'
EXEC run_the_sql @fragmented = 'low_frag_level', @rows_to_read = 1,		 @range_iterations = 99, @fill_bp = 'y'
EXEC run_the_sql @fragmented = 'low_frag_level', @rows_to_read = 10000,	 @range_iterations = 99, @fill_bp = 'y'
EXEC run_the_sql @fragmented = 'low_frag_level', @rows_to_read = 100000, @range_iterations = 99, @fill_bp = 'y'


-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--Reset
EXEC sp_configure 'cursor threshold', -1
EXEC sp_configure 'max server memory', 2147483647
RECONFIGURE
GO

-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--Run below manually and investigate the output

--Raw data from the trace
SELECT * FROM measures ORDER BY rows_to_read, index_width, fragmented

--Average the data.
--Note that I ran this script 4 times and used INTO here to create 4 tables. Res1, res2, res3, and res4.
SELECT
 m.rows_to_read
,m.index_width
,m.fragmented
,AVG(m.usec_xe) AS usec_xe
,AVG(m.usec_clock) AS usec_clock
,AVG(m.mb_cache) AS mb_cache
,AVG(m.physical_reads) AS physical_reads
,AVG(m.logical_reads) AS logical_reads
,AVG(m.cpu_microsec) AS cpu_microsec
-- INTO res4						--if you want to persist the results
FROM measures AS m
GROUP BY m.rows_to_read, m.index_width, m.fragmented
ORDER BY index_width, rows_to_read, fragmented;


--Check the details from the averages of each execution
WITH x AS(
SELECT 1 AS ord, * FROM res1
UNION ALL
SELECT 2 AS ord, * FROM res2
UNION ALL
SELECT 3 AS ord, * FROM res3
UNION ALL
SELECT 4 AS ord, * FROM res4
)
SELECT * FROM x
WHERE index_width = 'narrow_index'
ORDER BY index_width, rows_to_read, fragmented, ord;

--Average the averages
WITH x AS(
SELECT 1 AS ord, * FROM res1
UNION ALL
SELECT 2 AS ord, * FROM res2
UNION ALL
SELECT 3 AS ord, * FROM res3
UNION ALL
SELECT 4 AS ord, * FROM res4
)
SELECT
 m.rows_to_read
,m.index_width
,m.fragmented
,AVG(m.usec_xe) AS usec_xe
,AVG(m.usec_clock) AS usec_clock
,AVG(m.mb_cache) AS mb_cache
,AVG(m.physical_reads) AS physical_reads
,AVG(m.logical_reads) AS logical_reads
,AVG(m.cpu_microsec) AS cpu_microsec
 FROM x AS m
GROUP BY m.rows_to_read, m.index_width, m.fragmented
ORDER BY index_width, rows_to_read, fragmented;





-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--Cleanup
/*
DROP TABLE IF EXISTS narrow_index
DROP TABLE IF EXISTS myXeData
DROP TABLE IF EXISTS myXeData2
DROP TABLE IF EXISTS bp_usage
*/

Looking for strange (in your SQL Server)

Do you know if you have weird stuff going on in your SQL Server? Perhaps you walk through the errorlog every now and then? (Do you, really? You should!).

But there are lots of thing that doesn’t end up in the errorlog. For these kind of things, I created what I refer to as my KMDFM trace. This is from a tune of theirs, “Looking for strange“.

I decided to have this as an article, on my web-site. The reason is that I think if it as something that may evolve over time. Hopefully I’ll get feedback, with suggestions to refine the trace (adding filters, events, etc).

Needless to say, I’m using Extended Events. Instead of capturing each event, which can get pretty noisy, I use the event_counter target. This just count how many has happened.

You find the article here, with the details: https://karaszi.com/looking-for-strange

Here are some info from that article. In particular, I included below what events I added, and why:

General events

Attention. This fires when a client times out. The client might do a re-try, so then end-user isn’t even aware that this is happening, except that the performance sucks. The time-out period is specified at the client side, in the database API. “Indicates that a cancel operation, client-interrupt request, or broken client connection has occurred. Be aware that cancel operations can also occur as the result of implementing data access driver time-outs.

Auto_stats. If you have lots of auto update statistics going on, then you might want to do something about that. The user submitting a query when auto-update occurs will notice a halt while the update happens (unless set to async, but then the query is ran with inferior statistics). Perhaps you want to run manual update more frequently? Or remove some table/index/stats from the auto-update handling? I had to work quite heavily with filtering for this to ger rid of noise. “Occurs when index and column statistics are automatically updated. This event can be generated multiple times per statistics collection when the update is asynchronous.

Database_file_size_change. The change can be for several reasons. A grow or shrink can be automatic or manual. This should capture all. You should be aware of manual size changes, and they shouldn’t be frequent. But frequent autogrow you might want to handle by manually growing the file or changing the increment. And frequent autosrink is so bad and I doubt I have to explain any further on this. Hopefully you don’t have autosrink turned on for any of your databases. If this happens a lot,. consider adding a filter to only catch shrink operations and/or add a file target. “Occurs when any of the data or log files for a database changes size. Use this event to monitor changes to the size of database files.

Database_started. You probably want to know if databases are created, attached frequently. Not to mention f you have autoclose turned on for any databases and that causes a database to be started. “Database started.“‘

Lock_timeout_greater_than_0. If a programmer specified to get a time-out while waiting for a log, then this can fire. Why not the evant simply named “lock_timeout”? For two reasons: One is that I noticed it fires more than once for each occurrence. And the other reason is that a developer might have deliberately set a timeout of 0, to immediately get an exception, and those you probably don’t want to catch. “Occurs when a request for a lock times out because of a blocking lock that is being held by a different resource. This event is identical to lock_timeout, but excludes cases in which the time-out value is 0, such as when lock probes are used. Time-outs with a duration of 0 do not necessarily indicate a blocking problem.

Long_io_detected. You might have seem messages in the log that an I/O operation took longer than 15 seconds. This event fires when that happens. I didn’t include this in the performance section since 15 seconds is such a ridiculous long time for an I/O operation and something is clearly not normal if this event fires. “Long IO detected

Lock_deadlock. Hopefully this is obvious. If you have lots of deadlocks, then you want to do something about it. Perhaps run a trace and catch the xml_deadlock_report event, and then pass this on to the developers. “Occurs when an attempt to acquire a lock is canceled for the victim of a deadlock.

Lock_escalation. When SQL server escalates a log to a higher level (table), then it first acquired lots is fine-grained locks, then the use the sledge by acquiring a table level lock. You don’t want this to happen too frequently, adding indexes or other measures can reduce lock escalations. “Occurs when a fine-grain lock is converted into a coarser-grain lock, such as when a row lock is converted into an object lock. Use this event to monitor how locks are being used on the server.

I also wanted to have the plan_affecting_convert event, to capture when some covers happens in a WHERE clause so that it makes it impossible to use an index seek. But SQL server seems to include anything with a conversion here, including conversion in the column list. This makes the even, IMO, useless.

Performance related events

Check the trace definitions for these. Some might happen a lot. If you aren’t interested in these, then just skip them. Perhaps you can’t do anything about it in the first place?

Index fragmentation, part 2

Does fragmentation hurt?

Checkout my earlier post in this series http://sqlblog.karaszi.com/does-index-fragmentation-matter/

[EDIT] Some of the conclusions in this post turned out to be invalide, due to auto-update statistics kicking in. Check out the final post: http://sqlblog.karaszi.com/fragmentation-the-final-installment/

In my last blog post, I wanted to focus on the sequential vs random I/O aspect and how that part should be more or less irrelevant with modern hardware. So I did a test that did a full scan (following the linked list of an index) and see if I could notice any performance difference on my SSD. I couldn’t.

That isn’t the end of the story, it turns out. Another aspect is how the data is brought into memory. You might know that SQL server can do “read ahead”, meaning it does larger reads per I/O instead if single-page I/O. I.e., fewer but larger I/O operations. This sounds fine, but what happens when we have fragmentation?

I have had this in the back of my mind for ages now. Since nobody else seemed to care about or discuss this topic, I had this feeling that I was barking up the wrong tree. But, since I now have a test-script I can just tweak it a bit so it does less than a full scan to see if fragmentation seems to make a difference for read-ahead. Spoiler alert: It sure can do! (But possibly not for read-ahead per se.)

My thinking was that read-ahead will read stuff into memory, which it turns out it doesn’t need. Something like bring an extent into memory, but it turns it only need 1 of those 8 pages, and then it jumps somewhere else. I never had a real confirmation whether this applies, or if read-ahead is smarter than that.

So, basically, I parameterized my query so I can play with the selectivity for each SELECT. I have two tables, or perhaps I should say two B-Tree indexes, to play with:

NameRows (millions)Size GB
wide index3.78.4
narrow index201

So, what selectivity to choose? I did three tests:

  • 1000 rows per query.
  • 2 rows per query. For this I suspected no difference. After all, each query will just navigate the tree and find those 2 rows most probably on the same page.
  • 10000 rows per query.

This should be enough to see a pattern, I hope. We already know that for a full scan/seek we couldn’t see a difference. I’m measuring the following attributes:

  • Duration.
  • Physical reads.
  • Size of index in memory. I.e., how much memory was consumed by the query load.

I did not find any difference for a warm index, i.e., when the data is already in cache. So, I didn’t clutter this post with such measures here. Feel free to run my test and play with it if you want to see for yourself. This is why I did not measure logical reads. Also, CPU usage was so low it wasn’t measurable.

Again, the rebuild was done with a fillfactor so that the indexes are the same size both when fragmented and when not fragmented. This is probably not how you do things, but it is so obvious that fitting more rows on a page will make reads quicker, so we don’t need to investigate that aspect.

1000 rows per SELECT:

index_widthfragmentedduration_msphysical_readsmb_cache
narrow_indexn95180.64
narrow_indexy109013118.01
wide_indexn515619114.94
wide_indexy565620416.27

What stands out above is how much of the narrow table was brought into memory from the workload. First few times I ran the test, I thought that something was wrong. I also ran the test and looked if I had physical I/O for the “warm” execution – no. The thinking was that the for the no frag executions, data was brought out of cache for whatever reason, but no physical I/O for the warm execution. I also did a run where I removed the wide_index from the test, thinking that perhaps having this in there will skew the test data – but no.

I also investigated the BP using sys.dm_os_buffer_descriptors with the thinking that just a few MB of used cache cannot contain the data I’m looking for. I fit 162 rows per data page when the index isn’t fragmented. I had 72 such pages in cache. This gives me about 11000 rows. In my workload, I did 10 iterations where each read 1000 rows. I.e., 10000 rows in total. In other words, the data that I worked with did indeed fit in those few MB of memory. I also used sys.dm_os_buffer_descriptors after an execution with only the fragmented narrow table and now there are 17,000 pages in cache instead of 72 pages.

My conclusion is that fragmentation can cause bringing lots and lots if unnecessary data into cache. If you have other thoughts or theories, let me know.

As for the other numbers: Duration increases a bit with fragmentation, and so does physical reads. But not that much.

2 rows per SELECT:

index_widthfragmentedduration_msphysical_readsmb_cache
narrow_indexn67320.16
narrow_indexy84013124.41
wide_indexn94120.19
wide_indexy1421171.58

Above doesn’t make sense at all. Why would a two (2) row SELECT bring so much data into cache when the index happens to be fragmented? I.e., why is read-ahead kicking in for a query with such high selectivity? So I decided to dig deeper. I extracted the query and ran it for itself, just a single execution, without any fancy dynamic SQL going on. I compared execution plans between the fragmented index and the non-fragmented index, but both are identical, index seek. Estimated numbers of rows were also pretty much on the target (1.2 rows).

A single query, reading a single row bring 20 data pages into cache when the table was fragmented (32 pages in total, including index pages). When defragmenting that index, the very same query brought 6 pages into cache. That was still more than expected, but less then when we had fragmentation. There is obviously more at play here… The funny thing is that STATISTICS IO doesn’t report any read-ahead and trace flag 652 (disable read ahead) doesn’t do any difference for the numbers above in this paragraph. So maybe it isn’t read ahead, after all? But there is something fishy going on.

10,000 rows per SELECT:

index_widthfragmentedduration_msphysical_readsmb_cache
narrow_indexn2851655.14
narrow_indexy225158120.04
wide_indexn313711883147.12
wide_indexy346971908149.26

Same pattern for the narrow index. Not much for me to add, except that here we got less I/O for the narrow index when the table was fragmented. I ran it a few times and got a consistent result. I can’t explain this, so I just leave it for any of you to pick your guesses.

Disabling read-ahead

What if I disable read-ahead? I used trace flag 652 to disable read-ahead and ran the the workload with 1000 rows selectivity:

index_widthfragmentedduration_msphysical_readsmb_cache
narrow_indexn1117151.19
narrow_indexy172322120.65
wide_indexn1433320415.95
wide_indexy1614321316.83

Same pattern for memory usage. But now we see a very different number for physical reads. It seems like Extended Event, the sp_statement_completed event and physical_reads value, suffers from the same defect as performance monitor. It doesn’t include the reads done my the read-ahead thread. On the other hand, we can also see that read-ahead can be our friend. Look at the difference in duration.

Bottom line

It turns out that fragmentation can have a pretty significant impact. How much? As usual, it depends on your workload. If you scan entire indexes (see my earlier blog post), then the memory aspect doesn’t seem to differ, which also makes sense. Unless you are so memory constrained that a page is removed from cache, while the query is still running and that page turns out to be needed again.

But when we are more selective than doing a full scan, fragmentation can make a difference for memory usage, evidently. My test is only one workload. And I only noticed a relevant difference for the narrow index. YMMV, of course.

The code

SET NOCOUNT ON


-----------------------------------------------------------------------------------
--Create table to keep track of buffer pool usage
USE StackOverflow

DROP TABLE IF EXISTS bp_usage

CREATE TABLE bp_usage (
 index_width varchar(30) NOT NULL
,fragmented varchar(20) NOT NULL
,mb_cache decimal(9,2) NOT NULL)


-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--Disable IAM order scan, so we know that SQL Server will follow the linked list
--See https://sqlperformance.com/2015/01/t-sql-queries/allocation-order-scans

EXEC sp_configure 'cursor threshold', 1000000
RECONFIGURE


-----------------------------------------------------------------------------------
--Drop and create event session to keep track of execution time
IF EXISTS (SELECT * FROM sys.server_event_sessions WHERE name = 'frag_test')
DROP EVENT SESSION frag_test ON SERVER

--Delete XE file, using xp_cmdshell (bad, I know)
EXEC sp_configure 'xp_cmdshell', 1 RECONFIGURE WITH OVERRIDE
EXEC xp_cmdshell 'DEL R:\frag_test*.xel', no_output 
EXEC sp_configure 'xp_cmdshell', 0 RECONFIGURE WITH OVERRIDE

CREATE EVENT SESSION frag_test ON SERVER 
ADD EVENT sqlserver.sp_statement_completed()
ADD TARGET package0.event_file(SET filename=N'R:\frag_test')
WITH (MAX_DISPATCH_LATENCY=2 SECONDS)
GO

ALTER EVENT SESSION frag_test ON SERVER STATE = START


-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--Setup section

--Grow the data and log files for StackOverflow database.
IF EXISTS(SELECT size * 8/(1024*1024), * FROM sys.database_files WHERE name = N'StackOverflow2010' AND size * 8/(1024*1024) < 20)
	ALTER DATABASE [StackOverflow] MODIFY FILE ( NAME = N'StackOverflow2010', SIZE = 20GB )
IF EXISTS(SELECT size * 8/(1024*1024), * FROM sys.database_files WHERE name = N'StackOverflow2010_log' AND size * 8/(1024*1024) < 10)
	ALTER DATABASE [StackOverflow] MODIFY FILE ( NAME = N'StackOverflow2010_log', SIZE = 10GB )
GO


--Create the table for the narrow index
RAISERROR('Setup section. About to create table with narrow index...', 10, 1) WITH NOWAIT

DROP TABLE IF EXISTS narrow_index

SELECT TOP(1000*1000*20) ROW_NUMBER() OVER( ORDER BY (SELECT NULL)) AS c1, CAST('Hello' AS char(8)) AS c2
INTO narrow_index
FROM sys.columns AS a, sys.columns AS b, sys.columns AS c

CREATE CLUSTERED INDEX x ON narrow_index(c1)

--Modify Posts table, so we condense cl ix and make it non-fragmented. So we (soon) can fragment it.
RAISERROR('Setup section. About to modify Posts table, so we condense cl ix and make it non-fragmented...', 10, 1) WITH NOWAIT
UPDATE Posts SET LastEditorDisplayName = ''
ALTER INDEX PK_Posts__Id ON Posts REBUILD WITH (FILLFACTOR = 100)
GO


-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--Investigate the data if you want

/*
--wide index
SELECT TOP(100) * FROM Posts
EXEC sp_indexinfo Posts				--Found on my web-site
EXEC sp_help 'Posts'

--narrow index
SELECT TOP(100) * FROM narrow_index
EXEC sp_indexinfo narrow_index		--Found on my web-site
EXEC sp_help 'narrow_index'
*/



-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--Create the proc that executes our SQL
RAISERROR('About to create the proc that executes our SQL...', 10, 1) WITH NOWAIT
GO
CREATE OR ALTER PROC run_the_sql
 @fragmented varchar(20)
,@rows_to_read int
,@range_iterations int
AS
DECLARE 
 @sql varchar(1000)
,@sql_condition varchar(1000)
,@bp_wide_mb decimal(9,2)
,@bp_narrow_mb decimal(9,2)
,@range_start int
,@range_iterations_counter int


--Empty cache
CHECKPOINT
DBCC DROPCLEANBUFFERS

--Cold cache
SET @range_iterations_counter = 1
SET @range_start = 1000
WHILE @range_iterations_counter <= @range_iterations
BEGIN
	SET @sql_condition = ' BETWEEN ' + CAST(@range_start AS varchar(10)) + ' AND ' + CAST(@range_start + @rows_to_read AS varchar(10))
	SET @sql = 'DECLARE @a int SET @a = (SELECT COUNT_BIG(OwnerUserId) AS [cl_ix_scan ' + @fragmented + '] FROM Posts WHERE Id ' + @sql_condition + ')'
	EXEC (@sql)
	SET @sql = 'DECLARE @a int SET @a = (SELECT COUNT_BIG(c1) AS [nc_ix_scan ' + @fragmented + '] FROM narrow_index WHERE c1 ' + @sql_condition + ')'
	EXEC (@sql)
	SET @range_start = @range_start + 100000
	SET @range_iterations_counter += 1
END

/*
--Warm cache (same as above, don't clear cache first)
SET @range_iterations_counter = 1
SET @range_start = 1000
WHILE @range_iterations_counter <= @range_iterations
BEGIN
	SET @sql_condition = ' BETWEEN ' + CAST(@range_start AS varchar(10)) + ' AND ' + CAST(@range_start + @rows_to_read AS varchar(10))
	SET @sql = 'DECLARE @a int SET @a = (SELECT COUNT_BIG(OwnerUserId) AS [cl_ix_scan ' + @fragmented + '] FROM Posts WHERE Id ' + @sql_condition + ')'
	EXEC (@sql)
	SET @sql = 'DECLARE @a int SET @a = (SELECT COUNT_BIG(c1) AS [nc_ix_scan ' + @fragmented + '] FROM narrow_index WHERE c1 ' + @sql_condition + ')'
	EXEC (@sql)
	SET @range_start = @range_start + 100000
	SET @range_iterations_counter += 1
END
*/


--Keep track of BP usage
SET @bp_wide_mb = 
(
			SELECT 
			CAST((COUNT(*) * 8.00) / 1024 AS DECIMAL(9,2)) AS MB 
			FROM  sys.allocation_units AS a  
			  JOIN sys.dm_os_buffer_descriptors AS b 
				ON a.allocation_unit_id = b.allocation_unit_id
			  JOIN sys.partitions AS p
				ON a.container_id = p.hobt_id
			WHERE p.object_id = OBJECT_ID('Posts')
			  AND b.database_id = DB_ID()
)

SET @bp_narrow_mb = 
(
			SELECT 
			CAST((COUNT(*) * 8.00) / 1024 AS DECIMAL(9,2)) AS MB 
			FROM  sys.allocation_units AS a  
			  JOIN sys.dm_os_buffer_descriptors AS b 
				ON a.allocation_unit_id = b.allocation_unit_id
			  JOIN sys.partitions AS p
				ON a.container_id = p.hobt_id
			WHERE p.object_id = OBJECT_ID('narrow_index')
			  AND b.database_id = DB_ID()
)

INSERT INTO bp_usage (index_width, fragmented, mb_cache)
VALUES
 ('wide_index', CASE WHEN @fragmented = 'high_frag_level' THEN 'y' ELSE 'n' END, @bp_wide_mb)
,('narrow_index', CASE WHEN @fragmented = 'high_frag_level' THEN 'y' ELSE 'n' END, @bp_narrow_mb)


--Note size of index and frag level, should be comparative between executions
SELECT OBJECT_NAME(object_id), index_type_desc, CAST(avg_fragmentation_in_percent AS decimal(5,1)) AS frag, page_count/1000 AS page_count_1000s
FROM sys.dm_db_index_physical_stats(DB_ID(), NULL, NULL, NULL, 'LIMITED') 
WHERE index_level = 0 AND alloc_unit_type_desc = 'IN_ROW_DATA' AND OBJECT_NAME(object_id) IN('Posts', 'narrow_index')
ORDER BY index_id
GO


-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--1: Cause fragmentation in both indexes

--Fragment wide ix
RAISERROR('About to cause fragmentation in wide index...', 10, 1) WITH NOWAIT
UPDATE Posts SET LastEditorDisplayName = REPLICATE('x', 39)

--Fragment narrow ix
RAISERROR('About to cause fragmentation in narrow index...', 10, 1) WITH NOWAIT
UPDATE narrow_index SET c1 = c1 + 1 WHERE c1 % 100 = 0

--Run the queries
RAISERROR('About to run queries with high frag level...', 10, 1) WITH NOWAIT
EXEC run_the_sql @fragmented = 'high_frag_level', @rows_to_read = 1000, @range_iterations = 10


-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--2: no frag in either index, fillfactor set to make same size as when fragmented

RAISERROR('About to eliminate fragmentation in wide index...', 10, 1) WITH NOWAIT
ALTER INDEX PK_Posts__Id ON Posts REBUILD WITH (FILLFACTOR = 60)

RAISERROR('About to eliminate fragmentation in narrow index...', 10, 1) WITH NOWAIT
ALTER INDEX x ON narrow_index REBUILD WITH (FILLFACTOR = 50)

--Run the queries
RAISERROR('About to run queries with low frag level...', 10, 1) WITH NOWAIT
EXEC run_the_sql @fragmented = 'low_frag_level', @rows_to_read = 1000, @range_iterations = 10


-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--Reset
EXEC sp_configure 'cursor threshold', -1
RECONFIGURE

--Stop trace
ALTER EVENT SESSION frag_test ON SERVER STATE = STOP

-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--Work the trace data

--Extract into a temp table
DROP TABLE IF EXISTS myXeData
DROP TABLE IF EXISTS myXeData2
DROP TABLE IF EXISTS myXeData3

SELECT CAST(event_data AS XML) AS StatementData
INTO myXeData
FROM sys.fn_xe_file_target_read_file('R:\frag_test*.xel', NULL, NULL, NULL);

--SELECT * FROM #myXeData;

--Use XQuery to transform XML to a table
WITH t AS(
SELECT 
 StatementData.value('(event/data[@name="duration"]/value)[1]','bigint') AS duration_microsec
,StatementData.value('(event/data[@name="cpu_time"]/value)[1]','bigint') AS cpu_microsec
,StatementData.value('(event/data[@name="physical_reads"]/value)[1]','bigint') AS physical_reads
,StatementData.value('(event/data[@name="logical_reads"]/value)[1]','bigint') AS logical_reads
,StatementData.value('(event/data[@name="statement"]/value)[1]','nvarchar(500)') AS statement_
FROM myXeData AS evts
WHERE StatementData.value('(event/data[@name="statement"]/value)[1]','nvarchar(500)') LIKE '%frag_level%'
),
t2 AS (
SELECT 
 CASE WHEN t.physical_reads = 0 THEN 'warm' ELSE 'cold' END AS cold_or_warm
,CASE WHEN t.statement_ LIKE '%cl_ix_scan_%' THEN 'wide_index' ELSE 'narrow_index' END AS index_width
,CASE WHEN t.statement_ LIKE '%low_frag_level%' THEN 'n' ELSE 'y' END AS fragmented
,duration_microsec
,cpu_microsec
,physical_reads
,logical_reads
FROM t)
SELECT *
INTO myXeData2
FROM t2;




-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--Run below manually and investigate the output

--Raw data from the trace
SELECT * FROM myXeData2 ORDER BY cold_or_warm, index_width, fragmented
--Verify pretty consistent values in each quartet. 
--If not, then something special occurred (checkppoint, or something external to SQL) - delete that row.

--Get avg values into a new temp table
SELECT
-- t2.cold_or_warm
 t2.index_width
,t2.fragmented
,AVG(t2.duration_microsec) AS duration_microsec
--,AVG(t2.cpu_microsec) AS cpu_microsec
,AVG(t2.physical_reads) AS physical_reads
--,AVG(t2.logical_reads) AS logical_reads
INTO myXeData3
FROM myXeData2 aS t2
GROUP BY t2.index_width, t2.fragmented
ORDER BY index_width, fragmented

--Check it out
SELECT * FROM myXeData3 ORDER BY index_width, fragmented

--Memory usage
SELECT * FROM bp_usage

--Bottom line, join the two
SELECT p.index_width, p.fragmented, p.duration_microsec, p.physical_reads, b.mb_cache
FROM myXeData3 AS p INNER JOIN bp_usage AS b ON p.index_width = b.index_width AND p.fragmented = b.fragmented
ORDER BY index_width, fragmented


-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--Cleanup
/*
DROP TABLE IF EXISTS narrow_index
DROP TABLE IF EXISTS myXeData
DROP TABLE IF EXISTS myXeData2
DROP TABLE IF EXISTS bp_usage
*/

Possibly the worst analogy in the world – columnstore and ordering

This post is about a topic that everybody who uses columnstore should think about. And that is about ordering. The analogy I will be using is flawed in so many ways, but it help to get my point through. I hope.

You now say to yourself that columnstore indexes aren’t sorted, so what am I yapping about? Hear me out.

Imagine that I will organize my underpants. I don’t know if the word underpants is, sort of, reserved for female underwear, but I’ll use that term anyhow. (I’m Swedish and don’t get all nuances of the English language.) Also Imagine that I have a bunch of drawers. Say I fit about 50 underpants in each drawer and I have some 1000 underpants. How should I do this? The drawer is the rowgroup but it would be ridiculous to say you have about 1,000,000 underpants in each drawer (the approx number of rows in a col-store rowgroup). The 1000 underpants is the number of rows in my “underpants” table, and even 1000 underpants seems like a ridiculous number. Anyhow…

Now imagine that I’m very picky with what underpants I’m wearing. The type (boxer, briefs, string, and whatnot), color, size, material etc. So how do I decide what underpants goes into each drawer? I could just shove them in there without much though about that process. I think that is how many handle their col-store indexes.

Also imagine that we have a device that register attributes about the underpants as we add them to the drawer. Lets call this device u47. So, the u47 will label the drawer with some info on what briefs are in there. Like range of sizes (xxs to xxl). OK, strange perhaps but say that my size vary a lot over time. And also range of colors. And range of models. You have now figured ut that I like to classify things in ranges. I.e., this particular drawer has m-xl; white-green, string to briefs and cotton to satin. This would be the meta-data that SQL Server has for each rowgroup/segment, as seen in sys.column_store_segments.

Lets go back to how you populated your drawers. Did you just shove the underpants there? Then you probably have close to the full range of all attribute values in each drawer. I.e., every drawer has underpants of all sizes, all colors, etc.

Now I want to wear all black underpants. Let’s say I’m a bit strange in that sense. Since there are black underpants in all drawers, it will take a while to find them all. I have to go through each drawer. Or I want all satin underpants. Or perhaps all medium size black satin briefs. I still have to go through each drawer.

But what if I took the time to sort the underpants before adding them to the drawers? I can only sort them by one attribute. I should pick the one attribute that I most often have in my “what underpants should I use today?” decision process. You achieve this by having a row-clustered index on that column when building the col-store clustered index and build the col-store index using WITH DROP_EXISTING. Also, instead of having my friends over and each of us grab some underwear and add them to each person’s drawer (parallelism when building the index), I would do this myself. Yes, that means MAXDOP = 1. Without it, say you have 4 threads/friends and sort by size when when populating the drawers/building the index, then you have 4 drawers with xs-s, 4 drawers with m-l, etc.

For non-clustered col-store index, it would be enough with a non-clustered index that covers the clustered index when converting it from row to col-store index (DROP_EXISTING).

The underpants inside the drawer isn’t sorted; this is a col-store index we’re talking about, remember? But, you might have some fancy vacuum-tech that allow you to fit many underpants in each drawer (the various columnstore compression technologies), something you probably ordered when you bought the u47, btw.

Now, imagine me walking from the shower to my dressing room and thinking about how to find the underpants of today. I.e., the execution plan. “Today, I want to wear underpants based on color, size and material – model doesn’t matter”. I can’t say what drawers to go through, since that information isn’t available at this stage. It is a run-time decision to do drawer-elimination based on the notes on each drawer (what ranges it contains for each attribute). I.e., rowgroup/segment elimination is a run-time decision.

The underwear will only be partitioned over the drawers in an effective manner for one attribute, the one I sorted them by before populating my drawers. If that is size, then one drawer has only xs, another has only x, etc. But the “xs drawer” has all range of colors, materials etc. There might be some correlation between attributes (like rubber underpants tend to be black), but that correlation isn’t usable for us.

How would row indexes fit in this analogy? You have a bunch of post-it notes with which you build a tree-like structure that you lay out on the floor. You use this tree (over size, for instance) to navigate to the ones you are interested in (medium size for instance), and finally you reach the sub-section in the drawer. You can have several such trees, for various attributes, of course. But, the smart vacuum-tech you bought with the u47 isn’t available for these indexes. If you are very selective (remember the rubber pants?), then you can very quickly find that pair using the tree structure. But it is quicker to find all black underpants using the col-store index because you have many of those and the vacuum-tech allow you to fit so many in each drawer.

And the vacuum-tech also magically allow you to put on as many as some 900 pairs of underpants as a time. Remember that I want to wear all my black underpants. This is batch-mode in the execution plan.. (Rumors has it that we at the end of 2019 will have the option to put on more then one pair at a time even without this magical vacuum tech. That would be batch mode on row-store. We’ll see. )

The bottom line? Think about how the data is sorted when building your col-store indexes and consider rebuilding them if your loading/modification routines causes them to degrade over time. Oh, I should add that you shouldn’t mess about with col-store indexes without having Niko’s blog post series handy and also have read Hugo’s series first.

Disclaimer 1: I don’t own any rubber underpants.
Disclaimer 2: If I did own any rubber underpants, I would still have disclaimer 1 in there…

Does index fragmentation matter?

Short answer: probably not.

[Edit] I suggest you also check out the final post in this series: http://sqlblog.karaszi.com/fragmentation-the-final-installment/

[Edit 2] Because of the flood of spam, I’ve closed comments on this. Please comment o the final installment post instead.

Long answer: read on.

You know the story. Every week or so, we defragment the indexes. Many of us uses Ola Hallengren’s great script for this, some uses Maintenance Plans, and there are of course other alternatives as well. But are we just wasting time and effort? Quite probably we are. I’m going to start with some basics, and then do some reasoning, and finally give you some numbers of a very simple test that I ran. The T-SQL code is available. If you give it a try, please let us know your finding for your environment by adding a comment. I will do some generalizations and simplifications, to avid this post being 10 times longer.

What is index fragmentation?

By index here I mean b-tree indexes. You know, the regular types of indexes. I’m not referring to “odd” indexes such as columnstore, hash, geospatial, fulltext or xml indexes. For a b-tree index (from now on, I will just say “index”), the leaf level is stored according the order of the index key. I.e., the leaf is sorted. As you modify data, you can get page splits. Imagine inserting a row “in the middle” and there’s no room on the page. SQL server will split the page, by inserting a page from somewhere into the linked list, move 50% to that new page, and then add the row to where it should be.

  • If you now follow the linked list, you jump back and forth. Aka external fragmentation.
  • The page split left us with two (nearly) half-full pages. Aka internal fragmentation.

 

What do we do about it?

Duh, we defragment the indexes, of course! 🙂 The two commonly used methods are:

  • Rebuild the indexes. ALTER INDEX … REBUILD. This will create a new index and then remove the old index. This requires free space in the database file(s) about the same size as the index.
  • Reorganize the index. ALTER INDEX … REORGANIZE. Basically walk the linked list and for every time you have to go backward in the file when going to the next page, you swap these two pages. It does a bit more than that (re-introduce the fillfactor), but essentially swapping places of the pages not going forwards physically in the file.

While doing defrag, we typically leave some free space on each page (aka fillfactor). Rebuild allow us to specify the fillfactor, while reorg will re-apply whatever value you specified when creating the index.

 

Why might fragmentation hurt us?

Several aspects here. Let us discuss them one by one:

Lots of empty space

Yes, a page split leaves us with two half-full pages. At that point in time. But as more rows are added, they will fit on that free space.

In the end, you will probably average at about 75% fullness (75% being in between half-full and full). That is for indexes where the inserts (and possibly updates) are inserted “all over the place”.

But we also have increasing indexes. Think columns such as “OrderDate”. Or identity/sequence columns, which are pretty often primary keys, which in turn pretty often are also the clustered index. For such an index, inserts will always add rows to the end of the linked list, and SQL Server will just add pages to the end – not “splitting” any page. (Yeah, adding a page to the end of the linked list is technically called a page split in SQL Server lingo, but that show more the problems that SQL server has with terminology than anything else.) The point here are that increasing indexes will not have free space in them because of inserts going to the end.

Common values I’ve seen for fillfactor are between 70 and 80%.

  • For increasing indexes, that just make the indexes bigger. And remember that it isn’t unlikely that the actual data is an increasing index (the clustered index). So, instead of having a 100 GB table, you now have a 120 GB table.
  • For other indexes, you probably get about the same size “naturally” (without doing defrag) as when doing a defrag (depending on your fillfactor, of course).

Pages are out of order

Ask your SAN people how much difference there is between sequential I/O and random I/O in your SAN. The answer to this question will tell you whether this point is relevant. For a single spinning disk, there is a big difference. With sequential I/O you don’t have the disk head movement. But as you introduce more disks in various RAID configs, add in a SAN in the mix and now have a bunch of disks and also other concurrent activity apart from your SQL Server in that SAN, the picture isn’t so clear anymore. Now add SSD or some similar technology to the mix. In the end, there might not be that much of a difference between random and sequential I/O on a modern storage solution.

Splitting a page takes time

Yup, sure does. If you can defragment frequently enough with a chosen fillfactor for each index so that they never split, then you avoid this. But how many of us do that? I.e., hand craft the chosen fillfactor for each index by observing the amount of fragmentation for each index since last time we did defrag, and repeat this week after week until we tuned the “perfect” fillfactor for each index. Some of us might do that, but as a consultant, I can tell you that none of my clients have opted for me to do that type of work. In the end, I’m not that certain that we reduce page splits by doing our regular defrag, at least not to the extent that we hope for.

Splitting writes to the transaction log.

Yes, it does. But again, how much difference is there in reality between defrag and not defrag regarding page splits occurring during business hours.

 

Why might doing defrag hurt us?

Of course there is another side of the coin! Defrag isn’t free. When people asking me whether to defrag, a short question back from me is “How much does it hurt?” If the answer is “None at all”, then sure go ahead! If the answer is “A lot!”, then we want to think twice.

It reduces concurrency because of locking

  • Rebuild has an ONLINE option available if you are on Enterprise Edition (EE), which makes the rebuild essentially online. But using online causes the rebuilt to take longer time compared to offline.
  • If not on EE, then the whole table will be locked.
    • Shared lock if the rebuild is on a nonclustered index – this allow concurrent reads but not modifications.
    • Exclusive locks if the index is a clustered index – disallowing both read and modifications.
  • Reorg will only put a brief lock of the two pages “where it currently is”. This makes reorg much more online-ly than rebuild on non-EE.

 

It puts load your system

Say you have a 200 GB database, and rebuild everything. Now you shuffle 200 GB data. This isn’t free. Even if you use smarter scripts to only defrag what is actually fragmented in the first place, defrag will probably shuffle quite a lot of data. Your VM/SAN/SQL/whatever probably have other things to do at night-time. We all fight over these precious night/weekend hours. And for systems that are expected to be online pretty much all the time, this is even more pressing.

You will need a larger LDF file

If you rebuild everything (blindly) then the ldf file need to be the same size as the (sum of) data file(s). A smart script perhaps only rebuild 30% of te data, cutting this down to some 30% of the data size. (This doesn’t apply to simple recovery model.)

The following log backup will be large

Again, if you rebuild 100 GB worth of indexes, then the subsequent log backup will be 100 GB in size. Etc. You get the picture by now. (Again this doesn’t apply to simple recovery model.)

Is there a bottom line?

I don’t know. Maybe there is. But as you already know, we need to throw in a bunch of “it depends” in the mix. But I believe that there are a bunch of defrag going on for a bunch of SQL servers just because “that is the way we always did it”. Our environments aren’t the same as they were 20 years ago.

Can we quantify this?

Now we are getting to the really hard part. Our environments aren’t the same. The HW, SAN, data, VMs, load pattern. Etc.

Still, I was curious to see if I could measure any difference between a fragmented and a non-fragmented index, on an SSD. I used my Ultrabook, 2 core proc with 4 logical cores, with an SSD disk. (I did not disable parallelism.) This is not how your production SQL Servers are setup, I hope. But I still wanted to test the SSD aspect, and I think that many of you are moving towards flash and tiered SANS, if you didn’t do that transformation already. Anyhow, at the end of this post, you’ll find the SQL I ran. In short:

  • I used the Stack Overflow database, available to download. The smallest one (roughly 10 GB of data). Here I used the Posts table, which has a clustered index on an identity column. This is my wide index. The table (i.e., the index) has 37 million rows and is about 8.5 GB in size (when I run my tests).
  • I created my own table for the narrow index. I wanted to test both. This has 20 million rows, and is only about 1 GB in size.
  • I have a stored procedure that scans the whole index following the linked list (for both above, one query per each). No data is returned to the client.
  • I run the queries twice in the proc, so I can compare cold (physical I/O) and warm (no physical I/O).
  • I did this first for when the indexes are fragmented, ran the proc 4 times.
  • I then did a defrag, and ran the proc 4 times again.
  • I selected the fillfactor so we end up with same index size as in the first run. This can be debated, but what I wanted to look at is the random vs sequential aspect.
  • I run the proc 4 times. I was prepared to delete if any one execution had weird values because of external things going on. That was not necessary in the end.
  • I averaged duration_ms, cpu_ms, physical_reads and logical_reads.
  • I captured the metrics using an Extended Events trace, which I post-processed using T-SQL and X/Query.

If you want to use the script at the end, you will need to go through it and adapt to your environment. It will perhaps take 30 minutes to understand what it does and adapt file name for trace etc. And then some 30 minutes for the execution.

My findings

For my environment, I found no relevant difference between a fragmented index and a non fragmented index when doing a scan over the whole index, from first page in the linked list to the last page. I’m not surprised considering that I’m on SSD, abut I still wanted to see this with my own eyes, in a sort of controlled environment.

Update: Magnus Ahlkvist did a test at a SQL Saturday meeting on a spinning disk. He did find a significant difference for the wider table. However, no noticeable difference for the narrow index (something one would want to dive deeper into why…).

Here are the numbers (not nicely formatted):

cold_or_warm index_width fragmented duration_ms cpu_ms physical_reads logical_reads
cold narrow_index n 1848 2543 123787 124471
cold narrow_index y 1877 2550 124307 125571
cold wide index n 15067 2758 1019433 1020312
cold wide index y 15217 2722 1018283 1019295
warm narrow_index n 1149 4104 0 124942
warm narrow_index y 1105 4101 0 126517
warm wide index n 475 1867 0 1028017
warm wide index y 490 1945 0 1031488

References

Paul White: https://sqlperformance.com/2015/01/t-sql-queries/allocation-order-scans

The Code:

-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--Setup trace section

SET NOCOUNT ON

ALTER EVENT SESSION frag_test ON SERVER STATE = STOP
WAITFOR DELAY '00:00:02'

--Delete XE file, using xp_cmdshell (bad, I know)
EXEC sp_configure 'xp_cmdshell', 1 RECONFIGURE WITH OVERRIDE
EXEC xp_cmdshell 'DEL R:\frag_test*.xel', no_output 
EXEC sp_configure 'xp_cmdshell', 0 RECONFIGURE WITH OVERRIDE


IF EXISTS (SELECT * FROM sys.server_event_sessions WHERE name = 'frag_test')
DROP EVENT SESSION frag_test ON SERVER


CREATE EVENT SESSION frag_test ON SERVER 
ADD EVENT sqlserver.sp_statement_completed()
ADD TARGET package0.event_file(SET filename=N'R:\frag_test')
WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=2 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=OFF,STARTUP_STATE=OFF)
GO

ALTER EVENT SESSION frag_test ON SERVER STATE = START



-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--Disable IAM order scan, so we know that SQL Server will follow the linked list
--See https://sqlperformance.com/2015/01/t-sql-queries/allocation-order-scans

EXEC sp_configure 'cursor threshold', 1000000
RECONFIGURE


-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--Setup section

--You might want to grow the data and log files for StackOverflow database pre-running this test. Or run it twice.

USE StackOverflow

DROP TABLE IF EXISTS narrow_index

--Create the table for the narrow index
SELECT TOP(1000*1000*20) ROW_NUMBER() OVER( ORDER BY (SELECT NULL)) AS c1, CAST('Hello' AS char(8)) AS c2
INTO narrow_index
FROM sys.columns AS a, sys.columns AS b, sys.columns AS c

CREATE CLUSTERED INDEX x ON narrow_index(c1)

--Modify Posts table, so we condense cl ix and make it non-fragmented. So we (soon) can fragment it.
UPDATE Posts SET LastEditorDisplayName = ''
ALTER INDEX PK_Posts__Id ON Posts REBUILD WITH (FILLFACTOR = 100)
GO


-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--Investigate the data if you want

/*
--wide index
SELECT TOP(100) * FROM Posts
EXEC sp_indexinfo Posts				--Found on my web-site
EXEC sp_help 'Posts'

--narrow index
SELECT TOP(100) * FROM narrow_index
EXEC sp_indexinfo narrow_index		--Found on my web-site
EXEC sp_help 'narrow_index'
*/



-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--Create the proc that executes our SQL

CREATE OR ALTER PROC run_the_sql
 @fragmented varchar(20)
AS
DECLARE 
 @sql varchar(1000)

--Empty cache
CHECKPOINT
DBCC DROPCLEANBUFFERS


--Cold cache
SET @sql = 'DECLARE @a int SET @a = (SELECT COUNT_BIG(OwnerUserId) AS [cl_ix_scan ' + @fragmented + '] FROM Posts WHERE Id between 1000 AND 20000)'
EXEC (@sql)
SET @sql = 'DECLARE @a int SET @a = (SELECT COUNT_BIG(c1) AS [nc_ix_scan ' + @fragmented + '] FROM narrow_index WHERE c1 BETWEEN 50000 AND 60000)'
EXEC (@sql)

----Warm cache
SET @sql = 'DECLARE @a int SET @a = (SELECT COUNT_BIG(OwnerUserId) AS [cl_ix_scan ' + @fragmented + '] FROM Posts WHERE Id between 1000 AND 20000)'
EXEC (@sql)
SET @sql = 'DECLARE @a int SET @a = (SELECT COUNT_BIG(c1) AS [nc_ix_scan ' + @fragmented + '] FROM narrow_index WHERE c1 BETWEEN 50000 AND 60000)'
EXEC (@sql)

--Note size of index and frag level, should be comparative between executions
SELECT OBJECT_NAME(object_id), index_type_desc, CAST(avg_fragmentation_in_percent AS decimal(5,1)) AS frag, page_count/1000 AS page_count_1000s
FROM sys.dm_db_index_physical_stats(DB_ID(), NULL, NULL, NULL, 'LIMITED') 
WHERE index_level = 0 AND alloc_unit_type_desc = 'IN_ROW_DATA' AND OBJECT_NAME(object_id) IN('Posts', 'narrow_index')
ORDER BY index_id
GO



-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--1: cause fragmentation in both indexes

--Fragment cl ix
UPDATE Posts SET LastEditorDisplayName = REPLICATE('x', 39)

--Fragment nx ix
UPDATE narrow_index SET c1 = c1 + 1 WHERE c1 % 100 = 0

--Run the queries
EXEC run_the_sql @fragmented = 'high_frag_level'
EXEC run_the_sql @fragmented = 'high_frag_level'
EXEC run_the_sql @fragmented = 'high_frag_level'
EXEC run_the_sql @fragmented = 'high_frag_level'




-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--2: no frag in either index, fillfactor set to make same size as when fragmented

ALTER INDEX PK_Posts__Id ON Posts REBUILD WITH (FILLFACTOR = 60)
ALTER INDEX x ON narrow_index REBUILD WITH (FILLFACTOR = 50)

--Run the queries
EXEC run_the_sql @fragmented = 'low_frag_level'
EXEC run_the_sql @fragmented = 'low_frag_level'
EXEC run_the_sql @fragmented = 'low_frag_level'
EXEC run_the_sql @fragmented = 'low_frag_level'



-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--Reset
EXEC sp_configure 'cursor threshold', -1
RECONFIGURE

--Stop trace
ALTER EVENT SESSION frag_test ON SERVER STATE = STOP

-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--Work the trace data

--Extract into a temp table
DROP TABLE IF EXISTS myXeData
DROP TABLE IF EXISTS myXeData2

SELECT CAST(event_Data AS XML) AS StatementData
INTO myXeData
FROM sys.fn_xe_file_target_read_file('R:\frag_test*.xel', NULL, NULL, NULL);

--SELECT * FROM #myXeData;

--Use XQuery to transform XML to a table
WITH t AS(
SELECT 
 StatementData.value('(event/data[@name="duration"]/value)[1]','bigint') / 1000 AS duration_ms
,StatementData.value('(event/data[@name="cpu_time"]/value)[1]','bigint') /1000 AS cpu_ms
,StatementData.value('(event/data[@name="physical_reads"]/value)[1]','bigint') AS physical_reads
,StatementData.value('(event/data[@name="logical_reads"]/value)[1]','bigint') AS logical_reads
,StatementData.value('(event/data[@name="statement"]/value)[1]','nvarchar(500)') AS statement_
FROM myXeData AS evts
WHERE StatementData.value('(event/data[@name="statement"]/value)[1]','nvarchar(500)') LIKE '%frag_level%'
),
t2 AS (
SELECT 
 CASE WHEN t.physical_reads = 0 THEN 'warm' ELSE 'cold' END AS cold_or_warm
,CASE WHEN t.statement_ LIKE '%cl_ix_scan_%' THEN 'wide index' ELSE 'narrow_index' END AS index_width
,CASE WHEN t.statement_ LIKE '%low_frag_level%' THEN 'n' ELSE 'y' END AS fragmented
,duration_ms
,cpu_ms
,physical_reads
,logical_reads
FROM t)
SELECT *
INTO myXeData2
FROM t2;

--Raw data from the trace
SELECT * FROM myXeData2 ORDER BY index_width, cold_or_warm, fragmented
--Verify pretty consistent values in each quartet. 
--If not, then something special occurred (checkppoint, or something external to SQL) - delete that row.

--Get avg values and compare them
SELECT
 t2.cold_or_warm
,t2.index_width
,t2.fragmented
,AVG(t2.duration_ms) AS duration_ms
,AVG(t2.cpu_ms) AS cpu_ms
,AVG(t2.physical_reads) AS physical_reads
,AVG(t2.logical_reads) AS logical_reads
FROM myXeData2 aS t2
GROUP BY t2.cold_or_warm, t2.index_width, t2.fragmented
ORDER BY cold_or_warm, index_width, fragmented



-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--Cleanup
/*
DROP TABLE IF EXISTS narrow_index
DROP TABLE IF EXISTS myXeData
DROP TABLE IF EXISTS myXeData2
*/

Make sure you update the statistics

You might know that statistics can have a high impact on performance. As you add, remove and modify data, the statistics will be more and more outdated.

Sure, there’s the database option that updates statistics automatically, but it is a bit … rough. For a table with 10,000,000 rows, you have to modify 2,000,000 rows prior to 2016 (20%) and 100,000 rows as of 2016 with db compat level 2016 (SQRT(1000*@rows)). Also, when auto-update stats kicks in, it will sample the underlying data, in order for it not to take too long etc.

Many of us use Ola Hallengren’s excellent maintenance solution. I’m sure that some of us have our own favorite modifications we do to the jobs that the installation scrips creates for us. My point here is that the installation script do not update statistics by default. Here’s how job step in the “IndexOptimize – USER_DATABASES” job look like on my SQL Server 2017 instance. All by default:

EXECUTE [dbo].[IndexOptimize]
@Databases = 'USER_DATABASES',
@LogToTable = 'Y'

The default value for the @UpdateStatistics parameter is NULL which means “Do not perform statistics maintenance” See this page for reference.

So, by using the defaults, you end up defragmenting your indexes (which might not give you that much nowadays, considering that there isn’t that much of a difference between random and sequential I/O on modern disk subsystems). For some of the indexes that you defragment, you will get new statistics as a by-product. Those are the indexes that you rebuild – rebuild is internally creating a new index and the dropping the old one. But you likely have many indexes that don’t reach the 30% fragmentation threshold. And consider indexes over a key which is steadily increasing or decreasing. Inserting new rows will not cause fragmentation, but the statistics will become out-of-date.

What to do? Easy, just add a new job in which you call the IndexOptimize procedure with options to update all statistics where at least one row has been modified. Ola even has such example on his web-site, Example D. Here it is, I just added a parameter to log the commands to the CommandLog table:

EXECUTE dbo.IndexOptimize
@Databases = 'USER_DATABASES',
@FragmentationLow = NULL,
@FragmentationMedium = NULL,
@FragmentationHigh = NULL,
@UpdateStatistics = 'ALL',
@OnlyModifiedStatistics = 'Y',
@LogToTable = 'Y'

Schedule above as you wish. I prefer to do it every early morning if I can. But your circumstances like database size, how the data is modified etc will influence the frequency.

Another option is toe modify the existing “IndexOptimize – USER_DATABASES” job and just add below. This will both defrag your indexes and also update statistics.

@UpdateStatistics = 'ALL'

What about Maintenance Plans, you might think? Well, we all know that they aren’t that … smart – which is why we use scripts like Ola’s in the first place. Regarding statistics updates, they will update all stats, regardless of whether any rows at all has been modified since last time. It is a complete waste of resources to update stats if nothing has changed. Even sp_updatestats is smarter in this regard.

Note: This is in no way a criticism of the maintenance solution that Ola provides. He has no way of knowing our requirements in the job he creates. If the installation script could read our minds, I’m sure that it would schedule the jobs for us as well. OTOH, if Ola could create a T-SQL installation script that could read our minds, then he probably would do something else in the first place. 🙂

Rebuilding indexes online, wasting time and resources?

Many SQL server installations does regular index rebuild or reorg in order to defragment indexes. One can question how much we really gain by doing this when using modern SANs, SSD, flashdisks, etc. But, that is not the topic for this blog post. What I want to raise with this blog is whether to use ONLINE when rebuilding indexes. Some background first::

  • With ONLINE = OFF (default), you will be blocked from accessing the data while the index rebuild is being performed. Depending on the type of index, you either get a schema modification lock on the table which prohibits any type of access, or a shared lock which allow reads only.
  • With ONLINE = ON (requires Enterprise Edition), you get a very brief lock at beginning and end of operation. I will refer to the modes as ONLINE and OFFLINE from now on.

There’s more details to what locks are acquired depending on various factors. Check out https://docs.microsoft.com/en-us/sql/t-sql/statements/alter-index-transact-sql for more info. The point is that ONLINE = ON, allow for users to both read and modify data while the operation is on-going. That is a good thing, right? Well, it comes with a price…

The time it takes to rebuild the index can be substantially longer for ONLINE. Many of us has other things to do with the database during night-time and/or weekends. Yes, these are the typical window in time where we try to find things such as index rebuilds. Say that you do it night-time and it currently take 4 hours. Wouldn’t it be nice if you could cut that time down to 1.5 hours? That would leave more time for imports, massaging of data, CHECKDB and other things you want to do. Sure, you can do it ONLINE, but it will slow down access during the rebuild. Also the more data you modify during the rebuild, the more space you need in tempdb.

My point is to not do ONLINE unless you actually need it. Some system don’t have to be “open” during night-time anyhow. Only you can tell what your requirements are, of course. To get some numbers I decided to create a few indexes on a copy of the FactResellerSalesXL_PageCompressed found in the AdventureworksDW2016 database, and I named that table FactResellerSalesXL. I created a copy without data compression and also without any columnstore index. And then I created a few non-clustered index. Here’s the index and space usage for the table, using my sp_indexinfo (with some columns omitted and others re-arranged):

Not a large table by any means, but it gives us something to play with. And it allowed me to do the rebuilds several times in order to rule out figures out of the ordinary. I tested this both in FULL and SIMPLE recovery. The numbers are average after several executions,  minutes : seconds:

  • FULL recovery
    • ONLINE: 3:45
    • OFFLINE: 1:03
  • SIMPLE recovery
    • ONLINE: 1:40
    • OFFLINE: 0:31

 

I frequently use Ola Hallengren’s scripts for database maintenance. (We all do, right? :-)) The default for Ola’s solution (IndexOptimize) is to do the rebuild ONLINE if possible. Or, to be more specific, the defaults are:

  • < 5% fragmentation: do nothing
  • Fragmentation between 5 and 30 %: do REORGANIZE.
  • Fragmentation over 30%: do REBUILD with ONLINE = ON

I.e., when using Ola’s scripts you will rebuild your indexes ONLINE, assuming you are on Enterprise Edition (or similar). For Maintenance Plans (anybody remember those?), the checkbox for ONLINE is not checked by default.

Bottom line: If you have problems with index rebuilds taking a long time and you don’t need other people accessing the data while this is being done, consider doing it OFFLINE instead. Or perhaps not defragment in the first place? Or, raise the bar for to some 60-70 % so you only defragment the indexes that are really fragmented in the first place.

Free book about execution plans

Grant Fritchey has released the third edition of his book about execution plans. And it is freely available thanks to the cooperation with Redgate.

There’s a lot of good stuff in there. Introduction to execution plans, plan caching, reading plans, etc. and then of course perhaps the “meat of the book”: a discussion about the various operators that you can find in an execution plan. Sure, these are documented in Books Online, but far from the level of this book. Say you find a Hash Match in the plan. Is it good? Is it bad? Under what circumstances? And, exactly what does it really do? Things like that…

Get it here .

Managing tempdb

This post is all about the tempdb database. It is not about capacity planning, like size or number of data files – there is plenty of info about that out there. Instead, this is about managing it, in the sense of changing initial file size, adding a file, removing a file etc. SSMS isn’t obvious for some of these things for the tempdb database, and for some things SSMS is flat out wrong.

I tested all operations on SQL server 2017 and SSMS 17.8.1.

 

Background

Tempdb is re-created at start-up. Or, rather re-initiated (in lack of better words). SQL Server do not delete the files and create new ones, that would take ages for large files if you don’t have Instant File Initialization (which you never have for log files). So how does SQL Server know what files to create, the logical and physical names, size etc? The answer is sys.master_files.

 

sys.master_files

This is the “template” information for the tempdb files. I will refer to sys.master_files as the template below (not to be confused with model which is template for other things, but not the file aspects of tempdb). Here you see how the tempdb files will be created/initialized at next start-up. Don’t confuse this with tempdb.sys.database_files, which show the current state for the tempdb files. Below two queries will show you the current state for the tempdb files and also the template state:

--Current
SELECT
'tempdb' AS db_name_
,file_id
,name
,physical_name
,size * 8/1024 AS size_MB
,type_desc
,CASE WHEN is_percent_growth = 1 THEN CAST(growth AS varchar(3)) + ' %' ELSE CAST(growth * 8/1024 AS varchar(10)) + ' MB' END AS growth
,max_size * 8/1024 AS max_size_MB
FROM tempdb.sys.database_files
ORDER BY type, file_id

--Template
SELECT
DB_NAME(database_id) AS db_name_
,file_id
,name
,physical_name
,size * 8/1024 AS size_MB
,type_desc
,CASE WHEN is_percent_growth = 1 THEN CAST(growth AS varchar(3)) + ' %' ELSE CAST(growth * 8/1024 AS varchar(10)) + ' MB' END AS growth
,max_size * 8/1024 AS max_size_MB
FROM master.sys.master_files
WHERE DB_NAME(database_id)  = 'tempdb'
ORDER BY db_name_, type, file_id

 

Increasing current and template file size to larger than what it currently is

This is pretty straight-forward, both using T-SQL directly and also using SSMS. Here’s what it looks like in SSMS:

(The word “Initial” for the header “Initial Size (MB)” is pure nonsense. What you see is the current file size, picked up from sys.database_files. I am happy to see that the word “Initial” is removed in SSMS 18.0.)

To increase the file size you just type the desired, larger, size for the tempdb file. The T-SQL command to do this is also very straight forward, example:

ALTER DATABASE tempdb 
 MODIFY FILE ( NAME = N'tempdev', SIZE = 5GB )

ALTER DATABASE tempdb 
 MODIFY FILE ( NAME = N'temp2', SIZE = 5GB )

Note that SSMS will generate the file size in KB fpr the T-SQL command, but the T-SQL command accepts KB, MB, GB and even TB.

This operation increases the current file size as well as the template size.

 

Decreasing current file size

To reduce the current size, you can attempt a shrink operation, for instance using such as below T-SQL command:

USE tempdb
DBCC SHRINKFILE(tempdev, 100)

The second parameter is the desired size in MB. This will change the current size as well as the template size. In my experience, you should be prepared for a limited success to do shrink file a tempdb file reducing the current size. In many cases, you will have to re-start SQL Server so it will be created with the template size. Or, hunt down whatever is using it and take it from there.

 

Changing template file size to smaller than what it currently is

One way is to change the size in the GUI to a smaller size and it will try to make the current size smaller as well as change the template size. If you don’t want to perform the shrink operation, then specify a larger size than the current size, script the command to a T-SQL query window and in there just specify whatever size you want. It will generate a command such as:

ALTER DATABASE tempdb 
 MODIFY FILE ( NAME = N'tempdev', SIZE = 100MB )

If the specified size is smaller than the current file size, then it will only change the template, not the current size.

 

Adding file

This is dead-easy. In SSMS, just press the “Add” button and specify the desired name, size etc. This will generate T-SQL such as:

ALTER DATABASE tempdb 
 ADD FILE ( NAME = N'tempdb3', FILENAME = N'R:\SqlData\a\tempdb_mssql_3.ndf' , SIZE = 5GB , FILEGROWTH = 200MB )

This will add it immediately and also to the template (both sys.database_files and sys.master_files).

 

Remove file

This is simple enough if the file is empty. In SSMS you select the file and press the “Remove” button.  It will generate T-SQL such as:

ALTER DATABASE tempdb 
 REMOVE FILE tempdb3

But if the file is current used you will get an error message and nothing was changed: not the current state, nor the template.

You can always attempt to shrink first using the EMPTYFILE option:

USE tempdb
DBCC SHRINKFILE(tempdb3, EMPTYFILE)

If you’re lucky, then the file was emptied and you can now use the ALTER command with the REMOVE FILE option. If not, then you are back to where you started.

You can try to re-start SQL Server and see if that allow you to remove the file. If not, then perhaps using the “failsafe” startup switch: /f (see Erin Stellato’s blog about this: https://www.sqlskills.com/blogs/erin/remove-files-from-tempdb/ ). Not be connected from SSMS when executing the commands might also help.

 

Changing other file attributes

If you want to change things like autogrow or maxsize, then just use SSMS or T-SQL directly. Here’s an example T-SQL command to change both autogrow and maxsize:

ALTER DATABASE tempdb 
 MODIFY FILE ( NAME = N'tempdev', MAXSIZE = 500GB , FILEGROWTH = 500MB )

 

Moving a file

This operation requires a re-start. SSMS doesn’t allow you to do this, so use T-SQL directly, such as:

ALTER DATABASE tempdb 
 MODIFY FILE (NAME = tempdev, FILENAME = 'R:\SqlData\b\tempdb.mdf')

Warning: Henrik reminded me in the comments section to add a warning here. “Make sure that R:\SqlData\b is present and that SQL Server has sufficient rights to create files here!” He is of course correct. In worst case you end up with SQL Server refusing to start if it cannot create tempdb where you have specified. In such case you can try the -f startup switch, perhaps a topic for another blog post.

Now, this is a sneaky one. It does change both sys.database_files and sys.master_files. So, looking only at the catalog views, you will be fooled that the file was moved immediately. But that didn’t happen, check the disk. Fortunately, the command comes with a message such as:

The file "tempdev" has been modified in the system catalog.
The new path will be used the next time the database is started.

That makes it pretty obvious. Note that the old file will not be removed when you restart your SQL Server. Go delete that file, after verifying that the new file was indeed created (if the “old” one is still the current one, then you won’t be able to delete it as long as SQL Server is running).

 

Summary

Managing tempdb isn’t that hard. As soon as you understand that we have the current state (tempdb.sys.master_files) and the template (sys.master_files), it is easier to understand. And, yes, SSMS is a bit quirky for these operations. The reason, methinks, is that it uses the same GUI as for other databases, which doesn’t have the aspect of a “template”.

Did I miss any operation? Let me know and I’ll might add it.