Fragmentation, the final installment

Measuring things can be difficult. I have been painfully reminded of that with my attempts to measure whether fragmentation matters with new hardware. I created a test script in which I read data seeking the index (following the linked list of the leaf level of the index).

My thoughts were that “jumping back and forth” doesn’t matter with modern I/O subsystems. But perhaps read-ahead reads more (unnecessary) data into cache and less useful data remains in memory because of this?

In my first post, came to the conclusion that it didn’t matter for execution time, but I noticed that lots of memory was brought into cache when the index was fragmented. There’s some background info in there: http://sqlblog.karaszi.com/does-index-fragmentation-matter/ .

In my second blog post, I focused on how much memory was brought into cache. And I found that when the index was fragmented, I noticed 1000 times more memory usage. Surely, read-ahead, cannot cause this!? http://sqlblog.karaszi.com/index-fragmentation-part-2/

Because of the absurde figures, I had to dig deeper. Paul White told me about a lesser known optimization where SQL server will do “large I/O” (my own term for this) if you have plenty of unused memory. Instead of reading single pages, it reads whole extents. This makes sense, warming up the cache. https://blogs.msdn.microsoft.com/ialonso/2011/12/09/the-read-ahead-that-doesnt-count-as-read-ahead/

But still, that can’t account for 1000 times more memory usage for a fragmented index. I can live with 5-10 times, but not 1000 times. And then Simon Sabin pointed out the obvious: auto-update of statistics! My first set of SELECTs was done after a updating a bunch of rows (so to fragment the index), and that triggered auto update statistics (bringing the whole index into memory). Mystery solved.

My findings

As I said, I found it extremely difficult to measure this. Things happens in the machine which can muddle the results. I did the test over and over again. Looking at both details and averages. My conclusion is that with same amount of pages, you will see a pretty marginal effect of fragmentation. “Marginal” is in the eye of the beholder, though. More details further down. Please read my earlier posts, especially the first one for background information.

What about page fullness?

To be honest, I find this to be a much more interesting aspect. I’d like to split this into two sub-sections. Increasing and “random” index.

Increasing index

This can be like an identity or ever increasing datetime column (think for instance order_date). This will not become fragmented from INSERTS and assuming that you don’t delete “every other row”, then you will end up with a nice unfragmented index with a high degree of page fullness. If you rebuild it with a fillfactor of 100%, you have just wasted resources. If you rebuild it with a lower fillfactor, you also waste space and cause SQL Server to read more data – causing worse performance.

Random index

By this I mean index where the data distribution is all over the place. For instance an index on the LastName column. As you insert data, pages are split and you end up with a fillfactor of about 75% (which is in between a half-full and a full page). If you rebuild it with a similar fillfactor, you didn’t gain anything. If you rebuild it with a high fillfactor, then the following activity in your database will generate a high number of page splits. It seems reasonable to let it be, accepting a fillfactor of about 75% and accept a page split every now and then.

But we also delete data!

Do you? Really? OK, let us say that you do. What is the delete pattern? And the degree of deletes? How does this spread over your particular index? Deleting old data over some ever increasing index will just deallocate those pages. Deleting data which doesn’t match with the index key (rows deleted are spread all over the index) will leave empty space. This will be re-used as you insert new rows. But if you delete a lot of data, then you will have a low degree of page fullness. IMO, this is a rather unusual situation and should warrant a manually triggered index maintenance, in contrast to rebuilding indexes every week or so.
(And don’t get me started on heaps, but surely you don’t have those…? I once had a SELECT of all rows for a table with only 3000 rows, which took minutes. It had allocated more extents than number of rows in the table. Rebuild it, and it took, of course, sub-second. Needless to say, this didn’t stay as a heap for long. And no, the table wasn’t wide, at all.)

Bottom line?

Being who I am, I don’t want to draw too hard conclusions from some tests and reasoning. But I must say that I doubt the value of regular index defrag – assuming you are on some modern I/O subsystem, of course. I think that folks focus on index defrag and forget about statistics. I like to provide the optimizer with as high-quality statistics as possible. Sure, index rebuild will give you new statistics. But if you condition it based on fragmentation, then you end up with not rebuilding a bunch of indexes and have stale statistics. Auto update statistics is OK, but you have to modify a decent amount of data before it kicks in. And when it kicks in, the user waits for it. Unless you do async update…. But I digress. Statistics management is a topic of its own. My point is that you probably don’t want to lose track of statistics management because you keep staring as index fragmentation.

My tests – details

Check out this section if you are curious of how I did it, want to see the numbers and perhaps even do tests of your own. I welcome constructive criticism which can be weaknesses etc in my test scripts. Or perhaps different interpretations of the outcome!

The basic flow is:

  • Create a proc which does bunch of reads to fill the buffer pool with data.
  • Create some other supporting stuff, like trace definition, trace control procedure etc.
  • Grow the data and log files for the Stackoverflow (10 GB) database.
  • Create table to hold my measures.
  • Create the narrow_index table which is a table with a clustered index which is narrow. A row is some 40 bytes (very rough calculation off the top of my head).
  • Create the wide_index is a table with a clustered index which is wider – some 340 bytes.
  • Set max server memory to 1 GB.
  • Turn off auto-update statistics.
  • Create the proc that will do the tests.
  • Generate fragmentation in both indexes/tables.
  • The narrow table has 20,000,000 rows and is about 970 MB in size.
  • The wide table has 10,000,000 rows and is about 6.8 GB in size.
  • Run the proc that does the test with 1 row selectivity.
  • Run the proc that does the test with 10000 rows selectivity.
  • Run the proc that does the test with 100000 rows selectivity.
  • Each test runs 99 SELECTs, spread over the data (so not to read the same row).
  • Average the result, into a table named resx.
  • I did 4 such iterations, so I have the tables res1, res2, res3 and res4.
  • Check the numbers from these tables and see if they correspond reasonably. They did.
  • Average the results from these four table to get the end results.

The procedure which executes the SELECTS and captures the measures has this basic flow

  • Stop the (XE) trace if it is running.
  • Delete all old trace files.
  • Do a first SELECT to get rid of auto-stats (yeah, I disabled it, but just in case).
  • Empty the cache.
  • Reads lots of other data into cache.
  • Start the trace
  • Do the SELECT, in a loop, reading as many rows as requested (param to the proc), as many times as requested (another param to the proc). Don’t read the same data over and over again. This is done both for the narrow and wide table/index.
  • Capture memory usage for each table (buffer pool usage).
  • Get the trace raw data into a temp table.
  • Parse the trace data/above table using XQuery into the measures table.
  • Ie., we now have one row for each SELECT with a bunch of measures.

The measures

  • Microseconds as captured with the Extended Event trace.
  • Microseconds as “wall clock time”. I.e., capture datetime before the loop in a variable and calculate microseconds after the loop into a variable, divided by number of executions. This value will then be the same for each execution in the loop. I want to have two different ways to calculate execution time, as a sanity check. In the end, they corresponded pretty well.
  • How much data of the table was brought into cache.
  • Physical reads. This is of lower interest, since read-ahead will muddle the result.
  • Logical reads. This is more interesting and can among other things be used to check t at we read about the same number of pages for the fragmented and for the non-fragmented index. Which we should – they have the same fillfactor!
  • CPU microseconds. I didn’t find this very interesting.

Outcome

I ignored the reading of 1 row, this shouldn’t differ, and the small difference I found I attribute to imperfections in the ability to capture exact numbers.

When reading 1000 rows for the narrow table, about 75 pages, the fragmented index was about 30% slower.

When reading some 700-900 pages (100,000 rows for the narrow table and 10,000 rows for the wide table) the fragmented table was about 16-17 % slower.

When reading about 9,000 pages (100,000 rows for the wide table), the fragmented index was about 15% slower.

Here are the averages, one row per execution (remember that I did 99 SELECTs in a loop, so this is the average of those):

Narrow index
Wide index

And here is the average of those averages

The code

------------------------------------------------------------------------------------------------
--Measure cost of fragmentation in terms of execution time and memory usage
--Tibor Karaszi, 2019
--Requires the Stackoverflow database. I was using the 10 GB version.
------------------------------------------------------------------------------------------------

SET NOCOUNT ON

------------------------------------------------------------------------------------------------
--Setup section
------------------------------------------------------------------------------------------------
USE StackOverflow
GO

--Proc to fill buffer pool, reads lots of data, takes a while to execute, especially if on spinning disk
CREATE OR ALTER PROC fill_bp
AS
DECLARE @a bigint
SET @a = (SELECT AVG(CAST(PostId AS bigint)) FROM Comments)
SET @a = (SELECT AVG(CAST(CommentCount AS bigint)) FROM Posts)
SET @a = (SELECT AVG(CAST(PostId AS bigint)) FROM Comments)
SET @a = (SELECT AVG(CAST(VoteTypeId AS bigint)) FROM Votes)
SET @a = (SELECT AVG(CAST(PostId AS bigint)) FROM Comments)
SET @a = (SELECT AVG(CAST(VoteTypeId AS bigint)) FROM Votes)
SET @a = (SELECT AVG(CAST(VoteTypeId AS bigint)) FROM Votes)
GO

--Disable IAM order scan, so we know that SQL Server will follow the linked list
--See https://sqlperformance.com/2015/01/t-sql-queries/allocation-order-scans
EXEC sp_configure 'cursor threshold', 1000000 RECONFIGURE
GO

--Proc to change trace status
CREATE OR ALTER PROC change_trace_status 
	 @state varchar(50) --start, stop, delete_all_files
	AS

	IF @state = 'start'
	 BEGIN
	   IF NOT EXISTS (SELECT * FROM sys.dm_xe_sessions WHERE name = 'frag_test')
		ALTER EVENT SESSION frag_test ON SERVER STATE = START
	 END

	IF @state = 'stop'
	BEGIN
	   IF EXISTS (SELECT * FROM sys.dm_xe_sessions WHERE name = 'frag_test')
		 ALTER EVENT SESSION frag_test ON SERVER STATE = STOP
	 END

	--Delete XE file, using xp_cmdshell (ugly, I know)
	IF @state = 'delete_all_files'
	BEGIN
	  EXEC sp_configure 'xp_cmdshell', 1 RECONFIGURE WITH OVERRIDE
	  EXEC xp_cmdshell 'DEL R:\frag_test*.xel', no_output 
	  EXEC sp_configure 'xp_cmdshell', 0 RECONFIGURE WITH OVERRIDE
	END

	/*
	EXEC change_trace_status @state = 'start'
	EXEC change_trace_status @state = 'stop'
	EXEC change_trace_status @state = 'delete_all_files'
	*/
GO


--Drop and create event session to keep track of execution time
EXEC change_trace_status @state = 'stop'
WAITFOR DELAY '00:00:01'
IF EXISTS (SELECT * FROM sys.server_event_sessions WHERE name = 'frag_test')
 DROP EVENT SESSION frag_test ON SERVER
EXEC change_trace_status @state = 'delete_all_files'
CREATE EVENT SESSION frag_test ON SERVER 
	ADD EVENT sqlserver.sp_statement_completed()
	ADD TARGET package0.event_file(SET filename=N'R:\frag_test')
	WITH (MAX_DISPATCH_LATENCY=10 SECONDS)
GO


--Create proc to report progress
CREATE OR ALTER PROC #status 
 @msg varchar(200)
AS
RAISERROR(@msg, 10, 1) WITH NOWAIT
GO

--Grow the data and log files for StackOverflow database.
EXEC #status 'Grow the data and log files for StackOverflow database...'
IF EXISTS(	SELECT size * 8/(1024*1024), * 
			FROM sys.database_files 
			WHERE name = N'StackOverflow2010' AND size * 8/(1024*1024) < 30)
	ALTER DATABASE [StackOverflow] MODIFY FILE ( NAME = N'StackOverflow2010', SIZE = 30GB )
IF EXISTS(	SELECT size * 8/(1024*1024), * 
			FROM sys.database_files 
			WHERE name = N'StackOverflow2010_log' AND size * 8/(1024*1024) < 15)
	ALTER DATABASE [StackOverflow] MODIFY FILE ( NAME = N'StackOverflow2010_log', SIZE = 15GB )
GO


--Table to hold output 
DROP TABLE IF EXISTS measures

CREATE TABLE measures (
 id int identity(1,1) PRIMARY KEY NOT NULL
,rows_to_read int NOT NULL
,index_width varchar(20) NOT NULL
,fragmented varchar(2) NOT NULL
,usec_xe bigint NOT NULL
,usec_clock bigint NOT NULL
,cpu_microsec bigint NOT NULL
,physical_reads bigint NOT NULL
,logical_reads bigint NOT NULL
,mb_cache decimal(9,2) NOT NULL
);



--Create the table for the narrow index
EXEC #status 'Setup section. Create table with narrow index...'
DROP TABLE IF EXISTS narrow_index


--Adjust numbers. 20,000,000 rows means about 970 MB in the end
SELECT TOP(1000*1000*20) ROW_NUMBER() OVER( ORDER BY (SELECT NULL)) AS c1, CAST('Hello' AS varchar(80)) AS c2
INTO narrow_index
FROM sys.columns AS a, sys.columns AS b, sys.columns AS c


CREATE CLUSTERED INDEX x ON narrow_index(c1)


--Create the table for the wide index
EXEC #status 'Setup section. Create table with wide index...'
DROP TABLE IF EXISTS wide_index

--Adjust numbers. 10,000,000 rows means about 6.8 GB in the end
SELECT 
 TOP(1000*1000*10) ROW_NUMBER() OVER( ORDER BY (SELECT NULL)) AS c1
,CAST('Hi' AS char(80)) AS c2
,CAST('there' AS char(80)) AS c3
,CAST('I''m' AS char(80)) AS c4
,CAST('on' AS char(80)) AS c5
,CAST('my' AS varchar(200)) AS c6
INTO wide_index
FROM sys.columns AS a, sys.columns AS b, sys.columns AS c

CREATE CLUSTERED INDEX x ON wide_index(c1)
GO


------------------------------------------------------------------------------------------------
--Investigate the data if you want

/*
--wide index
SELECT TOP(100) * FROM wide_index
EXEC sp_indexinfo wide_index		--Found on my web-site
EXEC sp_help 'wide_index'

--narrow index
SELECT TOP(100) * FROM narrow_index
EXEC sp_indexinfo narrow_index		--Found on my web-site
EXEC sp_help 'narrow_index'
*/

--Execute this if you want to have a rather full BP, restricts memory to 1 GB
EXEC sp_configure 'max server memory', 1000 RECONFIGURE

--Turn off auto-update statistics
ALTER DATABASE Stackoverflow SET AUTO_UPDATE_STATISTICS OFF

------------------------------------------------------------------------------------------------
--/Setup section
------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------
--Create the proc that executes our SQL
EXEC #status 'Create the proc that executes our SQL...'
GO
CREATE OR ALTER PROC run_the_sql
 @fragmented varchar(20)
,@rows_to_read int
,@range_iterations int
,@fill_bp char(1)			--'y' or 'n'
,@report_frag char(1) = 'n'
AS
DECLARE 
 @sql varchar(1000)
,@sql_condition varchar(1000)
,@bp_wide_mb decimal(9,2)
,@bp_narrow_mb decimal(9,2)
,@range_start int
,@range_iterations_counter int
,@a int
,@start_time datetime2
,@exec_time_wide_index_usec bigint
,@exec_time_narrow_index_usec bigint

EXEC change_trace_status @state = 'stop'
EXEC change_trace_status @state = 'delete_all_files'

--Do a first execution to get autostats etc out of the way
DBCC FREEPROCCACHE
SET @range_iterations_counter = 1
SET @range_start = 1000
SET @a = (SELECT COUNT_BIG(c1) AS c1_count FROM wide_index WHERE c1 BETWEEN @range_start AND (@range_start + @rows_to_read - 1))
SET @a = (SELECT COUNT_BIG(c1) AS c1_count FROM narrow_index WHERE c1 BETWEEN @range_start AND (@range_start + @rows_to_read - 1))

--Empty data cache (buffer pool, BP)
CHECKPOINT
DBCC DROPCLEANBUFFERS

--Run proc to read stuff into BP if requested
IF @fill_bp = 'y'
  EXEC fill_bp

EXEC change_trace_status @state = 'start'

--Do the SELECTs, narrow index
SET @range_iterations_counter = 1
SET @range_start = 1000
SET @start_time = SYSDATETIME()
WHILE @range_iterations_counter <= @range_iterations
BEGIN
	SET @a = (SELECT COUNT_BIG(c1) AS c1_count FROM narrow_index WHERE c1 BETWEEN @range_start AND (@range_start + @rows_to_read - 1))
	SET @range_start = @range_start + 100000
	SET @range_iterations_counter += 1
END
SET @exec_time_narrow_index_usec = DATEDIFF_BIG(microsecond, @start_time, SYSDATETIME()) / @range_iterations

--Do the SELECTs, wide index
SET @range_iterations_counter = 1
SET @range_start = 1000
SET @start_time = SYSDATETIME()
WHILE @range_iterations_counter <= @range_iterations
BEGIN
	SET @a = (SELECT COUNT_BIG(c1) AS c1_count FROM wide_index WHERE c1 BETWEEN @range_start AND (@range_start + @rows_to_read - 1))
	SET @range_start = @range_start + 100000
	SET @range_iterations_counter += 1
END
SET @exec_time_wide_index_usec = DATEDIFF_BIG(microsecond, @start_time, SYSDATETIME()) / @range_iterations

EXEC change_trace_status @state = 'stop'

--Keep track of BP usage
SET @bp_wide_mb = 
(
			SELECT 
			CAST((COUNT(*) * 8.00) / 1024 AS DECIMAL(9,2)) AS MB 
			FROM  sys.allocation_units AS a  
			  JOIN sys.dm_os_buffer_descriptors AS b 
				ON a.allocation_unit_id = b.allocation_unit_id
			  JOIN sys.partitions AS p
				ON a.container_id = p.hobt_id
			WHERE p.object_id = OBJECT_ID('wide_index')
			  AND b.database_id = DB_ID()
)

SET @bp_narrow_mb = 
(
			SELECT 
			CAST((COUNT(*) * 8.00) / 1024 AS DECIMAL(9,2)) AS MB 
			FROM  sys.allocation_units AS a  
			  JOIN sys.dm_os_buffer_descriptors AS b 
				ON a.allocation_unit_id = b.allocation_unit_id
			  JOIN sys.partitions AS p
				ON a.container_id = p.hobt_id
			WHERE p.object_id = OBJECT_ID('narrow_index')
			  AND b.database_id = DB_ID()
)


--Wait for trace data to arrive in target
WAITFOR DELAY '00:00:10'

--Get the trace data into our table
SELECT CAST(event_data AS XML) AS StatementData
INTO #myRawXeData
FROM sys.fn_xe_file_target_read_file('R:\frag_test*.xel', NULL, NULL, NULL);

--Done with trace file, delete it
EXEC change_trace_status @state = 'delete_all_files';

--Transform raw trace data into our measures table
WITH t AS(
SELECT 
 StatementData.value('(event/data[@name="duration"]/value)[1]','bigint') AS duration_microsec
,StatementData.value('(event/data[@name="cpu_time"]/value)[1]','bigint') AS cpu_microsec
,StatementData.value('(event/data[@name="physical_reads"]/value)[1]','bigint') AS physical_reads
,StatementData.value('(event/data[@name="logical_reads"]/value)[1]','bigint') AS logical_reads
,StatementData.value('(event/data[@name="statement"]/value)[1]','nvarchar(500)') AS statement_
FROM #myRawXeData AS evts
WHERE StatementData.value('(event/data[@name="statement"]/value)[1]','nvarchar(500)') LIKE '%index WHERE c1%'
),
t2 AS (
SELECT 
 CASE WHEN t.statement_ LIKE '%wide_index%' THEN 'wide_index' ELSE 'narrow_index' END AS index_width
,CASE @fragmented WHEN 'high_frag_level' THEN 'y' ELSE 'n' END AS fragmented
,duration_microsec
,CASE WHEN t.statement_ LIKE '%wide_index%' THEN @exec_time_wide_index_usec ELSE @exec_time_narrow_index_usec END AS usec_clock
,cpu_microsec
,physical_reads
,logical_reads
,CASE WHEN t.statement_ LIKE '%wide_index%' THEN @bp_wide_mb ELSE @bp_narrow_mb END AS mb_cache
FROM t)
INSERT INTO measures(rows_to_read, index_width, fragmented, usec_xe, usec_clock, cpu_microsec, physical_reads, logical_reads, mb_cache)
SELECT @rows_to_read, index_width, fragmented, duration_microsec, usec_clock, cpu_microsec, physical_reads, logical_reads, mb_cache
FROM t2;

--Report fragmentation level, if requested
IF @report_frag = 'y'
	--Note size of index and frag level, should be comparative between executions
	SELECT
	  OBJECT_NAME(object_id) AS table_name
	 ,index_type_desc
	 ,CAST(avg_fragmentation_in_percent AS decimal(5,1)) AS frag_level
	 ,page_count/1000 AS page_count_1000s
	FROM sys.dm_db_index_physical_stats(DB_ID(), NULL, NULL, NULL, 'LIMITED') 
	WHERE index_level = 0 AND alloc_unit_type_desc = 'IN_ROW_DATA' AND OBJECT_NAME(object_id) IN('narrow_index', 'wide_index')
	ORDER BY index_id
GO


------------------------------------------------------------------------------------------------
--1: Generate fragmentation in both indexes

--Fragment wide ix
EXEC #status 'Generate fragmentation in wide index...'
UPDATE wide_index SET c6 = REPLICATE('a', 200) WHERE c1 % 20 = 0
UPDATE STATISTICS wide_index WITH FULLSCAN

--Fragment narrow ix
EXEC #status 'Generate fragmentation in narrow index...'
UPDATE narrow_index SET c2 = REPLICATE('a', 20) WHERE c1 % 100 = 0
UPDATE STATISTICS narrow_index WITH FULLSCAN

--Run the queries
EXEC #status 'Run queries with high frag level...'
EXEC run_the_sql @fragmented = 'high_frag_level', @rows_to_read = 1,		@range_iterations = 99, @fill_bp = 'y'
EXEC run_the_sql @fragmented = 'high_frag_level', @rows_to_read = 10000,	@range_iterations = 99, @fill_bp = 'y'
EXEC run_the_sql @fragmented = 'high_frag_level', @rows_to_read = 100000,	@range_iterations = 99, @fill_bp = 'y'


-----------------------------------------------------------------------------------
--2: no frag in either index, fillfactor set to make same size as when fragmented

EXEC #status 'Eliminate fragmentation in wide index...'
ALTER INDEX x ON wide_index REBUILD WITH (FILLFACTOR = 48)

EXEC #status 'Eliminate fragmentation in narrow index...'
ALTER INDEX x ON narrow_index REBUILD WITH (FILLFACTOR = 50)

--Run the queries
EXEC #status 'Run queries with low frag level...'
EXEC run_the_sql @fragmented = 'low_frag_level', @rows_to_read = 1,		 @range_iterations = 99, @fill_bp = 'y'
EXEC run_the_sql @fragmented = 'low_frag_level', @rows_to_read = 10000,	 @range_iterations = 99, @fill_bp = 'y'
EXEC run_the_sql @fragmented = 'low_frag_level', @rows_to_read = 100000, @range_iterations = 99, @fill_bp = 'y'


-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--Reset
EXEC sp_configure 'cursor threshold', -1
EXEC sp_configure 'max server memory', 2147483647
RECONFIGURE
GO

-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--Run below manually and investigate the output

--Raw data from the trace
SELECT * FROM measures ORDER BY rows_to_read, index_width, fragmented

--Average the data.
--Note that I ran this script 4 times and used INTO here to create 4 tables. Res1, res2, res3, and res4.
SELECT
 m.rows_to_read
,m.index_width
,m.fragmented
,AVG(m.usec_xe) AS usec_xe
,AVG(m.usec_clock) AS usec_clock
,AVG(m.mb_cache) AS mb_cache
,AVG(m.physical_reads) AS physical_reads
,AVG(m.logical_reads) AS logical_reads
,AVG(m.cpu_microsec) AS cpu_microsec
-- INTO res4						--if you want to persist the results
FROM measures AS m
GROUP BY m.rows_to_read, m.index_width, m.fragmented
ORDER BY index_width, rows_to_read, fragmented;


--Check the details from the averages of each execution
WITH x AS(
SELECT 1 AS ord, * FROM res1
UNION ALL
SELECT 2 AS ord, * FROM res2
UNION ALL
SELECT 3 AS ord, * FROM res3
UNION ALL
SELECT 4 AS ord, * FROM res4
)
SELECT * FROM x
WHERE index_width = 'narrow_index'
ORDER BY index_width, rows_to_read, fragmented, ord;

--Average the averages
WITH x AS(
SELECT 1 AS ord, * FROM res1
UNION ALL
SELECT 2 AS ord, * FROM res2
UNION ALL
SELECT 3 AS ord, * FROM res3
UNION ALL
SELECT 4 AS ord, * FROM res4
)
SELECT
 m.rows_to_read
,m.index_width
,m.fragmented
,AVG(m.usec_xe) AS usec_xe
,AVG(m.usec_clock) AS usec_clock
,AVG(m.mb_cache) AS mb_cache
,AVG(m.physical_reads) AS physical_reads
,AVG(m.logical_reads) AS logical_reads
,AVG(m.cpu_microsec) AS cpu_microsec
 FROM x AS m
GROUP BY m.rows_to_read, m.index_width, m.fragmented
ORDER BY index_width, rows_to_read, fragmented;





-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--Cleanup
/*
DROP TABLE IF EXISTS narrow_index
DROP TABLE IF EXISTS myXeData
DROP TABLE IF EXISTS myXeData2
DROP TABLE IF EXISTS bp_usage
*/

How selective do we need to be for an index to be used?

You know the answer already: It depends. But I often see some percentage value quoted and the point of this post is to show that there is no such percentage value.

To get the most out of this blog post, you should understand the basic structure for an index, i.e. how the b+ tree look like. You should also understand the difference between a clustered and a non-clustered index. In essence, you should be able to visualize these structures and searches through them as you read the text. If you find that difficult, draw a few versions on a piece of paper and “navigate” through them by tracing through them with a pen or your finger. After a while, you will do this in your mind. For instance, check out the sections under this.

I’m not saying that we shouldn’t consider selectivity when designing indexes – of course we should! I’m not saying that one shouldn’t have some vague “feeling” about how much data to be return when making such decisions. What I will prove is that there is in reality no set percentage that the optimizer uses. The comment we usually see is something like:

“If we return more than 5% of the rows, then an index will not be used.”

Where did that 5% number came from? I can assure you that this is not some hard-wired number in the optimizer (except for an edge-case, see below). The optimizer aims at running the query with as low cost as possible. Considering the data access part (think WHERE clause and the condition), this is pretty much about reading as few pages as possible (few page-accesses).

Just to cut down a bit on the thread that might follow these types of blogs (“Hey, when I do this, your observations doesn’t match, your blog post is incorrect!”), let us first consider some special cases:

Clustered index
The clustered index *is* the data. If the search condition (SARG) is SEEKable, then SQL Server will obviously seek through a clustered index instead of scan it. Anything else would be stupid. There can be *other* search conditions that are more efficient, but we are considering one search condition at-a-time.

Non-clustered index that covers the query
This is pretty much the same argument as for above. Since all data is in the index (“covers the query”), not seeking it would be stupid. Again, there can be cheaper alternatives for any of the other search conditions, but we are considering one condition at-a-time.

The value is not known to the optimizer
This is what happens when you have a TSQL variable you compare against. Something like “colname = @v”. The optimizer has no knowledge of the contents of this variable. Either it uses density (where applicable, like “=”), as stored in the statistics information of the index. Where not applicable (like “>”, “<“, “BETWEEN” etc), then the optimizer actually do use some hard-wired percentage value. This value can change between versions so give it a spin of you want to know what value you have for your version/build number. Note that a variable is not the same thing as a parameter. SQL Server sniffs parameters (parameter sniffing). Read this for elaboration: http://msdn.microsoft.com/en-us/library/ee343986.aspx.

The search expression is not seek-able
I hope you know this already, but just to point it out. In most cases, having some calculation at the column side will void the ability to seek through the index. This should ideally be known to all T-SQL developers: Never do calculations at the column side! So, things to avoid are like “colname * 23 > 45” or “SOMEFUNCTION(colname) = 44”.

Hopefully by now we all understand that there are always special cases and exceptions. The more of Kalen’s books you have read, the more you understand this. What we are discussing here is the typical situation. OK? Fine. So, “Why is there no percentage value that the optimizer uses?”, you ask. Because the value will differ. In short, SQL Server wants to read as few pages as possible. In the most simple example, the alternative to an index seek is a table scan. So we will use this as basis for your discussion. There can be other alternatives to the table scan (using some other index for some other condition), but that doesn’t change the principal “it depends” concept.

In essence, it is all about the alternative. As I said, our example wil use a table scan as alternative. A table scan (or clustered index scan if it is a clustered table) means that SQL Server will look at every page and see what rows satisfies the search condition on each page.

My example has two different tables, both with 100,000 rows. These tables both have an integer column with consecutive increasing unique values, which also has a non-clustered index. I will see how selective I need to be when searching on this column in order for an index search to be done, compared to a table scan. I.e, find this percentage cut-off value.

The fewrows table only fit one row per data page. This means 100,000 data pages. My tests show that the cutoff for fewrows is about 31,000 rows. I.e., we need to be more selective than 31% for the index to be used.

The manyrows table fit 384 rows per page. This means 260 pages. My tests show that the cutoff for fewrows is about 155 rows. I.e., we need to be more selective than 0.16% for the index to be used.

You might end up with different exact numbers, depending on what you have in the statistics, the build number of your SQL Server etc. But what you will see that a similar pattern. A huge difference between the two.

It is all about the alternative
If I look at my indexes using sys.dm_db_index_physical_stats, I will see that the non-clustered index on the int column for the two tables are exactly the same (same number of pages in the index, etc). So, two indexes with the same characteristics have very different cut-off values. How can that be? It is because the alternative differs. The alternative for this example is a table scan. For the bigrows table, the alternative means reading 100,000 pages. But for the smallrows table, the alternative means reading only 260 pages. There can of course be other alternatives, like using some other index for some other search condition. This is, in the end, why we don’t have a set percentage value: it is all about the alternative!

Conclusion
The typical cases will of course fall somewhere between my more extreme examples. But my examples show that there is no set percentage value used by the optimizer. I showed that for my test, the percentage value can be as low as 0.15% or as high as 31%. What matter is the alternative!

T-SQL

USE tempdb
GO

IF OBJECT_ID('manyrows'IS NOT NULL DROP TABLE manyrows
IF OBJECT_ID('fewrows'IS NOT NULL DROP TABLE fewrows
GO

CREATE TABLE manyrows(c1 INT IDENTITY PRIMARY KEYc2 INTc3 INT)
CREATE TABLE fewrows(c1 INT IDENTITY PRIMARY KEYc2 INTc3 CHAR(4500))
GO

INSERT INTO manyrows
SELECT TOP 100000 ROW_NUMBER() OVER (ORDER BY a.OBJECT_IDAS c2AS c3
FROM sys.columns AS asys.columns AS b

INSERT INTO fewrows
SELECT TOP 100000 ROW_NUMBER() OVER (ORDER BY a.OBJECT_IDAS c2'hi' AS c3
FROM sys.columns AS asys.columns AS b

CREATE INDEX ON manyrows (c2)
CREATE INDEX ON fewrows (c2)

--Number of pages etc:
EXEC sp_indexinfo 'manyrows'
-- Data: 265 pages (2.07 MB)
-- Index x: 187 pages (1.46 MB)

EXEC sp_indexinfo 'fewrows'
-- Data: 100385 pages (784 MB)
-- Index x: 187 pages (1.46 MB)

SELECT OBJECT_NAME(OBJECT_ID), *, OBJECT_NAME(OBJECT_ID)
FROM sys.dm_db_index_physical_stats(DB_ID(), NULL, NULL, NULL, 'DETAILED')

--Run below with showplan:
SELECT FROM manyrows
WHERE c2 BETWEEN AND 155
--155 = ix search, 156 = ts
--Cut-off is 0.16%

SELECT FROM fewrows
WHERE c2 BETWEEN AND 31000
--31000 = ix search, 32000 = ts
--Cut-off is 31%

Adding a PK online?

I just read in a forum about a user who want to replikate a table, but the table doesn’t have a PK. The table is pretty large, and having the table not available while adding the PK is undesireable. The table has a clustered index already, and there are other columns which are known to be unique (presence of unique indexes).

What I wanted to test is whether we can just add the PK constraint using the ONLINE option. Show answer is “yes”. We can’t turn a unique index into a PK using some meta-data only operation, unfortunately. That would be the easiest step. But we can add a unique constraint using the ONLINE option – there’s even an example syntax for this in BOL. We can then remove the pre-existing unique index using ONLINE. Since we are using ONLINE, we need to be on Enterprise or Developer Edition.

I wanted to test this, and below is my test script:

USE tempdb
SET NOCOUNT ON
GO

IF OBJECT_ID('t'IS NOT NULL DROP TABLE t
GO
CREATE TABLE t(c1 INT NOT NULL, c2 CHAR(100))
CREATE UNIQUE CLUSTERED INDEX ON t(c1)

INSERT INTO t
SELECT TOP(5000000ROW_NUMBER() OVER(ORDER BY a.id), 'x'
FROM syscolumns AS a
CROSS JOIN syscolumns AS b
CROSS JOIN syscolumns AS c
GO

-----------------------------------------------------
--Now try to add a PK "online"...:
-----------------------------------------------------

--Add a nullable identity?
ALTER TABLE ADD c3 INT IDENTITY NULL
--Msg 8147, Level 16, State 1, Line 1
--Could not create IDENTITY attribute on nullable column 'c3', table 't'.
GO

--Add a PK using ONLINE?
--Prepare a new connection with following INSERTs
--to verify it can run simultaneously:
--INSERT INTO t(c1, c2) VALUES(5000001, 't')
--INSERT INTO t(c1, c2) VALUES(5000001, 't')
--INSERT INTO t(c1, c2) VALUES(5000002, 't')
--GO
--INSERT INTO t(c1, c2) VALUES(5000003, 't')

--Above prepared? OK, execute below and jump to
--other window to verify it is online
ALTER TABLE ADD CONSTRAINT PK_t PRIMARY KEY NONCLUSTERED (c1WITH(ONLINE = ON)
GO

--Verify the indexes using my own sp_indexinfo
EXEC sp_indexinfo 't'

Key count in sys[.]indexes

The old sysindexes table (as of 2005 implemented as a compatibility view) has a useful column named keycnt. This is supposed to give us the number of columns (keys) in the index. However, to make heads and tails out of the numbers, we need to understand how a non-clustered index is constructed. For a heap, the pointer to a row is the physical file/page/row address (aka “rowid”). This is counted as a key in the keycnt column:

IF OBJECT_ID('t1'IS NOT NULL DROP TABLE t1
GO
CREATE TABLE T1 (c1 INTc2 datetimec3 VARCHAR(3))
CREATE INDEX ix_T1_c1 ON T1 (c1)
CREATE INDEX ix_T1_c1_c2 ON T1 (c1c2)
CREATE INDEX ix_T1_c1_c2_c3 ON T1 (c1c2c3)
CREATE INDEX ix_T1_c2 ON T1 (c2)
SELECT namekeycntindidid
FROM sys.sysindexes
WHERE id OBJECT_ID('T1')

For the index on column (c2), you see a keycnt of 2. This is the key in the index plus the rowid.

For a nonclustered index on a clustered table, the row locator is the clustering key. Note, though, that if the clustered index is not defined as unique (PK, UQ etc), then another “uniqueifier” key/column is added. Building on above example:

CREATE UNIQUE CLUSTERED INDEX ix_T1_c1 ON T1 (c1WITH DROP_EXISTING
SELECT namekeycntindidid
FROM sys.sysindexes
WHERE id OBJECT_ID('T1')
GO
CREATE CLUSTERED INDEX ix_T1_c1 ON T1 (c1WITH DROP_EXISTING
SELECT namekeycntindidid
FROM sys.sysindexes
WHERE id OBJECT_ID('T1')

Consider the (non-clustered) index on column c2. For the first one, where the table has a unique clustered index, we see a keycnt of 2, the column c2 plus the clustered key. But when we define the clustered index as non-unique, we see +1 for the keycnt column; the uniqueifier. The uniqueifier is 4 byte, and only populated for rows which are duplicates of an existing clustered key (i.e. no extra cost if no duplicates).

But we want to stay away from compatibility views, right? Since we no longer have a key count column in sys.indexes, we need to grab that from sys.index_columns. This do not, however include the internal columns, only the explicitly defined columns:

SELECT
i.name
,i.index_id
,(SELECT COUNT(*)
FROM sys.index_columns AS ic
WHERE i.OBJECT_ID ic.OBJECT_ID
AND i.index_id ic.index_idAS keycnt
FROM sys.indexes AS i
WHERE OBJECT_ID OBJECT_ID('T1')

Fed up with hunting physical index details?

I am. I find myself endlessly hunting for index information when working against the various SQL Servers I come in contact with. And, sure, the information is there. You just need to go and get it. This generally means that I start with sp_helpindex. Then some SELECT from sys.indexes. Then some more against sys.partitions and sys.allocation units (we want space usage stats as well). And perhaps general usage stats (sys.dm_index_usage_stats). (I sometimes might even use the GUI (SSMS) reports and index dialog – but you might already know that I’m not much of a GUI person.)

The good news with all this is that I learn to use these catalog and dynamic management views. Bad news is that it is kind of … boring to do the same thing again and again.

This is why wrote sp_indexinfo. You might have your own index information procedures (which you wrote yourself or found on the Internet). If not, you are welcome to use this one. I aim to improve it over time, so suggestions are welcome. Possible improvements include:

  1. Make it a function. Functions are nice since we can order the results, aggregate, and basically do whatever we want to when we SELECT from the function. But for this I need to find out how we install a user-defined global system function – there’s no supported way to do this. I’m not sure I want to go there…
  2. Reurn missing index information as well. For this we probably want two resultsets, and only return missing index information when we targeted *one* table (no wildcards). If we do this, then function is out since a function can only return *one* result set.

If you care to give it a spin, please let me know. I just wrote the procedure, so I haven’t tested it much yet. If you do find bugs, please leave a comment and I will incorporate into the source (let me know if you want to be acknowledged). Any comments are welcome.

You find the proc at: http://karaszi.com/spindexinfo-enhanced-index-information-procedure

Find table and index name for fragmented indexes

Got this question from a newsgroup today. The answer is pretty simple, just use the dynamic management view sys.dm_db_index_physical_stats. I’m posting this here mostly so I have somewhere to refer to when asked this question…

I prefer to have a helper function to get the index name:

CREATE FUNCTION dbo.index_name (@object_id int@index_id int)
RETURNS sysname
AS
BEGIN
RETURN
(SELECT name FROM sys.indexes WHERE object_id @object_id and index_id @index_id)
END;
GO

And then a simple query:

SELECT
OBJECT_NAME(object_idAS tblName
,dbo.index_name(object_idindex_idAS ixName
,avg_fragmentation_in_percent
FROM sys.dm_db_index_physical_stats(DB_ID(), NULL, NULL, NULL, NULL)
WHERE avg_fragmentation_in_percent 20
AND index_type_desc IN('CLUSTERED INDEX''NONCLUSTERED INDEX')

Then you just adjust the search clause to your liking. One hint is to exclude nindexes with few pages (the page_count column).