Index fragmentation, part 2 – Tibor Karaszi's SQL Server blog

Does fragmentation hurt?

Checkout my earlier post in this series http://sqlblog.karaszi.com/does-index-fragmentation-matter/.

[EDIT] Some of the conclusions in this post turned out to be invalide, due to auto-update statistics kicking in. Check out the final post: http://sqlblog.karaszi.com/fragmentation-the-final-installment/

In my last blog post, I wanted to focus on the sequential vs random I/O aspect and how that part should be more or less irrelevant with modern hardware. So I did a test that did a full scan (following the linked list of an index) and see if I could notice any performance difference on my SSD. I couldn’t.

That isn’t the end of the story, it turns out. Another aspect is how the data is brought into memory. You might know that SQL server can do “read ahead”, meaning it does larger reads per I/O instead if single-page I/O. I.e., fewer but larger I/O operations. This sounds fine, but what happens when we have fragmentation?

I have had this in the back of my mind for ages now. Since nobody else seemed to care about or discuss this topic, I had this feeling that I was barking up the wrong tree. But, since I now have a test-script I can just tweak it a bit so it does less than a full scan to see if fragmentation seems to make a difference for read-ahead. Spoiler alert: It sure can do! (But possibly not for read-ahead per se.)

My thinking was that read-ahead will read stuff into memory, which it turns out it doesn’t need. Something like bring an extent into memory, but it turns it only need 1 of those 8 pages, and then it jumps somewhere else. I never had a real confirmation whether this applies, or if read-ahead is smarter than that.

So, basically, I parameterized my query so I can play with the selectivity for each SELECT. I have two tables, or perhaps I should say two B-Tree indexes, to play with:

Name	Rows (millions)	Size GB
wide index	3.7	8.4
narrow index	20	1

So, what selectivity to choose? I did three tests:

1000 rows per query.
2 rows per query. For this I suspected no difference. After all, each query will just navigate the tree and find those 2 rows most probably on the same page.
10000 rows per query.

This should be enough to see a pattern, I hope. We already know that for a full scan/seek we couldn’t see a difference. I’m measuring the following attributes:

Duration.
Physical reads.
Size of index in memory. I.e., how much memory was consumed by the query load.

I did not find any difference for a warm index, i.e., when the data is already in cache. So, I didn’t clutter this post with such measures here. Feel free to run my test and play with it if you want to see for yourself. This is why I did not measure logical reads. Also, CPU usage was so low it wasn’t measurable.

Again, the rebuild was done with a fillfactor so that the indexes are the same size both when fragmented and when not fragmented. This is probably not how you do things, but it is so obvious that fitting more rows on a page will make reads quicker, so we don’t need to investigate that aspect.

1000 rows per SELECT:

index_width	fragmented	duration_ms	physical_reads	mb_cache
narrow_index	n	951	8	0.64
narrow_index	y	1090	13	118.01
wide_index	n	5156	191	14.94
wide_index	y	5656	204	16.27

What stands out above is how much of the narrow table was brought into memory from the workload. First few times I ran the test, I thought that something was wrong. I also ran the test and looked if I had physical I/O for the “warm” execution – no. The thinking was that the for the no frag executions, data was brought out of cache for whatever reason, but no physical I/O for the warm execution. I also did a run where I removed the wide_index from the test, thinking that perhaps having this in there will skew the test data – but no.

I also investigated the BP using sys.dm_os_buffer_descriptors with the thinking that just a few MB of used cache cannot contain the data I’m looking for. I fit 162 rows per data page when the index isn’t fragmented. I had 72 such pages in cache. This gives me about 11000 rows. In my workload, I did 10 iterations where each read 1000 rows. I.e., 10000 rows in total. In other words, the data that I worked with did indeed fit in those few MB of memory. I also used sys.dm_os_buffer_descriptors after an execution with only the fragmented narrow table and now there are 17,000 pages in cache instead of 72 pages.

My conclusion is that fragmentation can cause bringing lots and lots if unnecessary data into cache. If you have other thoughts or theories, let me know.

As for the other numbers: Duration increases a bit with fragmentation, and so does physical reads. But not that much.

2 rows per SELECT:

index_width	fragmented	duration_ms	physical_reads	mb_cache
narrow_index	n	673	2	0.16
narrow_index	y	840	13	124.41
wide_index	n	941	2	0.19
wide_index	y	1421	17	1.58

Above doesn’t make sense at all. Why would a two (2) row SELECT bring so much data into cache when the index happens to be fragmented? I.e., why is read-ahead kicking in for a query with such high selectivity? So I decided to dig deeper. I extracted the query and ran it for itself, just a single execution, without any fancy dynamic SQL going on. I compared execution plans between the fragmented index and the non-fragmented index, but both are identical, index seek. Estimated numbers of rows were also pretty much on the target (1.2 rows).

A single query, reading a single row bring 20 data pages into cache when the table was fragmented (32 pages in total, including index pages). When defragmenting that index, the very same query brought 6 pages into cache. That was still more than expected, but less then when we had fragmentation. There is obviously more at play here… The funny thing is that STATISTICS IO doesn’t report any read-ahead and trace flag 652 (disable read ahead) doesn’t do any difference for the numbers above in this paragraph. So maybe it isn’t read ahead, after all? But there is something fishy going on.

10,000 rows per SELECT:

index_width	fragmented	duration_ms	physical_reads	mb_cache
narrow_index	n	2851	65	5.14
narrow_index	y	2251	58	120.04
wide_index	n	31371	1883	147.12
wide_index	y	34697	1908	149.26

Same pattern for the narrow index. Not much for me to add, except that here we got less I/O for the narrow index when the table was fragmented. I ran it a few times and got a consistent result. I can’t explain this, so I just leave it for any of you to pick your guesses.

Disabling read-ahead

What if I disable read-ahead? I used trace flag 652 to disable read-ahead and ran the the workload with 1000 rows selectivity:

index_width	fragmented	duration_ms	physical_reads	mb_cache
narrow_index	n	1117	15	1.19
narrow_index	y	1723	22	120.65
wide_index	n	14333	204	15.95
wide_index	y	16143	213	16.83

Same pattern for memory usage. But now we see a very different number for physical reads. It seems like Extended Event, the sp_statement_completed event and physical_reads value, suffers from the same defect as performance monitor. It doesn’t include the reads done my the read-ahead thread. On the other hand, we can also see that read-ahead can be our friend. Look at the difference in duration.

Bottom line

It turns out that fragmentation can have a pretty significant impact. How much? As usual, it depends on your workload. If you scan entire indexes (see my earlier blog post), then the memory aspect doesn’t seem to differ, which also makes sense. Unless you are so memory constrained that a page is removed from cache, while the query is still running and that page turns out to be needed again.

But when we are more selective than doing a full scan, fragmentation can make a difference for memory usage, evidently. My test is only one workload. And I only noticed a relevant difference for the narrow index. YMMV, of course.

The code

SET NOCOUNT ON


-----------------------------------------------------------------------------------
--Create table to keep track of buffer pool usage
USE StackOverflow

DROP TABLE IF EXISTS bp_usage

CREATE TABLE bp_usage (
 index_width varchar(30) NOT NULL
,fragmented varchar(20) NOT NULL
,mb_cache decimal(9,2) NOT NULL)


-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--Disable IAM order scan, so we know that SQL Server will follow the linked list
--See https://sqlperformance.com/2015/01/t-sql-queries/allocation-order-scans

EXEC sp_configure 'cursor threshold', 1000000
RECONFIGURE


-----------------------------------------------------------------------------------
--Drop and create event session to keep track of execution time
IF EXISTS (SELECT * FROM sys.server_event_sessions WHERE name = 'frag_test')
DROP EVENT SESSION frag_test ON SERVER

--Delete XE file, using xp_cmdshell (bad, I know)
EXEC sp_configure 'xp_cmdshell', 1 RECONFIGURE WITH OVERRIDE
EXEC xp_cmdshell 'DEL R:\frag_test*.xel', no_output 
EXEC sp_configure 'xp_cmdshell', 0 RECONFIGURE WITH OVERRIDE

CREATE EVENT SESSION frag_test ON SERVER 
ADD EVENT sqlserver.sp_statement_completed()
ADD TARGET package0.event_file(SET filename=N'R:\frag_test')
WITH (MAX_DISPATCH_LATENCY=2 SECONDS)
GO

ALTER EVENT SESSION frag_test ON SERVER STATE = START


-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--Setup section

--Grow the data and log files for StackOverflow database.
IF EXISTS(SELECT size * 8/(1024*1024), * FROM sys.database_files WHERE name = N'StackOverflow2010' AND size * 8/(1024*1024) < 20)
	ALTER DATABASE [StackOverflow] MODIFY FILE ( NAME = N'StackOverflow2010', SIZE = 20GB )
IF EXISTS(SELECT size * 8/(1024*1024), * FROM sys.database_files WHERE name = N'StackOverflow2010_log' AND size * 8/(1024*1024) < 10)
	ALTER DATABASE [StackOverflow] MODIFY FILE ( NAME = N'StackOverflow2010_log', SIZE = 10GB )
GO


--Create the table for the narrow index
RAISERROR('Setup section. About to create table with narrow index...', 10, 1) WITH NOWAIT

DROP TABLE IF EXISTS narrow_index

SELECT TOP(1000*1000*20) ROW_NUMBER() OVER( ORDER BY (SELECT NULL)) AS c1, CAST('Hello' AS char(8)) AS c2
INTO narrow_index
FROM sys.columns AS a, sys.columns AS b, sys.columns AS c

CREATE CLUSTERED INDEX x ON narrow_index(c1)

--Modify Posts table, so we condense cl ix and make it non-fragmented. So we (soon) can fragment it.
RAISERROR('Setup section. About to modify Posts table, so we condense cl ix and make it non-fragmented...', 10, 1) WITH NOWAIT
UPDATE Posts SET LastEditorDisplayName = ''
ALTER INDEX PK_Posts__Id ON Posts REBUILD WITH (FILLFACTOR = 100)
GO


-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--Investigate the data if you want

/*
--wide index
SELECT TOP(100) * FROM Posts
EXEC sp_indexinfo Posts				--Found on my web-site
EXEC sp_help 'Posts'

--narrow index
SELECT TOP(100) * FROM narrow_index
EXEC sp_indexinfo narrow_index		--Found on my web-site
EXEC sp_help 'narrow_index'
*/



-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--Create the proc that executes our SQL
RAISERROR('About to create the proc that executes our SQL...', 10, 1) WITH NOWAIT
GO
CREATE OR ALTER PROC run_the_sql
 @fragmented varchar(20)
,@rows_to_read int
,@range_iterations int
AS
DECLARE 
 @sql varchar(1000)
,@sql_condition varchar(1000)
,@bp_wide_mb decimal(9,2)
,@bp_narrow_mb decimal(9,2)
,@range_start int
,@range_iterations_counter int


--Empty cache
CHECKPOINT
DBCC DROPCLEANBUFFERS

--Cold cache
SET @range_iterations_counter = 1
SET @range_start = 1000
WHILE @range_iterations_counter <= @range_iterations
BEGIN
	SET @sql_condition = ' BETWEEN ' + CAST(@range_start AS varchar(10)) + ' AND ' + CAST(@range_start + @rows_to_read AS varchar(10))
	SET @sql = 'DECLARE @a int SET @a = (SELECT COUNT_BIG(OwnerUserId) AS [cl_ix_scan ' + @fragmented + '] FROM Posts WHERE Id ' + @sql_condition + ')'
	EXEC (@sql)
	SET @sql = 'DECLARE @a int SET @a = (SELECT COUNT_BIG(c1) AS [nc_ix_scan ' + @fragmented + '] FROM narrow_index WHERE c1 ' + @sql_condition + ')'
	EXEC (@sql)
	SET @range_start = @range_start + 100000
	SET @range_iterations_counter += 1
END

/*
--Warm cache (same as above, don't clear cache first)
SET @range_iterations_counter = 1
SET @range_start = 1000
WHILE @range_iterations_counter <= @range_iterations
BEGIN
	SET @sql_condition = ' BETWEEN ' + CAST(@range_start AS varchar(10)) + ' AND ' + CAST(@range_start + @rows_to_read AS varchar(10))
	SET @sql = 'DECLARE @a int SET @a = (SELECT COUNT_BIG(OwnerUserId) AS [cl_ix_scan ' + @fragmented + '] FROM Posts WHERE Id ' + @sql_condition + ')'
	EXEC (@sql)
	SET @sql = 'DECLARE @a int SET @a = (SELECT COUNT_BIG(c1) AS [nc_ix_scan ' + @fragmented + '] FROM narrow_index WHERE c1 ' + @sql_condition + ')'
	EXEC (@sql)
	SET @range_start = @range_start + 100000
	SET @range_iterations_counter += 1
END
*/


--Keep track of BP usage
SET @bp_wide_mb = 
(
			SELECT 
			CAST((COUNT(*) * 8.00) / 1024 AS DECIMAL(9,2)) AS MB 
			FROM  sys.allocation_units AS a  
			  JOIN sys.dm_os_buffer_descriptors AS b 
				ON a.allocation_unit_id = b.allocation_unit_id
			  JOIN sys.partitions AS p
				ON a.container_id = p.hobt_id
			WHERE p.object_id = OBJECT_ID('Posts')
			  AND b.database_id = DB_ID()
)

SET @bp_narrow_mb = 
(
			SELECT 
			CAST((COUNT(*) * 8.00) / 1024 AS DECIMAL(9,2)) AS MB 
			FROM  sys.allocation_units AS a  
			  JOIN sys.dm_os_buffer_descriptors AS b 
				ON a.allocation_unit_id = b.allocation_unit_id
			  JOIN sys.partitions AS p
				ON a.container_id = p.hobt_id
			WHERE p.object_id = OBJECT_ID('narrow_index')
			  AND b.database_id = DB_ID()
)

INSERT INTO bp_usage (index_width, fragmented, mb_cache)
VALUES
 ('wide_index', CASE WHEN @fragmented = 'high_frag_level' THEN 'y' ELSE 'n' END, @bp_wide_mb)
,('narrow_index', CASE WHEN @fragmented = 'high_frag_level' THEN 'y' ELSE 'n' END, @bp_narrow_mb)


--Note size of index and frag level, should be comparative between executions
SELECT OBJECT_NAME(object_id), index_type_desc, CAST(avg_fragmentation_in_percent AS decimal(5,1)) AS frag, page_count/1000 AS page_count_1000s
FROM sys.dm_db_index_physical_stats(DB_ID(), NULL, NULL, NULL, 'LIMITED') 
WHERE index_level = 0 AND alloc_unit_type_desc = 'IN_ROW_DATA' AND OBJECT_NAME(object_id) IN('Posts', 'narrow_index')
ORDER BY index_id
GO


-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--1: Cause fragmentation in both indexes

--Fragment wide ix
RAISERROR('About to cause fragmentation in wide index...', 10, 1) WITH NOWAIT
UPDATE Posts SET LastEditorDisplayName = REPLICATE('x', 39)

--Fragment narrow ix
RAISERROR('About to cause fragmentation in narrow index...', 10, 1) WITH NOWAIT
UPDATE narrow_index SET c1 = c1 + 1 WHERE c1 % 100 = 0

--Run the queries
RAISERROR('About to run queries with high frag level...', 10, 1) WITH NOWAIT
EXEC run_the_sql @fragmented = 'high_frag_level', @rows_to_read = 1000, @range_iterations = 10


-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--2: no frag in either index, fillfactor set to make same size as when fragmented

RAISERROR('About to eliminate fragmentation in wide index...', 10, 1) WITH NOWAIT
ALTER INDEX PK_Posts__Id ON Posts REBUILD WITH (FILLFACTOR = 60)

RAISERROR('About to eliminate fragmentation in narrow index...', 10, 1) WITH NOWAIT
ALTER INDEX x ON narrow_index REBUILD WITH (FILLFACTOR = 50)

--Run the queries
RAISERROR('About to run queries with low frag level...', 10, 1) WITH NOWAIT
EXEC run_the_sql @fragmented = 'low_frag_level', @rows_to_read = 1000, @range_iterations = 10


-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--Reset
EXEC sp_configure 'cursor threshold', -1
RECONFIGURE

--Stop trace
ALTER EVENT SESSION frag_test ON SERVER STATE = STOP

-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--Work the trace data

--Extract into a temp table
DROP TABLE IF EXISTS myXeData
DROP TABLE IF EXISTS myXeData2
DROP TABLE IF EXISTS myXeData3

SELECT CAST(event_data AS XML) AS StatementData
INTO myXeData
FROM sys.fn_xe_file_target_read_file('R:\frag_test*.xel', NULL, NULL, NULL);

--SELECT * FROM #myXeData;

--Use XQuery to transform XML to a table
WITH t AS(
SELECT 
 StatementData.value('(event/data[@name="duration"]/value)[1]','bigint') AS duration_microsec
,StatementData.value('(event/data[@name="cpu_time"]/value)[1]','bigint') AS cpu_microsec
,StatementData.value('(event/data[@name="physical_reads"]/value)[1]','bigint') AS physical_reads
,StatementData.value('(event/data[@name="logical_reads"]/value)[1]','bigint') AS logical_reads
,StatementData.value('(event/data[@name="statement"]/value)[1]','nvarchar(500)') AS statement_
FROM myXeData AS evts
WHERE StatementData.value('(event/data[@name="statement"]/value)[1]','nvarchar(500)') LIKE '%frag_level%'
),
t2 AS (
SELECT 
 CASE WHEN t.physical_reads = 0 THEN 'warm' ELSE 'cold' END AS cold_or_warm
,CASE WHEN t.statement_ LIKE '%cl_ix_scan_%' THEN 'wide_index' ELSE 'narrow_index' END AS index_width
,CASE WHEN t.statement_ LIKE '%low_frag_level%' THEN 'n' ELSE 'y' END AS fragmented
,duration_microsec
,cpu_microsec
,physical_reads
,logical_reads
FROM t)
SELECT *
INTO myXeData2
FROM t2;




-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--Run below manually and investigate the output

--Raw data from the trace
SELECT * FROM myXeData2 ORDER BY cold_or_warm, index_width, fragmented
--Verify pretty consistent values in each quartet. 
--If not, then something special occurred (checkppoint, or something external to SQL) - delete that row.

--Get avg values into a new temp table
SELECT
-- t2.cold_or_warm
 t2.index_width
,t2.fragmented
,AVG(t2.duration_microsec) AS duration_microsec
--,AVG(t2.cpu_microsec) AS cpu_microsec
,AVG(t2.physical_reads) AS physical_reads
--,AVG(t2.logical_reads) AS logical_reads
INTO myXeData3
FROM myXeData2 aS t2
GROUP BY t2.index_width, t2.fragmented
ORDER BY index_width, fragmented

--Check it out
SELECT * FROM myXeData3 ORDER BY index_width, fragmented

--Memory usage
SELECT * FROM bp_usage

--Bottom line, join the two
SELECT p.index_width, p.fragmented, p.duration_microsec, p.physical_reads, b.mb_cache
FROM myXeData3 AS p INNER JOIN bp_usage AS b ON p.index_width = b.index_width AND p.fragmented = b.fragmented
ORDER BY index_width, fragmented


-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
--Cleanup
/*
DROP TABLE IF EXISTS narrow_index
DROP TABLE IF EXISTS myXeData
DROP TABLE IF EXISTS myXeData2
DROP TABLE IF EXISTS bp_usage
*/