In SQL server 2022, we finally have an ORDER option when we create a columnstore index. But what does it really do?
Some basics first:
If you are familiar with rowgroups, segments and segment elimination, then you can skip this section.
Data for a columnstore index is divded in groups of approximate 1 million rows, rowgroups. Each rowgroup has a set of pages for each column. The set of pages for a column in a rowgroup is called a segment. SQL Server has meta-data for the lowest and highest value for a segment. There are no SEEKs in a columnstore index. But, SQL Server can use this meta-data to skip reading segments, with the knowledge that “this segment cannot contain any data that I need based on my predicates in my WHERE clause”.
Also, you might want to do these operations using MAXDOP 1, so we don’t have several threads muddling our neat segment alignment.
Making sure the data is sorted when building the index
Say that you want the rows with the earliest OrderDates in the first rowgroup, then the next 1 million rows (based on OrderDate) in the next rowgroup… and finally the most recent rows in the last rowgroup. We can do that without ORDER by “happening” to have a row-index on the same set of columns sorted in the way that we want the columnstore index to be sorted when building the index. I.e., SQL Server takes the path of least resistance and uses that row index when building the columnstore index. You achieve this by having a rowstore index with matching column key and re-create it as a columnstore index using DROP_EXISTING.
The new ORDER specification in SQL Server 2022 for the clustered columnstore index takes care of this for you. Nice.
Making sure the data is sorted when adding more data
The ORDER you specified for the clustered columnstore index is reflected in sys.index_columns, the column_store_order_ordinal column. This is used by the engine when you load data. Data will be sorted and loaded in that order.
This is fine if you add data aligned as how you have the ORDER specified. Say you have it on OrderDate and as you add new orders, you get new rowgroups for the new orders based on new data coming in will have increasing OrderDate.
This is a common situation, but what if the ORDER for the index doesn’t match with how the new data arrives? Then over time, you have added rowgroups containing data all over the place, including the segment(s) for the column(s) that you specified for the ORDER clause.
Making sure the data is sorted when rebuilding the index
This is what surprised me. I was expecting the column_store_order_ordinal column in sys.index_columns to be respected when I do ALTER INDEX REBUILD. But it isn’t. Nor is it respected if I do CREATE INDEX … WITH DROP_EXISTING.
So, bottom line is that we still have to use DROP_EXISTING to convert the table to a rowstore table, having a clustered index key that will match our ORDER. And then convert it again, to a columnstore table using that ORDER definition. Just like we are doing today.
Above is what my tests indicates. Let me know if I missed something obvious.
[Update 1] Paul White pointed out that this has been brought up before and a sort is performed for rebuild, but it is a soft sort which doesn’t sort the full set of data. Check out these for more info:
https://www.erikdarlingdata.com/fixing-ordered-column-store-sorting-in-sql-server-2022/
https://www.brentozar.com/archive/2022/07/columnstore-indexes-are-finally-sorted-in-sql-server-2022/
https://sqlperformance.com/2015/04/sql-plan/internals-of-the-seven-sql-server-sorts-part-1
Here is the T-SQL I ran, requires the AdventureworksDW database.
USE AdventureworksDW GO --Supporting proc CREATE OR ALTER PROC GetSegmentAlignment @tablename sysname AS SELECT COL_NAME(ic.object_id, ic.column_id) as ColumnName ,s.segment_id ,s.min_data_id ,s.max_data_id ,s.row_count ,s.on_disk_size FROM sys.column_store_segments AS s INNER JOIN sys.partitions AS p ON p.hobt_id = s.hobt_id INNER JOIN sys.indexes AS i ON i.object_id = p.object_id AND i.index_id = p.index_id LEFT JOIN sys.index_columns AS ic ON ic.object_id = i.object_id AND ic.index_id = i.index_id AND ic.index_column_id = s.column_id WHERE OBJECT_NAME(p.object_id) = @tablename AND ic.column_id IS NOT NULL ORDER BY s.column_id, p.partition_number, s.segment_id GO DROP TABLE IF EXISTS NewFact GO --Create heap using SELECT INTO SELECT * INTO NewFact FROM FactResellerSalesXL_CCI WHERE UnitPrice < 400 --Create the ORDERed clustered columnstore index CREATE CLUSTERED COLUMNSTORE INDEX NewFact ON NewFact ORDER(OrderDateKey) WITH(MAXDOP = 1) --Verify index is ORDERed SELECT i.name, i.type_desc, c.column_id, COL_NAME(c.object_id, c.column_id) as ColumnName, c.column_store_order_ordinal FROM sys.indexes AS i INNER JOIN sys.index_columns AS c ON i.index_id = c.index_id AND i.object_id = c.object_id WHERE i.object_id = OBJECT_ID('NewFact') --Check segment alignment, nicely aligned on OrderDateKey EXEC GetSegmentAlignment 'NewFact' --Load the rest of the data INSERT INTO NewFact SELECT * FROM FactResellerSalesXL WHERE UnitPrice >= 400 EXEC GetSegmentAlignment 'NewFact' --The whole table is not re-aligned, but *new* data is aligned based on indexe's ORDER --Try any of below ALTER INDEX NewFact ON NewFact REBUILD WITH (MAXDOP = 1) EXEC GetSegmentAlignment 'NewFact' --Segments are not aligned CREATE CLUSTERED COLUMNSTORE INDEX NewFact ON NewFact ORDER(OrderDateKey) WITH(DROP_EXISTING = ON, MAXDOP = 1) EXEC GetSegmentAlignment 'NewFact' --Segments are not aligned