ORDER for columnstore index

In SQL server 2022, we finally have an ORDER option when we create a columnstore index. But what does it really do?

Some basics first:

If you are familiar with rowgroups, segments and segment elimination, then you can skip this section.

Data for a columnstore index is divded in groups of approximate 1 million rows, rowgroups. Each rowgroup has a set of pages for each column. The set of pages for a column in a rowgroup is called a segment. SQL Server has meta-data for the lowest and highest value for a segment. There are no SEEKs in a columnstore index. But, SQL Server can use this meta-data to skip reading segments, with the knowledge that “this segment cannot contain any data that I need based on my predicates in my WHERE clause”.

Also, you might want to do these operations using MAXDOP 1, so we don’t have several threads muddling our neat segment alignment.

Making sure the data is sorted when building the index

Say that you want the rows with the earliest OrderDates in the first rowgroup, then the next 1 million rows (based on OrderDate) in the next rowgroup… and finally the most recent rows in the last rowgroup. We can do that without ORDER by “happening” to have a row-index on the same set of columns sorted in the way that we want the columnstore index to be sorted when building the index. I.e., SQL Server takes the path of least resistance and uses that row index when building the columnstore index. You achieve this by having a rowstore index with matching column key and re-create it as a columnstore index using DROP_EXISTING.

The new ORDER specification in SQL Server 2022 for the clustered columnstore index takes care of this for you. Nice.

Making sure the data is sorted when adding more data

The ORDER you specified for the clustered columnstore index is reflected in sys.index_columns, the column_store_order_ordinal column. This is used by the engine when you load data. Data will be sorted and loaded in that order.

This is fine if you add data aligned as how you have the ORDER specified. Say you have it on OrderDate and as you add new orders, you get new rowgroups for the new orders based on new data coming in will have increasing OrderDate.

This is a common situation, but what if the ORDER for the index doesn’t match with how the new data arrives? Then over time, you have added rowgroups containing data all over the place, including the segment(s) for the column(s) that you specified for the ORDER clause.

Making sure the data is sorted when rebuilding the index

This is what surprised me. I was expecting the column_store_order_ordinal column in sys.index_columns to be respected when I do ALTER INDEX REBUILD. But it isn’t. Nor is it respected if I do CREATE INDEX … WITH DROP_EXISTING.

So, bottom line is that we still have to use DROP_EXISTING to convert the table to a rowstore table, having a clustered index key that will match our ORDER. And then convert it again, to a columnstore table using that ORDER definition. Just like we are doing today.

Above is what my tests indicates. Let me know if I missed something obvious.

[Update 1] Paul White pointed out that this has been brought up before and a sort is performed for rebuild, but it is a soft sort which doesn’t sort the full set of data. Check out these for more info:
https://www.erikdarlingdata.com/fixing-ordered-column-store-sorting-in-sql-server-2022/
https://www.brentozar.com/archive/2022/07/columnstore-indexes-are-finally-sorted-in-sql-server-2022/
https://sqlperformance.com/2015/04/sql-plan/internals-of-the-seven-sql-server-sorts-part-1

Here is the T-SQL I ran, requires the AdventureworksDW database.

USE AdventureworksDW
GO

--Supporting proc
CREATE OR ALTER PROC GetSegmentAlignment 
@tablename sysname
AS
SELECT
 COL_NAME(ic.object_id, ic.column_id) as ColumnName
,s.segment_id
,s.min_data_id
,s.max_data_id
,s.row_count
,s.on_disk_size
FROM sys.column_store_segments AS s
INNER JOIN sys.partitions AS p ON p.hobt_id = s.hobt_id
INNER JOIN sys.indexes AS i ON   i.object_id = p.object_id AND  i.index_id = p.index_id
LEFT  JOIN sys.index_columns AS ic ON ic.object_id = i.object_id AND  ic.index_id = i.index_id AND  ic.index_column_id = s.column_id
WHERE OBJECT_NAME(p.object_id) = @tablename
AND ic.column_id IS NOT NULL
ORDER BY s.column_id, p.partition_number, s.segment_id
GO


DROP TABLE IF EXISTS NewFact
GO

--Create heap using SELECT INTO
SELECT * INTO NewFact 
FROM FactResellerSalesXL_CCI
WHERE UnitPrice < 400

--Create the ORDERed clustered columnstore index
CREATE CLUSTERED COLUMNSTORE INDEX NewFact ON NewFact ORDER(OrderDateKey) WITH(MAXDOP = 1)

--Verify index is ORDERed
SELECT i.name, i.type_desc, c.column_id, COL_NAME(c.object_id, c.column_id) as ColumnName, c.column_store_order_ordinal 
FROM sys.indexes AS i
INNER JOIN sys.index_columns AS c ON i.index_id = c.index_id AND i.object_id = c.object_id
WHERE i.object_id = OBJECT_ID('NewFact')

--Check segment alignment, nicely aligned on OrderDateKey
EXEC GetSegmentAlignment 'NewFact'

--Load the rest of the data
INSERT INTO NewFact
SELECT * 
FROM FactResellerSalesXL 
WHERE UnitPrice >= 400

EXEC GetSegmentAlignment 'NewFact'
--The whole table is not re-aligned, but *new* data is aligned based on indexe's ORDER

--Try any of below

ALTER INDEX NewFact ON NewFact REBUILD WITH (MAXDOP = 1)
EXEC GetSegmentAlignment 'NewFact'
--Segments are not aligned

CREATE CLUSTERED COLUMNSTORE INDEX NewFact ON NewFact ORDER(OrderDateKey) WITH(DROP_EXISTING = ON, MAXDOP = 1)
EXEC GetSegmentAlignment 'NewFact'
--Segments are not aligned