Partitions in SQL Server: Partition an Existing Fact Table (by date)
Picking up on my series on partitioning, previously I looked at Partitions in SQL Server: Creating a Partitioned Fact Table (by date) now I’m going to look into splitting an existing table into partitions. A key thing to remember is that the moving of data on disk can be time consuming and cause locks.
The approaches to look at are:
- Partition the table by applying a function and schema to it.
- Create a new table, and shuffle the data into it.
Unfortunately, there is no way (that I’m aware of in present versions of SQL Server) to partition an existing table bit by bit (only shuffle a batch of rows into partitions). Any modification to a partition structure will require all rows to be moved.
Components of a Partitioned Table
- Partition Function: The definition of how data is to be split. It includes the data type, and the value ranges to use in each partition.
- Partition Scheme: The definition of how a partition function is to be applied to data files. This allows DBAs to split data across logical storage locations if required, however in most modern environments with large SANs most SQL Server implementations and their DBAs will just use ‘primary’.
- Table & Index Definition: The last part is the definition of the table, including the partition scheme to use, and the index with which to apply it.
The Fact Table
Sample Table
I’ll create a table using some made up data:
-------------------------------------------------------------------------------- -- Create Date List -------------------------------------------------------------------------------- DECLARE @DateStart DATE = '01-JAN-17'; DECLARE @DateEnd DATE = GetDate(); WITH number_tbl AS ( SELECT 0 AS a UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9 ), number_sequences AS ( SELECT (t3.a + (10 * t2.a) + (100 * t1.a)) AS concatenate_calculation FROM number_tbl t1 CROSS JOIN number_tbl t2 CROSS JOIN number_tbl t3 ), date_sequence AS ( SELECT CAST(DateAdd(DAY, concatenate_calculation, @DateStart) AS DATE) AS [TheDate] FROM number_sequences WHERE CAST(DateAdd(DAY, concatenate_calculation, @DateStart) AS DATE) <= @DateEnd ) SELECT [TheDate] INTO #DateList FROM date_sequence ORDER BY [TheDate] ; CREATE UNIQUE CLUSTERED INDEX PK_DateList ON #DateList ([TheDate]) ; -------------------------------------------------------------------------------- -- Create Dummy Fact table (we will keep coming back to this) -- Using cartesian join, giving 4 rows for each date -------------------------------------------------------------------------------- WITH s_data AS ( SELECT 'Data 1' AS [c_attribute], 100 AS [c_metric] UNION ALL SELECT 'Data 2' AS [c_attribute], 200 AS [c_metric] UNION ALL SELECT 'Data 3' AS [c_attribute], 300 AS [c_metric] UNION ALL SELECT 'Data 4' AS [c_attribute], 400 AS [c_metric] ) SELECT [TheDate], [c_attribute], [c_metric] INTO #FactTable FROM #DateList, s_data ; CREATE CLUSTERED INDEX SK_DateList ON #FactTable ([TheDate]) ;
Now I’ll combine the sample data (multiple times) into a fact table:
-------------------------------------------------------------------------------- -- Create Fact Table (without partitions) -------------------------------------------------------------------------------- CREATE TABLE [dbo].[FactTable_unpartitioned] ( [Date] DATE, [attribute] VARCHAR(100), [metric] BIGINT ); CREATE CLUSTERED INDEX [PK_FactTable_unpartitioned] ON [dbo].[FactTable_unpartitioned] ([Date], [c_attribute]) ; INSERT INTO [dbo].[FactTable_unpartitioned] SELECT * FROM #FactTable ; -- Create more rows DECLARE @START_INT BIGINT = 6; DECLARE @END_INT BIGINT = 0; WHILE @START_INT > @END_INT BEGIN INSERT INTO [dbo].[FactTable_unpartitioned] SELECT * FROM [dbo].[FactTable_unpartitioned] ; SET @START_INT = @START_INT - 1; END;
Creating The Partitioned Table
Partition Function
For this dataset, I’m going to add a new partition for each month (functions can be for any time frame you chose):
-------------------------------------------------------------------------------- -- Partition Function -- Using RIGHT (so the value given is the LOWER value) -------------------------------------------------------------------------------- CREATE PARTITION FUNCTION [FactTable_PF] (DATE) -- Datatype to use AS RANGE RIGHT -- See previous post FOR VALUES -- Values on which to 'split' the data ( N'2017-01-01T00:00:00', -- This will contain all data for 2017 Jan onwards (until the next partition). All data before then goes into a 'catch all' partition N'2018-01-01T00:00:00', -- This partition will contain all data from Jan 2018 onwards (until the next partition). N'2019-01-01T00:00:00' -- This partition will contain all data from Feb 2018 onwards (unless a subsequent partition is added in the future) );
Partition Scheme
Next, we create the scheme to tell SQL Server how to organise the data on the disk(s):
-------------------------------------------------------------------------------- -- Partition Scheme -------------------------------------------------------------------------------- CREATE PARTITION SCHEME [FactTable_PS] AS PARTITION [FactTable_PF] ALL TO ([primary]) ;
Partitioning by Applying a Function & Schema
A table ban have a partition scheme applied to it, by applying a CLUSTERED index:
-------------------------------------------------------------------------------- -- Apply scheme using CLUSTERED INDEX -------------------------------------------------------------------------------- CREATE CLUSTERED INDEX [CIDX_FactTable] ON [dbo].[FactTable_2] ([Date]) ON [FactTable_PS] ([Date]) ;
Note, in our example the above will fail. There is already a clustered index on the table (as there probably should be)! In this instance, that index has to be dropped and a new one created.
If we are working with a very large table,this can take a lot of time. However, we can keep data available using: ONLINE = ON
DROP INDEX [PK_FactTable_unpartitioned] ON [dbo].[FactTable_unpartitioned] ; CREATE CLUSTERED INDEX DK_FactTable_unpartitioned ON [dbo].[FactTable_unpartitioned] ([Date]) WITH (ONLINE = ON ) ON [FactTable_PS] ([Date]) ;
Partitioning by Copying to a New Table
This might be preferable for very large tables in busy environments… We will have two tables, one with the original data and one new partitioned table. The movement is best to be done in batches, this saves redo log and other overheads on the server during the migration. The migration can be done using a date loop with a statement such as:
-- In a loop INSERT INTO [target_table] SELECT * FROM [source_table] WHERE [partition_column] = @Date ;
As this will take a long time (in some cases, many hours) and processes/users will potentially still access the table. The point at which the tables are “switched over” is a key decision to make! There are benefits and dangers to switching (renaming) the tables before or after moving the data.
- Before:
- Advantages: Any process which adds data will automatically add it to the new table.
- Disadvantages: Specific update statements could be missed. Processes which use the table will not have access to the full dataset until it’s copied.
- After:
- Advantages: All processes reading and writing data from the table will continue uninterrupted.
- Disadvantages: Processes will have to be managed so that all data changes made during the data copy process are also copied (even if an update is made to data for a date which has already been copied).
This is a good task for a weekend!