Partitions in SQL Server: Partition an Existing Fact Table (by date)

2019-04-14

Picking up on my series on partitioning, previously I looked at Partitions in SQL Server: Creating a Partitioned Fact Table (by date) now I’m going to look into splitting an existing table into partitions. A key thing to remember is that the moving of data on disk can be time consuming and cause locks.

The approaches to look at are:

Partition the table by applying a function and schema to it.
Create a new table, and shuffle the data into it.

Unfortunately, there is no way (that I’m aware of in present versions of SQL Server) to partition an existing table bit by bit (only shuffle a batch of rows into partitions). Any modification to a partition structure will require all rows to be moved.

Components of a Partitioned Table

Partition Function: The definition of how data is to be split. It includes the data type, and the value ranges to use in each partition.
Partition Scheme: The definition of how a partition function is to be applied to data files. This allows DBAs to split data across logical storage locations if required, however in most modern environments with large SANs most SQL Server implementations and their DBAs will just use ‘primary’.
Table & Index Definition: The last part is the definition of the table, including the partition scheme to use, and the index with which to apply it.

The Fact Table

Sample Table

I’ll create a table using some made up data:

--------------------------------------------------------------------------------
-- Create Date List
--------------------------------------------------------------------------------
DECLARE @DateStart DATE = '01-JAN-17';
DECLARE @DateEnd   DATE = GetDate();
WITH number_tbl AS (
SELECT 0 AS a UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL
SELECT 4      UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL
SELECT 8      UNION ALL SELECT 9
), number_sequences AS (
SELECT (t3.a + (10 * t2.a) + (100 * t1.a)) AS concatenate_calculation
  FROM number_tbl t1
  CROSS JOIN number_tbl t2
  CROSS JOIN number_tbl t3
), date_sequence AS (
SELECT CAST(DateAdd(DAY, concatenate_calculation, @DateStart) AS DATE) AS [TheDate]
  FROM number_sequences
 WHERE CAST(DateAdd(DAY, concatenate_calculation, @DateStart) AS DATE) <= @DateEnd
)
SELECT [TheDate]
  INTO #DateList
  FROM date_sequence
 ORDER BY [TheDate]
;
CREATE UNIQUE CLUSTERED
 INDEX PK_DateList
    ON #DateList
      ([TheDate])
;
--------------------------------------------------------------------------------
-- Create Dummy Fact table (we will keep coming back to this)
-- Using cartesian join, giving 4 rows for each date
--------------------------------------------------------------------------------
WITH s_data AS (
SELECT 'Data 1' AS [c_attribute], 100 AS [c_metric] UNION ALL
SELECT 'Data 2' AS [c_attribute], 200 AS [c_metric] UNION ALL
SELECT 'Data 3' AS [c_attribute], 300 AS [c_metric] UNION ALL
SELECT 'Data 4' AS [c_attribute], 400 AS [c_metric]
)
SELECT [TheDate],
       [c_attribute],
       [c_metric]
  INTO #FactTable
  FROM #DateList,
       s_data
;
CREATE CLUSTERED
 INDEX SK_DateList
    ON #FactTable
      ([TheDate])
;

Now I’ll combine the sample data (multiple times) into a fact table:

--------------------------------------------------------------------------------
-- Create Fact Table (without partitions)
--------------------------------------------------------------------------------
CREATE TABLE [dbo].[FactTable_unpartitioned]
(
    [Date]      DATE,
    [attribute] VARCHAR(100),
    [metric]    BIGINT
);

CREATE CLUSTERED
INDEX [PK_FactTable_unpartitioned]
    ON [dbo].[FactTable_unpartitioned]
      ([Date], [c_attribute])
;

INSERT
  INTO [dbo].[FactTable_unpartitioned]
SELECT *
  FROM #FactTable
;

-- Create more rows
DECLARE @START_INT BIGINT = 6;
DECLARE @END_INT   BIGINT = 0;

WHILE @START_INT > @END_INT
BEGIN
    INSERT
      INTO [dbo].[FactTable_unpartitioned]
    SELECT *
      FROM [dbo].[FactTable_unpartitioned]
    ;
    SET @START_INT = @START_INT - 1;
END;

Creating The Partitioned Table

Partition Function

For this dataset, I’m going to add a new partition for each month (functions can be for any time frame you chose):

--------------------------------------------------------------------------------
-- Partition Function
-- Using RIGHT (so the value given is the LOWER value)
--------------------------------------------------------------------------------
CREATE PARTITION FUNCTION [FactTable_PF]
      (DATE) -- Datatype to use
AS
RANGE RIGHT  -- See previous post
 FOR VALUES  -- Values on which to 'split' the data
(
    N'2017-01-01T00:00:00', -- This will contain all data for 2017 Jan onwards (until the next partition). All data before then goes into a 'catch all' partition
    N'2018-01-01T00:00:00', -- This partition will contain all data from Jan 2018 onwards (until the next partition).
    N'2019-01-01T00:00:00'  -- This partition will contain all data from Feb 2018 onwards (unless a subsequent partition is added in the future)
);

Partition Scheme

Next, we create the scheme to tell SQL Server how to organise the data on the disk(s):

--------------------------------------------------------------------------------
-- Partition Scheme
--------------------------------------------------------------------------------
CREATE PARTITION SCHEME [FactTable_PS]
    AS PARTITION [FactTable_PF]
ALL TO ([primary])
;

Partitioning by Applying a Function & Schema

A table ban have a partition scheme applied to it, by applying a CLUSTERED index:

--------------------------------------------------------------------------------
-- Apply scheme using CLUSTERED INDEX
--------------------------------------------------------------------------------
CREATE CLUSTERED
 INDEX [CIDX_FactTable]
    ON [dbo].[FactTable_2]
      ([Date])
    ON [FactTable_PS] ([Date])
;

Note, in our example the above will fail. There is already a clustered index on the table (as there probably should be)! In this instance, that index has to be dropped and a new one created.

If we are working with a very large table,this can take a lot of time. However, we can keep data available using: ONLINE = ON

 DROP
INDEX [PK_FactTable_unpartitioned]
   ON [dbo].[FactTable_unpartitioned]
;

CREATE CLUSTERED
INDEX DK_FactTable_unpartitioned
    ON [dbo].[FactTable_unpartitioned]
      ([Date])
WITH (ONLINE = ON
      )
    ON [FactTable_PS] ([Date])
;

Partitioning by Copying to a New Table

This might be preferable for very large tables in busy environments… We will have two tables, one with the original data and one new partitioned table. The movement is best to be done in batches, this saves redo log and other overheads on the server during the migration. The migration can be done using a date loop with a statement such as:

-- In a loop
INSERT
  INTO [target_table]
SELECT *
  FROM [source_table]
 WHERE [partition_column] = @Date
;

As this will take a long time (in some cases, many hours) and processes/users will potentially still access the table. The point at which the tables are “switched over” is a key decision to make! There are benefits and dangers to switching (renaming) the tables before or after moving the data.

Before:
1. Advantages: Any process which adds data will automatically add it to the new table.
2. Disadvantages: Specific update statements could be missed. Processes which use the table will not have access to the full dataset until it’s copied.
After:
1. Advantages: All processes reading and writing data from the table will continue uninterrupted.
2. Disadvantages: Processes will have to be managed so that all data changes made during the data copy process are also copied (even if an update is made to data for a date which has already been copied).

This is a good task for a weekend!