Identity column vs ETL Process

Creating ETL process can pose so many challenges we will only fully discover them when really building the ETL.

Before you complain about modelling, the example here has no relation to data warehouse modelling. The ETL in question was a migration from a data source to another. The need for executing multiple times was mainly for a testing purpose.

Let’s get back a bit and understand what I’m talking about.

Imagine you are building an ETL. Importing a lot of data to a set of tables and implementing many rules inside the ETL. Your ETL will be executed many times, but the tables need to be always empty when the process starts. As a result your ETL needs a cleaning task, in the beginning, clearing all the tables.

Some tables have an identity field. Since you are clearing the tables and ensuring everything will start again from the same point, you need to reset the identity as well.

Probably your first thought is to use Truncate Table statement. The problem is Truncate Table is not supported in tables referenced by a foreign key.

Check the example below: The Truncate Table is blocked by the foreign key.

CREATE TABLE class
  (
     id    INT IDENTITY(1, 1) PRIMARY KEY,
     title VARCHAR(50)
  )
go

CREATE TABLE students
  (
     id      INT IDENTITY(1, 1) PRIMARY KEY,
     NAME    VARCHAR(50),
     classid INT,
     FOREIGN KEY (classid) REFERENCES class(id)
  )
go

TRUNCATE TABLE class 

You need to use Delete or drop and re-create the constraints. Choosing the simplest solution, let’s talk about using the Delete.

Let’s consider the scenario where your ETL will need to run many times and need to start with a cleaning task. This scenario, sometimes, is used for staging servers in the middle of a bigger ETL process. I used for data migration and had to make the execution many times during development, always returning to the start point before executing again.

That’s why you need a cleaning process, deleting all the records on the tables, but that’s not enough. The tables have identity fields and you also need to reset the identity seed before executing again the ETL.

Reseting the Identity Seed: Regular behaviour

Reset an identity seed seems simple:

DBCC checkident(‘testTable’, reseed, 1) 

The problem is the combination of Delete and the DBCC CheckIdent. This combination can lead to very strange results. Let’s see what can happen below:

The first execution will be in an empty table, never used before.

CREATE TABLE testtable
  (
     id    INT IDENTITY(1, 1) PRIMARY KEY,
     price NUMERIC(15, 2)
  )
go

DELETE testtable

DBCC checkident(‘testTable’, reseed, 1)

INSERT INTO testtable
VALUES      (10)

SELECT *
FROM   testtable 

The following executions will be in a table already used, having passed by a delete and dbcc checkident. It’s the same code executed again:

 

DELETE testtable

DBCC checkident(‘testTable’, reseed, 1)

INSERT INTO testtable
VALUES      (10)

SELECT *
FROM   testtable 

As you may notice, even making the dbcc checkident using the same seed value, the initial identity value in the table becomes different, what can cause all sort of problems for your software.

The root cause of the problem is one statement, DBCC CheckIdent, behaving in different ways according to how the table is at the moment. Our cleaning script needs a consistent result and it’s not achieving this.

 

Investigating the root cause

We will need to search deeper in SQL Server to discover why the checkident has these differences and where is SQL Server storing the current number for the Identity column of a table.

The DMV sys.identity_columns has a column called LAST_VALUE which contains the last value used for the identity column. Let’s analyse the value this column stores during the execution of the script.

DROP TABLE testtable
go

CREATE TABLE testtable
  (
     id    INT IDENTITY(1, 1) PRIMARY KEY,
     price NUMERIC(15, 2)
  )
go

DELETE testtable

SELECT object_id,
       NAME,
       column_id,
       last_value
FROM   sys.identity_columns
WHERE  Object_name(object_id) = ‘testTable’ 

Just after have been created, the Last_Value column contains NULL, even after the delete statement has been executed.

 

DBCC checkident(‘testTable’, reseed, 1)

SELECT object_id,
       NAME,
       column_id,
       last_value
FROM   sys.identity_columns
WHERE  Object_name(object_id) = ‘testTable’ 

 

Even after the execution of the checkident, the Last_Value column remains with NULL value.

Let’s insert a record:

INSERT INTO testtable
VALUES      (10)
go 2

SELECT *
FROM   testtable 

 

The Null value on Last_Value results in a first identity as 1, which is the exact value set as a seed for the identity.

Let’s repeat again:

DELETE testtable

SELECT object_id,
       NAME,
       column_id,
       last_value
FROM   sys.identity_columns
WHERE  Object_name(object_id) = ‘testTable’ 

 

After the delete, the last_value is 2, the last identity used. We need the checkident:

 

DBCC checkident(‘testTable’, reseed, 1)

SELECT object_id,
       NAME,
       column_id,
       last_value
FROM   sys.identity_columns
WHERE  Object_name(object_id) = ‘testTable’ 

Now we can easily see the difference: When the table is empty the last_value is NULL and it continues to be NULL after the checkident. However, when the table already had some records included, the last_value is filled and the checkident will reseed it to the value we are providing. So, it’s a difference between last_value field being NULL or the seed value.

Solving the problem

Knowing these details, we can create a query to calculate the next identity value for a table. If the last_value is NULL the next value will be the seed value, if last_value has a value, the next one will be the last_value + the increment

 

SELECT Isnull(CONVERT(INT, last_value)
              + CONVERT(INT, increment_value), CONVERT(INT, seed_value))
FROM   sys.identity_columns
WHERE  Object_name(object_id) = ‘testTable’ 

 

The Converts are needed because on this DMV the columns are of type sql_variant, so we need to convert them to an integer.

Knowing how to find the correct value we can now fix our cleaning code in order to always have the same result, with the identity starting in 1.

 

DECLARE @lastValue INT

DELETE testtable

SELECT @lastValue = last_value
FROM   sys.identity_columns
WHERE  Object_name(object_id) = ‘testTable’

IF ( @lastValues IS NULL )
  DBCC checkident(‘testTable’, reseed, 1)
ELSE
  DBCC checkident(‘testTable’, reseed, 0)