Empty Thoughts: Working with NULL

Whether or not to have NULLable columns in a table can be a religious debate, and how missing data is represented should be carefully considered during database design. In this article, Joe Celko considers the ways that SQL Server handles NULLs in several situations.

One of the hardest concepts in learning SQL is the meaning of a NULL. Traditionally, programming languages had no concept of missing or UNKNOWN data. The closest example that most programmers ran into was the ‘not applicable’ flags in spreadsheets, or the classic TBD flags for undetermined instructors, locations, or other things in print outs.

Dr. Codd defined a NULL as the lack of a value, so talking about NULL values is wrong. In SQL, however, we have to worry about physical storage. This means we have to know the data type of the column which is holding the NULL so the compiler can do its job. From that requirement, it logically follows that we can write CAST(NULL AS < data type>) and not just depend on automatic type conversions.

SQL is notorious for its three-value logic {TRUE, FALSE, UNKNOWN} which results from trying to compare a NULL to something, including another NULL. Since it is not a value, NULL <> NULL is UNKNOWN, but likewise so is NULL = NULL! This why we have the predicate <expression> IS [NOT] NULL to check for a NULL. This predicate and the EXISTS()function are some of the few predicates in SQL that can only return {TRUE, FALSE}.

NULLs in DDL

SQL defines a PRIMARY KEY(<column list>) as being implicitly declared NOT NULL. It is probably a good idea to go ahead and put the NOT NULL in your table declarations anyway. If the PRIMARY KEY constraint changes, then you are still safe. According to Dr. Codd and relational theory, to be a real table it must have a key. Originally, Dr. Codd said a PRIMARY KEY had to be designated, but later realized that a key is a key, so there’s no need to designate something special about one of them. This was another leftover from trying to implement relational systems on top of old file systems. In file systems, records (which are nothing like rows) come in a linear search sequence in physical storage, so tapes had to be sorted. The original sort key became the PRIMARY KEY in the new SQL products. Random access on unsorted magnetic tapes is technically possible, but it really doesn’t work.

But the relational model accommodates multiple keys in the same table. The syntax we picked for non- primary keys was a little strange. We added the UNIQUE(<column list>) constraint which guarantees that all the rows in the table will be different. It also allows more than one column, but the columns involved can have NULLs. Remember that in the relational model, a key can’t have NULLs, But you’re only allowed to have one NULL-ed row, as if it were a value.

All of these strange rules come from the GROUP BY clause. Without lapsing into a college algebra lesson, we have two equivalence relations, as they are called in set theory. The first is just regular old equals (=) with the extra rules about NULLs. The second relation is the GROUP BY, in which all the NULLs are put into one equivalence class.

However, in the DDL, the UNKNOWN result of the search condition in a CHECK(<search condition>) constraint is given the ‘benefit of the doubt’ and treated the same as TRUE.

NULLs in DML

When you use it in the DML statements, it treats the UNKNOWN result the same as a FALSE. Query updates and inserts have a stronger criterion.

returns only test1 as a result. The test2 row failed to insert. The test3 row became

“WHERE (NULL + 1 +1) < 10”

“WHERE (NULL + 2) < 10”

“WHERE NULL < 10”

“WHERE UNKNOWN” or “WHERE FALSE”

Replacing NULLs with actual values is very often handy. The SQL Server/Sybase family originally had the function ISNULL(<expression>, <non-NULL value>). If the <expression> was NULL then it returns the <value>. The data type of the result is taken from the first parameter. That particular choice can make for some funny results when parameters are not the same data types.

Microsoft now has the ANSI/ISO Standard COALESCE(<expression list>). The first thing to notice is that COALESCE takes a list of expressions, which it then parses left to right to determine the highest data type in the list, which becomes the data type of the result. The list is then parsed again from left to right to find the first non-NULL value, which is returned and cast to the result data type. One common mistake beginners make with this is to put a higher data type in the list when it’s not what they really meant.

I’m often asked why we chose the word “coalesce” instead of something else. We were trying to come up with the word that would be descriptive but not so common that it might be misunderstood or used as a column name. Phil Shaw of IBM pulled out a pocket thesaurus and started going down synonyms until he came up with this.

Grouping NULLs

A common way to use the grouping relation to return TRUE if two expressions are both NULL or their values match:

What if there is no ‘weird value’ you can use for the COALESCE check? You just need to add more search conditions:

While SQL Server does not yet have this construct, the SQL Standards added another comparison operator for this problem. This feature was introduced in two steps: SQL:1999 added T151, DISTINCT predicate. The optional negation with NOT was added by SQL:2003 as feature T152, DISTINCT predicate with negation.

Note that you have to use the negated form to get the results you want. The un-negated form is not really equality. This is easy to see with a truth table:

A

B

A = B

A IS NOT DISTINCT FROM B

0

0

TRUE

TRUE

0

1

FALSE

FALSE

0

NULL

UNKNOWN

FALSE

NULL

NULL

UNKNOWN

TRUE

Set Operators and NULLs

SQL has set operators (UNION, INTERSECT, EXCEPT) which work on table expressions. Both tables have to be what we call ‘union compatible,’ which means the tables have the same structure (the corresponding columns in each table are in the same order and have compatible datatypes) , and the result will have that structure. Most people don’t know that technically the result table does not have a name nor do the columns unless you actually assigned them with a <set expression> AS <table name> (column name list>) construct.

Set operators discard the duplicate rows and use the grouping rather than the equality relationship to discard multiple NULLs as well as duplicate values. The EXCEPT and INTERSECT operators also work in this way.

OUTER JOIN and NULLs

OUTER JOINs come in three flavors (LEFT, RIGHT, FULL) and were designed to solve an actual problem. Before the SQL-99 Standard, there was no standardized syntax for them nor a Standard definition of how they would work. Sybase in SQL Server did it one way and Oracle did it another way, then there was a product from a company called Gupta Technologies LLC that let you pick which one you wanted to use. I’m going to assume that everyone knows how an outer join works. The ‘preserved table’ is the one on the left (or right or both) side of the join operator, and the unpreserved table, if any, is the one on the opposite side. Since all datatypes in SQL must be NULLable, the values that did not match the join condition in the preserved table can be padded out with NULLs. It doesn’t matter if the original columns in the preserved table were declared NOT NULL because the result of the join is technically a whole new table.

OLAP and NULLs

When OLAP (online analytical processing) databases first came in, ANSI responded by defining some basic hierarchical aggregations in ANSI/ISO Standard SQL-99. They are defined as extensions to the GROUP BY clause. The original three were GROUPING SET, ROLLUP and CUBE. The last two are defined in the standard using the GROUPING SET construct. Rollup and cube are often called ‘super groups’ because they could be defined using the regular GROUP BY and UNION operators. As expected, the NULLs form their own group. However, we now have a special function, GROUPING (<column reference>), which returns a one if the column was created by the operation or zero otherwise. Then SQL-2003 added a multicolumn version, GROUPING_ID, that constructs a binary number from the zeros and ones in the columns in the list, using an implementation defined exact numeric data type; this is handy as you think.

These constructs allow you to do what we used to call ‘control break reports’ back in the pre-SQL days. You put a sequential file in sorted order and pass it through a program that would keep running totals in accumulators (an old term that actually used to refer to a physical feature in unit record equipment). When the controls (the columns at various levels in the hierarchy) changed, the accumulators were printed out and reset. Most of the time, frankly, this was doing running totals.

This is probably easier to see with an example. Imagine we have a file that gives us a region number and the city name along with the total sales for that city. We want to get a report that shows us the totals by region, the totals by city within the regions, and finally, a grand total for all sales in the company.

That’s assuming that there is some sample data, the output might look like this (the right-hand most column is a comment not part of the output)

 

comment

‘006’

‘Austin’

500.13

city within region total

‘006’

‘Dallas’

2010060.5

 

‘006’

‘San Antonio’

475.01

 

‘006’

NULL

1190902.75

region total

 

NULL

NULL

3426563.75

grand total

The general rule in SQL is not to do data formatting in the database tier. You pass the results of the database layer to a presentation layer and that layer adds the colors, labels, does any weird filters and calculations, etc. The purpose of the database tier is just raw data. However, having said that you can write something like this:

Again, you are doing something that is not recommended.

Avoiding NULLs

We debated this in the early days of ANSI X3H2. Some of the early products use the regular equality for their groupings, so each NULL became its own group. This did not work out so well. The example we had in the committee was a table of traffic tickets issued in California. Quite logically, the database designer used NULL for missing auto tags; the words none, nothing, missing, etc., (in multiple languages!) had been used on prestige tags, and the way the law was written, they were perfectly legal. There was no special checkbox on the traffic tickets for a missing tag; it had to go in the space for the tag number. The huge number of missing tags made reporting impossible when they each became one row in the summary reports.

As a default, you need to assume that all your columns will be NOT NULL, then go back and decide exactly what a NULL would mean for each particular column. If your column is on a nominal or categorical scale (see this article), you can create encodings for the missing values. For example, the ISO 5218 sex codes are (‘0’ = unknown, ‘1’ = male, ‘2’ = female, ‘9’ = not applicable or lawful person). A lawful person includes things like corporations, governments, and so forth. If a column is a temporal data type, then NULL is often used as a symbol for eternity when marking the open end of (start_timestamp, end_timestamp) intervals that have not closed yet.

Using zero for numeric data elements may or may not work as a missing value token. Blanks or empty strings may or may not work as a missing value for text. Neither of these options have any built-in special characteristics that NULLs have in SQL.

Conclusion

The simple fact is that in SQL, you really can’t escape NULLs. But think of them the way you would think about any other data design decision. Is it necessary? Is it sufficient to express the nature of the model? Is it easily understood by someone who is going to have to maintain this after you’re gone?