A UNIQUE experience

A properly designed table must have a key to uniquely identify each row. In this article, Joe Celko explains unique and primary keys.

If you look at the postings on SQL forums, you will notice that people are sloppy. They expect that the reader can read their minds and figure out everything that’s not been posted. They seldom post DDL, and when they do, it is also wrong. I don’t mean a little bit wrong. I mean fundamentally wrong. RDBMS has the concepts of (Relational) keys, so DDL without at least one key means you don’t have a table at all. If all the columns are NULL-able, they cannot ever be part of any key. This non-table is really a deck of 1960s punch cards hidden as an SQL table.

Dr. Codd defined a key as being a subset of columns in a table. A column is defined as an attribute of the entity that the table models. The IDENTITY table property is left over from sequential file days is not an attribute of an entity; it is a metadata attribute of the table. The GUID suffers the same problem of not being a proper attribute of the entities within the table in which they appear. Yes, they are UNIQUE, but they’re not attributes. Just sticking the constraint of PRIMARY KEY on them doesn’t avoid the fundamental problem. Novices forget that the “G” in GUID stands for “global,” and they are not local. Such attributes are specifically not in the table! These things cannot be keys by definition.

What is a key?

The definition of the key:

1) A key must be a non-empty subset of columns in the table (attributes of the entity). This means it cannot be an external pointer or physical locator to rows in the table. As an analogy, when you identify an automobile, you don’t use a parking space number in some particular garage; you use the VIN.

2) A key must be unique within the table. This is necessary but not sufficient. In fact, the best keys are unique across the whole schema and then some. This is another reason to use VIN for automobiles; it’s useful to the DMV, your insurance company, and any other use you should have for that automobile. It can be validated by using a really ugly regular expression or verified by physically going to your car and looking at the chassis or the etchings on your windshield. Validation and verification of data is a slightly different topic, so I’ll just skip over them for now.

3) A key must be precise, not approximate or vague. You don’t think about it, but this automatically eliminates approximate numeric types as keys. Two floating-point numbers are considered equal if they differ by a small value called the epsilon. This means it two different floating-point numbers can be treated as equal. This need for precision is also the reason that NULLs are not allowed in a key. Remember that a NULL is not a value. I’ll get into some of the problems with NULLs shortly.

A bit of history

When Dr. Codd created the relational model, data was still in file systems and a few early networks and hierarchical databases. Our mindset was still stuck in these pre-relational tools, so the good doctor came up with the idea that every table should have a PRIMARY KEY. It was defined as a key that we anointed as somehow special. A table could only have one of these (Wow! Just like the sort order in sequential files!).

In the early days of the SQL Standards, we considered requiring a PRIMARY KEY on a table but decided against it. The idea was that legacy data without keys (think mag tapes and punch cards) would be easier to move into SQL products without that constraint.

Shortly after that, Dr. Codd realized that a key is a key. There was no need to mark one as special. This is what led to the UNIQUE constraint.

UNIQUE and PRIMARY KEY constraints

The UNIQUE and PRIMARY KEY constraints say that no duplicate values are allowed in the column. It comes in two forms. The syntax is:

There are some subtle differences between UNIQUE and PRIMARY KEY. There still can be only one PRIMARY KEY per table but many UNIQUE columns. A PRIMARY KEY is automatically declared to have a NOT NULL constraint on it, but a UNIQUE column can have a single NULL in it unless you explicitly add a NOT NULL constraint. Adding the NOT NULL whenever possible is a good idea, as it makes the column into a proper relational key. I also add NOT NULL to PRIMARY KEY declarations to document the table and to be sure it stays there when the key changes.

File system programmers understand the concept of a PRIMARY KEY, but for the wrong reasons. Their mindset is a sequential file, which can have only one key, because that key is used to determine the physical sort order of the records within the file. There is no ordering in a table; the term PRIMARY KEY in SQL has to do with defaults in referential actions, which is another topic.

A single NULL is allowed in the UNIQUE constraint to be able to match a NULL as if it were a value. The same type of logic is applied to the GROUP BY clause so that each NULL doesn’t become its own group. There’s a cute story about this decision that has to do with traffic tickets. Before SQL, someone got a prestige tag reading missing and was inundated by hundreds of traffic tickets because that’s what the police officers wrote for the missing auto tag everywhere in California.

Multi-column UNIQUE constraints

There is also a multiple-column form of the <UNIQUE specification>, which is usually written at the end of the column declarations. It is a list of columns in parentheses after the appropriate keyword; it means that the combination of those columns is unique.

Broadly speaking, there are two patterns for multi-column keys. One pattern is like coordinates, where each attribute is strong and has meaning in its own right. For example, I might declare PRIMARY KEY (city, department) so I can be sure that although I have offices in many cities and many identical departments in those offices, there is only one personnel department in Chicago.

The other pattern is hierarchical: one attribute is subordinate to another attribute. A classic example is an order number and an order detail number subordinated to that particular order. This has to do with weak and strong entities, and that’s another topic.

Super keys

A super key is a key that is too big. It has more columns in it than are needed to identify an entity in a table. In practice, you generally don’t like these things. The indexes or hash tables or whatever your SQL engine uses to implement uniqueness will have to do extra work carrying the redundant attributes. The problem is it isn’t always obvious which attributes are redundant. Imagine a super key UNIQUE (a, b, c, d) In which any three of the four columns is also unique. Which possible combination(s) do you want to keep?

Overlapping keys

You can also have overlapping keys, as in UNIQUE (a, b, c) and UNIQUE (b, c, d), which share columns (b, c).

Nested unique constraints

One of the basic tricks in SQL is representing a one-to-one or many-to-many relationship with a table that references the two (or more) entity tables related by their primary keys. This third table has several popular names such as “junction table,” “Associative Entity,” or “join table,” but we know that it is a relationship. The term “junction table” is a pointer structure from network databases, not part of an RDBMS. For example, given two tables for a Cafeteria database:

Yes, I know using names for a key is a bad practice, but it will make my examples easier to read. The simplest relationship table looks like this:

The Meals table allows creating simple one entree/one side dish meals insert rows like this:

Oops! ‘Meat Loaf’ is shown twice with ‘Mashed Potatoes’ because the Meals table does not have its own key. This is an easy mistake to make, but fixing it is not obvious.

The Menu table gets rid of the duplicated rows and makes this a proper table. The PRIMARY KEY for the table is made up of two or more columns and is called a compound key because of that fact. These are valid rows now.

But the only restriction on the Meals is that they appear only once. Every entree can be paired with every side dish. I want a rule that entrees can have as many side dishes as they want, but the side dishes have to stick to one entree.

The way I do this is to use a NOT NULL UNIQUE constraint on the sidedish_name column, which makes it a key. It is a simple key since it is only one column, but it is also a nested key because it appears as a subset of the compound PRIMARY KEY.

Plate_Combos is a proper table, without duplicated rows, but it also enforces the condition that each entree can have one or more side dishes.

Instead, you might want the side dishes to be combined with more than one entree.

The Single_Side_Plates table would permit these rows from the original set.

Think about all of these possible keys for a minute. The compound PRIMARY KEY is now redundant. If each entree appears only once in the table or each side dish appears only once in the table, then each (entree_name, sidedish_name) pair can appear only once. A rather dull option of a single entree with a single side dish can be enforced with this DDL:

The Meals table allows inserting these rows from the original set.

Making special provisions for the PRIMARY KEY in the SQL engine is not a bad assumption because the REFERENCES clause uses the PRIMARY KEY of the referenced table as the default. Many new SQL programmers are unaware that a FOREIGN KEY constraint can also reference any UNIQUE constraint in the same table or another table. Such things can get very complicated and require that you defer constraints. Again, that’s another topic.

Overlapping keys

Getting back to the nested keys, just how far can you go with them? My favorite example is a teacher’s schedule kept in a table like this (I am leaving off REFERENCES clauses and CHECK() constraints):

That choice of a PRIMARY KEY is the most obvious one — use all the columns. Typical rows would look like this:

The rules to be enforced are:

1) A teacher is in only one room each period.

2) A teacher teaches only one class each period.

3) A room has only one class each period.

4) A room has only one teacher in it each period.

Stop reading and see what you come up with for an answer. Okay, now consider using one constraint for each rule in the list.

There are four ways to pick three things from a set of four things; it is called a combination. I could drop the PRIMARY KEY as redundant if I have all four of these constraints in place. But what happens if I drop the PRIMARY KEY and then one of the constraints?

I can now insert these rows in the second version of the table:

This gives me a very tough sixth-period teaching load since I have to be in two different rooms at the same time. Things can get even worse when another teacher is added to the schedule:

Ms. Shields and I are both in room 223, trying to teach different classes at the same time. Matthew Burr looked at the constraints and rules and came up with this analysis.

If a teacher is in only one room each period, then given a period and a teacher, I should be able to determine only one room, i.e., the room is functionally dependent upon the combination of teacher and period. Likewise, if a teacher teaches only one class each period, then class is functionally dependent upon the combination of teacher and period. The same thinking holds for the last two rules: class is functionally dependent upon the combination of room and period, and teacher is functionally dependent upon the combination of room and period.

With the constraints provided in the first version, you will find that the rules are not enforced. For example, I could enter the following rows:

These rows violate rule #1 and rule #2.

However, the UNIQUE constraints first provided in Class_Schedule_2 do not capture this violation and will allow the rows to be entered.

The constraint

is checking the complete combination of teacher, room, and period. Since (‘Mr. Celko‘, 222, 6) is different from (‘Mr. Celko‘, 223, 6), the DDL does not find any problem with both rows being entered, even though that means that Mr. Celko is in more than one room during the same period.

doesn’t catch its associated rule either since (‘Mr. Celko‘, ‘Database 101‘, 6) is different from (‘Mr. Celko‘, ‘Database 102‘, 6). Mr. Celko can teach more than one class during the same period, thus violating rule two. It seems that it’s also possible to add the following row:

which violates rules #3 and #4.

Conclusion

Try to imagine enforcing this with procedural code. This is why I say that most of the work in SQL is done in the DDL. You will not always need complicated, nested, overlapping uniqueness constraints, but they can be incredibly powerful when you do.