Codd’s Twelve Rules

Dr. Codd proposed the normalization rules we used to design databases, but did you know that he also came up with rules that vendors must meet to call their products relational database systems? In this article, Joe Celko explains the thirteen RDBMS rules.

Let me take you back to those thrilling days of yesteryear when the relational model and relational databases were the fads that all hip kids were getting into. Yes, we have fads in IT. Sometimes it’s a new programming language that will solve all of our problems. Think about PL/1, Ada, and dozens of others that have come and gone. Sometimes it’s a new technique that will solve our problems. Structured Programming, Software Engineering, Agile, and others I have forgotten. We used to joke that the current “fad du jour” was like teenage sex. Everyone claimed they knew what it was; they were already doing it and had always done it, but they were a little weak on some of the details.

Software vendors were particularly bad about this. Any database product they had was relational! Just ask the salesperson; you know salespeople never lie. Dr. Codd, the creator of relational databases, was bothered by this, so he set up a set of 13 rules that a product had to match to be considered relational. The paper is referred to as “Codd’s Twelve Rules” or sometimes as “Codd’s Twelve Commandments”, despite the fact there were actually 13 of them because the numbering started with zero. In particular, Rule 12 was created to prevent some of this marketing hype.

I’ll begin by going through the rules.

Rule 0: The Foundation Rule

For any system that is advertised as, or claimed to be, a relational database management system, that system must be able to manage databases entirely through its relational capabilities. This means you don’t get to use a host language to do anything inside the database. If you remember the original database systems, they had to be embedded in COBOL or some other host language. They were essentially collections of procedure calls for data access, rather than what we would think of is a database today.

Rule 1: The Information Rule

All information in a relational database is represented explicitly at the logical level and in exactly one way – by values in tables.

Notice the phrase “logical level” and that there’s nothing about how physical storage is done. There’s nothing about pointer chains. There’s nothing about the physical position of data in arrays, files, or anything else. Remember that Dr. Codd started as a mathematician, so he thought abstractly. Today, this rule is worded as “scalar values in the columns of rows in tables”, but various vendor implementations allow non-scalar data in the table.

There is also some confusion about “atomic” versus “scalar”; a value is atomic if it cannot be further decomposed without losing information. For example, an American shoe size of 8½ B is a complete measurement. Either the length (8½) or the width (B) by itself has lost information about the actual shoe size. The term scalar has to do with scales and measurements, and I have written separate articles on this topic.

Rule 2: The Guaranteed Access Rule

Each and every datum (atomic value) in a relational database is guaranteed to be logically accessible by resorting to a combination of table name, primary key value and column name.

Please note Dr. Codd was still talking about a “primary key” at this time. This particular term also is part of SQL. We still agree that to have a table, you must have a key because that’s how data is located in RDBMS.

But the concept of a primary key was a leftover from sequential files; it’s so ubiquitous that we didn’t even think about it. To use a sequential file, the data has to be sorted on a key. Obviously, you can have only one such sort key. A bit later, Dr. Codd realized that all keys are keys at the logical level. To paraphrase “Animal Farm”, we realized you couldn’t have some keys “more equal than others”, so there was no need to make one special.

Rule 3: Systematic Treatment of NULL Values

NULL values (distinct from the empty character string or a string of blank characters and distinct from zero or any other number) are supported in fully relational DBMS for representing missing information and inapplicable information in a systematic way, independent of data type.

This is another concept that’s grown since Codd’s original paper. In the second version of the relational model, Dr. Codd defined “applicable” and “inapplicable” forms of NULL. An applicable NULL, a type A NULL, is used when the entity has the attribute, but we don’t know what its value is right now. An inapplicable NULL, a type I NULL, is used when the entity simply doesn’t have the attribute, so its value cannot ever be resolved.

Because these two NULLs did not come along until after we had gotten pretty far into SQL, the language only has one NULL which serves both purposes. Unfortunately, it gets more complicated after that.

The first complication is that Dr. Codd defined the NULL as having no data type. When you’re implementing a compiler in a strongly typed language like SQL, you really need a data type for all of the data elements. This is why we have to write CAST (NULL AS < data type>) to play safe, and why being NULL-able is a part of the declaration of a column in the DDL.

The second complication is how a NULL sorts. Is it always higher than any value in the datatype of the column? Always lower? Originally, each vendor could have his own rules. Currently, the ANSI/ISO standards let the programmer explicitly determine how NULLs are sorted.

The third complication is that even if you have declared everything in your schema to be NOT NULL, SQL will generate NULLs. There is no escape! The OUTER JOINs create NULLs in the unpreserved table, to pad out the rows in the result. In some cases, returning an empty set will be shown as a NULL. We also have NULLs created by an OLAP operation. OLAP is worth an article in its own right, but for now, consider the basic operations that are options in the GROUP BY clause; CUBE, ROLLUP, and GROUPING SET operations.

Looking at just the ROLLUP as a representative example, this lets us write what we would have called a “control break” report in the old days. These are the basic reports that would list the details, and then at each level of the hierarchy, print out aggregate functions (usually summations) between the groups. For decades, this was the major use of computers and green bar continuous printer paper and data processing.

But unlike a report, the result of a query has to return a table. And by definition, the rows in the table all have the same structure. This is one that’s easier to see an example. Assume I have a simple table of sales data:

My OLAP query gives me the totals at the state level and the city level:

The resulting output from this query looks like this:

AL Birmingham 200.00 ← detail rows

AL NULL 4500.00 ← state-level total rows
….
NULL NULL 115000.0 ← grand total row

Much like the OUTER JOINs, the OLAP operations create NULLs which were not in the original data. As is usual with grouping in SQL, the NULLs are treated as a group in their own right. But there’s a serious problem here. What does the NULL at these various levels mean? It’s not quite the same as an A-type or I-type missing value. For example, we were looking at a row that gives the city level totals. There are no NULLs at this level, but when I go to the state level, my NULL represents shorthand for a subset of the names of cities. This is a very different meaning than the NULLs which were used as a scalar value.

As expected, the NULLs form their own group. However, I would tell the difference between a created NULL and one that was in the original data? The solution which we got in the SQL: 2003 standard, is a function:

GROUPING (< column reference 1>, .., < column reference n>)

This function returns a vector where a one means the NULL was created by the query, and otherwise a zero.

Another consideration with NULLs is when we use them with temporal data. The ISO-8601 standards are based on a half-open interval model of time. That means we always know the starting point in time of an interval, but there may or may not be an ending point yet. If the event is still ongoing, we can’t terminate it. This is why there is no such thing as 24:00:00 Hrs today because it is actually 00:00:00 Hrs of the next day.

The classical way of modeling ongoing events is to use NULL for the ending timestamp. This lets us write things like COALESCE(event_ending_timestamp, CURRENT_TIMESTAMP) in our queries. But it also means that a NULL can be interpreted as a symbol for “eternity”, which we don’t have in SQL.

Rule 4: Dynamic Online Catalog Based on the Relational Model:

The database description is represented at the logical level in the same way as ordinary data, so that authorized users can apply the same relational language to its interrogation as they apply to the regular data.

SQL products are generally pretty good about this, but you have to be careful that you don’t mix data and metadata. There are schema information tables defined in the standards. In the real world, each vendor has its own schema information tables, which also include things that are particular to its implementation.

Rule 5: The Comprehensive Data Sublanguage Rule

A relational system may support several languages and various modes of terminal use (for example, the fill-in-the-blanks mode). However, there must be at least one language whose statements are expressible, per some well-defined syntax, as character strings and that is comprehensive in supporting all of the following items:

1. Data definition.

2. View definition.

3. Data manipulation (interactive and by program).

4. Integrity constraints.

5. Authorization.

6. Transaction boundaries (begin, commit and rollback).

Again, SQL products are pretty good about this. Unless you’re a bit older, you might not have ever seen QBE (query by example) or other tools. SQL is divided into three sublanguages; DDL (data definition language), DCL (data control language) and DML (data manipulation language). When we were designing the sublanguages in the SQL standards, we very deliberately decided that the languages would be LALR(1). This it has to do with the type of grammar used by the parsers in a computer language. If you like having flashbacks to your compiler writing classes as a freshman, you could look up one of these two articles:
https://en.wikipedia.org/wiki/LALR_parser
https://web.cs.dal.ca/~sjackson/lalr1.html

Rule 6: The View Updating Rule

All views that are theoretically updatable are also updatable by the system.

This requirement simply does not work. Updatable views are incredibly complicated, and it’s not practical to try to implement it. In particular, you might want to read a book by Chris Date, “View Updating & Relational Theory: Solving The View Update Problem” (ISBN 9781449357849).

Vendors have pretty much settled for an updatable VIEW having to map to distinct rows in a single base table. Trying to do an “un-JOIN” on a VIEW built on more than one table is problematic. If the VIEW has computations, trying to come up with the inverse functions is not always possible. Throw in some CASE expressions, and it’s an incredible mess.

The WITH CHECK OPTION has been in the language almost since the beginning, but it’s not well understood. Again, it’s one of those things which is easier to explain with an example. Let’s assume we have a skeleton table of our salespersons:

Now create a VIEW from the salesman in Austin, Texas:

This is a perfectly good VIEW of all the guys working in Austin, Texas. Since it obviously has one row in the VIEW mapping to one row in the base table, we can write:

Oops! We just moved everybody out of Texas to Massachusetts. However, if we had put a WITH CHECK OPTION on the end of the CREATE VIEW statement, then the WHERE clause is reevaluated in such an update is disallowed because it moves rows out of the VIEW.

Rule 7: Possible for High-Level Insert, Update, and Delete

The capability of handling a base relation or a derived relation as a single operand applies not only to the retrieval of data but also to the insertion, update and deletion of data.

Obviously, insertion update and deletion can be done on base tables in every SQL product. However, the most complicated options are in the MERGE statement.

Rule 8: Physical Data Independence

Application programs and terminal activities remain logically unimpaired whenever any changes are made in either storage representations or access methods.

Again, SQL is pretty good about this feature. The older network and hierarchical databases required rewriting the code when a new index, hash table or whatever was added. You had to open a given index and explicitly use it specifically.

Rule 9: Logical Data Independence

Application programs and terminal activities remain logically unimpaired when information-preserving changes of any kind that theoretically permit unimpairment are made to the base tables.

Notice the idea is that the information is preserved, even when its representation is altered. For example, if you change a value from INTEGER to a DECIMAL(n, 0) data types, and don’t permit fractional data values, then you should expect either representation to behave the same way.

Again SQL is pretty good about this feature. A select statement runs as written, even after I’ve altered the tables. This concept was very hard for traditional programmers who grew up with traditionally compiled languages. If I compiled a Fortran program with a particular Fortran compiler, the executable code would always be the same. The same query run on the same version of SQL can produce different execution plans, depending on the other users (can we share data among them?), the current data types (I can alter a column within its data type family?), changed access methods and data statistics.

Rule 10: Integrity Independence

Integrity constraints specific to a particular relational database must be definable in the relational data sublanguage and storable in the catalog, not in the application programs.

Data integrity is part of the DDL in SQL. SQL engines allow column level constraints (CHECK(), DEFAULT, NOT NULL) and simple DRI table constraints (REFERENCES). In the ANSI/ISO standards, we also have a CREATE ASSERTION statement that is like a schema level CHECK() constraint. All constraints are true for an empty table, but an assertion can handle empty tables and multiple tables.

Rule 11: Distribution Independence

The end-user must not be able to see that the data is distributed over various locations. Users should always get the impression that the data is located at one site only.

This is a little more vendor dependent, but SQL has no syntax to locate the physical storage of the data. When we look at RAID storage, we have no idea where is the physical data is kept or even which copy of the data we are currently using. But if we mean a truly distributed database, in the sense of the Cloud or other network configurations, they didn’t exist when Dr. Codd set up these rules. In the years since then, managing a distributed database has become a separate topic in itself.

Rule 12: The Non-subversion Rule

If a relational system has a low-level (single-record-at-a-time) language, that low level cannot be used to subvert or bypass the integrity rules and constraints expressed in the higher-level relational language (multiple-records-at-a-time).

We have cursors for this! In fact, the SQL model is based on the IBM magnetic tape drive functions. We didn’t have much choice when we were setting up the first RDBMS products since the first products were built on top of existing file systems and hardware. There was not much parallelism, column-oriented data storage, advanced hashing, or any of the other advances in computer science and hardware. Later we added the SQL/PSM, PL/SQL, Informix 4GL, T-SQL in other procedural languages that would work on a particular vendor’s product.

While it is possible in SQL products to turn off constraints, the rule has always been that at the end of the session, the database must return to a consistent state with all of the constraints turned back on. The ANSI/ISO SQL Standard allows you to declare constraints as either being deferred initially, deferred later or not deferrable at all. SQL Server has commands to turn the constraints on and off explicitly.

Deferring constraints is usually done when you need to get an initial state in the database that has to do with self-referencing constraints. Such things are referred to as “the Garden of Eden constraints”, and they can be a bit tricky. For example, if a constraint on a new row to a table has to refer back to rows that already exist in the table. But if you’ve just created the table, there are no rows to reference! So we need to turn off this constraint, insert an initial row, and then turn it back on.

Conclusion

One of our problems in IT is that terms drift. We are not as bad as politics, but it can get pretty bad if you don’t have some kind of guidepost. I think that Dr. Codd did a pretty good job of preserving a definition of the relational model. And while it took us a little while, I think SQL does a good job of meeting his goals as well.

References:

1. Codd, E. F., “Is Your DBMS Really Relational?”, ComputerWorld (1985-10-14).

2. Codd, E. F., “Does Your DBMS Run By the Rules”, ComputerWorld (1985-10-21).

3. Codd, E. F., “The Relational Model for Database Management: Version 2”, ISBN 978-020114192-4.