“On a scale from 1 to 10, what color is your favorite letter of the alphabet?”
–Slogan on a T-shirt
This is one of my favorite T-shirt gags because it teaches something about data types, scales and measurements. When you hear it, you automatically start to answer it as if it was a real question. The humor for us nerds comes from the absurdity of mixing different kinds of scales and data types.
We could compute “furlongs per fortnight” and actually get an answer that has some meaning. It would not be very useful, but at least it makes some kind of sense (distance per unit of time).The T-shirt slogan makes no sense whatsoever.
The simplest possible type of scale is called a “nominal” scale and all it does is give a name to an entity or data element. It’s really more important for what properties it does not have.
In fact, the nominal scale is so weak that originally people did not like to even include as a scale. After all, all you’ve done is name something. Its most important property is that it’s elements are unique. That is, we can use a name to identify an element in the set. But it establishes a logical principle called “The Law of Identity”.
The Law of Identity
The Law of Identity is the basis of all Western thought, and states that each and every thing is identical to itself. The shorthand version of this law, thanks to Ayn Rand, is that “A is A”. A more complete version, coupled with the concept of existence is that “to be is to be something in particular; to be nothing in particular or to be anything in general, Is to be nothing at all.” Or, if you prefer the idea is that every entity in the universe has an identity which makes it unique and associates certain properties (attributes) with it. An acorn can become an oak tree, but it cannot become an elephant and each acorn is a separate entity.
Identity is fundamentally important. There is an old Garfield comic strip in which Jon is talking to a friend about cats illustrates why the Law of Identity is so important.
Friend: “We had three cats when I was a kid.”
Jon: “What were their names?”
Friend: “Cat, cat and cat.”
Jon: “That doesn’t seem very useful. How did you tell them apart?”
Friend: “Who cares? Cats do not come when they’re called!”
What You Cannot do with Nominal Scales
To make this clearer, lets look at a couple of concepts.
1) There is no Inherent single display format.
Any unique string of symbols can serve as a name. In theory, if you give every employee his own emoji, instead of an employee id number. Historically, this was actually done during the renaissance by illiterate Italian stone cutters. Each of the sculptors had a glyph which looked like an alchemist symbol. These glyphs were used to label blocks of marble and hold them for various artists. If you like Ray Bradbury science fiction, in the movie “Fahrenheit 451” (1966, Oskar Werner, Julie Christie, Cyril Cusack), official personnel records used photographs instead of text.
In fact, the Unicode standard is very much concerned with display. First their concern with the font that is used, then they are concerned with the equivalent encodings which produce the same display character. Do you read this string from right to left, or left to right? And you still must pick a collation for your application.
For example, German has three collation systems that are in use (DIN, federal government and IBM) it was only a few decades ago that Spanish stopped treating “LL
” and “CH
” as a individual, separate letters for sorting dictionaries.
2) Nominal scales have no innate ordering
The relational model is based on sets, and sets have no concept of an ordering. They are completed wholes. The idea of an ordering is not part of the nominal scale, but it’s a very useful thing to have in the representation of the attribute is measured on the scale.
Many decades ago, I have a friend who taught American culture in China. Knowing that she could not read Chinese, the school Romanized the names of the students and printed them out next to the Chinese characters on a roster for her, so she could pronounce them (well try to pronounce them).
The problem was that they had no concept of alphabetical order, so each printout of the 150+ names in her class were arranged differently every time. Any ordering on a nominal scale is an ordering of the representation and not of the values of that data element.
Editor note: check here and here for further information if you are as curious as I was.
3) Nominal scales have no computations.
Yes, you can play with the representation of a data element, but it really has nothing to do with the nature of the data element itself. It would make absolutely no sense to take the squares of the employee id numbers or to average them. To quote a textbook from the 1800s, a number represents a magnitude or quantity. A name for an entity is neither of these things.
4) Digits are very useful symbols, but they have baggage.
As long as I’m doing these stories, let me repeat a classic folk story that’s been told and credited to several different sources over the decades. The version I first learned was that one school broke into the schoolhouse of their football rival and turned three pigs (or bulls or other large messy animals, depending on who is telling the tale) loose. Painted on the side of the animals were the digits “1”, “2”, and “4” respectively; The rival school spent quite some time looking for the animal with a “3” painted on it.
The baggage with using a string of digits to represent a nominal scale is that they are often seen as ordinal numbers, or as part of a sequence. They might well be for some side purpose (like as a numeric sequence useful mind tool), but it’s not required. This is why for a few centuries, we have had ordinal markers really didn’t order things.
You know that “Chanel no. 5”, “WD-40”, “#2 Pencil” or “Gate #9” are names because they include markers like “-“, “no.”, “nbr” or “#” (Computer people call it a hash mark, and Telephone people call it an octothorpe. It used to be a pound sign and several other things).
In fact, there were punctuation and display conventions that tell you the digits are part of a nominal scale this without special symbols. When you see the string “5” you are not really sure what it means by itself. But when you see “005” you know it has to be a nominal scale value. It may have some other properties; such as when you see”007” you immediately think of James Bond novels; if you’re a real fan you know there’s a “008” and immediately think of the whole fictional British spy apparatus.
INTEGER Is Not A Validated Data Type For A Nominal Scale
The TV show “Married With Children” briefly had a kid named 7 staying with the Bundys. One of the Borg was named “7 of 9” on “Star Trek: The Next Generation.” In the Peanuts comic strip there was a child named 555 95472. There seems to be a history of using numbers for names on in pop culture when you want to make something funny or unusual.
Why do people think it’s funny? Because instinctively you know that an entity is a totally different sort of thing from a magnitude, quantity, or ordering. Nominal scales are for entities.
Designing an Encoding
The Unicode people have done us a big favor. Every alphabet or symbol system you can find in Unicode must also include a set of symbols made up of Latin letters, digits, and a few punctuation marks. They exist so that all of the standard SI abbreviations four units can be represented in any text on earth. That means this character set is where you should limit your choice of symbols.
The string you pick can be a mix of alphanumeric symbols, and it can be fixed or varying length. Worst example of a confusing encoding is the alphanumeric strings allowed by the British postal system. They were constructed from the names of postal offices, most of which no longer exist. Compare this to the United States ZIP Code which can be validated with the simple regular expression “zip_code LIKE [0-9]0-9]0-9]0-9]0-9]’
.
Note: Please remember there’s a difference between validation and verification. Some ZIP code is valid when it is a string of five digits. But you cannot be sure that this code is actually in use; you have to verify it with the Postal Service.
IDENTITY Is the Worst Choice
The absolute worst choice you can make in designing a table is to use the old Sybase IDENTITY
table property column. It is a non-relational vendor extension that exist only because decades ago we built SQL on top of existing file systems. Just as a deck of punch cards or sequential tape file can have only one sordid order, tables can have only one IDENTITY
column.
First of all, this is a table property and not a relational column. Let’s drop back two absolutely basic definitions and relational theory. A column is an attribute of the entity modeled by the row in the table. but the integers in an IDENTITY
column have nothing to do with the entity at all. By way of analogy, imagine you have a parking garage. The parking spaces in this garage are sequentially numbered and as cars come in one at a time, they are assigned to a parking space. Think of punch cards and magnetic tape.
The sequential entry has another problem. SQL and the relational model are supposed to be set oriented. That means when I insert a set of rows into a table, it is supposed to occur all at once, and not in a sequence as with IDENTITY
, I can take the same set of inputs, shuffle them a little bit and then carry that data over to a second table or the second IDENTITY
column will not have the same values assigned to the new rows.
An actual correctly designed, key, such as the VIN (Vehicle Identification Number) would be a property of the automobile and not the parking space, so it would remain the same no matter where that vehicle shows up. Notice that includes automobile that appears outside of the parking garage completely. The VIN will be the same on your insurance papers, auto tags, police reports and so forth – it is a proper attribute and not some local table property.
Ideally, we would like a nominal scale to be as universal as possible. IDENTITY
is as local as possible. We would also like to be able to validate and verify the values of something for which we’ve bothered using nominal scales.
Conclusion
If you look at the overwhelming majority of designs for industry-standard encodings, you will see that they use the limited Unicode character set I just discussed, are fixed length strings (so that we can design simple computer files, paper forms and display screens easily) and over half of them have some kind of check digit, regular expression or other built-in validation.
Yes, doing your job correctly has some overhead and requires professional mind tools. Just because using integers for nominal scales is quick and easy, do not become “Larry the Cable Guy” and decide that you just need to “gitt’er done!!” when you’re programming.