Data, N-Tiles & Medians

Joe Celko takes a look at how SQL Server can do some of the work for you such as calculating medians and enforcing referential integrity.

Many decades ago, when I was an honest programmer, I had a consulting job at the Atlanta headquarters of the Coca-Cola Company. Structured programming was the current fad, and the company was trying to switch over from COBOL to PL/1. We were trying to move historical records about Coca-Cola syrup sales into some kind of database. When I say “historical records”, I mean invoices and sales records that were written with beautiful penmanship, delivered on horse-drawn wagons and measured in firkins (a unit of liquid volume equal to half a kilderkin or 40.91481 litres, in case you forgot).

While this information made a nice exhibit at the Coca-Cola Museum in Atlanta, it had absolutely no value for making sales predictions and marketing. A lot of things had changed since the company was founded in 1892 (we will ignore removing cocaine from the product).

Most data has value based on how current it is. In some situations, like an avionics system, the data has to be immediate to be useful. That immediacy is enforced by blinking red lights and screaming sirens. When the sexy robot voice says “The wings are a sheet of flames! The bomb is stuck in the bomb bay! There are incoming enemy missiles! We have run out of toilet paper!” The last item in this list is really not that important, because of everything else going on. You can and will worry about toilet paper later. If you’re still alive.

Cost of getting data

There is also a cost to getting immediate data. The avionics system is expensive, but it makes sense in that situation. If you have not worked in the gaming industry, you might not be aware that slot machines are networked. In large casinos, management can send a query to this network and find out what the slot machines are doing on the floor. These networks are set up so that management can sample data at any given point in time. This lets the casino move the guests to where they want them to be with promotional offers, see which machines are being played, how many players are on the machines, etc.

Obviously, the avionics real-time system needs more immediate data than a slot machine network. But more than that, the casino systems have to answer to more people. One of the problems with these systems was that if anything interesting was happening on the floor (a lot of really happy big winners jumping up and down, or people walking off in groups, etc.), and everyone in management could log on to get the full reports. Because the displays were pretty and easy to invoke, Management that had no real need to try to look at everything promptly did so. This, of course, choked the system with “Lookie Lou” traffic. This meant that the rate of coins being fed into the slot machines slowed and cost the casino money. The trick in this case was to limit access to people who really needed the data to make a decision.

Cost of keeping data

My third and final anecdote has to do with the fact that in America we have one lawyer for approximately every 400 people. We are a very litigious society. If you keep data to long, it spoils and goes from being valuable to being dangerous. This is why you need to have lawyers and professional records managers working for your company. Everyone knows, vaguely, what a lawyer is. But I found the database and computer oriented people don’t have any idea what a records manager does, or that such a profession even exists.

These are the people who set up the retention schedules and rules for companies’ data. Database people don’t want to give up any data at all and that’s how Coca-Cola wound up with all of those paper invoices in an obscure warehouse. But life has gotten worse because we now have to worry about privacy rules. You’ll want to look at https://www.red-gate.com/simple-talk/sysadmin/data-protection-and-privacy/gdpr-in-the-usa/, which quickly discusses the General Data Protection Regulation (GDPR), Sarbanes-Oxley (SOX), the Payment Card Industry (PCI) compliance, the Health Insurance Portability and Accountability Act (HIPAA), and the Family Educational Rights and Privacy Act (FERPA) as legal restrictions and obligations created by bureaucrats who have no idea how IT works.

The general rule of thumb is that if your records are subpoenaed, you have to produce them, If you have them. It doesn’t matter if they have expired. So if you fail to get rid of the incriminating evidence, you can be criminally liable. The term ROI has come to mean “risk of incarceration” Instead of “return on investment” these days.

Discrete partitioning

A year-month combination is called an interval data type in ANSI/ISO Standard SQL, but Microsoft does not yet support these things, so we need to fake it.

The next trick is to reference it from other tables.

When the referenced table in a REFERENCES clause constrains a column to a key (PRIMARY KEY or UNIQUE) in the referenced table, those are the only values you can use. This is a lot better than doing something like adding

or

Neither of these approaches will allow the DRI referential actions to be added to them.

I am using the MySQL convention of putting double zeros in the date fields (the field is a part of a column that has partial meaning in itself, and it is not a column). This is not an ANSI/ISO standard yet, but it is up for consideration and is found in use because the popularity of MySQL. The advantages are that it is language independent, and sorts with the ISO-8601 formatted dates.

One of the rules of a good database design is that you don’t store computations, especially computations that go over different levels of aggregations. You will find that you have to redo them or undo them to retrieve the atomic values that built them. Here’s a simple query that will unpack the year-to-date sales totals into monthly sales.

This is what we should have had in the first place, and not the year-to-date totals, which depend on a different level of aggregation as well as a temporal ordering.

The Pareto Principle

The Pareto Principle is an aphorism which asserts that 80% of outcomes (or outputs) result from 20% of all causes (or inputs) for any given event. In business, a goal of the 80-20 rule is to identify inputs that are potentially the most productive and make them the priority. You can put the formula for the distribution into an algebraic expression. But it’s best not used as an exact formula. Informally, this heuristic also goes by the name “80-20 rule”. The first name comes from the heuristic that 80% of your profit comes from 20% of your customers, 80% of your expenses come from 20% of your personnel, etc. This is not a firm law of the universe, but only a heuristic.

You’ll see this idea also worded as the “90-10” rule or “Sturgeon’s Law”, named after science fiction/fantasy author, Theodore Sturgeon. It comes from a remark he made during an interview where he was asked why man with his talents wrote for pulp magazines and other “lowbrow” publications. His response was that 90% of everything written is crap; that’s how

you know the good stuff when you see it.

You can use Bikini sales for 2019, for an example. We know that account #1 has a total of 19 sales. 80% of this would be 15.2 bikinis, which would fall somewhere between ‘2019-09-00’ and ‘2019-10-00’. Essentially what this says that by the end of 2019 October, you’ve sold about as many bikinis as you’re going to sell for the year.

Now look at account #2. Doing the same math, we get to 80% of our annual volume when we’ve sold 20.8 bikinis. This would fall during 2019 August, which makes more sense intuitively than the October we had in the prior account.

The NTILE() function

NTILE() is a window function that distributes rows of an ordered partition into a specified number of approximately equal groups, or buckets. It assigns each bucket a number starting at one. It takes its somewhat unusual name from the statistical of partitioning a set of values. Look for the terms “quartile” (four buckets), “quintile” (five buckets) and “dectile” (10 buckets). It depends on having an ORDER BY clause.

For each row in a group, the NTILE() function assigns a bucket number representing the group within its partition to which the row belongs. The goal is that each bucket will be the same size or differ by one.

The syntax is:

The <bucket count> is the number of buckets into which the rows are divided. The bucket count can be an expression or subquery that evaluates to a positive integer. It cannot be a window function.

The PARTITION BY clause and the ORDER BY clause behave in the usual way that they do in other window functions. The ORDER BY clause is not optional because this function would not make sense otherwise.

The real trick in this function is in the bucket count. If the count of rows in the partition is a multiple of the bucket count, life is good, and each bucket gets the same number of rows.

If the number of rows is not evenly divisible by the bucket count, the NTILE() function returns groups of two sizes with the difference by one. The larger groups always come before the smaller group in the order specified by the ORDER BY in the OVER() clause. For example, if you had 10 rows with a bucket size of three, then the first bucket will have four rows in the following two buckets will be three rows each.

It’s obvious that you can have up to (bucket size -1) extra rows. Here’s a simple example demonstrating the use of the NTILE() function:

Since you want to be able to get to the bucket numbers, this windows function is probably going to be called in a CTE, so it will materialize and be available to the following functions.

The median

The median is a value separating the higher half from the lower half of a data sample. It does not have to actually be in the set. The median is a better description of a typical value or central tendency than the simple mean (average) because it is not skewed so much by a few extreme values. Think about what happens to the average income in your neighborhood if Bill Gates moves in. For example, consider the data set {1, 2, 3, 4, 5}, which will have a median of 3, but if we have the set {1, 2, 3, 4, 5, 6} the usual rule is to find the two middle values and average them; in this case, it gives us {3, 4}, which will average out to (3+4)/2 = 3.5. However, this is still not really good, and a still better measure of central tendency is the weighted median. Consider the data set {1, 2, 2, 3, 3, 3}. The usual median computation would be (2+3)/2 = 2.5. The weighted median will include the duplicates from the set of central values in the averaging. In this is example that would be (2+2+3+3+3) = 13/5 = 2.6, which shows the data set is slightly skewed toward three.

You can obviously use the NTILE() function to compute a median. This is a very quick version, and you might make some objections to the way it handles values in the middle. I’ll leave it as an exercise to the reader to improve this code

Just a note, you can also get this from PERCENTILE_CONT:

A bit of history

Back in 1992 and 1993 there were two popular newsstand magazines devoted to databases; DBMS and Database Programming And Design. Window functions did not exist in the language at this time, so computing a median was not immediately obvious. Chris Date was writing in one magazine and I was writing in the other. Computing the median became the “interview question” for advanced SQL programmers. Both Chris and I did three variations each, but also Philip Vaughn, Anatoly Abramovich along with Yelena Alexandrova and Eugene Birger, and the late Ken Henderson also made contributions to this problem over the next few years. If you really want to look at the solutions, you can either go back to the archives of those magazines or you can look at a section in the third edition of my SQL for Smarties. But I will be honest and tell you that the code in these stories is of historical interest, not practical programming today.