Data Quality

Problems with data quality are easier and cheaper to prevent than to solve later. In this article, Joe Celko talks about data quality and where the issues tend to happen.

Around the years 2000 to 2005, the trade press filled with articles on data quality. Several good books (see references) were written on the subject, and then it seemed to fade from the popular view. It really is something that should be taught in database classes, so the people will begin to design databases with quality in mind instead of trying to stick it on after something’s gone into production.

Data quality should be enforced data quality at the system level, but this assumes that each of the individual tables and data sources within the system doesn’t have data quality problems. Quality is essentially a bottom-up process; if the inputs in the raw sources of data are clean and trustworthy, then the system as a whole can produce clean and trustworthy results. This is not guaranteed, however. Clean data is necessary but not sufficient for a quality database.

Data Quality at the System Level

Let’s look back in a time before most technologists thought in terms of databases and using software to ensure quality. There is actually an international job title, Certified Records Manager (CRM), which has been granted by Institute of Certified Records Managers (ICRM) since 1975.

These people worked mostly with paper records and microfilm largely because 30 or 40 years ago, the laws required paper copies. And in some states, the color of the ink used was specified by law. One of the most important jobs of the records manager, particularly in insurance companies, was to make sure that you never kept anything too long. As a general rule, data is like vegetables; the fresher it is, the better it is and that if it gets too old, it’s useless. How long you had retained the old records was also a matter of law. In the insurance industry in the United States, this meant state-by-state differences in the laws and the retention cycles. The truth is that old data can be worse than useless. You could be required to produce documents in court cases, if you still had them, so all data became evidence against your company.

It was also expensive to store lots of paper or microfilm records. Magnetic media is pretty resilient and very dense, and, if worse comes to worst, can be moved on to laser disks. It’s also a lot easier to move a bunch of bytes than it is to move a pile of paper or microfiche. And I haven’t even started to talk about what searching paper can be like! I have horror stories about driving across the state of Georgia to a warehouse to look for banker boxes for weeks.

When digital storage with relatively simple retrieval became cheap, most companies stopped getting rid of obsolete data. The idea was that you could do data mining and get information out of it. After all, you had paid to collect that data. You should get something from it for your expense. However, now you have to worry about not just discovery in court cases, but privacy requirements. You now have to worry about PII (personally identifiable information), GDPR (General Data Protection Regulation) and whatever the next piece of legislation is.

Instead of putting your data through a shredder, like was done with the paper records, it could be anonymized in a legally acceptable way and send it over to a data warehouse for analysis. One approach is to scramble the unique identifiers, so the individual rows are still represented. Another approach is to aggregate the data in some way that is still granular enough to be useful for analysis, but so aggregated that you cannot find individuals even under a constructed identifier.

Data Quality at the Table and Schema Level

SQL has several advantages built into the language for data quality. Simple punch card and mag tape files that were used by COBOL depended on application programs to do any quality checks and to give meaning to the data. With SQL, much of this work can be put into the DDL. For example, decades ago a credit card company bought public records information and transcribed it on to punch cards. There was a small error in the punch card layouts. This resulted in the last column of a town name, which ended in the letter “D”, being punched one column over. Unfortunately, this was the encoding for “deceased” in the company files. The system began doing just what it was supposed to, closing out the accounts of dead people. Whoops!

This sort of error would not happen in a correctly designed SQL database. First of all, datatypes and size in SQL are a property of each column, not dependent on the host program reading the data. The use of DEFAULT clauses, CHECK() constraints and REFERENCES prevents certain low-level kinds of bad data from ever getting in the schema in the first place. It is my experience that these features are underused. Look at some of your own schemas; have you prevented zero or negative values in the integer columns in your tables? Have you added CHECK(x IN (..)) on columns where appropriate? You want to be able to get these restrictions and constraints from a data dictionary when you are designing the system. Unfortunately, people very often knowingly put bad data into their tables, and then try to scrub it in place. Another choice is to set up staging tables and clean the data before it gets into the database. If you’re lucky, you will have a good ETL tool that matches your problem. In the real world, you should probably expect to have to write code to clean the data. But first look for tools.

Some of these tools are third-party products, designed for particular types of data. The most common ones deal with mailing addresses, some standardized industry codes and regular expression software available from trade groups or the public domain.

Characteristics of Quality Data

At a higher level of abstraction, there are at least four characteristics to ensure data quality.

Completeness

Is a concept missing? This error is going to be hard to correct. If you didn’t properly design your schema at the start, then you’ve got to go back and add the missing parts.

Are there missing values in a column, rows in a table? Values can be missing because nobody knows what they are or because nobody bothers to collect them. One of the classic examples of that is when people do not fill out the repair history of capital equipment. Since the repair was quick and easy as part of a routine maintenance program, or for some other reason, it’s ignored.

Very often missing values are not skewed in one direction or the other, so the aggregate is close enough to be useful. In my equipment example, the mean time between failure (MTBF) might not be significantly affected. What might be important is the type of repair being done; was it a part of planned routine maintenance? Or was it a very unplanned disaster? Or was it within tolerance (those light bulbs were getting old and there were expected to start burning out about now)?

The rate of missing values is also significant. Once the missing data rate gets too high, the data has no value because the sample is not valid. You can do imputation with statistical models, but any time you try to use statistical smoothing techniques, you’re smuggling in many assumptions. How do you know what the statistical distribution of your data is? How do you know that what’s in your database currently is representative of the whole population?

A quick check that you may have a problem is to look at how many columns in your database are NULLable and how many NULLs each of those columns has. Lots of NULLs are not automatically proof of low-quality, but they’re a pretty good code smell.

Accuracy

How does the database compare with reality? Some things are easy to keep consistent in the database. Once you get a correct birthdate, it’s probably not going to change very much. However, you can expect at any time to have about 8% of the street addresses in error. Nobody’s particularly evil, but many people are mobile. This is especially true if the population is younger. Even if someone doesn’t move in a given year, there can be ZIP Code changes or streets renamed.

Sticking with the example of street addresses, you’ll want to put in procedures for maintaining the accuracy of the data. The easiest way is to simply do a mailing and wait for the post office to return that which was not deliverable or to give you change of address notices. However, you would probably prefer not to make your customers into your quality assurance group.

The bad news is that all too often, no one checks the accuracy of the data until the middle of an audit of some kind. There are sampling techniques which were developed for manufacturing applications. For example, during World War II, munitions had to be tested. Testing munitions is what’s called destructive testing; you cannot test 100% of your output. But what’s the right size sample? Sequential analysis is a technique that adjusts the size sample taken based on the results from each testing. The more errors you find, the larger your sample gets. Fewer errors mean smaller sample sizes.

Consistency

Data is going to come from many different sources. Ideally, if multiple sources are reporting on the same reality, then they ought to be consistent with each other. Since I’m writing this piece just after the US midterm elections, my current favorite example is the number of counties here in Texas where the number of registered voters is significantly less than the number of ballots cast in the election.

The different data sources don’t have to be external. Different parts of the same system can lack consistency. One common problem is when employees or your customers are counted in several different categories in such a way that things are double counted. Another source of bad counts is work in process differences in which the data simply gets lost in the shuffle.

Relevancy

Is the data useful for the task at hand? Imagine that you have a pretty typical inventory system. You’re pretty sure that your data is correct, current and everything is measured in the same way. However, you have nothing in the database to tell who supplied the parts; they become treated as a commodity. Then you get a recall notice from one of your suppliers telling you that one of their shipments is defective. The database is not going to help track down the bad parts and remove them from inventory or do the recall. You are missing a critical attribute that’s relevant to making decisions.

Data Quality at the Environment Level

Another concern for data quality is what environment you work in. If you are in an environment that has legal restrictions, then data quality is defined as keeping your database in a way that will not put you in jail. Today, ROI no longer means return on investment; it stands for risk of incarceration. In the United States, two of the most important sets of regulations they can face are defined by the Health Insurance Portability and Accountability Act of 1996 (HIPAA) and the Sarbanes-Oxley Act of 2002 (SOX). If you happen to work internationally, then you’re running into the GDPR regulations from Europe. Frankly, a database administrator is probably not qualified to handle the legal aspects of all of these things. This is why companies hire lawyers, accountants, and lobbyists.

SOX leaves it up to the individual corporation to come up with internal control mechanisms for complying with the regulations. My recommendation is that most of us should not even try this. Look for a package that has been developed by a financial services company and make them criminally liable for any errors, omissions or illegal procedures.

For an introduction, look at a series of online articles by Robert Sheldon (see references). The articles provide an overview of HIPAA and SOX and explain how these regulations affect DBAs.

Conclusion

If you are old enough, you might have bought a television from Zenith Electronics (part of LG Electronics since 1999) which had the famous slogan “The quality goes in before the name goes on.” This company got it exactly right! Quality, specifically data quality, has to be good from the start. Unlike physical goods, data cannot be isolated, easily replaced or repaired. Data permeates everything in the enterprise, so bad data effects operations, decisions, and projections. Fix problems before they become problems!

REFERENCES

Olson, J. E., “Data Quality:The Accuracy Dimension”, ISBN-10: 1-55860-891-5; 2003.

Redman, Thomas C., “Data Quality: The Field Guide”, ISBN-10: 1-55558-251-6; 2001.

Wald, Abraham, “Sequential Analysis”, ISBN 13: 9780486615790; `947.

Yang, R.Y., Ziad, M.,Lee, Y.W., “Data Quality”, ISBN-10: 0-7923-7215-8; 2001.

Robert Sheldon, R. “Introduction to HIPAA and SOX