Episode #3

Data Integrity: SQL Cascades, Transactions, DB Design, and Continuous Validation

In this third episode about data integrity, I talk about how to save your data such that computers have an easier time working with it (e.g. for filters), how to ensure that data stays valid by your own rules as you add increasing numbers of validations over the years, about avoiding the mistake of duplicating data across tables and needing to sync these records, about using database transactions to ensure that state changes to multiple rows either all occur or all fail as a unit (preventing invalid intermediate states from cropping up), and lastly about using cascades to ensure the deletion of associated records that should not exist when their 'parent' records are deleted.

June 28, 2020

Show Notes

No notes available for this episode.

Screencast.txt

Transcribed by Rugo Obi

1. Constrain Data To A Limited Set Of Values To Allow For Filters

In many websites, the user interface is able to filter data.

For example, in the ”Find A Law Tutor” area of my website, you can use the select boxes to filter by law level. For example, postgraduate or undergraduate, and institution - that’s the particular school at which someone studied law.

This is made possible by the fact that each of these fields in the database, each of these columns, contains a limited number of values, and these values are cohesive in meaning.

This is something achieved through database constraints. To show you a situation where filters won't work, let me inspect a different part of my codebase, the subjects table.

You can see within this schema that it has a grade field which is VARCHAR ,and that there are no database constraints on it.

There are also no Ruby-land constraints on it either. Let's inspect what the values are by grouping by grade, and counting the number of entries in each of these groupings.

You can see at the top here that there are many different ways of denoting high grades. For example, an A, first, first-class, distinction, and so on.

This is no good for filtering. If I scroll all the way down, you'll see that there's a very long tail with many entries having only one occurrence in the database. There's a total of 265 possible ways to represent grades in my system.

This will be relatively annoying to fix at this point, but not impossible, since there are only 265 rows. However, it would have been better if I’d enforced some sort of database constraint or even Ruby-level constraints, to ensure that these fell into a narrow set of categories which would then enable me to better filter the data.

If I bring up the Subject model here in Rails, I can add a Rails-level validation on the grade column to ensure that it's included within this kind of enum, this set of values. It's useful to have a Rails validation, as well as the database validation.

That's because the Rails validation will give prettier error messages, instead of something like constraint failed in the database, which isn’t really something you can show to an end-user.

2.Continuous Validation

When you start off a project or a feature, you tend to have less validations in your model because there are edge cases you haven't thought of yet -- or it's simply unfinished.

Here, for example, I just have three validations: the presence of each of these three fields. And then with time (I've copied and pasted them back in), there are other fields, like citation should be there, it should be unique within the scope of a given year. There should be a slug, there should be a year as a separate field. There should be law_disciplines, there also should be validation that content_is_present, a custom validation in the Ruby world.

Now these are added after I’ve had some law cases in the database. However, these validations only get run when you save or create a law case.

Therefore, it's very possible that the old law cases in the database would not validate with the new validations, and this can lead to all sorts of confusing problems.

What makes these problems even worse is that the validation might fail at a completely different time to when someone is entering the data, for example, when the timestamp on a law case is touched in order to invalidate some cache. You wouldn't expect that to create a validation error; rather you’d expect a validation error to happen when someone is editing them in the admin area.

This problem wouldn't happen with the equivalent validations in a database constraint, because the constraint must be true in every row in order to be created.

However, some of these validations, or at least one of these validations, doesn't really make sense in the database-only world, since it’s a custom Ruby one - this one of content being present.

In order to tackle this problem, I created a sort of task that's run on a scheduler to validate all the data from the Ruby world.

The code you see here isn’t exactly what I have. What I have is more messy and complicated, but this is a clearer version that captures the essence of what's going on.

Basically, I cycle through all the classes, all the records in my database, and then run the valid? check on them and then output whichever ones have errors.

This helps me ensure that my data is valid according to my own rules going forward. It also makes me realize if I've added a validation too eagerly, something that I'm not able to actually fulfill given the current gaps in my data and what not.

3. Having one source of data truth (even if slightly awkward to use) is vastly superior to syncing across tables.

The biggest regrets I have in terms of database design for Oxbridge Notes, was a structure that I thankfully no longer have. Essentially, this involved me syncing data from one object to another, i.e. duplicating data.

So the structure at the time was that a Seller had many notes_packs. A notes pack represented basically something you studied, like law or history.

Then the notes_packs had many subjects. This represents maybe a sub-module, for example, contract law or constitutional law.

A subject also could be converted into a product, therefore there was a relationship between a subject and a product. A subject had one product.

In order to build these products, I have the code you see here that sets the institution_level, for example, to the notes_pack level, its institution to the notes_pack_institution. Uh, that’s weird. There's a mistake here in this version of the code, in that product.discipline should probably be notes_pack.discipline. And then product.state to subject.notes_pack.state and so on.

Why did I do this? Well, I think I was just scared of join operations at the time, and I was worried about the complication of fetching the subject. This led to an ungodly number of confusions and bugs.

Basically, because there were so many ways for the underlying subject to change. And some of them did not cause the related products to change in the expected way. Syncing data between two tables, is really really annoying and requires such a painstaking effort towards detail.

I ended up coming up with this class SynchronizedProductService, that would be called anytime there was a change to a subject. And you can see that the code updates product.grade, the exam_year, description and so on. And then reassigns taxons according to the new information. All that kind of stuff.

This method then gets called within the Admin::NotesPackController which you're seeing here, the update method in particular. But, and this is a bug, one of the bugs that caused me issues: It does not get called when the seller, the author, changes their notes_pack.

This was just an oversight, I guess I added the feature when I noticed the problem in admin. But I failed to think about what happened with the original authors, when they wanted to change things. This is no longer the case. I have now rewritten things, such that a product delegates :name, :exam_ year, :grade, :description, a bunch of stuff to the subject.

This means that the products table no longer contains this information, so my overall database has less columns; always a win. And on top of that, there is no longer any synchronization problem to deal with.

The cost for this is that whenever I look up a product, I should also remember to look up the associated subject, or subjects, as is the case here. Otherwise there's a risk of n+1 queries in the view, whenever some field that's dedicated to the subject is called.

4. DB Transactions -Especially For Job Idempotency Under Retry

I assume most viewers are familiar with a normal use of database transactions, to prevent states from being somehow inconsistent. For one part of an operation to be executed, but for the other one not to be -- for example because of a crash.

The classic use-case is often in banking, where I might withdraw 100 euro, and then deposit it to Mary. But imagine there's a crash at the point after I’ve withdrawn but before it arrives in Mary's bank, then that money is lost forever, so to speak.

This of course can be avoided by wrapping the whole set of operations in a database transaction, such that, either they all execute, or none of them execute.

Here I have some code in order to build up a product on my consumer-facing part of the website, based on some data that an author uploads. I've got it all wrapped in a transaction.

This is because a product has associated records, for example, taxons that get created based on the product discipline name, and the institution at which they studied. And then some sort of hidden records that assign those either existing or newly created taxons to the product.

It doesn't make sense however, for some of this work to get done without the other. So there's atomicity.

Moreover, this job, along with many other things in my codebase, happens asynchronously in background jobs. I want these background jobs to be idempotent in case there's some sort of transient error, maybe for example in extracting full text to ElasticSearch.

If I allow some of these records to continue existing, I'd have to rewrite the code and replace any calls to create(x) with find_or_create(x)

Rather than do that in every single location where I'm potentially creating a record, it's easier for me to just wrap the whole lot in a transaction, such that everything gets deleted if nothing gets properly created.

That means, all this state modifying code can be run again. I basically get i dempotency for free.

5. Cascades For Ensuring Deletion Of Associated Records That Should Not Exist Independently.

I also use database constraints to ensure that child records are removed, whenever their parent records are removed. The idea being that we get rid of any sort of records that don't make sense on their own.

For example, I have a pivot table here, products_taxons that connects product_ids to taxon_ids and enables the many-to-many connection.

Now let's say I wanted to delete some arbitrary taxon. I'm going to go for this one here, 148.

So I'm going to delete from taxons, where id is equal to that, and we get this error, DETAIL: key (ID)=(148) is still referenced from table “products_taxons”.

I don't really want this... this is just the default behavior of a foreign key constraint. It’s called RESTRICT. But in this case, I'd rather it just deleted that entry in the products_taxons table, instead of preventing me from deleting the overarching taxon.

The way for me to do that is to add an ON CASCADE foreign key constraint. What I'm gonna do first is drop the current constraint, because that is a foreign key one but not ON CASCADE. So let's have a look.

All right, I've dropped that one. Now I'm going to add another one, and let's have a look at this one. Let's have a look at this one. This is basically what I had before and now I just add a bit more. ON DELETE CASCADE’. And now, this constraint has been added.

Let me go back and run that piece of code again, where I try to delete from taxon where id is 148. Before I do that, let me pull up the products_taxons table and show you that this particular taxons_id is in fact being used.

Here you can see that there are 24 parents. Now, I'm going to drop just the taxon. drop from taxons where id=148. What have I done wrong here, oh yeah sorry, it's always delete from. And okay, I’ve deleted from the taxons table, what's going on the products_taxon table? None found. As you can see, the delete cascaded through and prevented me from having to think about this at any later stage.

Well, that's all I've got time to discuss today.

See you next time.