In the modern world all software interacts with data. That might be a simple user-provided input, any information fetched live from the outside world, local filesystem, etc. But by far the most common source of data is the database. When the operations become more complicated, and there is a risk of losing the consistency between different parts of it, we start using transactions. This is the first (of hopefully a few) post describing what are the transactions and what is a "isolation level", and how does it affect our interaction with the database.

Transaction

In the simplest terms, a transaction is a collection of database operations grouped in a way that they cannot be executed separately. In order for them to make an actual effect, we need to confirm them (after the last one), at what point all of them are executed. If, for some reason one of the operations fail, we can cancel all of them, or leave it to the database engine to do that automatically for us.

In order to work with transactions we need to use three SQL keywords:

  • BEGIN starts a transaction,
  • COMMIT confirms that all the operations during the transaction should be executed,
  • ROLLBACK cancels all the operations of the transaction (so that no changes are being made).

Each transaction can have a specific isolation level which describes how it should treat everything that is happening outside (in other operations) during the time between BEGIN and COMMIT (or ROLLBACK). The SQL standard provides the database engines architects with a pretty direct set of requirements and informations what are the available levels and what could happen when using each of them.

SQL Standards

The standard mentions two main concepts: a phenomena and a level. The former describes a quirky behaviour that might take place when choosing a particular isolation level. The level contains a set of the phenomena and it's name roughly explains what behaviour can be expected.

There are four phenomenas that we need to be aware of:

  • dirty read - the operations are read from a so-called transaction log of the database (meaning that the operations have not been yet stored to the disk or cache) and if the operation is not stored for some reason, the results might not be accurate for other read attempts,
  • nonrepeatable read - the situation when two identical read attempts result with the same number of rows, but the values inside is different
  • phantom read - the situation when two identical read attempts return different number of rows
  • serialization anomaly - the situation when committing multiple operations is inconsistent. To be honest I'm not sure what does this mean, so we won't touch on this. If you know more, please share in the comments.

The levels define what phenomena can occur, but it does not mean they have to. In other words, the given isolation level can be no worse than what the standard dictates, but the database engine might make the particular level a bit more bulletproof.

The available levels are defined as follows:

  • READ UNCOMMITTED - can have all four phenomena,
  • READ COMMITTED - can have nonrepeatable read, phantom read and serialization anomaly
  • REPEATABLE READ - can have phantom read and serialization anomaly
  • SERIALIZABLE - no phenomena allowed

Lookahead

In the next posts I will try to present how do those levels work in the real database engines like PostgreSQL and MySQL, I will also show the subtle differences between these two. Spoiler alert, I like the solutions from PostgreSQL much more :)