Temporal Databases Annotated Bibliography

2017-12-05

I’ve been reading about temporal databases for a few years now, so I think it’s time I share my bibliography and notes. This is presented in “narrative order”, so that you can get a sense of how the research has developed. This article somewhat overlaps a mini literature review I wrote on the Postgres hackers mailing list, but this article is more complete and in a place where I can keep it updated.

Temporal databases let you track the history of things over time: both the history of changes to the database (e.g. for auditing) and the history of the thing itself. They are not the same thing as time-series databases: whereas a time-series database has time-stamped events, a temporal database stores the history of things, typically by adding a start/end time to each row (so two timestamps, not one). With time-series the challenge is typically scale; with temporal the challenge is with complexity and correctness.

Research

Abdullah Tansel, James Clifford, Shashi Gadia, Sushil Jajodia, Arie Segev, and Richard T. Snodgrass (editors). Temporal Databases: Theory, Design, and Implementation. 1993. I only became aware of this recently. I own a copy now but haven’t read it. I’m curious to see the variety of ideas back then, before things got standardized. It’s amazing how many old ideas are just forgotten. Codd’s “Extending the data base relational model to capture more meaning” from 1979 is famous for introducing NULL and outer joins, but no one seems to remember the second half, with graph databases and queries that dynamically build themselves by querying the system catalog. So I bet this book has creative ideas that could inspire modern developments.

Snodgrass, Richard T. Developing Time-Oriented Database Applications in SQL. 1999. The seminal work on temporal databases and still the most useful introduction I know. Covers the “combinatorial explosion” of non-temporal/state-temporal/system-temporal/bi-temporal tables, current/sequenced/non-sequenced queries, SELECT/INSERT/UPDATE/DELETE, different RDBMS vendors, etc. Very similar to the proposed TSQL2 standard that was ultimately not accepted but still influenced Teradata’s temporal support. Available as a free PDF from his website.

Hugh Darwen and C. J. Date. “An Overview and Analysis of Proposals Based on the TSQL2 Approach.” Latest draft 2005, but originally written earlier. Criticizes the TSQL2 proposal’s use of “statement modifiers”, especially their problems with composability when a view/subquery/CTE/function returns a temporal result. Available as a PDF.

Ralph Kimball and Margy Ross. The Data Warehouse Toolkit. 3rd edition, 2013. (2nd edition 2002, 1st 1996.) (My notes are based on reading the 2nd edition, but I don’t think there are major relevant changes.) This book is not about temporal databases per se, but in Chapter 4 (and scattered around elsewhere) he talks about dealing with data that changes over time (“Slowly Changing Dimensions”). His first suggestion (Type 1) is to ignore the problem and overwrite old data with new. His Type 2 approach (make a new row) is better but loses the continuity between the old row and the new. Type 3 fixes that but supports only one change, not several. This writing is evidence for the need to handle temporal data, and the contortions that result from not having a systematic approach. (These criticisms and the realization that Kimball is trying to solve the same problems as temporal databases come from Johnston’s second book below; I’m glad he made the connection between Kimball and temporal databases!) (pdf)

C. J. Date, Hugh Darwen, Nikos Lorentzos. Time and Relational Theory, Second Edition: Temporal Databases in the Relational Model and SQL. 2nd edition, 2014. (First edition published in 2002.) I read this in March 2018. If you’ve read Date & Darwen’s other works you know they are quite rigorous, often critical of SQL’s compromises vs the pure relational model (e.g. NULL and non-distinct results), and not always very practical. Nonetheless this book is full of great ideas, and I hope anyone thinking about implementing temporal features in an RDBMS product will read it. They mostly deal with state-time, not transaction-time, which is great since that’s where the field most needs attention, although they also show how to apply their ideas to transaction-time tables too.

The core idea is that a temporal table should really have a row for each second (or millisecond or day or whatever is your finest granularity of time), saying what was true at that moment. Then the primary key is just the second plus the regular primary key. This idea isn’t new, but they use it to define everything else in the book. This makes thinking about temporal queries a lot easier, so that if you ever have any hesitation about something, you can think about it in terms of one-row-per-second and it starts to be easy.

Of course no database could adopt that in practice, so they offer operators to transform rows with start/end ranges to one-row-per-second and back again. In almost every case you don’t have to use those operators, but they are a solid theoretical basis for the next step: defining interval-aware primary keys, foreign keys, and all the relational operators. Finally they show how you implement things without “materializing” the expanded version of the data.

Another idea implicit in that is that temporal concerns belong in the SQL operators, not as “statement modifiers” like in TSQL2. Really that might be the most important idea in the whole book, so I’m glad to see it being applied to Postgres in the papers by Dignös et al (below).

They also argue (vs SQL:2011) that a single interval column type is better than separate start/end columns, which I agree with. Strangely they go against all the existing research by using closed/closed intervals (e.g. [Jan2017,Dec2017]) instead of closed/open ([Jan2017,Jan2018)), without really giving much justification. They also avoid using NULL for an unbounded side (e.g. [Jan2017,NULL)), preferring a magic “end of time” date ([Jan2017,Jan3000)). I wasn’t surprised by the second decision, given the authors’ history, but the first was less explicable. In both cases their approach sadly marginalizes their work and imposes barriers to adopting it in real SQL products.

I really appreciate how these authors have insisted that valid time (and the same with transaction time) should be a regular column, not something different. That was a big part of their complaint against TSQL2. To be honest when I first read Snodgrass the idea of using pseudo-columns seemed so suboptimal it was hard to take seriously (“Of course they don’t mean it.”), so I’m glad these authors have insisted on pointing out that approach’s shortcomings. Unfortunately in SQL:2011 valid-time is still a pseudo-column made up of two regular date/time columns. The problem is in composability: using a temporal query as a subquery, view, or function result: it all works cleanly if the interval is just another input to your interval-aware operators, but not if you need some kind of extra pseudo-column metadata. I hope implementers will take their advice seriously and not build temporal features on such a distorting idea.

Something I disagreed with was their suggestion to use tables in sixth-normal form—basically every column gets its own table—since attributes can have different lifespans. I can see how that is purer, but it seems like just too much complexity. They probably suspected the same because they always show how to do things with either that approach or tables in a more traditional third-normal form (or BCNF if you prefer). Even that is slightly distorted in order to avoid NULLs, but you can easily look past that.

Finally, I appreciated that on page 282 they mention DDL on temporal databases. Like everyone they say it’s beyond the scope of their current discussion, but it’s a penetrating insight to say, “the database catalog might itself need to be treated as a temporal database.”

SQL:2011 Draft standard. (pdf) Personally I find the standard pretty disappointing. It uses separate start/end columns instead of built-in range types, although range types offer benefits like exclusion constraints and convenient operators for things like “overlaps” that are verbose to code correctly by hand. It only mentions inner joins, not the various outer joins, semi-joins (EXISTS), anti-joins (NOT EXISTS), or aggregates. Many of its features apply only to system-time, not application-time, even though applicaion-time is the more interesting and less-available feature. (There are lots of auditing add-ons, but almost nothing for tracking the history of things.) The syntax seems too specific, lacking appropriate generality. A lot of these drawbacks seem motivated by a goal that goes back to TSQL2: to let people add temporal support to old tables without breaking any existing queries. That has always seemed to me like an unlikely possibility, and an unfortunate source of distortions. I don’t expect something for free, and I don’t mind doing work to migrate a table to a temporal format, as long as the result is good. Instead we get an (ostensible) one-time benefit for a prolonged compromise in functionality and ease-of-use.

Krishna Kulkarni and Jan-Eike Michels. “Temporal Features in SQL:2011”. SIGMOD Record, September 2012. Nice overview of the temporal features included in the SQL:2011 standard. Here is a PDF of the paper. See also these slides by Kulkani.

Peter Vanroose. “Temporal Data & Time Travel in PostgreSQL,” FOSDEM 2015. (Slides as a pdf) Lots of good detail here about SQL:2011. I’d love to see a recording of this talk if it’s available, but I haven’t found it yet.

Tom Johnston and Randall Weis. Managing Time in Relational Databases: How to Design, Update and Query Temporal Data. 2010. I finished this in May/June 2019. Although it concentrates on the authors’ proprietary middleware add-on to MS SQL Server called Asserted Versioning, it nonetheless has many ideas for building practical temporal features on any RDBMS. I liked their creativity and willingness to improve on SQL:2011 when they could. (Technically their book predates the final form of the standard.) Their middleware approach did sometimes lead them to behavior that would be strange in a more SQL-oriented solution. For example their temporal updates and deletes can only target a single record by ID (but they also mention they have a fix for this), and if that ID is not found you get an error. In ordinary SQL you would not get an error if an UPDATE or DELETE didn’t change any rows, and on the other hand you could hit more rows than one by not using the primary key. They mention that at the time of writing they were trying to replace their middleware approach with an implemention using views and instead-of triggers (something that has been done elsewhere too), and I expect that would wind up behaving more like ordinary SQL. They also introduce an interesting idea of future-dating records’ system-time (“assertion-time” in their lingo) so that the database will not make them “effective” until you’re ready. This has some nice practical use-cases, but there are pretty serious drawbacks. In their system a future-dated record is essentially immutable, so you had better not change your mind! Despite this book’s many contributions, I think everything here was also covered in Johnston’s second book, which is probably what I’d read if I had to choose one.

Tom Johnston. Bitemporal Data: Theory and Practice. 2014. I felt like I found a kindred soul when I read how he connects database design and ontology, as I’ve always thought of programming as “applied philosophy.” Databases as Aristotelian propositional logic is inseparable from the mathematical set-based theory. Johnston gives helpful distinctions between the physical rows in the table, the assertions they represent, and the things themselves. Eventually this leads to a grand vision of connecting every assertion’s bitemporal (or tritemporal) history to its speaker, somewhat like some ideas in the Semantic Web, although this doesn’t sound very practical. Like Date he seems to be landing on something like sixth-normal form, with a view-like presentation layer to bring all the attributes back together again. (Also like Date he seems to acknowledge 6NF may not be practical.) He points out how unsatisfactory Kimball’s suggestions are. He also criticizes the limitations of SQL:2011 and offers some amendments to make it more useful. Describes a (patented) idea of “episodes” to optimize certain temporal queries.

Anton Dignös, Michael H. Böhlen, and Johann Gamper. “Temporal Alignment”, SIGMOD ’12. Amazing! Shows how to define temporal versions of every relational operator by a combination of the traditional operators and just two simple transforms, which they call “align” and “split”. Gives a very readable exposition of the new theory and then describes how they patched Postgres 9.0 and benchmarked the performance. I think this solves the composability problems Date objected to in TSQL2, and unlike SQL:2011 it is general and comprehensive. The focus is on state-time, and I’m not sure how it will map onto bi-temporal, but even just having good state-time functionality would be tremendous. And the paper is only 12 easy-to-read pages! (pdf)

Anton Dignös, Michael Hanspeter Böhlen, Johann Gamper, and Christian S. Jensen. “Extending the kernal of a relational DBMS with comprehensive support for sequenced temporal queries,” ACM Transactions on Database Systems, 41(4):1-46. Continues the previous paper but adds support for scaling the inputs to aggregate groups according to how much of their time period goes into each group. Gives more benchmarks against a patched Postgres 9.5. (pdf) These researchers are now trying to contribute their work to the Postgres core project, of which I am very much in favor. :-)

Tools

Finally here are some tools for temporal support in Postgres. The sad theme is that pretty much everything gives audit support but not history support:

Postgres

The most complete implementation of bi-temporal tables is Vik Fearing’s periods extension. Since Postgres uses a bison grammar, an extension can’t add new syntax, but this one does a great job working around that limitation to provide almost everything from the SQL:2011 standard.

Henrietta (Hettie) Dombrovskaya has a pg_temporal repository with talks, slides, and code. I haven’t tried building it but I believe her company is using this in production. She is an awesome speaker, so her talks are great intros to this topic too.

The pgaudit extension looks pretty useful but I haven’t tried it yet. According to the AWS docs you can even use this on RDS.

Vlad Arkhipov’s temporal tables extension only supports system-time (auditing). Also on Github and a nice writeup by Clark Dave.

Magnus Hagander presented an approach to capturing system-time history in a separate schema at PGConf US 2015 and PGDay’15 Russia. Here are slides and video. Quite elegant if you want to ask questions like “what did we think as of time t?” If I recall correctly this is similar to one of the ideas proposed at the end of Snodgrass, although I haven’t compared them carefully. Hagander points out that DDL changes against temporal databases are challenging and hopefully infrequent. This is a topic that is almost completely absent from the literature, except for a brief mention in Johnston 2014.

I’ve heard about a Julia project for temporal data in Postgres. I’m not sure how it works, but it looks interesting!

Ruby

Chronomodel extends the ActiveRecord ORM to record system-time history. The pg_audit_log gem works fine but like many audit solutions is rather write-only. I wouldn’t want to build any functionality that has to query its tables to reconstruct history. You could also try paper_trail or audited (formerly acts_as_audited). Of these projects only Chronomodel seems to be aware of temporal database research.

Further Research

Temporal databases are exciting because there is still so much to do. For example:

  • What should the UI look like? Even one dimension adds a lot of complexity, let alone bi-temporal. How do you present this to users? As usual an audit history is easier, and it’s possible to find existing examples, whereas a state-time history is more challenging but probably more valuable. How should we let people view and edit the history of something? How does it work if there is a “Save” button vs save-as-you-type?

  • What does “full stack” temporal support look like? Do we extend REST? What would be a nice ORM interface? Should we use triggers to hide the temporal behavior behind regular-looking SQL? Or maybe extend SQL so you can more explicitly say what you want to do?

  • SELECT support for “as of” semantics or “over time” semantics.

  • Temporal foreign keys. I have a POC Postgres extension and am making progress adding it to core.

  • DDL changes. For example if you want to add a NOT NULL column, what do you do with the old data? Could there be built-in support to apply constraints only to a given time span?

  • Future time. There are some cool ideas about this in Johnson’s books, and I’ve heard of fintech folks doing something here.

  • Temporal upsert. This would be an INSERT that falls back to FOR PORTION OF for spans that already exist. Quite often this is what an application really wants. Johnson talks about how to stanard should have included something here.

  • Outer joins and aggregates. Dignös et al show how to do this, but it’s not in the standard. I’d love to get their old patch working again and try to get something similar into Postgres, once the basics are finished.

blog comments powered by Disqus Prev: Counting Topologically Distinct Directed Acyclic Graphs with Marshmallows Next: Postgres isn't running the archive_command on my standby