A glance at pg_cron to automatically schedule database tasks

pg_cron is an interesting PostgreSQL extension by Citus Data: it does include a background worker (i.e., a PostgreSQL managed process) to execute database tasks on the server side. This is something I’ve done for years, I mean managing automated tasks using operating system wide cron(1) and schedulers alike, but having the scheduler within the database sounds really cool, since I can keep it tied to the data itself.

An example scenario

I’ve one server pulling data regularly out of another server, via a foreign data wrapper. No matter how this design choice sounds to you, it works for me!
In order to constantly pull data, I have set up a cron(1) task in my user crontab to execute a function that does all the business logic I need. Therefore my crontab file looks like:
$ crontab -l
10,20,30,40,50,0 * * * * \
 /usr/bin/psql -U postgres -h 127.0.0.1 \
 -c "SELECT f_pull('crontab import');" mydb
So I’m executing the function f_pull on database mydb specifying a label crontab import. Let’s see how this can be done using pg_cron too.

Installing pg_cron

While there are some packages for major Linux distributions, I find it quite easily to install it from the official repository with the following short commands:
$ git clone https://github.com/citusdata/pg_cron.git
$ cd pg_cron
$ export PATH=/usr/pgsql-11/bin:$PATH
$ make
$ sudo PATH=$PATH make install
and in the case it does matter, I’m using a CentOS 7 Linux here. Now, in order to make pg_cron working it must be loaded as a shared library, so you have to adjust the PostgreSQL configuration (usually ~postgresql.conf~) as follows:
shared_preload_libraries = 'pg_cron' 
cron.database_name = 'mydb'
Here I use mydb as the database on which store the pg_cron data. In fact, pg_cron will create a cron schema with a table job in there that will do the same as your crontab file on any Unix machine. Unluckily, you need to restart the PostgreSQL cluster in order to apply changes.
$ sudo service postgresql-11 restart
It is now time to decide who will execute the cron jobs, and in my case it is the postgres superuser. This could be not the optimal choice, so choose the user that fits the need for you. In my case I was already using cron(1) with postgres user, so it sounded to me the right and faster way to migrate from regular cron to pg_cron. Why does it matter choosing the user in advance? Because pg_cron requires such user to be able to connect to the database without providing any password, so you either should adjust the pg_hba.conf properly or add a .pgpass in the home the user. Yes, even if a background worker is used to implement the pg_cron features, the connection happens thru libpq, so the need for the user to be granted to connect withou providing a password. Therefore I changed the pg_hba.conf as follows:
host    all          all        127.0.0.1/32     md5
host    mydb    postgres   localhost        trust
adding the specific line for postgres (line ordering does not matter) and leaving all other connections requiring a password. Now, you can issue a reload and test your user connectivity. Once this is done, you can configure pg_cron.

Configuring a job

Configuring pg_cron is really simple: all jobs are kept in the cron.job table and you can either edit such table with standard SQL or use the cron.schedule() function to get an initial entry to work later on. Since I was migrating a cron(1) entry, things were as simple as copy and paste the cron(1) entry line with dollar quoting:
mydb=# SELECT cron.schedule('5,15,25,35,45,55 * * * *',
  $CRON$ SELECT f_pull('pg_cron import'); $CRON$
  );
pg_cron replies with the identifier of the job, in my case 1 because it is the very first job inserted in the scheduler. I can inspect it with an ordinary SELECT against the cron.job table.
mydb=# SELECT * FROM cron.job;
-[ RECORD 1 ]------------------------------------------------------------
jobid    | 1
schedule | 5,15,25,35,45,55 * * * *
command  | SELECT public.f_pull('pg_cron import');
nodename | localhost
nodeport | 5432
database | mydb
username | postgres
active   | t
All the fields are perfectly understandable, and please note that the schedule field reports the string in the exact same format of cron(1); this is due to the fact the pg_cron uses the very same parser as cron(1), making migration from cron(1) to pg_cron really easy. By feault cron.schedule() uses the current PostgreSQL instance parameters and the current username, but you can than adjust them to something else. While I haven’t tested it, this means you could execute cron task from one PostgreSQL into a remote one.

And that’s all!
Now you can sit down and check your cron jobs.

pg_cron logging

Things never works the first time! In the case you need inspection, consider that pg_cron logs at the LOG level and provides a statement for job begin and end. A succesfully executed job prints log statements as
LOG:  cron job 1 starting: SELECT public.f_pull('pg_cron import');
LOG:  cron job 1 completed: 1 row
while a failing job prints lines as
LOG:  cron job 1 starting: SELECT public.f_pull('pg_cron import');
LOG:  cron job 1 connection failed
Often problems arise due to connection permissions or grants, so double check your cron user is really able to do what you are exepcting to do. In my case it was really simple because I was already using a cron(1) job, so the user was already granted to do its job.

Conclusions

pg_cron is an awesome tool to keep in your toolbag because it makes really easy to migrate from cron(1) to pg_cron (and back!). Moreover, being an extension, it makes all schedule configuration available within the database, and since cron.job is added to the backup from the extension installation instruction, this means you will get scheduler backups for free!

The article A glance at pg_cron to automatically schedule database tasks has been posted by Luca Ferrari on May 21, 2019