Cancel a hanging PostgreSQL query

Posted on 2022-05-24 by Laurenz Albe

cancel a statement because it is sexist — © Laurenz Albe 2022

Sometimes a PostgreSQL query takes forever. Usually, it is easy to interrupt (cancel) such a statement, but sometimes it can turn into a problem, and a query cannot be canceled. This article discusses what might be the cause. I’ll also show you a trick how to solve the problem (not for the faint of heart!)

How to cancel a running query

The PostgreSQL protocol has provisions for interrupting a running statement. This is done by opening a new connection and sending a CancelRequest message with a secret key. That secret key was sent by the server during the start of the original connection. Without that key, everybody could cancel your query, which would be an inacceptable security problem.

The C library libpq provides the functions PQgetCancel() and PQcancel() to cancel queries, and other database APIs should have similar provisions. In an interactive psql session, you can simply hit Ctrl+C to send a cancel request, and GUI clients usually have a button for that purpose.

But it is also possible to cancel somebody else’s query by calling the database function pg_cancel_backend(). Another option is pg_terminate_backend(), which goes one step further and terminates someone else’s database session. To be able to use these functions, you must either be a superuser or a member of the default role pg_signal_backend (the next section will explain why), or you must have connected to the database as the same database user as the session you want to torpedo (you are allowed to cancel your own statements).

How the server responds to a cancel request

Inter-process communication in PostgreSQL relies a lot on signals.

When the postmaster process receives a CancelRequest, it sends the signal SIGINT to the backend process of the corresponding database session. This is also what the function pg_cancel_backend() does. pg_terminate_backend() sends the signal SIGTERM.

Now each PostgreSQL process has a signal handler that processes these signals when they are received. This signal handler does not immediately interrupt the backend process, but it sets global variables for the process. SIGINT will set QueryCancelPending and SIGTERM will set ProcDiePending. These variables act as flags, and it is the responsibility of the backend process to react to them as soon as is convenient. This ensures that no process is interrupted at an inconvenient time, when it would for example leave shared memory in an inconsistent state.

Calls to the CHECK_FOR_INTERRUPTS() macro, which invoke the ProcessInterrupts() function, are sprinkled all over the PostgreSQL code at safe places. This function will then throw the error that cancels the current statement or it will terminate the backend process, depending on which flag was set.

Reasons why canceling a query may not work

There are several possible causes:

Execution is stuck in a loop that does not contain CHECK_FOR_INTERRUPTS(). That would be a PostgreSQL bug, and the fix would be to add another call to the macro.
Execution is stuck in a third-party C function that was called in an SQL statement. In this case, you should report it as a bug to the author of the function.
Execution is stuck in a system call that cannot be interrupted. That would indicate a problem on the operating system or hardware level. Note that the delivery of signals is held off while a process is in kernel space.

Don’t use `kill -9` unless you are desperate

It is perfectly fine to use plain kill on a PostgreSQL backend process. That will send the SIGTERM signal, so it is the same as calling pg_terminate_backend() for that backend. If that has no effect, it is tempting to use kill -9, which will send SIGKILL. That signal cannot be caught and immediately terminates the process. The problem is that the postmaster detects if one of its child processes did not shut down cleanly. It will then kill all other PostgreSQL processes and undergo crash recovery, which causes an outage of the whole database that can take seconds to minutes.

Note that while using kill -9 on a backend causes a short down time, kill -9 on the postmaster process itself has even worse effects and should be avoided at any cost. It opens a time window during which a new postmaster could be started while some of the old postmaster’s children are still alive, which is likely to lead to corruption of the data on disk. Never, ever, kill the postmaster process with kill -9!

Sometimes even kill -9 won’t be able to kill a PostgreSQL backend. That means that the backend is stuck in an uninterruptible system call, for exampling performing I/O on network attached storage that is no longer available. If that condition persists, the only way to get rid of the process is a reboot of the operating system.

A trick to cancel a stuck query without crashing the server

Sometimes you can avoid crash recovery and an outage by proceeding as follows. This example uses the GNU debugger on Linux; you’ll have to adapt it to other environments as necessary.

An example of a hanging function

We write this simple C function (source file loop.c):

#include "postgres.h"
#include "fmgr.h"

#include <unistd.h>

PG_MODULE_MAGIC;

PG_FUNCTION_INFO_V1(loop);

Datum loop(PG_FUNCTION_ARGS)
{
	/* an endless loop */
	while(1)
		sleep(2);
}

Build the shared library like this (change the include path as appropriate):

gcc -I /usr/pgsql-14/include/server \
	-fPIC -shared -o loop.so loop.c

and copy the file into the PostgreSQL shared library directory (which you can get with “pg_config --libdir”).

Define and call the function

Define the function in SQL as superuser:

CREATE FUNCTION loop() RETURNS void
   LANGUAGE c AS 'loop';

Then, as an arbitrary user, call the function:

SELECT loop();

Execution will hang. You can try to cancel the query, but it will keep running.

Identify the hanging backend process and send it a termination signal

Open another database connection with the same database user. Then find out the process ID of the background process, which identifies the database session:

SELECT pid, query
FROM pg_stat_activity
WHERE query LIKE '%loop%';

Once you know the process ID, send the process a SIGTERM:

SELECT pg_terminate_backend(12345);

Of course, the argument should be the process ID obtained in the previous step. The function returns TRUE, since the signal was sent, but the query continues to execute.

Attach with the debugger

Make sure that the GNU debugger gdb is installed. To get a readable stack trace, you should have the debugging symbols for the PostgreSQL server installed, although that is not necessary for the trick I will show you. Log into the database server machine as the PostgreSQL user, which is typically called postgres. Call gdb like this (using the proper path to the postgres executable and the proper process ID):

gdb /usr/pgsql-14/bin/postgres 12345

As soon as you get the prompt “(gdb)”, you generate a stack trace with the command “bt”. That will look similar to the following:

#0  __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, 
    req=req@entry=0x7ffdaf61cde0, rem=rem@entry=0x7ffdaf61cde0)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:71
#1  0x00007f113d864897 in __GI___nanosleep (req=req@entry=0x7ffdaf61cde0, 
    rem=rem@entry=0x7ffdaf61cde0) at ../sysdeps/unix/sysv/linux/nanosleep.c:25
#2  0x00007f113d8647ce in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#3  0x00007f113e623139 in loop () from /usr/pgsql-14/lib/loop.so
#4  0x00000000006d71fb in ExecInterpExpr (state=0x13837b8, econtext=0x13834e0, 
    isnull=<optimized out>) at executor/execExprInterp.c:1260
#5  0x000000000070e391 in ExecEvalExprSwitchContext (isNull=0x7ffdaf61ced7, 
    econtext=0x13834e0, state=0x13837b8)
    at executor/../../../src/include/executor/executor.h:339
#6  ExecProject (projInfo=0x13837b0)
    at executor/../../../src/include/executor/executor.h:373
#7  ExecResult (pstate=<optimized out>) at executor/nodeResult.c:136
#8  0x00000000006da8b2 in ExecProcNode (node=0x13833d0)
    at executor/../../../src/include/executor/executor.h:257
#9  ExecutePlan (execute_once=<optimized out>, dest=0x137f4c0, direction=<optimized out>, 
    numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, 
    use_parallel_mode=<optimized out>, planstate=0x13833d0, estate=0x13831a8)
    at executor/execMain.c:1551
[...]

The stack trace is a valuable help for determining where the problem is. Include it if you report a bug to PostgreSQL!

If you don’t want to perform the next step, you can chicken out and enter “detach” to detach the debugger from the process and allow it to continue.

Cancel execution by making the hanging backend exit cleanly

The above stack trace shows that execution currently is not inside PostgreSQL code, but in a custom function (in loop () from /usr/pgsql-14/lib/loop.so). That means that it is pretty safe to let the process exit. If execution is somewhere inside the PostgreSQL server, there is a small risk that PostgreSQL may be in the middle of modifying shared state, holding a spinlock or something similar. If you know the PostgreSQL source, a look at the call stack will help you to assess that risk. Now, if you dare, call ProcessInterrupts(), which will cause the process to exit, since ProcDiePending is set:

(gdb) print ProcessInterrupts()
[Inferior 1 (process 12345) exited with code 01]
The program being debugged exited while in a function called from GDB.
Evaluation of the expression containing the function
(ProcessInterrupts) will be abandoned.
(gdb) quit

Fix the function to allow the user to cancel execution

To improve the situation, the function code should be modified like this:

#include "postgres.h"
#include "fmgr.h"
#include "miscadmin.h"

#include <unistd.h>

PG_MODULE_MAGIC;

PG_FUNCTION_INFO_V1(loop);

Datum loop(PG_FUNCTION_ARGS)
{
	/* an endless loop */
	while(1)
	{
		CHECK_FOR_INTERRUPTS();
		sleep(2);
	}
}

The above changes will cause the function to check for interrupts every two seconds, so that execution can be canceled safely.

Conclusion

Canceling queries works by sending the backend a SIGINT signal. If neither that nor sending SIGTERM can interrupt a backend, you can attach to the hanging backend with gdb and call ProcessInterrupts() directly to make it exit.

If you found this article interesting, perhaps you will also like my article about how to use TCP keepalive to keep abandoned queries from running forever, and to keep idle sessions from closing.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

How to cancel a hanging PostgreSQL query