How to troubleshoot and fix PostgreSQL performance issues

PostgreSQL database performance is critical for application performance. EzzEddin Abdullah shows how to get information about a query’s performance from the execution plan.

PostgreSQL is a popular database that is used by many applications. It’s been in active development for 30 years! PostgreSQL (or Postgres) is a powerful database management system (DBMS) that you can use and worry less about performance.

Although Postgres has a powerful optimizer and can fix performance issues for you, sometimes you have to tweak your query to improve your query performance. You will find why in the next section.

Many developers tend to write the query first and then optimize it later. This is often not a good practice because it makes the query more complex and will make it harder to understand and thus difficult to optimize. This is why it’s better to write the query the right way from the start.

This article talks about specific techniques to improve query performance; however, the most important thing is to understand how a database engine processes the query, what a query planner is, and how it decides what execution plan to use.

In this tutorial, you will learn about optimization techniques when you imagine yourself as a database engine. Think from the perspective of a database engine, and then decide what your query should be optimized for.

It’s not an easy task to optimize a query, but if you understand the basics of query optimization, you can go far and make your queries performant.

Understanding the SQL nature

Why would two queries that give the same result drastically differ in execution time? It’s a difficult question.

The source of the problem is that SQL is a declarative language. That means you describe what the result of the query should be and not how the query should be executed.

The reverse is true for the imperative language. It means you describe how the result will be obtained through the sequence of the steps that will be executed.

Under the hood, the database optimizer chooses the best way to execute the query. How is best determined? Many factors come into play such as storage structures, indexes, and data statistics.

Setting up a Postgres database

First set up the environment used in this tutorial so that you can follow up on the performance issues that will be tackled.

I’m assuming that you have already installed Postgres on your machine. Begin by creating a new database.

Note: The shell commands run in this tutorial are used on a Linux/Mint system.

Now, the testdb database has been created.

Setting up Postgres tables

To be able to create tables, you need to be connected to the database through a SQL client or a command line tool like psql.

If you want to access the database from the command line, you can use the following command:

and to connect to the database, you can use:

The prompt postgres=# indicates that my machine is, by default, connected to the postgres database. Once that prompt is changed to testdb=#, you are connected to the testdb database.

This example creates two tables: genders and employees. Feel free to name them with your conventions. I just prefer the plural nouns for tables.

For the rest of this tutorial, I’ll use a SQL client like DBeaver, so don’t be surprised when you no longer see the prompt testdb=#.

Create the genders table:

Confirm with SELECT *:

Editor’s note: For simplicity of the examples, only two genders are listed, but, in practice, there will be more.

Create the employees table:

Now populate 5 million entries in the employees table:

Note that the first column gender_id in the previous query contains 1’s and 2’s and is evenly distributed across the 5,000,000 records. However, the hire_date column is distributed non-uniformly to be able to play with different distributions in this tutorial.

To see the differences in performance, look at execution time. If you’re using psql, turn on the execution time with \timing. The DBeaver SQL client calculates the execution time for you.

This example aggregates the genders table by how many females and males are employees.

This query takes around 10.7 seconds on my machine.

Tracking the execution time is the easiest way to diagnose a performance issue. Next, find out more ways to diagnose and fix performance issues.

Using EXPLAIN

One of the most important tools for debugging performance issues is the EXPLAIN command. It’s a great way to understand what Postgres is doing behind the scenes. The result would be the execution plan for the query. This plan is a tree of nodes that Postgres uses to execute the query.

Run this to see the plan for the previous query:

which outputs the following in my case:

 

When you explore each node of the previous execution plan, you’ll find different data access algorithms whether it is a sorting, grouping, hash aggregating, or hash joining. These algorithms are computed by the Postgres optimizer. The optimizer chooses the best way to execute the query. This decision depends on the data and the query itself.

But how to read such execution plan? Look at the bottom of that plan. You’ll see a sequential scan is done on genders table. This means that the optimizer has chosen a simple sequential scan on all rows of genders table.

This seq scan is followed by parentheses and includes some information: the cost, rows, and width. The cost has two numbers; the estimated startup cost (e.g., if you have a sort node, it would be the time to do the sorting), and the estimated total cost which assumes all available rows, related to this node, are retrieved.

The rows value defines the estimated number of rows output by this node.

The width indicates the estimated average width of rows output size (in bytes) by this plan node.

How fast the query is executed depends on the selectivity ratio. This ratio is the ratio of the number of retained rows to the total number of rows in the stored table.

It’s better to seek a low selectivity ratio to avoid the read operation cost.

The worst-case scenario is to have a selectivity ratio of 1. In this case, the optimizer will do a full scan of all the rows in the table. The engine consecutively reads all the rows in a table and checks the filter condition on each block.

Note: A block is the unit of storage that is used to transfer data from disk to memory. The block size is fixed, and it’s 8192 bytes in Postgres.

To understand how the optimizer makes different decisions depending on the selectivity ratio, filter a big chunk of the employees table by the following query:

The following execution plan is returned:

Note that this big chunk that is filtered out is around 4 million rows out of the total 5 million rows. That’s why there is a high selectivity ratio. Since the optimizer is smart, it has chosen the sequential scan algorithm.

However, the optimizer can choose a different execution plan for the same query if you just change the filter condition in the WHERE clause.

which results in the following execution plan:

This is because the selectivity ratio is low (100 rows out of 5 million rows returned), so the Postgres optimizer plans two workers to execute the query in parallel.

Note that you can look at a detailed execution time of the query by the EXPLAIN ANALYZE command. Just write it before the SELECT statement.

Using indexes

PostgreSQL (like any relational database) allows additional, redundant data structures to speed up the table access more than a sequential scan.

These data structures are called indexes.

Indexes are described as “redundant” because they do not store new information than the data is already stored in the table.

What exactly do indexes do? They provide data access paths to the table. They allow you to access the rows without having to read the entire table.

If there is a filtering condition on the table, then the index allows you to only read the blocks that match the condition. These are determined by the pointers. The underlying data structure is a heap. A heap is a tree-like data structure which means rows are stored in an unordered fashion.

The cost of the heap algorithm becomes higher than the full scan if the selectivity ratio is high.

For small values of selectivity, the cost is proportional to the number of blocks that match the condition.

The next section demonstrates how creating an index will speed up queries in different scenarios.

Unique indexes

A unique index supports any primary key or a unique constraint on the table.

A table can have a single primary key (though it’s not a requirement) and multiple unique constraints. Note that the definition of the genders table didn’t specify id as the primary key.

You can alter the genders table with a new primary key:

If, by mistake, you try to alter another primary key in the same table as the following query:

you will see the following error:

SQL Error [42P16]: ERROR: multiple primary keys for table “genders” are not allowed

What you can do instead is to create a unique constraint on the name column:

For the employees table, the primary key is done here:

and the index is created on the same pkey:

To utilize these features, create unique indexes on these constraints.

and another index on the name column:

It is also possible to create a unique index without formally creating a unique constraint. Just add the keyword unique to the index statement:

Note that if there are any duplicate values in the column, the index will not be created.

How about foreign keys? Alter the gender_id column in the employees table:

If searches by the foreign key gender_id become slow, you need to explicitly create an index on the column:

Originally, the query took 10.7 seconds on my machine:

Now, I get an execution plan that makes use of the index employees_gender_id executed at around 1.4 seconds:

Note that although there’s an index for the genders table, the optimizer decides on a sequential scan as indicated in the execution plan in this part:

This happens because the genders table is only two rows, so there is no need to create an index on that table because the optimizer is smart enough to choose a full scan algorithm.

Partial indexes

One of the best features of PostgreSQL is partial indexes. Partial indexes are indexes that only cover a subset of the table.

For the unlikely case that the hire date is on an early date like April 4, 2020, create a partial index on the hire_date column:

And here is the execution plan:

Note that before the index was created, the query took about 400ms to run on my machine.

Conclusion

You’ve have seen how to optimize queries in PostgreSQL and fix the performance issues that you’ve have been facing. This article covered how the optimizer works and how to create indexes, constraints, and partial indexes.