Citus Tips: How to undistribute a distributed Postgres table

This post by Halil Ozan Akgul about Citus was originally published on the Azure Database for PostgreSQL Blog on Microsoft TechCommunity.

Update in October 2022: Citus has a new home on Azure! The Citus database is now available as a managed service in the cloud as Azure Cosmos DB for PostgreSQL. Azure documentation links have been updated throughout the post, to point to the new Azure docs.

Once you start using the Citus extension to distribute your Postgres database, you may never want to go back. But what if you just want to experiment with Citus and want to have the comfort of knowing you can go back? Well, as of Citus 9.5, now there is a new undistribute_table() function to make it easy for you to, well, to revert a distributed table back to being a regular Postgres table.

If you are familiar with Citus, you know that Citus is an open source extension to Postgres that distributes your data (and queries) to multiple machines in a cluster—thereby parallelizing your workload and scaling your Postgres database horizontally. When you start using Citus—whether you’re using Citus open source or whether you’re using Citus as part of a managed service in the cloud—usually the first thing you need to do is distribute your Postgres tables across the cluster.

What is undistribute_table()?

To distribute your Postgres tables with the create_distributed_table() function of Citus, you first need to make some decisions, such as: which column to choose as the distribution column, how many shards you need, and which Postgres tables you need to distribute.

If you just want to try different settings and go back when you want to, you're now in luck. Our Citus team introduced the undistribute_table() function in the Citus 9.5 release—enabling you to turn distributed Citus tables back into regular Postgres tables.

If you are one of the Citus users who has asked for the ability to undistribute your Citus tables—like in the request below from Matt Watson of Stackify—we hope this new feature will help you.

Also, is there a way to convert a distributed table to not being distributed? I could then change it back to distributed and fix my colocate… without having to drop the table.

The new undistribute_table() function will:

return all the data of a distributed table from the Citus worker nodes back to the Citus coordinator node,
remove all the shards of the distributed table from the Citus workers,
make the previously distributed table a local Postgres table on the Citus coordinator node

Here is the simplest code example of going distributed with Citus and coming back:

-- First distribute your table
SELECT create_distributed_table ('my_table', 'id');
-- Now your table has shards on the worker nodes and any data that was in the table is distributed to those shards.

-- To go back to local, just call the undistribute_table function with your table as parameter
SELECT undistribute_table('my_table');
-- Now your table is only on the coordinator node just like before you distributed.

Undistributing a Citus table is as simple as the one line of SQL code in the code block above.

Note that when you distribute a Postgres table with Citus you need to pass the distribution column into the create_distributed_table() function—but when undistributing, the only parameter you need to pass into the undistribute_table() function is the table name itself.

After undistributing, the distribution column becomes a regular column. If in the future, you want to distribute your Postgres table again, you can just pick another distribution column (or use the same one).

In the past, before we introduced the undistribute_table() function in Citus 9.5, if you wanted to turn a distributed table back into a local table, you would have had to create a new Postgres table on your coordinator node. Then, you would have needed to move all the data from the distributed table to this new local table. However, Citus did not have an easy way to move data from distributed Citus tables to local Postgres tables so you would have had to do some workarounds. Let me explain:

The usefulness of INSERT INTO local SELECT .. FROM distributed

To undistribute a table, distributed data needs to be moved back to the Citus coordinator from all the shards in the cluster. But prior to the Citus 9.4 release, Citus did not support queries that SELECT from distributed tables and INSERT into local tables. So, there was a need to implement support for:

INSERT INTO local_table SELECT * FROM distributed_table;

In fact, the INSERT INTO local SELECT .. FROM distributed feature was introduced in Citus 9.4 to make the undistribute_table() function possible.

Other than being necessary for undistributing tables, inserting distributed data into local tables has some more beneficial use cases.

Rollup Tables

A rollup table in Postgres is a table that you pre-aggregate your data into. Before we introduced INSERT INTO local SELECT .. FROM distributed in Citus 9.4, you could still have rollup tables. (And many of you did!) But your rollup tables had to be distributed tables, which may not have been the best option in every case. Especially if your rollup table was a very small table.

Let me give you an example.

Let's say you have a distributed table and a graph that shows some daily statistics of the data on that table. Instead of calculating the statistics from scratch every time you open the graph, you can now create a local Postgres table on the Citus coordinator that you will rollup into. Every night, you can calculate the statistics value for the day and insert the result of the calculations to the rollup table. When you open the graph, the data will be readily available.

-- Every midnight
INSERT INTO rollup_table SELECT your_analysis_function(statistics_column) FROM distributed_table WHERE date = CURRENT_DATE;

-- When you need the graph
SELECT * FROM rollup_table;

ETL in the Database

ETL (Extract, Transform, Load) is the process of gathering data from a data source, transforming the data into a more meaningful form, and then storing the transformed data. Imagine running an online store, and imagine you have a distributed table for customer data and another distributed table for purchases the customers made. What if you need to find the best 100 customers and send them e-mails about a special discount for the top customers?

With the new INSERT INTO local SELECT .. FROM distributed feature and the ETL logic, you can create a local Postgres table for your best customers.

-- Create the table for the top customers
CREATE TEMP TABLE top_customers (customer_id bigint primary key, email text, total_purchase money);

-- Find the best customers and put their data into the top_customers table
INSERT INTO top_customers
SELECT customer_id, email, total_purchase
FROM customers JOIN
(
  SELECT sum(amount) AS total_purchase, customer_id
  FROM purchases
  GROUP BY customer_id
) total_purchases ON customers.id = total_purchases.customer_id
ORDER BY total_purchase DESC
LIMIT 100;

-- Load the top customer IDs back into the distributed table
UPDATE customers SET is_top_customer = true WHERE id IN (SELECT customer_id FROM top_customers);

Increased support for INSERT SELECT in Citus

As of Citus 9.4 any INSERT SELECT command works!

The logic for INSERT INTO local SELECT .. FROM distributed queries is quite similar to the logic for SELECT .. FROM distributed. When you just want to get the distributed data with SELECT, Citus will:

gather data from the Citus distributed worker nodes
combine the data, on the Citus coordinator node
return the combined data back to you

If you want to INSERT INTO local SELECT .. FROM distributed, Citus does all the steps the same way, except for the last one. In the last step, instead of returning the combined data to you, Citus inserts the data to the local Postgres table on the Citus coordinator node.

After all the engineering effort, it would be selfish to keep the INSERT INTO local SELECT .. FROM distributed feature just for internal use. So, we added support for the feature in Citus 9.4.

What does the Citus undistribute_table() function do, under the hood?

So as of Citus 9.4, with help from the new INSERT INTO local SELECT .. FROM distributed feature, you could undistribute your tables manually, if you needed to revert. To undistribute Citus tables manually, you used to have to:

Create a new Postgres table
Insert Select everything from your old, distributed table into the new Postgres table
Drop the old table and rename the new table.

That might seem easy enough, but that's not all. Some of the things you might have also had to deal with:

keep the Postgres indexes you had on the old distributed Citus table in mind…
create partitions again, if your table was a partitioned table…
deal with the fact that while dropping your table, you also dropped any views that depended on your distributed table—and the views that depended on those views, too.

The good news is that as of Citus 9.5 or later—you can now use the new undistribute_table() function and let Citus seamlessly handle everything. Specifically, when you use the undistribute_table() function, Citus automatically:

creates the indexes you had for the distributed table,
handles sequences owned by the table so they continue from where you left off,
recursively finds the views that directly or indirectly depend on your table and moves them to the new Postgres table,
preserves constraints, and the table owner,
if your table was a partitioned table, does all these steps for the partitions,
and more…

Bottom line: undistribute_table() makes it easier to experiment with Citus distributed tables

Hopefully it's interesting to know a bit more about why our Citus team introduced the INSERT INTO local SELECT .. FROM distributed feature in Citus 9.4—and the undistribute_table() function in Citus 9.5.

The most important thing to know is that distributing a Postgres table with Citus is not a one-way street. It's easy to go back and to undistribute a Citus table. So if you want to get started with Citus, it's now easier to experiment—as long as you're running Citus 9.5 or later. After downloading the Citus open source packages—or provisioning a Hyperscale (Citus) server group on Azure—you can distribute your tables or make your tables reference tables and then undistribute back to local Postgres tables—and find what data model works best for you and your application. And if you change your mind later, you can just undistribute again.

Written by Halil Ozan Akgul

Computer engineer, Citus developer, Postgres lover, piano player, plant grower—all while working at Microsoft.