In a previous post, I mentioned that Postgres databases often have text indexes sorted linguistically rather than bytewise, which is why they need to be reindexed on libc or ICU upgrades. In this post, let’s discuss how to use bytewise sorts, and what are the upsides and downsides of doing so.

Sorting strings in binary means comparing the bytes inside the strings without caring at all about what characters they represent. For instance in an UTF-8 database, when considering the strings Beta and alpha:

  • a bytewise comparison says that 'Beta' < 'alpha', since the code point of the upper-case letter B is 0x42 and the code point of the lower-case letter a is 0x61.
  • a linguistic comparison says that 'alpha' < 'Beta' because it understands that the letter a comes before B even when cases are mixed. More generally linguistic collations have sorting rules concerning accents, punctuation, symbols, plus potentially regional tailorings.

A brief pros and cons comparison of these sorts could look like this:

  Linguistic order Binary order
Ease of use ✅ better ❌ worse
Human readability ✅ better ❌ worse
Range search (*) ✅ better ❌ worse
Performance ❌ worse ✅ better
Portability ❌ worse ✅ 100%
Real immutability ❌ No ✅ Yes
LIKE prefix search ❌ No ✅ Yes

(*) Locating strings between two bounds, for instance to output paginated results

How to have text in binary order in an index?

  1. The first case is when the text column or expression has an explicit collation whose collcollate property is set to one these locales:
    • C, which is always available in Postgres
    • POSIX (which is an alias for C)
    • C.utf8 (or C.UTF-8) for some operating systems (Linux, FreeBSD, …) For instance in
      CREATE TABLE product(
       product_id uuid,
       product_code text COLLATE "C" UNIQUE
      );
      

      the index created to support the unicity constraint on product_code will be sorted in binary order. It also implies that ORDER BY product_code, if it appears in a query, will use the binary sort order. To sort otherwise, an explicit COLLATE clause would need to be added to the query.

  2. The second case is when no COLLATE clause is specified when creating the column, but the default locale of the database is one of these binary-sorting locales. This happens either when creating the database with
    CREATE DATABASE dbname LOCALE=binary-sorting-locale TEMPLATE=template0;
    -- or
    CREATE DATABASE dbname LC_COLLATE=binary-sorting-locale TEMPLATE=template0;
    

    This implies that all text sorting in the database will be in binary except when using explicit COLLATE clauses with a different collation.

  3. The third case is when the database is not created with these options, but the template1 database was initialized with a binary-sorting locale. For instance on a modern Linux system (glibc 2.35 or newer) where C.utf8 is now a built-in locale, initdb --locale=C.utf8 would initialize the Postgres instance with text sorted in binary by default in all databases.

Note that ICU collations cannot be used for this, as the ICU library does not expose any locale that would sort in binary.

Why do most installations not use any binary sort?

In most Postgres installations, initdb is not called with a --locale argument, so it’s taken from the environment (typically from $LANG, see Locales in GLIBC or the Locale chapter of POSIX specification for more).

As the documentation says:

initdb will initialize the database cluster with the locale setting of its execution environment by default, so if your system is already set to use the locale that you want in your database cluster then there is nothing else you need to do

Often $LANG has the language and region set up when the OS was installed, for instance en_US.utf8. Then that choice of locale goes into template0 and template1. Then the user-created database inherit from it, and then inside these databases the text columns inherit from it too. Finally, indexes based on these columns are sorted with the collation of the column, so they inherit from this locale too.

In short, by following the defaults when creating all these objects, we have this chain of events, where everything is set up to use the default locale of the operating system.

$LANG -> initdb -> template dbs -> user-created dbs -> columns -> indexes

But a developer or even a DBA does not necessarily decide or expect any particular value in $LANG at the start of this chain. When producing SQL scripts to create databases, schemas, tables, developers will generally not be prescriptive about locales and assume that the default installed locale will do. It’s also simpler to just not care about collations and locales in creation scripts and queries.

Yet when the libraries that implement the en_US.utf8 locale or any other linguistically-sorting locale must be upgraded, the fact that any index depending on that locale must be recreated before going back online is a database/DBA problem.

So to reduce the exposure to that problem, we might ask in advance for each of the indexed columns, at creation time: does that column really need a linguistic sort order? And very often, the answer is No, bytewise would be fine. And even sometimes not just fine, but actually better.

Or considering the database itself, we might ask: which queries run in this database require a linguistic sort order? If the answer is none, then why not default to a binary collation for the database?

However, because “binary” is not a property that we set on an index, but the result of using a collation that happens to sort in binary, there are a few issues to understand before switching to binary collations. Let’s talk about these in the rest of this post.

The weakness of C/POSIX

In Postgres the choice of collation does not only drives the sort order, but also the character classification and case conversion. Specifically the results of functions like upper(), lower(), initcap(), and decisions in pattern matching on what characters are letters and digits and punctuation.

As libc allows two separate locales to sort and classify, so does Postgres libc collations: the locale for sorting is found in pg_collation.collcollate, and the one for character classification and case conversion is in pg_collation.collctype corresponding respectively to the LC_COLLATE and LC_CTYPE options of CREATE COLLATION.

At the database level, the default libc collation comes also as a dual setting: pg_database.datcollate and pg_database.datctype, that correspond respectively to the LC_COLLATE and LC_CTYPE options of the CREATE DATABASE command.

Using C or POSIX is fine with sorting (the “collate” part) all character sets that are supported by Postgres, including utf-8, but when it comes to character classification and case conversion (the “ctype” part), it knows only about the US-ASCII character set. So basically it gives correct results for english, but not beyond that. Example:

-- show the inability of "C" to uppercase accented characters
test=> select initcap('élysée' collate "C");
 initcap 
---------
 éLyséE

The C.utf8 collation

C.utf8 is meant to sort text in binary like C, but give correct results for ctype-like functions with all Unicode characters (at least up to the Unicode version supported by your libc). Example:

-- show the ability of "C.utf8" to uppercase accented characters
test=> select initcap('élysée' collate "C.utf8");
 initcap 
---------
 Élysée

The main problem with C.utf8 is that it’s not part of the POSIX standard, is not supported everywhere, and in the case of the GNU libc that is generally used on Linux systems, it’s only fully supported since GNU libc 2.35, released in February 2022. Major distributions like Debian or Redhat had patches to implement this locale before that, but they happen to not sort all code points in binary, so these pre-2.35 versions cannot be trusted.

Another problem, more subtle, is that Postgres does not assume or “know” that C.utf8 sorts in binary, even when it truly does, and it thus not going to use some optimisations that it would use with the C collation, for instance the use of an index to match a left-anchored LIKE pattern.

On systems that do not have a proper C.utf8 locale, or if want to overcome this lack of optimisation problem, a Postgres collation can be made from lc_collate=C and lc_ctype=some-locale, where it’s up to you to find what some-locale is suitable depending on your operating system.

For instance, to get a database that defaults to binary sorts with correct UTF-8 support for ctype functions, assuming the OS has en_US.utf8, this would do:

CREATE DATABASE dbname
   TEMPLATE='template0'
   ENCODING='UTF8'
   LC_COLLATE='C'
   LC_CTYPE='en_US.utf8';

The future built-in C.utf8 collation

The concept of C.utf8 is fine, but the fact that it’s not available universally, and that some its behavior is still OS-dependant, are problematic. A solution that is currently being worked is an OS-independent Postgres binary collation that is pretty much like C.utf8, but internalized in Postgres. This is refered to as Built-in collation provider for “C” and “C.UTF-8” in the current CommitFest.

Hopefully this work will end up in the next PostgreSQL version, and provide the simplest choice for a binary collation in the future. But at the moment, there’s a bit more of “Do It Yourself” involved.

The “unicode” or “und-x-icu” collations

Now suppose that we have created a database whose default collation sorts in binary. Occasionally, we may want to output text sorted the “human way”, so that ‘alpha’ comes before ‘Beta’, not to mention the more sophisticated linguistic rules.

As mentioned above, an explicit COLLATE clause is needed in that case. But with what collation? With the libc collations, aside from C/POSIX, SQL scripts and queries cannot always know what libc locales are defined in the operating system, and hence in pg_collation. Also the COLLATE clause expects an identifier as the name of the collation, not a parameter, which kind of forces us to hardcode that name.

It happens that Postgres installations that support ICU have hardcoded language-agnostic collations that are guaranteed to exist. Starting with Postgres 16, the name "unicode" is assigned to the root unicode collation, so basically any query can do:

SELECT ... FROM ... ORDER BY colname COLLATE "unicode";

and get a linguistic-aware order.

Before Postgres 16, we can use the more weirdly named "und-x-icu" collation, which is also guaranteed to exist.

Conclusion

Just as the Unicode character repertoire is growing at each Unicode release (about once a year), the versions of the UTF-8 collations change with it, and the collation upgrade issue for Postgres is never going away.

While it’s necessary to have the good support of linguistic collations that Postgres has, it does not mean that they have to be used for every piece of text in our databases, and that we need to pay the cost of reindexing that comes with them.

Because linguistic sort tends to come by default, using binary collations or a mix of binary and linguistic collations is a bit more involved, but it may be worth considering.