Migrating to utf8mb4The utf8mb4 character set is the new default as of MySQL 8.0, and this change neither affects existing data nor forces any upgrades.

Migration to utf8mb4 has many advantages including:

  • It can store more symbols, including emojis
  • It has new collations for Asian languages
  • It is faster than utf8mb3

Still, you may wonder how migration affects your existing data. This blog covers multiple aspects of it.

Storage Requirements

As the name suggests, the maximum number of bytes that one character can take with character set utf8mb4 is four bytes. This is larger than the requirements for utf8mb3 which takes three bytes and many other MySQL character sets.

Fortunately, utf8mb3 is a subset of utf8mb4, and migration of existing data does not increase the size of the data stored on disk: each character takes as many bytes as needed. For example, any digit or letter in the Latin alphabet will require one byte. Characters from other alphabets can take up to four bytes. This can be verified with a simple test.

As a result, all your data that uses a maximum of three bytes would not change and you will be able to store characters that require 4-bytes encoding.

Maximum Length of the Column

While the data storage does not change, when MySQL calculates the maximum amount of data that the column can store, it may fail for some column size definitions that work fine for utf8mb3. For example, you can have a table with this definition:

If you decide to convert this table to use the utf8mb4 character set, the operation will fail:

The reason for this is that the maximum number of bytes that MySQL can store in a VARCHAR column is 65,535, and that is 21845 characters for utf8mb3 character set and 16383 characters for the utf8mb4 character set.

Therefore, if you have columns that could contain more than 16383 characters, you will need to convert them to the TEXT or LONGTEXT data type.

You can find all such columns if you run the query:

For example, in my test environment, it returns:

 

Index Storage Requirement

MySQL does not know in advance which characters you will store in the column when you are creating indexes. Therefore, when it calculates the storage required for the index, it takes the maximum value for the character set chosen. As a result, you may hit the index storage limit when converting from another character set to utf8mb4. For InnoDB, the maximum size of the index is 767 bytes for REDUNDANT and COMPACT row formats, and 3072 bytes for DYNAMIC and COMPRESSED row formats. See The User Reference Manual for details.

That means you need to check if you have indexes that could grow to exceed these values before performing the update. You can do this with the following query:

Here is the result of running the query in my test environment:

Once you have identified such indexes, check the columns and adjust the table definition accordingly.

Note: The query uses CTE, available as of MySQL 8.0. If you are still on version 5.7 or earlier, you will need to rewrite the query.

Temporary Tables

One more issue you can hit after converting to the utf8mb4 character set is an increased size of the implicit temporary tables that MySQL creates to resolve queries. Since utf8mb4 may store more data than other character sets, the column size of such implicit tables will also be bigger. To figure out if you are affected by this issue, watch the global status variable Created_tmp_disk_tables. If this starts significantly increasing after the migration, you may consider updating RAM on your machine and increasing the maximum size of the temporary tables. Note that this issue could be a symptom that some of your queries are poorly optimized.

Conclusion

Converting to the utf8mb4 character set brings you the advantages of better performance, a larger range of characters that you can use, including emojis and new collations (sorting rules). This conversion comes at almost no price, and it can be done smoothly.

Ensure:

  • You converted all VARCHAR columns that could store more than 16383 characters to the TEXT or LONGTEXT data type
  • You adjusted index definitions that could take more than 767 bytes for the REDUNDANT and COMPACT row formats, and 3072 bytes for DYNAMIC and COMPRESSED row formats after migration.
  • You optimized your queries so that they should not start using internal disk-based temporary tables

Subscribe
Notify of
guest

8 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Bernt Marius Johnsen

Another issue worth mentioning: Migrating from utf8(mb3) to utf8mb4 may break unique constraints since the collations are not identical. See https://dev.mysql.com/blog-archive/mysql-8-0-collations-migrating-from-older-collations/ and https://dev.mysql.com/blog-archive/mysql-8-0-collations-migrating-from-older-collations-part-2/

dimus

>It is faster than utf8mb3
And what about performance utf8mb3 vs utf8mb4 in MariaDB?

Mark Gruenberg

Is this an error in the query :

FROM information_schema.INNODB_INDEXES i JOIN information_schema.INNODB_TABLES t USING(TABLE_ID)

should it be
FROM information_schema.INNODB_SYS_INDEXES i JOIN information_schema.INNODB_SYS_TABLES t USING(TABLE_ID)

Natarajan Chidhambharam

Under the Index storage requirement – “Once you have identified such indexes, check the columns and adjust the table definition accordingly.” Can you please advise what we should do if we have identify such indexes from the query?

Vito Meuli

The query to check index length problems fails for me on my MySql 8.0.36, I can’t understand why. It says “syntax error”.
I rewrote it as following (I hope I did it well):