Online DDL Tools and Metadata LocksOne thing I commonly hear when working with my clients is “I want to change my DDL strategy in order to avoid locking in my database! The last time I used the same old method I ended up in a metadata lock situation!”

I agree that metadata locks can be painful, but unfortunately, it’s completely unavoidable, and changing from one tool to another won’t help with this. That said, it’s still worth it to examine how metadata locks work and what the impact is for each of the common tools and processes. In doing so we will see that all these tools will require metadata locks, but knowing more about how they work and how the use locking can help us determine the right tool for your specific use case.

Any time you make a change to a table a metadata lock is needed to ensure consistency between the table itself and MySQL’s data dictionary. In order for MySQL to establish this lock it has to wait for any query against the table in question to finish, and in turn, any query on the table that occurs after the lock request has to wait for the long-running query and the lock to process before continuing. This will become a lot more apparent in our first example where we’re going to cover the online DDL method using the instant algorithm. I will also illustrate how this works with other tools such as pt-online-schema-change and gh-ost.

Lab setup

In order to demonstrate each technology, the first thing we have to do is set up the environment. I’ve used my VM lab to create a single host using Percona Server for MySQL 8.0. I created a test database and an example table and then created three dummy records.

Online DDL, Algorithm=INSTANT

In order to demonstrate how metadata locks work I have several terminals running against my host.

Terminal One

In the first terminal, I have created a long-running query that will prevent the formation of a metadata lock on the table in question.

Terminal Two

In terminal two, I attempted to run the direct alter against MySQL, which hung.

Terminal three

In terminal three we can see the processlist which shows the offending query and the alter that is seeking, but unable to obtain, the metadata lock.

Terminal four

And in terminal four I have created another select statement against the table, but it hangs because it’s queued behind the metadata lock.

Terminal one

So now we have fully recreated the issue. In order to resolve this I have gone back into terminal one to kill the offending long-running query.

Terminal two

In terminal two we can see that the alter table completed now that it was able to get the lock it needed.

Terminal three

In terminal three we can see that the processlist is now clear.

Terminal four

And in terminal four we can see the query that was queued behind the lock request was able to complete.

As such we can conclude that Online DDL was impacted by a long-running query given that it was unable to get the metadata lock it needed. Once the long-running query was killed the alter table was able to complete.

Before moving on to the next example, I’ll undo the changes in my lab so we have a fresh start.

Terminal one

pt-online-schema-change

Now that we know a little more about metadata locks, I’ll simplify the lab output in this example and the gh-ost example to follow in order to keep things a little more brief and concise.

For pt-online-schema change, there are at least four places where metadata locks are found. One for the creation of each of the three associated triggers, and one for the table swap.

Just as the same as before, in terminal one I created the long-running query against the table in question. In terminal two I ran the following pt-online-schema-change command.

Terminal two

Terminal three

As you can see, this process is hung on the creation of triggers that will keep the original table in sync with the new table that pt-osc has created. This is because creating triggers on a table changes the table definition, which impacts the data dictionary. We can see this in the processlist as shown below…

Terminal two

Once I killed the offending long-running query, the pt-osc process continued:

Terminal two

But you’ll notice in our command that we noted that we would swap tables later using the –no-swap-tables –no-drop-new-table options. So now we have to swap the table. This is impacted by metadata locks as well. In terminal one, I created the long-running query again, and in terminal two I attempted to do the table swap.

Terminal three

This was hung due to waiting for a metadata lock, which we can see in the processlist:

Terminal three

Once the long-running query was killed, the table swap was able to continue. Further work on this process was not interrupted by long-running queries because all work at this point only happens on the swapped _old table.

So, as you can see, pt-osc is impacted by metadata locks as well. In fact, three more than direct alters.

gh-ost

You may think that using gh-ost will help you get around metadata locks, but this isn’t the case. As you will see below, if there is a long-running query, it will hang when it tries to do the table swap similar to how pt-osc swaps tables.

And once this lock clears, the process is able to complete.

Conclusion

As you can see above, all schema change tools have to work with metadata locks. Some more than others, but this is something that we always need to consider when executing changes. Even if you opt to set a table swap for a later date/time, you still need to be aware that metadata locks can interrupt that process. The key here is to monitor for long-running queries at the time that that change occurs and kill them if they are getting in the way, if that’s possible.

This brings me to the final point: selecting a tool to use for a change. Personally, I do not believe there is such a thing as a single best tool for schema changes. You pick the best tool that will be the lightest weight and the least interruptive for the change you want to make.

Subscribe
Notify of
guest

3 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Bill Karwin

At my last job, we invoke pt-osc with a short timeout (1 second) for the metadata lock. If it can’t get the lock, then retry. This helps because other queries won’t be blocked by pt-osc for longer than the timeout duration, if pt-osc is blocked by a long-running query.

We still had cases where pt-osc fails after 200 retries, because the database was too busy. In those cases, we must run the DDL at some other time with less load.

If the load is truly 24/7 and there is no time in the week with less traffic, then sorry, there is no way to run the DDL without downtime.

Morgan Tocker

I recently contributed a feature to gh-ost which attempts to us ALGORITHM=INSTANT before proceeding with the gh-ost OSC algorithm: https://github.com/github/gh-ost/blob/master/doc/command-line-flags.md#attempt-instant-ddl

It’s currently disabled by default, but that might change in future. I tried to make it clear in the docs that this requires a MDL upfront, which I agree with you is something users might not understand or be expecting 🙂

Federico Razzoli

Note that with MariaDB you can add NOWAIT to the ALTER TABLE. If the algorithm is INSTANT, this avoids any long wait.