Small Datum: Fun with innodb_page

Thursday, July 6, 2023

Fun with innodb_page_cleaners

I have been running the insert+delete benchmark for Postgres, InnoDB and MyRocks and then trying to improve the configurations I use for each DBMS. With InnoDB a common problem is write stalls when user sessions get stuck doing single-page flushing.

This post is truthy rather than true. I have begun reading InnoDB source but my expertise isn't what it used to be and I am still figuring things out on my latest round of benchmarking.

The single-page flushing problem occurs when background threads cannot keep enough clean pages at the tail of the LRU. Any user session that needs a clean page might have to do write back, including writes to the doublewrite buffer, to find a clean page that can be evicted and reused. From PMP stacks I frequently see most of the page cleaner threads stuck on writes and then most of the user sessions all stuck doing single-page flush writes -- and the extra work from the doublewrite buffer just makes this worse.

Upstream has good docs on configuring the buffer pool and Percona has many great posts including here and here. But I have yet to discover how to avoid single-page flushes via my.cnf tuning.

Architecture

I have a few complaints and/or feature requests because avoiding single-page flush is a good idea:

Provide always-on counters that make the problem easy to spot
Use the same value by default for innodb_buffer_pool_instances and innodg_page_cleaners
Wake the page cleaner threads more often (more than once/second) so they can run more frequently but do less work per run.
For users, consider reducing the value for innodb_max_dirty_pages_pct. The default, 90%, is too high (reducing the default is another request). Perhaps 50% is a better default. The page cleaners will have an easier time keeping up when they only have to write back 50% of the pages at the tail of the LRU versus writing back 90% of them.
In theory using a larger value for innodb_lru_scan_depth means that more pages will be cleaned per second. The expected number of dirty pages cleaned per second is innodb_max_dirty_pages_pct X innodb_lru_scan_depth X innodb_buffer_pool_instances. However a larger value for lru_scan_depth means there will be more mutex contention.
Make page cleaning rates adaptive based on the workload rather than hardcoded based on innodb_lru_scan_depth

My first request is to make the default values the same for innodb_buffer_pool_instances (currently 8) and innodb_page_cleaners (currently 4). When there are more buffer pool instances than page cleaner threads (which is true by default today) then InnoDB is likely to have a harder time avoiding single-page flushes.

The page cleaner threads wake once per second, scan innodb_lru_scan_depth pages from the tail of the LRU, and writeback the dirty pages to make the clean and faster to evict. A given page cleaner thread works on one buffer pool instance at a time. The page cleaner thread can also do some writeback from the flush list, but I didn't try to understand that code path (see here).

To browse the source, start with buf_flush_page_coordinator_thread, pc_request and pc_flush_slot.

4 comments:

Jean-François GagnéJuly 6, 2023 at 4:47 PM
> The single-page flushing problem occurs when background threads cannot keep enough clean pages at the tail of the LRU.

For a query thread to try finding a clean page at the tail of the LRU, the Free List first needs to be empty. Have you tried maintaining more free pages in the free list by increasing innodb_lru_scan_depth ? Many details about this in [1] (probably also relevant even if you are not using Percona Server).

[1]: https://jfg-mysql.blogspot.com/2022/11/tail-latencies-in-percona-server-because-innodb-stalls-on-empty-free-list.html

> Wake the page cleaner threads more often (more than once/second) so they can run more frequently but do less work per run.

Percona has this, details in [1b] and [1].

[1b]: https://docs.percona.com/percona-server/8.0/performance/xtradb_performance_improvements_for_io-bound_highly-concurrent_workloads.html#multi-threaded-lru-flusher

> In theory using a larger value for innodb_lru_scan_depth means that more pages will be cleaned per second.

This applies to the tail of the LRU, but has other effects, details in [1], [2] and [3].

[2]: https://bugs.mysql.com/bug.php?id=108927

[3]: https://bugs.mysql.com/bug.php?id=108928

> However a larger value for lru_scan_depth means there will be more mutex contention.

I do not understand this, could you elaborate ?

> Make page cleaning rates adaptive based on the workload rather than hardcoded based on innodb_lru_scan_depth

I think Percona Server has this, details in [1b] and [1].

While writing [1], I thought about ways to improve Flushing. My idea would boil-down to:
- keep doing List Flushing once per second
- split LRU Flushing in two: LRU Cleaning and Free List Refilling
- do Free List Refilling very often, and LRU Cleaning not too often

Basically, above is because the "only" waste in flushing is scanning the LRU (because needing to scan clean pages, which is a waste), all the rest is "useful". So I would have LRU flushing waking-up often to refill the Free List (probably in an adaptative way as implemented in Percona Server), but only scan / clean the LRU every 10 to 100 iteration of Free List Refilling, or when refilling the Free List finds a dirty page at the tail of the LRU.

Also, I would like LRU Cleaning to go further than the size of the Free List, to be able to maintain a "long" tail of the LRU clean, to prevent Free List refilling to hit a dirty page, or prevent a query thread looking for a clean page in the LRU to have to scan too many dirty pages. This longer LRU cleaning is described in [3] (look for innodb_lru_scan_size, I changed my mind a few time in this feature request, you will have to read all the comments).

Also, with a longer tail of the LRU clean, I can assume that a query thread starting to scan the LRU when the Free List is Empty will quickly find a clean page. This way, I can also aggressively reduce the Free List Size, because the pages in the Free List is wasted RAM.

Also, while cleaning the tail of the LRU, a page cleaner should check that the Free List is still relatively full, because if it drains, query threads will try getting the LRU Mutex, which is held by the page cleaner, and this is a stall because the page cleaner was too busy cleaning and did not keep refilling.

This is a complex subject, and I would be happy to bounce ideas with you on this if you want. Ping me on Messenger if you want to schedule a chat.
ReplyDelete
Replies

Add comment