This post is mainly inspired by findings from the previous testing of MySQL 8.0 on TPCC workload(s) and observations from IO-bound Sysbench OLTP on Optane -vs- SSD. But also by several "urban myths" I'm often hearing when discussing with users about their IO-bound OLTP performance problems :

Myth #1 : "if I'll double the number of my storage drives -- I'll get x2 times better TPS !"
  • this was mostly true during "HDD era", and again..
  • (ex.: a single thread app doing single random IO reads from a single HDD will not go faster by doing the same from 2x HDD -- similar like single thread workload will not run faster on 8CPU cores -vs- 2CPU cores, etc.)
  • all depends on your workload and how many parallel IO operations you're involving..
  • indeed, it is much more easier to saturate HDD, but it's much more harder to do it with modern SSD/NVMe
  • NOTE : we're speaking about OLTP (e.g. if you started to observe full table scans in your workload -- means you're already doing something wrong ;-))
  • simple rule : if you're not saturating your storage on any of its limits => you'll not see any gain by adding more drives, you'll probably just have a bigger storage space, that's all.

Myth #2 : "I'll go faster with flash drive which is claimed capable x2 times more IOps in specs than my current drive !"
  • if you're expecting to run OLTP workload, rather pay attention to IO latency first !
  • sometimes it may be not mentioned these x2 times more IOps were obtained with x4 times more IO threads ;-))
  • e.g. a drive capable of x2 times more IOps but with x4 times higher latency will still be x4 times slower than your current drive on 1 user load, and 2, and 4, and .. maybe up to 64 ? (depends on where your current drive the limit is reached) -- and if you don't have more than 64 concurrent users ? -- then you'll never see your x2 times more IOps, but rather x4 times worse TPS ;-))

Test it yourself - this is the only advice I could give you !
  • because "only a real test will give you a real answer" (and I'm repeating to say it again and again.. ;-))
  • testing your own workload will give you the best answer !
  • otherwise you may still use some generic benchmark workloads which are representative for you
  • for ex. your new flash drive may be look better from all points and passing very well all generic pure IO tests, but show x3 times worse results once used by MySQL -- and this is just because, for ex., every involved fsync() will take x3 times more time, etc. (based on real story, no kidding ;-))

So far, for the following story I'll use :
  • Sysbench-TPCC workload, 10x100W (x10 of 100 warehouses, ~100GB data, here is why)
  • Same Skylake server as before, same config, etc.
  • and the same Optane & SSD as in the previous testing, except that I'll also add to the game x2 Optane drives used as a single RAID-0 MDM volume !
  • EXT4 is used on top of each drive or volume

Starting Test scenario :
  • concurrent users : 1, 2, 4, .. 1024
  • trx_commit : 1 (flush REDO on every COMMIT)
  • Buffer Pool (BP) : 128GB (in-memory), 32GB (IO-bound)

Sorry, no "user friendly" graphs this time, but hope it'll still be easy to understand the results ;-))

On the first graph you'll see :
  • 3 test results : with x2 Optaine first, then with single Optane, and then single SSD
  • on each test the load is going from 1, 2, 4 .. up to 1024 concurrent users
  • the Commits/sec curve is showing obtained TPS level

and the first result is with 128GB BP :


InnoDB Buffer Pool 128GB

Comments :
  • first of all, as you can see, TPS is not better with two -vs- one Optane
  • as we're not hitting any single limit of Optane drive, there is no any gain by using x2 ;-))
  • also, regardless the used data volume is ~100GB, we're not observing here such a big difference between Optane -vs- SSD as it was with Sysbench OLTP on a similar data size and still "in-memory"..
  • this because TPCC workload is much less aggressive on IO writes, so REDO flushing is less impacted..

now, how TPS will be impacted if BP size was smaller, just 32GB ?

InnoDB Buffer Pool 32G

Comments :
  • x2 Optane : 20% peak TPS drop..
  • x1 Optane : 40% peak TPS drop..
  • x1 SSD : ~80% peak TPS drop..
  • definitively we're hitting a single Optane limits here, so using x2 Optane helps, but no need to expect x2 times gain, as we're not hitting single Optane limits that hard ;-))

and you can see it clearly from corresponding IO rate stats :
NOTE : as there is a SUM of devices on the graph, in case of x2 Optane test you need to divide the rate number by 2 because the MDM "device" is automatically counted in (which is raid0 Optane+Optane, so in SUM both Optane are counted twice)..

The next step now will be to do the same, but with InnoDB page size = 4K :


InnoDB Buffer Pool 128GB, page 4K

Comments :
  • absolutely no impact on Optane
  • but slightly better result on SSD (comparing to the previous result with default 16K page)

But the main expected gain is should be rather on IO-bound test with 32GB BP, because with 4K page we'll have more useful data cached in Buffer Pool :


InnoDB Buffer Pool 32GB, page 4K

Comments :
  • Optane : 7% peak TPS drop only ! (both x2 or x1 Optane)
  • SSD : 60% peak TPS drop..
  • while on the same time with 4K page size TPS on SSD is slightly higher than 16K page even with 128GB BP, and mostly x2 times higher with 32GB BP, so it's definitively something to consider, specially for IO-bound workloads ;-))

and IO rate stats are confirming :


And now let's see how the things will change if we'll add to the game InnoDB Double Write ! (yes ;-))

Note, just to remind :
  • InnoDB Double Write (DBLWR) feature is the only "on board" solution to protect your data from partial writes in case of accidental poweroff, etc.
  • so, each page is literally written twice (first in DBLWR "buffer" (file), and then to its dedicated datafile place)
  • and as these writes should be serialized for every page, your page write latency will double
  • as well in the tested MySQL 8.0 version the DBLWR code is not yet fixed and has its own bottleneck inside
  • (so, your latency page write will be "in the best" double, while in reality it may impact much more ;-))

so far, what about results ?

the following graphs are now will be per device :
  • you'll see 4 tests :
    • 1) default 16K page with DBLWR=off
    • 2) 4K page, DBLWR=off
    • 3) 16K page, DBLWR=on
    • 3) 4K page, DBLWR=on
  • same as before, load is going from 1, 2, 4 .. up to 1024 concurrent users

and we're starting with single Optane device :


InnoDB Buffer Pool 128G, 1xOptane

Comments :
  • mostly zero impact of DBLWR here
  • the writes are not that intensive to hit DBLWR bottleneck
  • while Optane is fast enough to "hide" doubled latency of pages write + keep REDO flushing not impacted..


InnoDB Buffer Pool 32G 1xOptane

Comments :
  • first of all, pretty clearly seen the TPS gain when 4K page is used !
  • and with DBLWR=off the gain is nearly 60% (!!)
  • while with DBLWR=on it's even bigger than 60% ! ;-))
  • however, DBLWR here is lowering TPS roughly by 30% (will be interesting to compare once it'll be fixed in 8.0 ;-))
  • NOTE : I did not try in these tests to "protect" peak TPS from going down on higher load (it was possible to do by limiting concurrency or by using pooling solution like ProxySQL) -- so, it's possible to keep peak TPS till 1024 users load, but I rather want to see "natural limits" of the code ;-))

And now the same, but with SSD :


InnoDB Buffer Pool 128G, 1xSSD

Comments :
  • amazing, but there is NO error in the graphs.. -- yes, enabling DBLWR is increasing TPS here ! ;-))
  • and you may ask -- "HOW ever this is possible ?.."
  • the story here is pretty similar as x2 times lower TPS observed on IO-bound Sysbench OLTP_RW with the same SSD, just that the impact is opposite ;-))
    • enabling DBLWR has more than doubled write time of every page
    • but also has a bottleneck in the current code
    • (e.g. even if we want to send more pages to write -- we cannot ;-))
    • which is lowered overall IO activity around page writes
    • which is resulted in leaving more "room" for REDO writes ;-))

and you can see it clearly from page flushing stats :
and then from REDO flush times :
While it looks surprising initially, there is no mystery so far. And once DBLWR will be fixed in MySQL 8.0 -- such kind of artificial "optimization" will be gone ;-))

what about 32GB BP now ?

InnoDB Buffer Pool 32G, 1xSSD

Comments :
  • with 32GB BP much less data remains cached in BP, so the whole IO activity becomes much higher
  • and it's much harder to keep this IO load for a given SSD than for Optane, so SSD is largely lagging..
  • enabling DBLWR has also a pretty big impact here, but what is interesting -- the "relative" impact is very similar to Optane (around 30%) -- again, will be very interesting to compare once the DBLWR code will be fixed..
  • seeing this, I better understand why Vadim observed only 2K TPS in his TPCC tests with MySQL within similar test conditions -- storage matters ;-)) (however, no idea why with 128GB BP he got only 6K TPS)..
  • from the other side -- using 4K page making a huge difference here as well : more than x2 times better performance !! -- because on OLTP workload with 4K page you have more chances to bring more "useful data" to BP on the same time than with 16K page..


INSTEAD OF SUMMARY

  • indeed, having a super performant storage like Intel Optane is helping a lot on IO-bound OLTP !
  • still IO latency is extremely important for MySQL on OLTP workloads (specially IO-bound)
  • on the same time, it also depends if you can reduce your IO traffic or not..
  • using InnoDB 4K page is something definitively to consider !
  • InnoDB Double Write is impacting a lot.. -- even once its bottleneck will be fixed in MySQL 8.0, it'll still remain "double write" (e.g. still write twice the same data, and still double latency to expect from page write) -- would be so cool if we could get a rid of DBLWR totally.. ;-))
  • and, of course, there are still several InnoDB internal contentions remaining to be fixed -- as we're not yet scaling on TPCC workload (while, again, everything is relative..)
  • seems like a lot of fun ahead ! ;-))

Thank you for using MySQL ! - stay tuned ;-))

Rgds,
-Dimitri