The performance characteristics of Amazon’s Elastic Block Store are moody, technically opaque and, at times, downright confounding. At Heroku, we’ve spent a lot of time managing EBS disks, and recently I had a very long night trying to figure out how to get the best performance out of the EBS disks, and little did I know I was testing them on a night when they were particularly ornery. On a good day, an EBS can give you 7,000 seeks per second and on a not so good day will give you only 200. On those good days you’ll be singing its praises and on the bad days you’ll be cursing its name. What I think I stumbled on that day was a list of techniques that seem to even out the bumps and get decent performance out of EBS disks even when they are performing badly.
Under perfect circumstances a totally untweaked EBS drive running an ext3 filesystem will get you about 80kb read or write throughput and 7,000 seeks per second. Two disks in a RAID 0 configuration will get you about 140kb read or write and about 10,000 seeks per second, this represents the best numbers I’ve been able to get out of an EBS disk setup as it seems to be saturating the IO channel on the EC2 instance (this makes sense as it would be about what you’d expect from gigabit ethernet). However, when the EBS drives are NOT running their best, which is often, you need a lot more tweaking to get good performance out of them.
The tool I used to benchmark was bonnie++, specifically:
bonnie++ -u nobody -fd /disk/bonnie
Saturating reads and writes was not very hard, but seeks per second – which is CRITICAL for databases – was much more sensitive and is what I was optimizing for in my tests.
In my tests I build raids. I’ve been using mdadm raid 0:
mdadm --create /dev/md0 --metadata=1.1 --level=0 ...
Each EBS disk is claimed to be a redundant disk to begin with so I felt safe just striping for speed.
Now, I just need to take a moment to point something out. Performance testing on EBS is very hard. The disks speed up and slow down on their own. A lot. Telling when your tweak is helping vs it just being luck is not easy. It feels a bit like trying to clock the speed of passing cars with a radar gun from the back of a rampaging bull. I fully expect to find that some of my discoveries here are just a mare’s nest, but hopefully others will prove enduring.
After testing, what I found surprised me:
- More disks are better than fewer. I’ve had people tell me that performance maxed out for them at 20 to 30 disks. I could not measure anything above 8 disks. Most importantly, lots of disks seem to smooth out the flaky performance of a single EBS disk that might be busy chewing on someone else’s data.
- Your IO scheduler matters (but not as much as I thought it would). Do not use noop. Use cfq or deadline. I found deadline to be a little better but YMMV.
- Larger chunk sizes on the raid made a (shockingly) HUGE difference in performance. The sweet spot seemed to be at 256k.
- A larger read ahead buffer on the raid also made a HUGE difference. I bumped it from 256 bytes to 64k.
- Use XFS or JFS. The biggest surprise to me was how much better XFS and JFS performed on these moody disks. I am used to seeing only minimal performance enhancements to disk performance when using them but something about the way XFS and JFS group reads and writes plays very nicely with EBS drives.
- Mounting noatime helps but only by about 5%.
- Different EC2 instance sizes, much to my surprise, did not make a noticeable difference in disk IO.
- I was not able to reproduce Ilya’s results where a disk performed poorly when newly created but faster after being zeroed out with dd (due to lazy allocation of sectors).
I’ve included my notes from that day below. I was not running tests three times in a row and taking the standard deviation into account (although I wish I had), and these aren’t easy to reproduce because it’s been a while since the EBS drives were having such a bad day.
|Scheduler||FS||Disks||Settings||Seq Block W||Seq Block RW||Seq Block R||Random Seeks/s|
|cfq||ext3||24||noatime,stride=16,blockdev —setra 65536 /dev/md0||124k||31k||38k||3939|
|deadline||ext3||24||noatime,chunksize=256,stride=64,blockdev —setra 65536 /dev/md0||126k||37k||44k||1720|
|cfq||ext3||24||noatime,chunksize=256,stride=64,blockdev —setra 65536 /dev/md0||124k||34k||44k||4560|
|cfq||ext3||24||noatime,chunksize=256,stride=64,blockdev —setra 393216 /dev/md0||126k||35k||43k||1860|
|cfq||ext3||24||noatime,chunksize=256,blockdev —setra 65536 /dev/md0||129k||34k||43k||1285|
|cfq||ext3||24||noatime,chunksize=256,blockdev —setra 65536 /dev/sd*||125k||35k||44k||2557|
|cfq||ext3||16||noatime,chunksize=256,stride=64,blockdev —setra 65536 /dev/md0||125k||40k||48k||2770|
|noop||ext3||16||noatime,chunksize=256,stride=64,blockdev —setra 65536 /dev/md0||124k||38k||47k||2504|
|deadline||ext3||16||noatime,chunksize=256,stride=64,blockdev —setra 65536 /dev/md0||125k||41k||46k||1886|
|cfq||xfs||16||noatime,chunksize=256,blockdev —setra 65536 /dev/md0||126k||62k||93k||7428|
|deadline||xfs||16||noatime,chunksize=256,blockdev —setra 65536 /dev/md0||118k||63k||92k||10723|
|cfq||xfs||16||noatime,chunksize=512,blockdev —setra 65536 /dev/md0||122k||63k||92k||10099|
|cfq||xfs||16||noatime,chunksize=512,blockdev —setra 131072 /dev/md0||116k||64k||99k||9664|
|deadline||xfs||16||noatime,chunksize=512,blockdev —setra 131072 /dev/md0||118k||66k||99k||6396|
|cfq||xfs||24||noatime,chunksize=256,blockdev —setra 65536 /dev/md0||117k||62k||89k||7657|
|cfq||xfs||8||noatime,chunksize=256,blockdev —setra 65536 /dev/md0||117k||62k||91k||3059|
|deadline||xfs||8||noatime,chunksize=256,blockdev —setra 65536 /dev/md0||139k||63k||85k||10403|
|deadline||xfs||8||noatime,chunksize=256,blockdev —setra 32768 /dev/md0||124k||60k||82k||9308|
|deadline||xfs||4||noatime,chunksize=256,blockdev —setra 65536 /dev/md0||88k||48k||77k||1133|
|deadline||xfs||12||noatime,chunksize=256,blockdev —setra 65536 /dev/md0||119k||67k||90k||8590|
|cfq||xfs||12||noatime,chunksize=256,blockdev —setra 65536 /dev/md0||85k||64k||91k||10340|
|cfq||ext2||8||noatime,chunksize=256,blockdev —setra 65536 /dev/md0||141k||51k||51k||3242|
|deadline||xfs||8||noatime,chunksize=256,blockdev —setra 65536 /dev/md0||112k,112k,115k||67k,61k,61k||85k,83k,86k||9568,8541,8339|
|deadline||jfs||8||noatime,chunksize=256,blockdev —setra 65536 /dev/md0||135k,138k,85k||66k,64k,33k||92k,87k,79k||9785,10109,8615|