Intermittent I/O delays on part of RAID array causing poor MySQL performance

Question

I am managing a bare-metal server running a MySQL database as well as some other software. We are seeing performance issues with some MySQL queries, but the issues occur intermittently, and almost exclusively affect UPDATE queries.

As an example, I have a particular query that runs about every 15 seconds. Most of the time it completes in milliseconds, but sometimes query profiling shows this:

Status                          Duration
starting                        0.000026
checking permissions            0.000003
Opening tables                  0.000011
System lock                     0.000002
init                            0.000007
update                          0.000052
Waiting for query cache lock    0.000001
update                          0.000002
end                             0.000001
query end                       15.198987
closing tables                  0.000023
freeing items                   0.000026
logging slow query              0.000014
logging slow query              0.000001
cleaning up                     0.000002

I tried the suggestions in this question, without effect. I then started looking at lower-level items, since I noticed that some of the other software was also exhibiting intermittent slow response.

Here is top output when performance is good:

top - 12:09:48 up 138 days, 21:57,  2 users,  load average: 2.35, 1.73, 1.55
Tasks: 392 total,   2 running, 388 sleeping,   2 stopped,   0 zombie
%Cpu0  : 16.4 us,  1.3 sy,  0.0 ni, 80.3 id,  1.0 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu1  :  9.3 us,  0.7 sy,  0.0 ni, 89.1 id,  0.7 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu2  :  4.0 us,  0.7 sy,  0.0 ni, 95.0 id,  0.3 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  3.0 us,  0.7 sy,  0.0 ni, 96.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  : 10.0 us,  0.7 sy,  0.0 ni, 89.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  : 14.9 us,  0.7 sy,  0.0 ni, 84.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  :  2.6 us,  0.3 sy,  0.0 ni, 96.4 id,  0.7 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  :  1.3 us,  0.0 sy,  0.0 ni, 98.0 id,  0.7 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  32836768 total, 24516984 used,  8319784 free,   143868 buffers
KiB Swap:        0 total,        0 used,        0 free. 15166836 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
26264 root      20   0 1255040  81148  17792 S  19.3  0.2 367:38.32 node
 1450 root      20   0 1249548 106060  17760 S   7.6  0.3   2612:39 node
30815 root      20   0 1245184  71652  17780 S   5.6  0.2 898:00.21 node
20700 root      20   0 1238644  63004  16812 S   2.3  0.2 635:20.34 node
  281 root      20   0  193348 101184  97840 S   2.0  0.3   2832:15 systemd-journal
 2974 root      20   0   48404  21688   3812 S   1.3  0.1   1329:51 openvpn
26283 postgres  20   0  235384  45408  42648 S   1.3  0.1  10:17.88 postgres
26284 postgres  20   0  235380  44780  42008 S   1.3  0.1  10:14.81 postgres
25764 root      20   0 1232976  58792  17716 S   1.0  0.2  42:51.01 node
26279 postgres  20   0  235356  45772  43028 S   1.0  0.1  10:13.60 postgres
12793 root      20   0   23956   3300   2488 R   0.7  0.0   0:00.11 top
26280 postgres  20   0  235384  44764  41992 S   0.7  0.1  10:16.64 postgres
27135 postgres  20   0  235380  44932  42156 S   0.7  0.1  10:15.24 postgres
27136 postgres  20   0  235360  44288  41532 S   0.7  0.1  10:15.45 postgres

And when it is bad:

top - 12:13:19 up 138 days, 22:00,  2 users,  load average: 3.34, 2.45, 1.88
Tasks: 393 total,   2 running, 389 sleeping,   2 stopped,   0 zombie
%Cpu0  : 11.1 us,  0.0 sy,  0.0 ni,  0.0 id, 87.8 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu1  :  1.3 us,  0.0 sy,  0.0 ni, 75.5 id, 22.8 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu2  :  5.7 us,  0.3 sy,  0.0 ni,  8.4 id, 85.6 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 11.0 us,  0.3 sy,  0.0 ni, 74.3 id, 14.3 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  :  0.7 us,  0.0 sy,  0.0 ni, 87.3 id, 12.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  :  1.3 us,  0.0 sy,  0.0 ni, 98.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  :  1.0 us,  0.0 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  :  3.0 us,  0.3 sy,  0.0 ni, 93.3 id,  3.4 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  32836768 total, 24870588 used,  7966180 free,   143868 buffers
KiB Swap:        0 total,        0 used,        0 free. 15073140 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
26264 root      20   0 1252992 109712  17792 S  10.0  0.3 368:16.02 node
 1450 root      20   0 1255692 112632  17760 S   4.3  0.3   2612:57 node
30815 root      20   0 1240064  81032  17780 S   4.0  0.2 898:13.71 node
20700 root      20   0 1243764  74472  16812 S   2.3  0.2 635:24.22 node
  281 root      20   0   58644  22468  19124 S   1.0  0.1   2832:18 systemd-journal
25764 root      20   0 1240144  64560  17716 S   1.0  0.2  42:53.01 node
 2974 root      20   0   48404  21688   3812 S   0.7  0.1   1329:53 openvpn
    7 root      20   0       0      0      0 S   0.3  0.0 728:52.82 rcu_sched
  770 postgres  20   0  235648  45920  42780 S   0.3  0.1   8:11.67 postgres
 5476 root      20   0       0      0      0 S   0.3  0.0   0:00.15 kworker/u16:1
12329 root      20   0       0      0      0 D   0.3  0.0   0:00.18 kworker/2:1
12778 root      20   0       0      0      0 R   0.3  0.0   0:00.25 kworker/2:2
26278 postgres  20   0  232340  17764  15544 S   0.3  0.1   5:38.38 postgres
28478 mysql     20   0 7210212 3.677g   8560 S   0.3 11.7   1849:14 mysqld
30829 postgres  20   0  232340  17484  15272 S   0.3  0.1  14:01.09 postgres
    1 root      20   0   28968   5188   2864 S   0.0  0.0   5:44.60 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:12.38 kthreadd
    3 root      20   0       0      0      0 S   0.0  0.0  75:39.66 ksoftirqd/0
    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H
    8 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh

Following some advice on troubleshooting high I/O wait in Linux, I ran iostat -xm 5.

Here's a normal output:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6.05    0.00    0.63    0.80    0.00   92.52

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     2.60    0.00   58.40     0.00     0.64    22.58     0.02    0.32    0.00    0.32   0.29   1.68
sdb               0.00     2.80    0.00   58.20     0.00     0.64    22.66     0.12    2.10    0.00    2.10   1.70   9.92
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00   24.00     0.00     0.10     8.53     0.00    0.00    0.00    0.00   0.00   0.00
md3               0.00     0.00    0.00   27.80     0.00     0.53    39.19     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00   24.20     0.00     0.53    45.02     0.22    9.06    0.00    9.06   2.64   6.40

and a slow output:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.90    0.00    0.28   51.05    0.00   44.77

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     7.20    0.00   42.20     0.00    15.24   739.82     0.35    8.27    0.00    8.27   0.74   3.12
sdb               0.00     7.40    0.00   62.00     0.00    24.69   815.55   110.16 2142.10    0.00 2142.10  16.13 100.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00    0.40     0.00     0.00    24.00     0.00    0.00    0.00    0.00   0.00   0.00
md3               0.00     0.00    0.00    0.20     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.20     0.00     0.00     8.00   219.50 3880180.00    0.00 3880180.00 5000.00 100.00

sda and sdb are Samsung SSD 850 EVO 1TB drives. Each has an identical partition table creating four md devices at RAID1:

Personalities : [raid1] 
md3 : active raid1 sda4[0] sdb4[1]
      940987072 blocks super 1.2 [2/2] [UU]
      bitmap: 2/8 pages [8KB], 65536KB chunk

md2 : active raid1 sda3[0] sdb3[1]
      19514368 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sda2[0] sdb2[1]
      15617024 blocks super 1.2 [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[1]
      487104 blocks super 1.2 [2/2] [UU]

md0 is /boot, type xfs. md1 is for swap, but is currently not in use. md2 is /, type xfs. Finally, md3 is the backing device for an encrypted volume (dm-0) where the data is stored:

LUKS header information for /dev/md3

Version:        1
Cipher name:    aes
Cipher mode:    xts-plain64
Hash spec:      sha1
Payload offset: 4096

What bothers me particularly is that the I/O load on sdb seems so much higher than on sda. Since both disks are only providing member partitions for RAID1 arrays, this doesn't seem right.

I checked SMART data for sdb, and ran both short and long self-test without problems being detected:

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      8719         -
# 2  Short offline       Completed without error       00%      8711         -

System logs show no entries for sdb other than smartd logging temperature changes every 30 minutes.

kernel version:

Linux no 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2+deb8u3 (2016-07-02) x86_64 GNU/Linux

What am I missing?

shodanshok · Accepted Answer · 2017-06-09 18:29:56Z

Based on your iostat output, sdb seems to have a very long I/O queue, with corresponding high wait time.

Why this only happens on sdb, rather than on both disks? This is a possible explaination: when you first configured the RAID device, all sda content (even invalid/garbage one) was copied to sdb. This automatically marked all blocks as used on sdb, leaving very little space (only a small overprovisioning) for flash block/page recycling. This can wreak havok on consumer-grade disks, with the controller literally pausing all I/O operations to garbage collect some flash block.

This is exacerbated by the encrypted partition, which basically looks as a big binary blog to the disk's controller.

What to do now? If your LUKS installation (and all the I/O stack below it) supports ATA TRIM passdown, you can simply issue a fstrim -v <mntpoint> command to mark all free space as, well, not used. If it works, I suggest you to schedule a nightly cronjob to automatically execute it during low-load hours. Give a look here for more information.

If your setup does not support TRIM passdown, you are out of luck: in this case, you had to secure erase the disks and re-partition them, leaving at least 25% space as unassiged. In some cases it is even needed to use hdparm to set an appropriately-sized Host Protected Area to manually set more overprovisioning; however, this should not be necessary on Samsung 850 EVOs.

Stack Exchange Network

Intermittent I/O delays on part of RAID array causing poor MySQL performance

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
linux
raid
performance
ssd
trim
.

Hot Network Questions

Intermittent I/O delays on part of RAID array causing poor MySQL performance

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged linuxraidperformancessdtrim.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
linux
raid
performance
ssd
trim
.