I have a server running CentOS 6 with two Crucial M500 SSDs configured in mdadm RAID1. This server is also virtualized with Xen.
Recently, I started seeing iowait
percentages creep up in the top -c
stats of our production VM. I decided to investigate and ran iostat on the dom0 so I could inspect activity on the physical disks (e.g., /dev/sda and /dev/sdb). This is the command I used: iostat -d -x 3 3
Here's an example of the output I received (scroll to the right for %util
numbers):
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.33 0.00 38.67 0.00 337.33 8.72 0.09 2.22 0.00 2.22 1.90 7.33
sdb 0.00 0.33 0.00 38.67 0.00 338.00 8.74 1.08 27.27 0.00 27.27 23.96 92.63
md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md1 0.00 0.00 0.00 1.00 0.00 8.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md127 0.00 0.00 0.00 29.33 0.00 312.00 10.64 0.00 0.00 0.00 0.00 0.00 0.00
drbd5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
drbd3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
drbd4 0.00 0.00 0.00 8.67 0.00 77.33 8.92 2.03 230.96 0.00 230.96 26.12 22.63
dm-0 0.00 0.00 0.00 29.67 0.00 317.33 10.70 5.11 171.56 0.00 171.56 23.91 70.93
dm-1 0.00 0.00 0.00 8.67 0.00 77.33 8.92 2.03 230.96 0.00 230.96 26.12 22.63
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-6 0.00 0.00 0.00 20.00 0.00 240.00 12.00 3.03 151.55 0.00 151.55 31.33 62.67
dm-7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
To my alarm, I noticed that there was a significant difference between /dev/sda
and /dev/sdb
in await
(2ms vs 27ms) and %util
(7% vs 92%). These drives are mirrors of one another and are the same Crucial M500 SSD so I don't understand how this could be. There is no activity on /dev/sda
that should not also occur on /dev/sdb
.
I've been regularly checked the SMART values for both of these disks and I've noticed that the Percent_Lifetime_Used
for /dev/sda
indicates 66% used while /dev/sdb
reports a non-sensical value (454% used). I hadn't been too concerned up until this point because the Reallocated_Event_Count
has remained relatively low for both drives and hasn't changed quickly.
Could there be a hardware issue with our /dev/sdb
disk? Any other possible explanations?