Erratic SMART readings on one member of a RAID 1 array

Question

I am managing a server that uses 2 nvme ssds on RAID 1 connectivity. At once point I lost access to one of the 2 and got my normal raid array degraded mails from mdadm.

So I asked from the hosting company to check it out and they said that the array's contacts needed cleaning to make better contact and once they did that the machine picked up the nvme and started rebuilding the array.

When rebuilding finished I went in and checked the results. So the ssds are not new. They are used so SMART readings should reflect this.

when I ran nvme list I got the following result.

| => nvme list
Node                  SN                   Model                                    Namespace Usage                      Format           FW Rev
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1          S************1       SAMSUNG MZVKW512HMJP-00000               1          36.70  GB / 512.11  GB    512   B +  0 B   CXA7500Q
/dev/nvme1n1          S************5       SAMSUNG MZVL2512HCJQ-00B00               1         511.95  GB / 512.11  GB    512   B +  0 B   GXA7801Q

Now the server is pretty old, but I got it second hand and reformated it a couple of weeks ago. So it's pretty empty right now. 36.7GB on Member 1 as a used space seem correct. The second member is the one that was rebuilt. It reports 511.95Gb used. This makes no sense on a raid 1 array (or does it?) please correct me if I'm wrong.

I mean, the system works just fine. When I run:

| => cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 nvme1n1p1[2] nvme0n1p1[0]
      33520640 blocks super 1.2 [2/2] [UU]

md1 : active raid1 nvme1n1p2[2] nvme0n1p2[0]
      1046528 blocks super 1.2 [2/2] [UU]

md2 : active raid1 nvme0n1p3[0] nvme1n1p3[1]
      465370432 blocks super 1.2 [2/2] [UU]
      bitmap: 4/4 pages [16KB], 65536KB chunk

unused devices: <none>

I see that the software raid array works just fine. Those two drives should be identical. What does that 511.96Gb Usage mean on the 2nd nvme? Is it normal?

I tried to see what the SMARTMONTOOLS will report and I got that:

| => smartctl -A /dev/nvme1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-52-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        31 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    25,639 [13.1 GB]
Data Units Written:                 2,127,320 [1.08 TB]
Host Read Commands:                 101,600
Host Write Commands:                8,203,941
Controller Busy Time:               239
Power Cycles:                       7
Power On Hours:                     26
Unsafe Shutdowns:                   3
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               31 Celsius
Temperature Sensor 2:               31 Celsius

(yes I know, power on hours is 26. This nvme is brand new. I got a confirmation from the hosting company.)

Everything else on the drive seems just fine. The other drive is much older and it's smarmontools report is:

| => smartctl -A /dev/nvme0
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-52-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        27 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    26%
Data Units Read:                    115,783,912 [59.2 TB]
Data Units Written:                 281,087,251 [143 TB]
Host Read Commands:                 1,142,872,239
Host Write Commands:                8,039,604,613
Controller Busy Time:               38,359
Power Cycles:                       519
Power On Hours:                     16,843
Unsafe Shutdowns:                   496
Media and Data Integrity Errors:    0
Error Information Log Entries:      154
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               27 Celsius
Temperature Sensor 2:               33 Celsius

Which also seems to be just fine and as expected. But for some reason nvme list shows that it's using 512Gb. How can this be the case? Was the rebuilding process not properly completed?

What do you think?

Why are you using consumer-grade SSDs in a server? Is this just the boot drive? — Chopper3, Dec 21, 2022 at 10:10
This was offerred by Hetzner. I got it really cheap so it's done. Isn't it a bit irrelevant though? — escozul, Dec 27, 2022 at 22:07
No, not at all, serverfault is a site for professionals - who inherently wouldn't use consumer grade parts, with their much higher MTBFs, in a professional setting. — Chopper3, Dec 28, 2022 at 10:35
This is a server meant for a professional installation. It is meant to receive about 50 websites that are currently hosted in a VPS. Hetzner in turn is undeniably a professional hosting provider. The raid array is RAID 1 which reduces the chance for failure significantly. The Server is setup using a XEON and ECC memory. I myself do this for a living. I don't get your "not - professional" comment here. I only asked about the reading of one particular command that seemed obscure to me. Please let me clarify again: "The server works". I just get 512GB used when running nvme list on 1 disk — escozul, Dec 28, 2022 at 15:36

Robert Hrovat · Accepted Answer · 2023-04-14 05:46:12Z

0

I see now I also get such results:

    Node                  SN                   Model                                    Namespace Usage                      Format           FW Rev  
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1          S69xxxxxxxxxxxxx      Samsung SSD 980 PRO 2TB                  1           2.00  TB /   2.00  TB    512   B +  0 B   5B2QGXA7
/dev/nvme1n1          S69xxxxxxxxxxxxx      Samsung SSD 980 PRO 2TB                  1         381.65  GB /   2.00  TB    512   B +  0 B   5B2QGXA7

And mdstat looks ok:

    Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 nvme0n1p2[1] nvme1n1p2[0]
      1952279552 blocks super 1.2 [2/2] [UU]
      bitmap: 2/15 pages [8KB], 65536KB chunk

Does anybody know why is that?

answered Apr 14 at 5:46

Robert Hrovat

1

is the /dev/nvme0n1 disk much older than the /dev/nvme1m1?
– escozul
Apr 15 at 12:38
Both were bought on same day. Production date on them differs for 1 month
– Robert Hrovat
Apr 18 at 7:01
Listen I never got an answer or a suggestion here. Instead, I got an irrelevant comment about whether I should be using what hardware. I failed to see the point of diverting the discussion from what that Usage column means to whether my SSD is Pro or Consumer... Eventually, I figured out what that "Usage" thingie means. Take it with a grain of salt please:
– escozul
Apr 19 at 12:53
The Usage Column actually means what percentage of the available space on the SSD has been used. What percentage of the physical NANDS have been used at least once(?) It is ok if, at a certain point, these values are not the same for both drives. On my system, they eventually matched. Right now the actual NvME usage is about 137GB for both of my drives but if I do a du -h I see that only around 32 GB are occupied. Both of my drives though have used only 137/512 GB from their NvME physical address space. That's how I interpreted it
– escozul
Apr 19 at 12:54
I also don't think its something wrong. It's just strange how it's shown. Maybe it's just a reading bug.
– Robert Hrovat
Apr 20 at 13:04

| Show 1 more comment

Stack Exchange Network

Erratic SMART readings on one member of a RAID 1 array

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
software-raid
ssd
raid1
nvme
.

Hot Network Questions

Erratic SMART readings on one member of a RAID 1 array

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged software-raidssdraid1nvme.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
software-raid
ssd
raid1
nvme
.