0

We are a small business that analyze lots of biological data, writing and reading something in range from 500Gb to 1.5Tb now to produce final results that are much smaller. I personally am a data scientist, but most of the people are either biologist or business men or a mix of two. And to make it worse I work remotely so trying many thing is tricky.

Now about the process. We store raw data on Amazon so we don't deal with this part, but the calculations themself we do inhouse on a business grade tower (Meshify C, ASUS Pro WS WRX80E-SAGE SE motherboard, Threadreaper PRO 5975WX, RTX5000, 4Tb nVME for runtime and ZFS zraid5 with default configuration (3x16Tb Seagate Ironwolf Pro). Initialy we thought that this is a reasonable setup, but we keep getting problems with hard drives. They live on average just 2-3 month, after which they just get faulted. So far we have replaced 2 with varanty, but feel bad about this whole situation. We are never 100% sure that next day we won't spent time replacing disk or the company will refuse to provide varanty. Our setup is such that we get data with rsync over 2.5Mbps ethernet, copy it to nVME process it there and once the files are processed it is copied back to HDD. The copying is done by a workflow system called Nextflow which should theoretically create a queue for copying files but I wouldn't be surprised that this is not always the case and sometimes we copy up to 20 500Mb files simultaniously from HDD.

The thing that raises a flag to me is that hard drives operate at around 38-40 (Mobo SYSTIN is also around same temp, even with no side pannels) degrees when the system is idle and 45 when it is processing. Strangely, removing side pannels doesn't help with this either. There are bunch of fans in the system and we can probably figure out a solution to cool down it more, but maybe somebody have some experience with similar systems. Where else can be the reason of such high failure rate. Another thing that I noticed is that one disk was holding back pretty well (single Seagate EXOS 16Tb, that was probably bought by mistake). It was operated the longest, for almost 9 month and had 5 times less errors than the disk that was operated for 1 month. But now I swaped both cables (power and SATA) from failed disk to this one and the number of errors rose dramatically. How likely is that something is that either sATA controller or power (though it was on the same power line just previous jack before) can cause this.

Any ideas or advice how to properly build such setup is welcome!

1
  • I have my doubts if the disks you are using are a good choice for the type of work then need to do. How did made the decision to go for these disks. And for which step in the process are those Seagates? These Seagates are for NAS storage, not for heavy data analysis. 3 hours ago

2 Answers 2

0

It does require some investment, but I would definitely recommend looking into a Pure Storage all flash array.

I worked for a small saas financial, where they core of the application was very old, and since they where in the financial business they require very fast storage, and from what I have seen there, pure was able to deliver that and even more.

1
  • The cheapest Pure array probably costs more than everything mentioned in question. While it would destroy most performance requirements, it also is likely to destroy budgets. Admittedly it would make maintenance of the thing mostly the vendor's problem. 4 hours ago
0

But now I swaped both cables (power and SATA) from failed disk to this one and the number of errors rose dramatically. How likely is that something is that either sATA controller or power (though it was on the same power line just previous jack before) can cause this.

Yes, cable faults can definitely cause problems. Replace data and power cables on any suspect storage. If problems persist, replace everything on the path from the disks to the main board. Possibly the main board itself, although eventually this becomes building an entirely new chassis.

While monitoring environmental conditions, also watch power quality. Use an uninterruptable power supply on the power input. Replace the power supply if that is at all suspect.


"zraid5" for ZFS is confusing two different naming schemes for arrays. RAID 5 is the textbook distributed parity, survives one failure. The most similar thing for ZFS is raidz1.

Beware, this array might not survive a rebuild. Many of us on Server Fault think terabyte class disks are too big for one redundancy parity, even on ZFS. When you rebuild this array, consider more redundancy designs like raidz2, or mirrors which have different considerations.


Doubts that you can replace parts under warranty could mean you do not have the correct service contract. Or your vendor doesn't understand heavily used storage arrays. Disks are basically consumables that might live 3 years, plus or minus.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .