We are a small business that analyze lots of biological data, writing and reading something in range from 500Gb to 1.5Tb now to produce final results that are much smaller. I personally am a data scientist, but most of the people are either biologist or business men or a mix of two. And to make it worse I work remotely so trying many thing is tricky.
Now about the process. We store raw data on Amazon so we don't deal with this part, but the calculations themself we do inhouse on a business grade tower (Meshify C, ASUS Pro WS WRX80E-SAGE SE motherboard, Threadreaper PRO 5975WX, RTX5000, 4Tb nVME for runtime and ZFS zraid5 with default configuration (3x16Tb Seagate Ironwolf Pro). Initialy we thought that this is a reasonable setup, but we keep getting problems with hard drives. They live on average just 2-3 month, after which they just get faulted. So far we have replaced 2 with varanty, but feel bad about this whole situation. We are never 100% sure that next day we won't spent time replacing disk or the company will refuse to provide varanty. Our setup is such that we get data with rsync over 2.5Mbps ethernet, copy it to nVME process it there and once the files are processed it is copied back to HDD. The copying is done by a workflow system called Nextflow which should theoretically create a queue for copying files but I wouldn't be surprised that this is not always the case and sometimes we copy up to 20 500Mb files simultaniously from HDD.
The thing that raises a flag to me is that hard drives operate at around 38-40 (Mobo SYSTIN is also around same temp, even with no side pannels) degrees when the system is idle and 45 when it is processing. Strangely, removing side pannels doesn't help with this either. There are bunch of fans in the system and we can probably figure out a solution to cool down it more, but maybe somebody have some experience with similar systems. Where else can be the reason of such high failure rate. Another thing that I noticed is that one disk was holding back pretty well (single Seagate EXOS 16Tb, that was probably bought by mistake). It was operated the longest, for almost 9 month and had 5 times less errors than the disk that was operated for 1 month. But now I swaped both cables (power and SATA) from failed disk to this one and the number of errors rose dramatically. How likely is that something is that either sATA controller or power (though it was on the same power line just previous jack before) can cause this.
Any ideas or advice how to properly build such setup is welcome!