Understanding ZFS NVMe benchmarks with FIO

Question

I'm trying to benchmark an all-NVMe ZFS disk array. I'm familiar with extremely fast benchmark results, with very little disk activity due to efficient ZFS caching. I ran into the reverse: massive amount of ZFS disk activity, but FIO shows only little bandwidth. How do I explain this behaviour?

Zpool iostat -v 1 only shows a single second, but the output is consistent every second: 16 to 20 GiB/s write bandwidth

                capacity     operations     bandwidth
pool          alloc   free   read  write   read  write
------------  -----  -----  -----  -----  -----  -----
my_pool-10    1.92T  19.0T      2   155K  95.7K  16.9G
  mirror-0     329G  3.16T      0  26.2K  3.99K  2.84G
    nvme2n1       -      -      0  13.2K      0  1.42G
    nvme3n1       -      -      0  13.1K  3.99K  1.42G
  mirror-1     328G  3.16T      0  26.0K      0  2.83G
    nvme4n1       -      -      0  13.0K      0  1.41G
    nvme5n1       -      -      0  13.0K      0  1.41G
  mirror-2     329G  3.16T      0  25.8K      0  2.83G
    nvme6n1       -      -      0  12.9K      0  1.42G
    nvme7n1       -      -      0  12.8K      0  1.42G
  mirror-3     328G  3.16T      0  25.8K  3.99K  2.81G
    nvme8n1       -      -      0  12.9K  3.99K  1.40G
    nvme9n1       -      -      0  12.9K      0  1.40G
  mirror-4     328G  3.16T      0  26.0K  87.7K  2.82G
    nvme10n1      -      -      0  13.0K  87.7K  1.41G
    nvme11n1      -      -      0  13.0K      0  1.41G
  mirror-5     327G  3.16T      0  25.6K      0  2.82G
    nvme12n1      -      -      0  12.9K      0  1.41G
    nvme13n1      -      -      0  12.8K      0  1.41G
------------  -----  -----  -----  -----  -----  -----

As far as I understand, the write-stats in this table are doubled. The 16.9G at the top for writes counts writes on all devices. As these are mirrored vdevs, only half of these will be 'data', the other half is redundancy. Obviously, there still is some overhead. In that sense, I would expect my bandwidth to drop from 16.9GB/s to 8.45GB/s to about 6.75GB/s (20% overhead)

My FIO benchmark on the other hand, shows about 240MiB/s. Either I don't understand the output, or something is gobbling up my bandwidth.

I run 25 parallel jobs of 4k writes (testing an annoying network load). The command is: sync; fio --name=asyncio --ioengine=posixaio --rw=randwrite --bs=4k --fsync=1 --size=100g --numjobs=25 --iodepth=64 --runtime=60 --time_based --filename=/my_pool-10/unbuffwrite2 --group_reporting

asyncio: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=64
...
fio-3.28
Starting 25 processes
asyncio: Laying out IO file (1 file / 102400MiB)
Jobs: 25 (f=25): [w(25)][100.0%][w=240MiB/s][w=61.4k IOPS][eta 00m:00s]
asyncio: (groupid=0, jobs=25): err= 0: pid=33672: Sat Aug 26 20:20:18 2023
  write: IOPS=61.2k, BW=239MiB/s (251MB/s)(14.0GiB/60015msec); 0 zone resets
    slat (nsec): min=90, max=5052.1k, avg=810.57, stdev=5386.98
    clat (usec): min=52, max=47390, avg=9496.96, stdev=6324.85
     lat (usec): min=53, max=47392, avg=9497.77, stdev=6324.94
    clat percentiles (usec):
     |  1.00th=[ 2442],  5.00th=[ 2835], 10.00th=[ 3195], 20.00th=[ 3785],
     | 30.00th=[ 4555], 40.00th=[ 5866], 50.00th=[ 7635], 60.00th=[ 9765],
     | 70.00th=[12256], 80.00th=[15139], 90.00th=[18744], 95.00th=[21890],
     | 99.00th=[27395], 99.50th=[29754], 99.90th=[34341], 99.95th=[36439],
     | 99.99th=[40633]
   bw (  KiB/s): min=211040, max=272704, per=100.00%, avg=244851.53, stdev=462.61, samples=2975
   iops        : min=52760, max=68176, avg=61212.84, stdev=115.66, samples=2975
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.06%, 4=23.36%, 10=37.74%, 20=31.19%, 50=7.64%
  fsync/fdatasync/sync_file_range:
    sync (usec): min=267, max=53554, avg=16564.62, stdev=7006.70
    sync percentiles (usec):
     |  1.00th=[ 3589],  5.00th=[ 5145], 10.00th=[ 6849], 20.00th=[ 9896],
     | 30.00th=[12518], 40.00th=[14877], 50.00th=[16909], 60.00th=[18482],
     | 70.00th=[20317], 80.00th=[22414], 90.00th=[25297], 95.00th=[27919],
     | 99.00th=[33817], 99.50th=[35914], 99.90th=[40109], 99.95th=[42206],
     | 99.99th=[44827]
  cpu          : usr=0.99%, sys=0.26%, ctx=916290, majf=0, minf=702924
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=1.9%, 16=11.8%, 32=161.4%, >=64=25.1%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=95.3%, 8=1.4%, 16=1.6%, 32=1.1%, 64=0.6%, >=64=0.0%
     issued rwts: total=0,3672021,0,3683331 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=239MiB/s (251MB/s), 239MiB/s-239MiB/s (251MB/s-251MB/s), io=14.0GiB (15.0GB), run=60015-60015msec

Any insights on where my bandwidth went?

Interesting question. I have more questions about the application, since fio may not provide the right data about the capabilities of the system. Depending on the application and what you're storing, ZFS mirrors may not be the best choice. There's also quite a bit of tuning needed to get the most out of NVMe on ZFS/Linux. — ewwhite, Aug 26 at 23:20
I agree that FIO does not provide the right data. I'll simulate the real workload as well. The 4k block size was more of a worst case scenario test. Anyway, I think I figured out my answer while writing my follow-up questions: How does zpool iostat know MY block size? I guess it can't, and it is calculating bandwidth based on ZFS recordsize. If I max out at 155k IOPS with a recordsize of 128k -> 155k*128k = 18,9G/s. That looks a lot closer to the 16,9 that I'm seeing. If I take my blocksize: 155k*4k = 605M/s (mirrored disks!) and 300M/s is pretty close to FIO's results. Is it that simple? — Frank Vermeulen, Aug 27 at 3:29

shodanshok · Accepted Answer · 2023-08-27 09:47:48Z

5

You are probably running fio on a recordsize=128k dataset. When reading/writing 4K blocks you are going to see up to 32x I/O amplification.

To get maximum IOPs you can create a new dataset with recordsize=4k and retry your fio bench; however, in the real world, I would recommend using 16/32k recordsize and lz4 compression.

answered Aug 27 at 9:47

shodanshok

48.2k7 gold badges112 silver badges181 bronze badges

Add a comment |

Stack Exchange Network

Understanding ZFS NVMe benchmarks with FIO

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
zfs
benchmark
nvme
fio
.

Hot Network Questions

Understanding ZFS NVMe benchmarks with FIO

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged zfsbenchmarknvmefio.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
zfs
benchmark
nvme
fio
.