Skip to content

nvme/050: add support for NVMe multipath devices#214

Open
yizhanglinux wants to merge 2 commits intolinux-blktests:masterfrom
yizhanglinux:nvme-050-fix
Open

nvme/050: add support for NVMe multipath devices#214
yizhanglinux wants to merge 2 commits intolinux-blktests:masterfrom
yizhanglinux:nvme-050-fix

Conversation

@yizhanglinux
Copy link
Copy Markdown
Contributor

$./check nvme/050
nvme/050 => nvme1n1 (test nvme-pci timeout with fio jobs) [failed]
runtime 94.236s ... 62.734s
--- tests/nvme/050.out 2025-11-17 00:23:56.086469327 -0500
+++ /root/blktests/results/nvme1n1/nvme/050.out.bad 2025-11-19 03:17:45.389644408 -0500
@@ -1,2 +1,3 @@
Running nvme/050
-Test complete
+Test failed
+tests/nvme/050: line 50: /sys/bus/pci/devices//remove: Permission denied

@igaw
Copy link
Copy Markdown
Contributor

igaw commented Nov 25, 2025

There are enterprise PCI disks with multipath. I think it would be good to tests these devices as well. It seems you are testing with such a device, maybe we could get this tests also working with these types?

I think it it's possible to make _get_pci_dev_from_blkdev a bit smarter so it returns the correct PCI device?

kawasaki added a commit to kawasaki/blktests that referenced this pull request Jan 3, 2026
The test case nvme/032 sets the value "pci" to the global variable
nvme_trtype to ensure that the test case runs only when TEST_DEV is a
NVME device using PCI transport. However, this approach was not working
as intended since the global variable is not referred to. The test case
was run for NVME devices using non-PCI transport, and reported false-
positive failures.

Commit c634b8a ("nvme/032: skip on non-PCI devices") introduced the
helper function _require_test_dev_is_nvme_pci(). This function ensures
that the test case nvme/032 is skipped when TEST_DEV is not a NVME
device with PCI transport. Despite this improvement, the unused global
variable nvme_trtype continued to be set. Remove the unnecessary
substitution code.

In the same manner, the test case nvme/050 is expected to be run only
when TEST_DEV is a NVME device with PCI transport. It also sets the
global variable nvme_trtype, but it caused unexpected failure as
reported in the Link. Modify the test case to use
_require_test_dev_is_nvme_pci() to ensure the requirement.

Fixes: c634b8a ("nvme/032: skip on non-PCI devices")
Link: linux-blktests#214
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
@kawasaki
Copy link
Copy Markdown
Collaborator

kawasaki commented Jan 3, 2026

IIUC what @igaw suggests,

  • nvme/050 requires TEST_DEV which is NVME with PCI transport, and,
  • nvme/050 does not require TEST_DEV to not have multipath

Assuming this guess it correct, I created a patch. It just checks that TEST_DEV has PCI transport.

@yizhanglinux , does this patch avoid the failure you face?

@yizhanglinux
Copy link
Copy Markdown
Contributor Author

@kawasaki This can skip the test when the disk is an enterprise PCI disk with multipath.
I think what @igaw suggests is we still need to test such enterprise pci disk with multiapth.

Maybe something like below to return the PCI device when the disk supports multipath:

diff --git a/tests/nvme/050 b/tests/nvme/050
index 91f3564..ba09956 100755
--- a/tests/nvme/050
+++ b/tests/nvme/050
@@ -25,8 +25,7 @@ test_device() {
        local i

        echo "Running ${TEST_NAME}"
-
-       pdev=$(_get_pci_dev_from_blkdev)
+       pdev=$(_nvme_get_pci_from_dev_sysfs)
        nvme_ns="$(basename "${TEST_DEV}")"
        echo 1 > /sys/block/"${nvme_ns}"/io-timeout-fail

diff --git a/tests/nvme/rc b/tests/nvme/rc
index a8f80d8..e25cda2 100644
--- a/tests/nvme/rc
+++ b/tests/nvme/rc
@@ -87,6 +87,24 @@ _require_test_dev_is_not_nvme_multipath() {
        return 0
 }

+_nvme_dev_support_native_multipath() {
+       if [[ "$(readlink -f "$TEST_DEV_SYSFS/device")" =~ /nvme-subsystem/ ]]; then
+               return 0
+       fi
+       return 1
+}
+
+_nvme_get_pci_from_dev_sysfs() {
+       if _nvme_dev_support_native_multipath; then
+               readlink -f /sys/block/$(basename "${TEST_DEV}")/multipath/nvme*c*n*/device | \
+                       grep -Eo '[0-9a-f]{4,5}:[0-9a-f]{2}:[0-9a-f]{2}\.[0-9a-f]' | \
+                       tail -1
+       else
+               readlink -f "$TEST_DEV_SYSFS/device" | \
+                       grep -Eo '[0-9a-f]{4,5}:[0-9a-f]{2}:[0-9a-f]{2}\.[0-9a-f]' | \
+                       tail -1
+       fi
+}
 _require_test_dev_support_sed() {
        if ! nvme sed discover "$TEST_DEV" &> /dev/null; then
                SKIP_REASONS+=("$TEST_DEV doesn't support SED operations")

@kawasaki
Copy link
Copy Markdown
Collaborator

kawasaki commented Jan 4, 2026

@yizhanglinux Thanks for the clarification. Now I have better understanding.

@igaw Please comment if the suggested change suits your comment.

As to the change by @yizhanglinux , I have a few comments:

  • The readolink short option "-f" is to be replaced with the long option "--canonicalize".
  • In _nvme_get_pci_from_dev_sysfs(), "/sys/block/$(basename "${TEST_DEV}")" can be replaced with "${TEST_DEV_SYSFS}.
  • In _nvme_get_pci_from_dev_sysfs(), the "else" block can be replaced with a call to _get_pci_dev_from_blkdev(), probably.

@yizhanglinux
Copy link
Copy Markdown
Contributor Author

@yizhanglinux Thanks for the clarification. Now I have better understanding.

@igaw Please comment if the suggested change suits your comment.

As to the change by @yizhanglinux , I have a few comments:

  • The readolink short option "-f" is to be replaced with the long option "--canonicalize".

OK, we replace all the "-f" with "--canonicalize" for all files in one patch

  • In _nvme_get_pci_from_dev_sysfs(), "/sys/block/$(basename "${TEST_DEV}")" can be replaced with "${TEST_DEV_SYSFS}.
  • In _nvme_get_pci_from_dev_sysfs(), the "else" block can be replaced with a call to _get_pci_dev_from_blkdev(), probably.

How about the below change:

diff --git a/tests/nvme/050 b/tests/nvme/050
index 91f3564..b6eba8b 100755
--- a/tests/nvme/050
+++ b/tests/nvme/050
@@ -26,7 +26,7 @@ test_device() {

        echo "Running ${TEST_NAME}"

-       pdev=$(_get_pci_dev_from_blkdev)
+       pdev=$(_nvme_get_pci_from_dev_sysfs)
        nvme_ns="$(basename "${TEST_DEV}")"
        echo 1 > /sys/block/"${nvme_ns}"/io-timeout-fail

diff --git a/tests/nvme/rc b/tests/nvme/rc
index a8f80d8..9314671 100644
--- a/tests/nvme/rc
+++ b/tests/nvme/rc
@@ -87,6 +87,25 @@ _require_test_dev_is_not_nvme_multipath() {
        return 0
 }

+_nvme_dev_support_native_multipath() {
+       if [[ "$(readlink -f "$TEST_DEV_SYSFS/device")" =~ /nvme-subsystem/ ]]; then
+               return 0
+       fi
+       return 1
+}
+
+_nvme_get_pci_from_dev_sysfs() {
+       if _nvme_dev_support_native_multipath; then
+               readlink -f $TEST_DEV_SYSFS/multipath/nvme*c*n* | \
+                       grep -Eo '[0-9a-f]{4,5}:[0-9a-f]{2}:[0-9a-f]{2}\.[0-9a-f]' | \
+                       tail -1
+       else
+              _get_pci_dev_from_blkdev
+       fi
+}
+

@igaw
Copy link
Copy Markdown
Contributor

igaw commented Jan 5, 2026

Yes, this looks like what I had in mind. Thanks for taking care!

@kawasaki
Copy link
Copy Markdown
Collaborator

kawasaki commented Jan 5, 2026

@yizhanglinux Thanks. Overall, the suggested change looks good. Posting it as a proper patch or PR will be appreciated. Again, please use "--canonicalize" instead of "-f". Also, I think it is good to have another patch to replace "-f" with "--canonicalize" for other places, which can be done in the same PR/series, or later.

@yizhanglinux
Copy link
Copy Markdown
Contributor Author

@igaw @kawasaki

I just found the case still failed due to no "Input/output error" output from dmesg, and also no error log output [2].

[1]

	nvme_ns="$(basename "${TEST_DEV}")"
	echo 1 > /sys/block/"${nvme_ns}"/io-timeout-fail

	echo 100 > /sys/kernel/debug/fail_io_timeout/probability
	echo   1 > /sys/kernel/debug/fail_io_timeout/interval
	echo  -1 > /sys/kernel/debug/fail_io_timeout/times
	echo   0 > /sys/kernel/debug/fail_io_timeout/space
	echo   1 > /sys/kernel/debug/fail_io_timeout/verbose

	fio --bs=4k --rw=randread --norandommap --numjobs="$(nproc)" \
	    --name=reads --direct=1 --filename="${TEST_DEV}" --group_reporting \
	    --time_based --runtime=1m >& "$FULL"

	if grep -q "Input/output error" "$FULL"; then
		echo "Test complete"
	else
		echo "Test failed"
	fi

[2]

# ./check nvme/050
nvme/050 => nvme0n1 (test nvme-pci timeout with fio jobs)    [failed]
    runtime    ...  62.913s
    --- tests/nvme/050.out	2026-01-05 01:05:11.924877002 -0500
    +++ /root/blktests/results/nvme0n1/nvme/050.out.bad	2026-01-05 07:41:41.764764187 -0500
    @@ -1,2 +1,2 @@
     Running nvme/050
    -Test complete
    +Test failed
nvme/050 => nvme3n1 (test nvme-pci timeout with fio jobs)    [failed]
    runtime  63.098s  ...  63.110s
    --- tests/nvme/050.out	2026-01-05 01:05:11.924877002 -0500
    +++ /root/blktests/results/nvme3n1/nvme/050.out.bad	2026-01-05 07:42:46.482290283 -0500
    @@ -1,2 +1,2 @@
     Running nvme/050
    -Test complete
    +Test failed
# dmesg
[ 7090.147822] run blktests nvme/050 at 2026-01-05 07:40:40
[ 7152.050970] pci 0000:41:00.0: [144d:a826] type 00 class 0x010802 PCIe Endpoint
[ 7152.051113] pci 0000:41:00.0: BAR 0 [mem 0xa4a00000-0xa4a07fff 64bit]
[ 7152.467124] pci 0000:41:00.0: VF BAR 0 [mem 0x00000000-0x00007fff 64bit]
[ 7152.467147] pci 0000:41:00.0: VF BAR 0 [mem 0x00000000-0x001fffff 64bit]: contains BAR 0 for 64 VFs
[ 7152.467907] pci 0000:41:00.0: 31.504 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x4 link at 0000:40:03.1 (capable of 126.028 Gb/s with 32.0 GT/s PCIe x4 link)
[ 7152.604706] pci 0000:41:00.0: Adding to iommu group 48
[ 7152.641542] pcieport 0000:40:03.1: bridge window [io  size 0x1000]: can't assign; no space
[ 7152.641558] pcieport 0000:40:03.1: bridge window [io  size 0x1000]: failed to assign
[ 7152.641664] pcieport 0000:40:03.1: bridge window [io  size 0x1000]: can't assign; no space
[ 7152.641675] pcieport 0000:40:03.1: bridge window [io  size 0x1000]: failed to assign
[ 7152.641864] pci 0000:41:00.0: BAR 0 [mem 0xa4a00000-0xa4a07fff 64bit]: assigned
[ 7152.641895] pci 0000:41:00.0: VF BAR 0 [mem size 0x00200000 64bit]: can't assign; no space
[ 7152.641906] pci 0000:41:00.0: VF BAR 0 [mem size 0x00200000 64bit]: failed to assign
[ 7152.656355] nvme nvme0: pci function 0000:41:00.0
[ 7152.753246] nvme nvme0: D3 entry latency set to 10 seconds
[ 7152.865881] nvme nvme0: 16/0/0 default/read/poll queues
[ 7154.555876] run blktests nvme/050 at 2026-01-05 07:41:44
[ 7216.661760] pcieport 0000:40:03.1: bridge window [io  size 0x1000]: can't assign; no space
[ 7216.661779] pcieport 0000:40:03.1: bridge window [io  size 0x1000]: failed to assign
[ 7216.661888] pcieport 0000:40:03.1: bridge window [io  size 0x1000]: can't assign; no space
[ 7216.661898] pcieport 0000:40:03.1: bridge window [io  size 0x1000]: failed to assign
[ 7216.663032] pci 0000:c7:00.0: [1e0f:0013] type 00 class 0x010802 PCIe Endpoint
[ 7216.663162] pci 0000:c7:00.0: BAR 0 [mem 0xdb500000-0xdb50ffff 64bit]
[ 7216.663238] pci 0000:c7:00.0: ROM [mem 0xffff0000-0xffffffff pref]
[ 7217.083827] pci 0000:c7:00.0: VF BAR 0 [mem 0x00000000-0x0000ffff 64bit]
[ 7217.083849] pci 0000:c7:00.0: VF BAR 0 [mem 0x00000000-0x001fffff 64bit]: contains BAR 0 for 32 VFs
[ 7217.084456] pci 0000:c7:00.0: 31.504 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x4 link at 0000:c0:03.3 (capable of 126.028 Gb/s with 32.0 GT/s PCIe x4 link)
[ 7217.217478] pci 0000:c7:00.0: Adding to iommu group 13
[ 7217.253368] pci 0000:c7:00.0: BAR 0 [mem 0xdb500000-0xdb50ffff 64bit]: assigned
[ 7217.253404] pci 0000:c7:00.0: ROM [mem 0xdb510000-0xdb51ffff pref]: assigned
[ 7217.253416] pci 0000:c7:00.0: VF BAR 0 [mem size 0x00200000 64bit]: can't assign; no space
[ 7217.253428] pci 0000:c7:00.0: VF BAR 0 [mem size 0x00200000 64bit]: failed to assign
[ 7217.268252] nvme nvme3: pci function 0000:c7:00.0
[ 7217.502216] nvme nvme3: D3 entry latency set to 10 seconds
[ 7217.610849] nvme nvme3: 16/0/0 default/read/poll queues

@igaw
Copy link
Copy Markdown
Contributor

igaw commented Jan 5, 2026

First, is the line below correct now?

 echo 1 > "/sys/bus/pci/devices/${pdev}/remove"

If so, then there is another problem. But from the kernel logs, it looks like the remove works.

@igaw
Copy link
Copy Markdown
Contributor

igaw commented Jan 5, 2026

Actually, I wonder why the fio jobs is supposed to succeed at all. The device is removed on PCI level -> block device should also be gone. It's not a reset where the block layer should not see any device remove/add operation.

@igaw
Copy link
Copy Markdown
Contributor

igaw commented Jan 5, 2026

Ah wait, the test is expecting that fio is failing but it doesn't. And the test doesn't fail because to the nature of the multipath device. In this configuration the head nvme device might not be removed and the block layer buffers the IOs until the driver is ready again. Though I am not totally sure how nvme-pci works here. Need to check the source.

Could you check the output of fio? If it doesn't fail, then it is very likely all in flight IOs are buffered at the block layer. If so, than we have to figure out what we want to test here.

@yizhanglinux
Copy link
Copy Markdown
Contributor Author

First, is the line below correct now?

 echo 1 > "/sys/bus/pci/devices/${pdev}/remove"

If so, then there is another problem. But from the kernel logs, it looks like the remove works.

Yes, the disk was removed and initialized again after pci rescan.

@yizhanglinux
Copy link
Copy Markdown
Contributor Author

The fio pass with no errors from the full log.

# cat results/nvme0n1/nvme/050.full
reads: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.36
Starting 16 processes

reads: (groupid=0, jobs=16): err= 0: pid=33838: Mon Jan  5 10:14:28 2026
  read: IOPS=4525, BW=17.7MiB/s (18.5MB/s)(1061MiB/60005msec)
    clat (usec): min=784, max=86139, avg=3533.19, stdev=815.08
     lat (usec): min=784, max=86139, avg=3533.39, stdev=815.09
    clat percentiles (usec):
     |  1.00th=[ 2507],  5.00th=[ 2769], 10.00th=[ 2933], 20.00th=[ 3097],
     | 30.00th=[ 3195], 40.00th=[ 3294], 50.00th=[ 3359], 60.00th=[ 3458],
     | 70.00th=[ 3556], 80.00th=[ 3687], 90.00th=[ 4293], 95.00th=[ 5080],
     | 99.00th=[ 6587], 99.50th=[ 7177], 99.90th=[ 8848], 99.95th=[ 9503],
     | 99.99th=[12780]
   bw (  KiB/s): min=16118, max=19067, per=100.00%, avg=18111.24, stdev=28.86, samples=1904
   iops        : min= 4023, max= 4760, avg=4521.34, stdev= 7.24, samples=1904
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.04%, 4=86.78%, 10=13.14%, 20=0.03%, 50=0.01%
  lat (msec)   : 100=0.01%
  cpu          : usr=0.05%, sys=19.27%, ctx=291366, majf=0, minf=215
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=271550,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=17.7MiB/s (18.5MB/s), 17.7MiB/s-17.7MiB/s (18.5MB/s-18.5MB/s), io=1061MiB (1112MB), run=60005-60005msec

Disk stats (read/write):
  nvme0n1: ios=270757/0, sectors=2166056/0, merge=0/0, ticks=412618/0, in_queue=412618, util=99.26%

@igaw
Copy link
Copy Markdown
Contributor

igaw commented Jan 7, 2026

Ah I remember what's happening with the PCI device removal and the handling of the nvme head device. When we have a mulitpath device, the life time of the nvme head device is coupled to the PCI subsystem hotplug behavior. That is nvme_remove might not be executed thus the nvme head device will not be removed (depends on the hardware hotplug support of the architecture). Only the nvme devices which represent the paths are removed.

As shown, fio will not fail for such configurations as the block layer will handle the requeue the failing IOs instead reporting an error to the upper layers. So this tests is written is for single path devices.

This means we could disable it for multipath nvme devices or better extend it and expect no IO errors when it is an multipath device. I'd prefer the second approach. WDYT?

EDIT: note, the behavior could be different on different architecture. Though I think it would be good to collect this information and document it. We could make the tests considering also the architecture if necessary. Or recent kernel have addressed this problem after all :)

@kawasaki
Copy link
Copy Markdown
Collaborator

kawasaki commented Jan 8, 2026

@igaw Thanks for the clarification.

  • The explanation about the device removal dependency between PCI device and NVME device was interesting for me (multipath capability and architecture are relevant!). and I think it's worth documentation, somewhere under driver/nvme//.c as a block comment or Documentation/nvme/*.
  • Also, it's worth testing, then I think that the ultimate goal of this test case should be "extend it and expect no IO errors when it is an multipath device".
  • If it takes time to reach to the ultimate goal, I think it's the better to take two steps approach to suppress the error that @yizhanglinux is facing soon: 1) disable it for multipath nvme devices, 2) extend to check no IO errors for multipath devices

@igaw
Copy link
Copy Markdown
Contributor

igaw commented Jan 8, 2026

FWIW, a nvme-pci multipath behaves similar to a fabrics device. The paths can go away and as long the ctrl loss timeout doesn't expire (or in this case pci device remove) the block device will be around.

I had something like this in mind:

diff --git a/tests/nvme/050 b/tests/nvme/050
index 91f356422f63..4320c00d0a81 100755
--- a/tests/nvme/050
+++ b/tests/nvme/050
@@ -19,10 +19,22 @@ requires() {
 	_have_kernel_options FAIL_IO_TIMEOUT FAULT_INJECTION_DEBUG_FS
 }
 
+is_multipath_device() {
+	local nvme_ns cmic
+
+	nvme_ns="$1"
+
+	cmic="$(nvme id-ctrl "$nvme_ns" --output-format=json | jq -r '.cmic')"
+
+	if (( cmic & 0x1 )); then
+		return 0
+	fi
+
+	return 1
+}
+
 test_device() {
-	local nvme_ns
-	local pdev
-	local i
+	local nvme_ns pdev io_error i
 
 	echo "Running ${TEST_NAME}"
 
@@ -40,10 +52,13 @@ test_device() {
 	    --name=reads --direct=1 --filename="${TEST_DEV}" --group_reporting \
 	    --time_based --runtime=1m >& "$FULL"
 
-	if grep -q "Input/output error" "$FULL"; then
-		echo "Test complete"
+	io_error=false
+	grep -q "Input/output error" "$FULL" && io_error=true
+
+	if is_multipath_device "$nvme_ns"; then
+		$io_error && echo "Test complete" || echo "Test failed"
 	else
-		echo "Test failed"
+		$io_error && echo "Test failed" || echo "Test complete"
 	fi
 
 	# Remove and rescan the NVME device to ensure that it has come back

or if we don't want to trust that when a cmic bit is set it's multipath device then we should check what the sysfs is telling us.

@yizhanglinux
Copy link
Copy Markdown
Contributor Author

@yizhanglinux Thanks for the clarification. Now I have better understanding.

@igaw Please comment if the suggested change suits your comment.

As to the change by @yizhanglinux , I have a few comments:

  • The readolink short option "-f" is to be replaced with the long option "--canonicalize".
  • In _nvme_get_pci_from_dev_sysfs(), "/sys/block/$(basename "${TEST_DEV}")" can be replaced with "${TEST_DEV_SYSFS}.

Here is the $TEST_DEV_SYSFS value for the nvme disk and nvme multipath device
disk: /dev/nvme1n1
TEST_DEV_SYSFS: /sys/devices/pci0000:40/0000:40:03.4/0000:44:00.0/nvme/nvme1/nvme1n1
disk: /dev/nvme2n1
TEST_DEV_SYSFS: /sys/devices/virtual/nvme-subsystem/nvme-subsys2/nvme2n1

The nvme2n1's TEST_DEV_SYSFS cannot be used for pci get.

  • In _nvme_get_pci_from_dev_sysfs(), the "else" block can be replaced with a call to _get_pci_dev_from_blkdev(), probably.

@yizhanglinux
Copy link
Copy Markdown
Contributor Author

or if we don't want to trust that when a cmic bit is set it's multipath device then we should check what the sysfs is telling us.

Check cmic maybe not enough, we also need to check the nvme_core.multipath=Y.

Maybe it's better to just check the sysfs:

diff --git a/tests/nvme/050 b/tests/nvme/050
index 91f3564..2fdf62e 100755
--- a/tests/nvme/050
+++ b/tests/nvme/050
@@ -20,13 +20,11 @@ requires() {
 }

 test_device() {
-       local nvme_ns
-       local pdev
-       local i
+       local nvme_ns pdev io_error i

        echo "Running ${TEST_NAME}"

-       pdev=$(_get_pci_dev_from_blkdev)
+       pdev=$(_nvme_get_pci_from_dev_sysfs)
        nvme_ns="$(basename "${TEST_DEV}")"
        echo 1 > /sys/block/"${nvme_ns}"/io-timeout-fail

@@ -40,10 +38,13 @@ test_device() {
            --name=reads --direct=1 --filename="${TEST_DEV}" --group_reporting \
            --time_based --runtime=1m >& "$FULL"

-       if grep -q "Input/output error" "$FULL"; then
-               echo "Test complete"
+       io_error=false
+       grep -q "Input/output error" "$FULL" && io_error=true
+
+       if _nvme_dev_support_native_multipath; then
+               $io_error && echo "Test failed" || echo "Test complete"
        else
-               echo "Test failed"
+               $io_error && echo "Test complete" || echo "Test failed"
        fi

        # Remove and rescan the NVME device to ensure that it has come back
diff --git a/tests/nvme/rc b/tests/nvme/rc
index a8f80d8..2ecf618 100644
--- a/tests/nvme/rc
+++ b/tests/nvme/rc
@@ -87,6 +87,27 @@ _require_test_dev_is_not_nvme_multipath() {
        return 0
 }

+_nvme_dev_support_native_multipath() {
+       if [[ "$(readlink -f "$TEST_DEV_SYSFS/device")" =~ /nvme-subsystem/ ]]; then
+               return 0
+       fi
+       return 1
+}
+
+_nvme_get_pci_from_dev_sysfs() {
+       if _nvme_dev_support_native_multipath; then
+               readlink -f /sys/block/$(basename "${TEST_DEV}")/multipath/nvme*c*n*/device | \
+                       grep -Eo '[0-9a-f]{4,5}:[0-9a-f]{2}:[0-9a-f]{2}\.[0-9a-f]' | \
+                       tail -1
+       else
+              _get_pci_dev_from_blkdev
+       fi
+}
+
 _require_test_dev_support_sed() {
        if ! nvme sed discover "$TEST_DEV" &> /dev/null; then
                SKIP_REASONS+=("$TEST_DEV doesn't support SED operations")

@igaw
Copy link
Copy Markdown
Contributor

igaw commented Jan 13, 2026

looks reasonable to me. Though a comment above if _nvme_dev_support_native_multipath; then might be good idea, explaining why it behaves differently.

@yizhanglinux yizhanglinux changed the title nvme/050: skip test with NVMe multipath device nvme/050: add support for NVMe multipath devices Jan 15, 2026
@yizhanglinux
Copy link
Copy Markdown
Contributor Author

@kawasaki Update the patch based on the discussion and alsp changed the title.

Comment thread tests/nvme/050 Outdated
Comment thread tests/nvme/rc Outdated
Add two helper functions to tests/nvme/rc:
- _nvme_dev_support_native_multipath(): Check if the test device is an
  NVMe native multipath device by examining the sysfs device path.
- _nvme_get_pci_from_dev_sysfs(): Get the PCI address for an NVMe
  device, handling multipath devices by reading from the multipath
  subdirectory.

Update nvme/050 to handle multipath devices correctly. When testing
I/O timeout on a multipath device, fio will not encounter I/O errors
because the multipath layer provides failover to alternate paths.
Adjust the test pass/fail logic accordingly:
- For multipath devices: pass if no I/O error (expected behavior)
- For non-multipath devices: pass if I/O error occurs (original behavior)

Signed-off-by: Yi Zhang <yi.zhang@redhat.com>
Replace the short option -f with the long option --canonicalize for
better readability and consistency.

Signed-off-by: Yi Zhang <yi.zhang@redhat.com>
@igaw
Copy link
Copy Markdown
Contributor

igaw commented Jan 16, 2026

Yeah, the default operation is " echo 1 > /sys/block/nvme0n1/io-timeout-fail", do you mean also add echo 1 > /sys/block/nvme0c0n1/io-timeout-fail here?

Yes, the failure injection happens at this level, that means you have to set it for all nvme0c*n1 devices. Another thing you need to do then is to set the default timeout for the I/Os. The default value is 30s. So the block layer will hold all I/Os in the queue and they won't be reported to the upper layers before it times out.

Something like this:

set fail_io_timeout for all nvme0c*n1 devices
set /sys/block/nvme0n1/queue/timeout relative short, let's say 1 second
let fio run
add sleep 2, give the block layer a chance to return an error.
check for errors, it should fail now

@yizhanglinux
Copy link
Copy Markdown
Contributor Author

I tried the change like [1], but the running fio never finished, and there are continuous error outputs from dmesg[2], and from [3], seems the fio is in Sl+ and Ds state.
[1]

diff --git a/tests/nvme/050 b/tests/nvme/050
index 91f3564..01f69dc 100755
--- a/tests/nvme/050
+++ b/tests/nvme/050
@@ -20,16 +20,22 @@ requires() {
 }

 test_device() {
-       local nvme_ns
-       local pdev
-       local i
+       local nvme_ns pdev io_error i

        echo "Running ${TEST_NAME}"

-       pdev=$(_get_pci_dev_from_blkdev)
+       pdev=$(_nvme_get_pci_from_dev_sysfs)
        nvme_ns="$(basename "${TEST_DEV}")"
+       ctrl_dev=${nvme_ns%n*}
+
        echo 1 > /sys/block/"${nvme_ns}"/io-timeout-fail

+       if _nvme_dev_support_native_multipath; then
+               echo 1 > /sys/block/${ctrl_dev}c*n1/io-timeout-fail
+
+               echo 1000 > /sys/block/"${nvme_ns}"/queue/io_timeout
+       fi
+
        echo 100 > /sys/kernel/debug/fail_io_timeout/probability
        echo   1 > /sys/kernel/debug/fail_io_timeout/interval
        echo  -1 > /sys/kernel/debug/fail_io_timeout/times
@@ -39,11 +45,21 @@ test_device() {
        fio --bs=4k --rw=randread --norandommap --numjobs="$(nproc)" \
            --name=reads --direct=1 --filename="${TEST_DEV}" --group_reporting \
            --time_based --runtime=1m >& "$FULL"
+       sleep 2
+
+       io_error=false
+       grep -q "Input/output error" "$FULL" && io_error=true

-       if grep -q "Input/output error" "$FULL"; then
-               echo "Test complete"
+       # The timeout failure injection causes an I/O to fail immediately. For
+       # a single-path device, the failed I/O is propagated up the stack and
+       # eventually reported to user space as an error. For multipath devices,
+       # the block layer evaluates whether the I/O is eligible for retry via
+       # failover to an alternate path. Because the I/O fails before the
+       # per-I/O timeout expires, it remains eligible for retry.
+       if _nvme_dev_support_native_multipath; then
+               $io_error && echo "Test failed" || echo "Test complete"
        else
-               echo "Test failed"
+               $io_error && echo "Test complete" || echo "Test failed"
        fi

        # Remove and rescan the NVME device to ensure that it has come back
@@ -58,4 +74,6 @@ test_device() {
        if (( i >= 10 )); then
                echo "Failed to restore ${TEST_DEV}"
        fi
+
+       echo 3000 > /sys/block/"${nvme_ns}"/queue/io_timeout
 }

[2]

[ 2424.311848] FAULT_INJECTION: forcing a failure.
               name fail_io_timeout, interval 1, probability 100, space 0, times -1
[ 2424.312085] FAULT_INJECTION: forcing a failure.
               name fail_io_timeout, interval 1, probability 100, space 0, times -1
[ 2424.312340] FAULT_INJECTION: forcing a failure.
               name fail_io_timeout, interval 1, probability 100, space 0, times -1
[ 2424.312566] FAULT_INJECTION: forcing a failure.
               name fail_io_timeout, interval 1, probability 100, space 0, times -1
[ 2424.312606] FAULT_INJECTION: forcing a failure.
               name fail_io_timeout, interval 1, probability 100, space 0, times -1
[ 2424.312824] FAULT_INJECTION: forcing a failure.
               name fail_io_timeout, interval 1, probability 100, space 0, times -1
[ 2424.313119] FAULT_INJECTION: forcing a failure.
               name fail_io_timeout, interval 1, probability 100, space 0, times -1
[ 2424.313382] FAULT_INJECTION: forcing a failure.
               name fail_io_timeout, interval 1, probability 100, space 0, times -1
[ 2424.313602] FAULT_INJECTION: forcing a failure.
               name fail_io_timeout, interval 1, probability 100, space 0, times -1
[ 2424.313992] FAULT_INJECTION: forcing a failure.
               name fail_io_timeout, interval 1, probability 100, space 0, times -1
[ 2424.314629] FAULT_INJECTION: forcing a failure.
               name fail_io_timeout, interval 1, probability 100, space 0, times -1
[ 2424.314666] FAULT_INJECTION: forcing a failure.
               name fail_io_timeout, interval 1, probability 100, space 0, times -1
[ 2424.314773] FAULT_INJECTION: forcing a failure.
               name fail_io_timeout, interval 1, probability 100, space 0, times -1
[ 2424.314980] FAULT_INJECTION: forcing a failure.
               name fail_io_timeout, interval 1, probability 100, space 0, times -1
[ 2454.327753] nvme nvme0: I/O tag 640 (3280) opcode 0x2 (I/O Cmd) QID 9 timeout, aborting req_op:READ(0) size:4096
[ 2454.338048] nvme nvme0: I/O tag 641 (3281) opcode 0x2 (I/O Cmd) QID 9 timeout, aborting req_op:READ(0) size:4096
[ 2454.348300] nvme nvme0: I/O tag 642 (3282) opcode 0x2 (I/O Cmd) QID 9 timeout, aborting req_op:READ(0) size:4096
[ 2454.358561] nvme nvme0: I/O tag 678 (22a6) opcode 0x2 (I/O Cmd) QID 9 timeout, aborting req_op:READ(0) size:4096
[ 2454.368841] nvme nvme0: I/O tag 679 (22a7) opcode 0x2 (I/O Cmd) QID 9 timeout, aborting req_op:READ(0) size:4096
[ 2454.379093] nvme nvme0: I/O tag 680 (22a8) opcode 0x2 (I/O Cmd) QID 9 timeout, aborting req_op:READ(0) size:4096
[ 2454.389345] nvme nvme0: I/O tag 681 (22a9) opcode 0x2 (I/O Cmd) QID 9 timeout, aborting req_op:READ(0) size:4096
[ 2454.399631] nvme nvme0: I/O tag 682 (22aa) opcode 0x2 (I/O Cmd) QID 9 timeout, aborting req_op:READ(0) size:4096
[ 2454.409902] nvme nvme0: I/O tag 683 (22ab) opcode 0x2 (I/O Cmd) QID 9 timeout, aborting req_op:READ(0) size:4096
[ 2454.420139] nvme nvme0: I/O tag 684 (22ac) opcode 0x2 (I/O Cmd) QID 9 timeout, aborting req_op:READ(0) size:4096
[ 2454.430363] nvme nvme0: I/O tag 685 (22ad) opcode 0x2 (I/O Cmd) QID 9 timeout, aborting req_op:READ(0) size:4096
[ 2454.440600] nvme nvme0: I/O tag 686 (22ae) opcode 0x2 (I/O Cmd) QID 9 timeout, aborting req_op:READ(0) size:4096
[ 2454.450815] nvme nvme0: I/O tag 687 (22af) opcode 0x2 (I/O Cmd) QID 9 timeout, aborting req_op:READ(0) size:4096
[ 2454.461048] nvme nvme0: I/O tag 688 (22b0) opcode 0x2 (I/O Cmd) QID 9 timeout, aborting req_op:READ(0) size:4096
[ 2454.471308] nvme nvme0: I/O tag 689 (22b1) opcode 0x2 (I/O Cmd) QID 9 timeout, aborting req_op:READ(0) size:4096
[ 2454.481547] nvme nvme0: I/O tag 690 (22b2) opcode 0x2 (I/O Cmd) QID 9 timeout, aborting req_op:READ(0) size:4096
[ 2454.491764] nvme nvme0: I/O tag 691 (22b3) opcode 0x2 (I/O Cmd) QID 9 timeout, aborting req_op:READ(0) size:4096
[ 2454.539245] nvme nvme0: Abort status: 0x0
[ 2454.550666] nvme nvme0: Abort status: 0x0
[ 2454.562039] nvme nvme0: Abort status: 0x0
[ 2454.571299] nvme nvme0: Abort status: 0x0
[ 2454.580398] nvme nvme0: Abort status: 0x0
[ 2454.591720] nvme nvme0: Abort status: 0x0
[ 2454.600785] nvme nvme0: Abort status: 0x0
[ 2454.612069] nvme nvme0: Abort status: 0x0
[ 2454.623329] nvme nvme0: Abort status: 0x0
[ 2454.632353] nvme nvme0: Abort status: 0x0
[ 2454.641366] nvme nvme0: Abort status: 0x0
[ 2454.652610] nvme nvme0: Abort status: 0x0
[ 2454.661595] nvme nvme0: Abort status: 0x0
[ 2454.676490] nvme nvme0: Abort status: 0x0
[ 2454.685420] nvme nvme0: Abort status: 0x0
[ 2454.694348] nvme nvme0: Abort status: 0x0
[ 2454.705464] nvme nvme0: Abort status: 0x0

[3]

# ps aux | grep fio
root        1979  2.2  0.0 240572  7572 pts/1    Sl+  11:14   0:45 fio --bs=4k --rw=randread --norandommap --numjobs=16 --name=reads --direct=1 --filename=/dev/nvme0n1 --group_reporting --time_based --runtime=1m
root        1997  0.0  0.0 174976  2564 ?        Ds   11:14   0:00 fio --bs=4k --rw=randread --norandommap --numjobs=16 --name=reads --direct=1 --filename=/dev/nvme0n1 --group_reporting --time_based --runtime=1m
root        1998  0.0  0.0 174980  2568 ?        Ds   11:14   0:00 fio --bs=4k --rw=randread --norandommap --numjobs=16 --name=reads --direct=1 --filename=/dev/nvme0n1 --group_reporting --time_based --runtime=1m
root        1999  0.0  0.0 174984  2516 ?        Ds   11:14   0:00 fio --bs=4k --rw=randread --norandommap --numjobs=16 --name=reads --direct=1 --filename=/dev/nvme0n1 --group_reporting --time_based --runtime=1m
root        2000  0.0  0.0 174988  2564 ?        Ds   11:14   0:00 fio --bs=4k --rw=randread --norandommap --numjobs=16 --name=reads --direct=1 --filename=/dev/nvme0n1 --group_reporting --time_based --runtime=1m
root        2001  0.0  0.0 174992  2492 ?        Ds   11:14   0:00 fio --bs=4k --rw=randread --norandommap --numjobs=16 --name=reads --direct=1 --filename=/dev/nvme0n1 --group_reporting --time_based --runtime=1m
root        2002  0.0  0.0 174996  2568 ?        Ds   11:14   0:00 fio --bs=4k --rw=randread --norandommap --numjobs=16 --name=reads --direct=1 --filename=/dev/nvme0n1 --group_reporting --time_based --runtime=1m
root        2003  0.0  0.0 175000  2556 ?        Ds   11:14   0:00 fio --bs=4k --rw=randread --norandommap --numjobs=16 --name=reads --direct=1 --filename=/dev/nvme0n1 --group_reporting --time_based --runtime=1m
root        2004  0.0  0.0 175004  2532 ?        Ds   11:14   0:00 fio --bs=4k --rw=randread --norandommap --numjobs=16 --name=reads --direct=1 --filename=/dev/nvme0n1 --group_reporting --time_based --runtime=1m
root        2005  0.0  0.0 175008  2516 ?        Ds   11:14   0:00 fio --bs=4k --rw=randread --norandommap --numjobs=16 --name=reads --direct=1 --filename=/dev/nvme0n1 --group_reporting --time_based --runtime=1m
root        2006  0.0  0.0 175012  2568 ?        Ds   11:14   0:00 fio --bs=4k --rw=randread --norandommap --numjobs=16 --name=reads --direct=1 --filename=/dev/nvme0n1 --group_reporting --time_based --runtime=1m
root        2007  0.0  0.0 240552  2536 ?        Ds   11:14   0:00 fio --bs=4k --rw=randread --norandommap --numjobs=16 --name=reads --direct=1 --filename=/dev/nvme0n1 --group_reporting --time_based --runtime=1m
root        2008  0.0  0.0 240556  2536 ?        Ds   11:14   0:00 fio --bs=4k --rw=randread --norandommap --numjobs=16 --name=reads --direct=1 --filename=/dev/nvme0n1 --group_reporting --time_based --runtime=1m
root        2009  0.0  0.0 240560  2572 ?        Ds   11:14   0:00 fio --bs=4k --rw=randread --norandommap --numjobs=16 --name=reads --direct=1 --filename=/dev/nvme0n1 --group_reporting --time_based --runtime=1m
root        2010  0.0  0.0 240564  2572 ?        Ds   11:14   0:00 fio --bs=4k --rw=randread --norandommap --numjobs=16 --name=reads --direct=1 --filename=/dev/nvme0n1 --group_reporting --time_based --runtime=1m
root        2011  0.0  0.0 240568  2536 ?        Ds   11:14   0:00 fio --bs=4k --rw=randread --norandommap --numjobs=16 --name=reads --direct=1 --filename=/dev/nvme0n1 --group_reporting --time_based --runtime=1m
root        2012  0.0  0.0 240572  2548 ?        Ds   11:14   0:00 fio --bs=4k --rw=randread --norandommap --numjobs=16 --name=reads --direct=1 --filename=/dev/nvme0n1 --group_reporting --time_based --runtime=1m
root        2365 42.8  0.0   6392  2112 pts/2    S+   11:47   0:00 grep --color=auto fio

# cat /proc/1979/stack
[<0>] hrtimer_nanosleep+0x12a/0x310
[<0>] common_nsleep+0x7a/0xc0
[<0>] __x64_sys_clock_nanosleep+0x283/0x3e0
[<0>] do_syscall_64+0x95/0x520
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

# cat /proc/1997/stack
[<0>] submit_bio_wait+0x143/0x200
[<0>] __blkdev_direct_IO_simple+0x3ac/0x820
[<0>] blkdev_read_iter+0x200/0x3f0
[<0>] vfs_read+0x6cb/0xb70
[<0>] __x64_sys_pread64+0x18a/0x1f0
[<0>] do_syscall_64+0x95/0x520
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

@igaw
Copy link
Copy Markdown
Contributor

igaw commented Feb 9, 2026

Thanks for testing. Either my mental model how this is supposed is working is broken or you just fund a bug. Anyway, I am still catching up with all the email/tasks after my vacation, so it takes a bit longer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants