perf(connection): raise MAX_TRANSMIT_SEGMENTS to 40 and MAX_TRANSMIT_DATAGRAMS to 80 by poka-IT · Pull Request #636 · n0-computer/noq

poka-IT · 2026-05-05T02:08:48Z

Description

Raises MAX_TRANSMIT_SEGMENTS from 10 to 40 and MAX_TRANSMIT_DATAGRAMS from 20 to 80 in noq/src/connection.rs.

At MTU 1280 this means 51.2 KB per sendmsg(UDP_SEGMENT) call instead of 12.8 KB. The Linux kernel hard limit UDP_MAX_SEGMENTS is 64, so 40 stays comfortably within bounds.

On uplink-saturated benchmarks I measured a meaningful throughput improvement on a Hetzner CCX23 box (single iperf3 -P 1, MTU 1280). I noticed this while looking at quinn issue 1572 and the ETHZ NSG 2024 thesis section 5.4 ("Patching Quinn"), which mentions that raising similar constants doubles msquic-style throughput in their bench.

Breaking Changes

None. Internal constants only.

Notes & open questions

Memory usage per drive call grows by 4x (40 vs 10 segments pre-allocated), which is an acceptable tradeoff for the throughput gain on saturated workloads. Lower-traffic connections still allocate on demand and should not see a difference.

Happy to add a benchmark to bench/ if useful, but the change is self-contained and the rationale matches the existing comment.

Change checklist

Self-review.
Documentation updates following the style guide, if relevant.
Tests if relevant. (No new behavior, constants only.)
All breaking changes documented.

…DATAGRAMS to 80 At MTU 1280 this means 51.2 KB per sendmsg(UDP_SEGMENT) call instead of 12.8 KB. The Linux kernel hard limit UDP_MAX_SEGMENTS is 64, so 40 stays within bounds. On uplink-saturated benchmarks I measured a meaningful throughput improvement on a Hetzner CCX23 box (single iperf3 -P 1, MTU 1280).

flub · 2026-05-05T12:48:47Z

Thanks for the PR, could you post your benchmarks from before and after, including instruction on how to recreate them? These kind of changes tend to be delicate, so we'll want to try this out in some scenarios and see how comparable the results are.

poka-IT · 2026-05-05T15:19:49Z

Thanks for taking a look. Here's the data, with enough context to reproduce it.

Hardware / network setup

Two Hetzner CCX23 dedicated x86 VMs in fsn1-dc14:

4 vCPU dedicated, 16 GB RAM
iperf3 VPS to VPS direct: ~16-17 Gbps symmetric (REF, no tunnel)

Sysctl on both sides:

net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.netdev_max_backlog = 5000
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr

Tunnel MTU 1280, GSO enabled, no other QUIC tuning beyond defaults
Stack on top of noq: a small QUIC VPN that runs N=32 noq::Connections per client, raw datagrams (RFC 9221), no streams. Exit side spawns one pump task per Connection. Client side has a single uplink task that dispatches by 5-tuple hash to one of the N Connections.

iperf3 driver: iperf3 -P 8 -t 30 -c <target> (8 parallel TCP streams), 5 runs uplink + 3 runs downlink.

Numbers

Bench A is vanilla noq 0.18.0. Bench B is the same source tree with the patched constants. Different VM pair, different day, same provisioning script and sysctls. Cross-session, not paired same-session, so see the variance caveat at the bottom.

Uplink (client to exit), 5x30s, P=8:

Run	A (10/20)	B (40/80)
1	1676	1953
2	1956	1898
3	1911	1814
4	1909	1853
5	1869	1932
avg	1864 Mbps	1890 Mbps

Downlink (exit to client), 3x30s, P=8:

Run	A (10/20)	B (40/80)
1	1949	2399
2	1924	2494
3	1948	2368
avg	1940 Mbps	2420 Mbps (+24.7%)

Within-session stddev: B uplink ±46 Mbps (2.5%), B downlink ±53 Mbps (2.2%). Tight enough that the +25% downlink is well above noise.

CPU client: ~210-230% (≈2.2 cores). RSS: 27 MB (A) vs 30 MB (B), so the 4x pre-alloc growth per drive call is small enough that it doesn't show up materially at N=32 connections.

Why it's asymmetric

Uplink shows no real change because our client-side architecture has a single sequential pump task. It's already drive-bound on its own loop, not on sendmsg capacity. Bumping the segment cap doesn't help when only one task is calling poll_transmit at a time.

Downlink benefits because the exit side has 32 parallel pump tasks, each driving its own Connection. Each one's sendmsg(UDP_SEGMENT) payload jumps from 12.8 KB (10 x 1280) to 51.2 KB (40 x 1280). That's where the ~25% comes from.

So the gain visible here is heavily workload-dependent: it shows up when several Connections drive transmit concurrently and the link has headroom. A single Connection doing bulk send on a saturated 25 GbE box (your typical bench setup) would probably show a different shape. Happy to run that variant if you want a more noq-native scenario.

Reproduce

Patch (already in this PR):

-const MAX_TRANSMIT_DATAGRAMS: usize = 20;
+const MAX_TRANSMIT_DATAGRAMS: usize = 80;
-const MAX_TRANSMIT_SEGMENTS: NonZeroUsize = NonZeroUsize::new(10).expect("known");
+const MAX_TRANSMIT_SEGMENTS: NonZeroUsize = NonZeroUsize::new(40).expect("known");

Apply against noq-v0.18.0. We pin via [patch.crates-io] in our workspace Cargo.toml:

[patch.crates-io]
noq       = { path = "vendor/noq-fork/noq" }
noq-proto = { path = "vendor/noq-fork/noq-proto" }
noq-udp   = { path = "vendor/noq-fork/noq-udp" }

(Patching all three is required, otherwise iroh-relay pulls in two copies of noq-proto types and the workspace fails to compile.)

Bench loop (runs on the client VM, exit running iperf3 -s on :49200):

for i in 1 2 3 4 5; do
  iperf3 -J -c <exit-tunnel-ip> -p 49200 -t 30 -P 8 \
    | jq -r '.end.sum_sent.bits_per_second / 1e6'
  sleep 1
done

Repeat with -R for downlink. The full multi-run wrapper we use parses min/max/stddev from iperf3 JSON via jq if that shape is useful.

Caveats I noticed

The two benches above are cross-session (different VM pair, different day). Hetzner inter-session network capacity varies a lot: on a follow-up run, the same provisioning gave a REF that dropped from 16.9 Gbps to 7.4 Gbps downlink, and the tunnel throughput scaled down with it. Anything below ~10% delta needs paired same-session runs to be trustworthy. The +25% downlink here is well above that threshold, but I would not claim a stable absolute Mbps number across hardware. If a paired same-session A/B with a flag-toggled binary would carry more weight for you, I can run that.

Memory: at N=32 Connections, the additional pre-allocated transmit space per drive call (4x growth) added ~3 MB total RSS in our setup. On something with thousands of Connections it would be more visible.

Regressions

We have 11 multi-conn QUIC integration tests covering the pump and accept loops; all pass unchanged after the patch. No new test added for the constants since there's no behavior change to assert, but happy to add a memory-footprint sanity test if you want one.

Let me know if you want flamegraphs, raw iperf3 JSON, or a different shape of bench (single Connection, lossy link, low-bandwidth path). I can also run the same A/B on a kernel-WireGuard pairing for sanity check if that's useful.

flub · 2026-05-06T15:54:42Z

If I followed your description it seems these benches are from using your entire stack and tests a TCP connection established by i3perf which itself is tunnelled into a QUIC connection using the QUIC datagram extension. Unfortunately I have no idea what your entire stack is, you haven't told me yet :).

I think ideally we'd be able to write a small self-contained perf tool in rust that only uses noq with no other components involved to compare the performance of these.

There is already a perf binary in the noq source tree that does some of this, but it may do things differently on your setup and only tests streams I think. Ideally you can demonstrate the per difference already with this, but if you need to make adjustments to how the connections are run and if the QUIC datagrams are important then maybe it is still a good start to modify things.

poka-IT · 2026-05-06T22:00:08Z

Thanks, that pushback is fair. The original numbers were end-to-end through our entire stack (Warren is a userspace QUIC VPN we're building, using QUIC datagrams for the tunnel), which made it impossible to attribute the delta to this patch alone.

I rebuilt the comparison so it's purely noq: HEAD vs HEAD~1 of this branch (so the only diff between the two binaries is the constants change), running noq/perf on a single VM with both server and client on 127.0.0.1, paired same-session. By "paired same-session" I mean alternating vanilla/patched runs back-to-back in the same VM, no reboot or fresh provision between samples. That controls Hetzner's ~5-15% inter-session variance which is otherwise large enough to drown the signal.

Setup: Hetzner cpx32 (4 vCPU AMD shared, 8 GB, fsn1, kernel 6.1, debian-12). The patch governs sendmsg(UDP_SEGMENT) GSO batching, which the Linux kernel applies on loopback exactly as on a NIC, so the syscall-reduction mechanism is exercised the same way.

Single bi-stream, MTU 1280 on both sides, --upload-size 1000G --download-size 0 --duration 30, 8 paired iterations.

variant	n	mean (Mbps)	median	stddev	client %CPU	server %CPU	Mbps/%CPU (client)
vanilla (HEAD~1)	8	4609	4591	197	99.0 (σ 0.0)	90.5 (σ 0.6)	46.6
patched (HEAD)	8	5892	5913	189	98.4 (σ 0.5)	93.1 (σ 1.0)	59.9

Ratio mean throughput 1.278x, ratio median 1.288x. Welch t-stat +13.3 (p < 0.0001). Client efficiency ratio (Mbps per 1% CPU): 1.287x.

The CPU column is the part I think matters most for the merge decision. Client saturates 1 core (~99% of one vCPU on the current_thread tokio runtime) in both variants, with vanilla and patched within a fraction of a percent of each other. So the +28% throughput is not coming from the patch using more CPU. It's the same CPU budget moving more bytes, which is the expected signature of a sendmsg syscall reduction (40 segments per call instead of 10, so roughly 4x fewer kernel transitions in the hot drive loop). Server-side is slightly less saturated (~91-93%) because in this configuration it does less work than the client; the efficiency ratio there is 1.24x, consistent with the same mechanism.

Per-iteration:

iter	vanilla Mbps	vanilla %CPU (c/s)	patched Mbps	patched %CPU (c/s)	ratio
1	4853	99% / 91%	5939	98% / 94%	1.22
2	4668	99% / 91%	6188	98% / 94%	1.33
3	4672	99% / 91%	5887	99% / 94%	1.26
4	4515	99% / 90%	5723	99% / 92%	1.27
5	4398	99% / 90%	5745	98% / 92%	1.31
6	4904	99% / 91%	6067	98% / 94%	1.24
7	4453	99% / 90%	5966	99% / 93%	1.34
8	4406	99% / 90%	5621	98% / 92%	1.28

Caveat: I had to fall back to cpx32 instead of the ccx23 from the original PR description (unrelated dedicated-core quota on my project at the time of test). Absolute throughput is therefore lower than the PR description, but the ratio sits inside the original 1.2-1.5x window. I can rerun on ccx23 if you want the exact same hardware as the claim.

On extending the perf binary with a --datagrams mode: happy to do it if you want it, but the streams numbers above already exercise the patched code path (drive_transmit is shared between streams and datagrams, same sendmsg(UDP_SEGMENT) loop). Let me know if you want me to push that anyway.

Reproduction script + raw CSVs available, can paste as a gist or open a follow-up PR adding a small bench/ helper to this repo if that fits.

(updated to add CPU/efficiency measurements from a follow-up paired run; absolute throughput is from a fresh session, hence marginally different from the first table I posted)

flub · 2026-05-07T16:07:03Z

Great that you think the existing perf tool is representative. I wasn't sure since in your first message you were going on about number of pump tasks and I didn't exactly follow what your setup and use was. But all the easier if the existing perf tool is representative.

On localhost I absolutely expect this to improve the throughput. Localhost is not really lossy and has a huge amount of bandwidth. So you can have huge flow-control and congestion windows. Testing this out on various links that are not localhost is much more interesting. E.g. same datacenter, across the internet to an ISP.

For example between a VPS in a datacenter in the UK and my home in Vienna I get these results:

Running the server using:

cd perf
cargo run --release -- server

Running the client using:

cd perf
cargo run --release -- client --download-size 10M --interval 5 a.b.c.d:4433

With the current noq main:

Overall stats:
RPS: 1.30 (78 requests in 60.02s)

Stream metrics:

      │ Upload Duration │ Download Duration | FBL        | Upload Throughput | Download Throughput
──────┼─────────────────┼───────────────────┼────────────┼───────────────────┼────────────────────
 AVG  │         71.32ms │          694.68ms │   629.00µs │       128.14 Mb/s │        129.65 Mb/s
 P0   │         53.18ms │          477.95ms │     1.00µs │        34.54 Mb/s │         58.10 Mb/s
 P10  │         53.82ms │          529.41ms │     2.00µs │        90.90 Mb/s │         81.79 Mb/s
 P50  │         61.38ms │          607.23ms │   552.00µs │       136.84 Mb/s │        136.84 Mb/s
 P90  │         92.35ms │             1.03s │     1.38ms │       155.98 Mb/s │        158.47 Mb/s
 P100 │        242.82ms │             1.44s │     6.11ms │       157.68 Mb/s │        175.51 Mb/s

With the changes from this PR:

Overall stats:
RPS: 1.18 (71 requests in 60.01s)

Stream metrics:

      │ Upload Duration │ Download Duration | FBL        | Upload Throughput | Download Throughput
──────┼─────────────────┼───────────────────┼────────────┼───────────────────┼────────────────────
 AVG  │         62.81ms │          774.41ms │     1.05ms │       138.98 Mb/s │        115.16 Mb/s
 P0   │         52.42ms │          478.46ms │     1.00µs │        36.80 Mb/s │         62.59 Mb/s
 P10  │         54.46ms │          551.42ms │    10.00µs │       118.82 Mb/s │         81.40 Mb/s
 P50  │         58.53ms │          731.65ms │   878.00µs │       143.00 Mb/s │        114.69 Mb/s
 P90  │         70.66ms │             1.03s │     2.24ms │       154.14 Mb/s │        152.31 Mb/s
 P100 │        227.97ms │             1.34s │     3.62ms │       160.04 Mb/s │        175.37 Mb/s

Now as you also observe, across different runs I also get different results, the best one I've seen with the increased constants was P90 at 178Mb/s, but then I also had one run with P90 at 147Mb/s.

For noq main the P90 swung between 155 Mb/s and 166Mb/s. Interestingly not as wide a gap.

But this is just some quick checks, not rigorous benchmarking or statistical analysis. Though it seems at least on this real link the difference is not as pronounced. This is why it would be good to get a bit more data on real links.l

poka-IT · 2026-05-07T20:03:12Z

Ran the paired same-session bench on real network as you suggested. Two scenarios, N=16 alternating per variant, both variants built from the same fork tree (vanilla = noq-v0.18.0 tag, patched = same tree + this PR's diff, sha256 confirms only the constants differ).

Setup: 3x Hetzner CCX23 (4 vCPU dedicated AMD EPYC-Milan, 16 GB, kernel 6.1, debian-12), one server fsn1-dc14, two clients (one fsn1-dc14 for intra-DC, one hel1-dc2 for inter-region EU). Sysctl on all three: rmem_max/wmem_max=16M, qdisc=fq, tcp_congestion_control=bbr. MTU 1280.

Bench loop: per variant, 16 iterations of noq-perf client --bi-requests 1 --duration 30 --interval 1 --initial-mtu 1280 with --upload-size 1000G --download-size 0 (upload-only) then the inverse (download-only). Sender CPU sampled with pidstat -p PID 1 30 (mean of 1s intervals). Throughput from the JSON output (sum bytes / duration). Alternated vanilla/patched per iteration to mitigate any time drift.

Scenario A: intra-DC fsn1 (~0.5ms RTT, REF iperf3 ~16-17 Gbps)

direction	variant	n	mean (Mbps)	median	stddev	CV%	P10	P50	P90
upload	vanilla	16	3324.7	3312.0	83.9	2.5	3247.6	3312.0	3448.6
upload	patched	16	3881.9	3873.9	57.8	1.5	3836.5	3873.9	3974.0
download	vanilla	16	3473.3	3489.0	105.0	3.0	3346.8	3489.0	3602.4
download	patched	16	3943.2	3943.9	149.0	3.8	3756.0	3943.9	4105.2

Upload: ratio mean 1.168x (+16.8%), Welch t = +21.88, p ≈ 1e-22.
Download: ratio mean 1.135x (+13.5%), Welch t = +10.31, p ≈ 6e-25.

CPU side, intra-DC:

direction	side	vanilla %CPU	patched %CPU	sender Mbps/%CPU ratio
upload	client (sender)	98.3	74.3 (-24 pts)	1.565x (+56.5%)
upload	server (receiver)	59.6	67.1 (+7.5 pts)	1.035x
download	server (sender)	90.4	74.1 (-16 pts)	1.400x (+40.0%)
download	client (receiver)	67.8	74.3 (+6.5 pts)	1.036x

Sender CPU drops noticeably, receiver CPU rises slightly because it has more bytes to process. Net: more throughput at the same or lower sender cost.

Scenario B: inter-region fsn1 ↔ hel1 (26.3ms RTT measured, Hetzner EU backbone)

direction	variant	n	mean (Mbps)	median	stddev	CV%	P10	P50	P90
upload	vanilla	16	365.6	365.9	1.3	0.4	364.1	365.9	367.1
upload	patched	16	359.4	358.5	2.5	0.7	356.9	358.5	363.1
download	vanilla	16	364.3	365.6	7.9	2.2	363.0	365.6	368.9
download	patched	16	365.9	368.1	7.8	2.1	364.7	368.1	369.0

Upload: ratio mean 0.983x (-1.7%), Welch t = -8.95, p ≈ 4e-10. Statistically significant negative, but small in absolute terms (-6 Mbps on 365).
Download: ratio mean 1.005x (+0.5%), Welch t = +0.60, p ≈ 0.55. Not significant.

So on this real link, the throughput is essentially flat, matching what you observed on UK ↔ Vienna. The plateau at ~365 Mbps is BDP-bound (RTT × default cwnd dominates), not syscall-bound. There's nothing the patch can do here on throughput alone.

CPU side, inter-region:

direction	side	vanilla %CPU	patched %CPU	sender Mbps/%CPU ratio
upload	client (sender)	31.1	25.7 (-5.4 pts)	1.186x (+18.6%)
upload	server (receiver)	9.9	7.1 (-2.8 pts)	1.363x
download	server (sender)	10.9	9.7 (-1.2 pts)	1.125x (+12.5%)
download	client (receiver)	20.2	20.1 (≈ flat)	1.009x

Even at flat throughput, the sender uses meaningfully less CPU. This is the syscall reduction the patch is supposed to deliver, and it shows up cleanly even when bandwidth is BDP-bound.

Takeaway

On a syscall-bound link (intra-DC, gigabit-class): clear +13-17% throughput, with sender Mbps/%CPU jumping +40 to +56%. p < 1e-22.
On a BDP-bound real link (26ms RTT): throughput is flat (-1.7% / +0.5%, the negative one is real but small). However, sender CPU drops 5-24% across all 4 directions tested. The patch is still doing its job, the link is just hiding it on the throughput axis.
Memory: at N=1 connection the 4x pre-allocated transmit space is invisible (all RSS deltas <1.5 MB across 64 sub-runs, mostly in the noise of the allocator).

So the conclusion I'd argue for: this isn't a "free 1.3x throughput everywhere" patch, and your UK ↔ Vienna result was correct. But on links where throughput can scale (LAN, datacenter, intra-region), the patch matters; and on links where it can't, it still reduces sender CPU. Both regimes are non-regressing on CPU efficiency. The only nominally-negative number in the whole bench is -1.7% upload throughput on the inter-region scenario, which is at the edge of what 16 paired runs can reliably measure on a backbone link.

If a synthetic-loss scenario (tc qdisc add netem loss 1% etc.) would help cover the worst-case real-internet path, happy to run that too. Same for adding a --datagrams mode to the perf binary if you want to exercise the patched path through datagrams specifically.

Reproduction script and aggregate Python (Welch t-test, P10/P50/P90, Mbps/%CPU sender/receiver) available on request, or I can paste them as a follow-up PR adding a small bench/network/ helper to this repo if that's the right shape.

n0bot Bot added this to iroh May 5, 2026

github-project-automation Bot moved this to 🚑 Needs Triage in iroh May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(connection): raise MAX_TRANSMIT_SEGMENTS to 40 and MAX_TRANSMIT_DATAGRAMS to 80#636

perf(connection): raise MAX_TRANSMIT_SEGMENTS to 40 and MAX_TRANSMIT_DATAGRAMS to 80#636
poka-IT wants to merge 1 commit inton0-computer:mainfrom
poka-IT:warren-perf-tier1-clean

poka-IT commented May 5, 2026

Uh oh!

flub commented May 5, 2026

Uh oh!

poka-IT commented May 5, 2026 •

edited

Loading

Uh oh!

flub commented May 6, 2026

Uh oh!

poka-IT commented May 6, 2026 •

edited

Loading

Uh oh!

flub commented May 7, 2026

Uh oh!

poka-IT commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

poka-IT commented May 5, 2026

Description

Breaking Changes

Notes & open questions

Change checklist

Uh oh!

flub commented May 5, 2026

Uh oh!

poka-IT commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Hardware / network setup

Numbers

Why it's asymmetric

Reproduce

Caveats I noticed

Regressions

Uh oh!

flub commented May 6, 2026

Uh oh!

poka-IT commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flub commented May 7, 2026

Uh oh!

poka-IT commented May 7, 2026

Scenario A: intra-DC fsn1 (~0.5ms RTT, REF iperf3 ~16-17 Gbps)

Scenario B: inter-region fsn1 ↔ hel1 (26.3ms RTT measured, Hetzner EU backbone)

Takeaway

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

poka-IT commented May 5, 2026 •

edited

Loading

poka-IT commented May 6, 2026 •

edited

Loading