Skip to content

device, tsasm/arm: add ChaCha20-Poly1305 assembly for 32-bit GOARCH=arm#58

Open
bradfitz wants to merge 1 commit intotailscalefrom
bradfitz/arm32_asm
Open

device, tsasm/arm: add ChaCha20-Poly1305 assembly for 32-bit GOARCH=arm#58
bradfitz wants to merge 1 commit intotailscalefrom
bradfitz/arm32_asm

Conversation

@bradfitz
Copy link
Copy Markdown
Member

@bradfitz bradfitz commented May 7, 2026

golang.org/x/crypto ships no chacha20poly1305 assembly on 32-bit ARM,
so the WireGuard data-path AEAD falls back to a slow pure-Go path.
The previous AF_ALG approach (commit 30595e7) borrowed the kernel's
NEON crypto via a socket; that helped on NEON hardware but cost a
syscall per chunk and was a net loss on ARMv6.

This adds a real in-process assembly path, regenerated from the
upstream CryptoGAMS Perl scripts (chacha-armv4.pl, poly1305-armv4.pl
from Andy Polyakov's CryptoGAMS distribution, dual-licensed
BSD-3-Clause / OpenSSL -- see tsasm/arm/regen/LICENSE.cryptogams).
The .pl files are NOT vendored: tsasm/arm/regen/regen.sh fetches them
on demand from a Tailscale-hosted mirror at a pinned commit SHA, with
SHA256 verification on each file. Updating upstream is a SHA bump in
regen.sh plus re-running it.

A new plan9-xlate.pl post-processor turns the GAS-syntax output into
Plan 9 ARM assembly the Go assembler accepts; a companion
neon_encode.pl encodes NEON instructions in pure Perl since Go's ARM
assembler rejects every NEON mnemonic. NEON dispatch happens at
runtime via golang.org/x/sys/cpu's HasNEON, so a single binary
covers both NEON and non-NEON ARM.

Performance, 1420-byte payload (typical WireGuard packet), median of
three runs:

Hardware                     pure-Go  AF_ALG    asm (this CL)
Pi 1 ARMv6 (no NEON)         6.3      4.7       15.1 MB/s  scalar
Pi 4 ARMv7+NEON (GOARM=7)    50       73        131.5 MB/s NEON

The asm path beats AF_ALG by ~1.8x on Pi 4 (no kernel transition per
chunk) and is ~2.4x faster than pure-Go on Pi 1 where there's no
NEON path at all (CryptoGAMS's optimized scalar inner loop).

Setting TS_WG_ASM=0 in the environment forces the pure-Go
x/crypto implementation, as an escape hatch for hardware
regressions or asm bugs.

See tsasm/arm/README.md for design notes where the non-trivial
parts of plan9-xlate.pl are documented, such as Plan 9's frame
layout (the SP shift for the auto-saved LR), R10=g shadowing,
NEON encoding strategy, the data-label folding trick for
sigma/one/rot8, and the Go-side length trimming that avoids
cross-function branches in chacha NEON.

Updates tailscale/tailscale#7053

golang.org/x/crypto ships no chacha20poly1305 assembly on 32-bit ARM,
so the WireGuard data-path AEAD falls back to a slow pure-Go path.
The previous AF_ALG approach (commit 30595e7) borrowed the kernel's
NEON crypto via a socket; that helped on NEON hardware but cost a
syscall per chunk and was a net loss on ARMv6.

This adds a real in-process assembly path, regenerated from the
upstream CryptoGAMS Perl scripts (chacha-armv4.pl, poly1305-armv4.pl
from Andy Polyakov's CryptoGAMS distribution, dual-licensed
BSD-3-Clause / OpenSSL -- see tsasm/arm/regen/LICENSE.cryptogams).
The .pl files are NOT vendored: tsasm/arm/regen/regen.sh fetches them
on demand from a Tailscale-hosted mirror at a pinned commit SHA, with
SHA256 verification on each file. Updating upstream is a SHA bump in
regen.sh plus re-running it.

A new plan9-xlate.pl post-processor turns the GAS-syntax output into
Plan 9 ARM assembly the Go assembler accepts; a companion
neon_encode.pl encodes NEON instructions in pure Perl since Go's ARM
assembler rejects every NEON mnemonic. NEON dispatch happens at
runtime via golang.org/x/sys/cpu's HasNEON, so a single binary
covers both NEON and non-NEON ARM.

Performance, 1420-byte payload (typical WireGuard packet), median of
three runs:

    Hardware                     pure-Go  AF_ALG    asm (this CL)
    Pi 1 ARMv6 (no NEON)         6.3      4.7       15.1 MB/s  scalar
    Pi 4 ARMv7+NEON (GOARM=7)    50       73        131.5 MB/s NEON

The asm path beats AF_ALG by ~1.8x on Pi 4 (no kernel transition per
chunk) and is ~2.4x faster than pure-Go on Pi 1 where there's no
NEON path at all (CryptoGAMS's optimized scalar inner loop).

Setting TS_WG_ASM=0 in the environment forces the pure-Go
x/crypto implementation, as an escape hatch for hardware
regressions or asm bugs.

See tsasm/arm/README.md for design notes where the non-trivial
parts of plan9-xlate.pl are documented, such as Plan 9's frame
layout (the SP shift for the auto-saved LR), R10=g shadowing,
NEON encoding strategy, the data-label folding trick for
sigma/one/rot8, and the Go-side length trimming that avoids
cross-function branches in chacha NEON.

Updates tailscale/tailscale#7053

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
@bradfitz bradfitz requested a review from raggi May 7, 2026 04:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant