device, tsasm/arm: add ChaCha20-Poly1305 assembly for 32-bit GOARCH=arm#58
Open
device, tsasm/arm: add ChaCha20-Poly1305 assembly for 32-bit GOARCH=arm#58
Conversation
golang.org/x/crypto ships no chacha20poly1305 assembly on 32-bit ARM, so the WireGuard data-path AEAD falls back to a slow pure-Go path. The previous AF_ALG approach (commit 30595e7) borrowed the kernel's NEON crypto via a socket; that helped on NEON hardware but cost a syscall per chunk and was a net loss on ARMv6. This adds a real in-process assembly path, regenerated from the upstream CryptoGAMS Perl scripts (chacha-armv4.pl, poly1305-armv4.pl from Andy Polyakov's CryptoGAMS distribution, dual-licensed BSD-3-Clause / OpenSSL -- see tsasm/arm/regen/LICENSE.cryptogams). The .pl files are NOT vendored: tsasm/arm/regen/regen.sh fetches them on demand from a Tailscale-hosted mirror at a pinned commit SHA, with SHA256 verification on each file. Updating upstream is a SHA bump in regen.sh plus re-running it. A new plan9-xlate.pl post-processor turns the GAS-syntax output into Plan 9 ARM assembly the Go assembler accepts; a companion neon_encode.pl encodes NEON instructions in pure Perl since Go's ARM assembler rejects every NEON mnemonic. NEON dispatch happens at runtime via golang.org/x/sys/cpu's HasNEON, so a single binary covers both NEON and non-NEON ARM. Performance, 1420-byte payload (typical WireGuard packet), median of three runs: Hardware pure-Go AF_ALG asm (this CL) Pi 1 ARMv6 (no NEON) 6.3 4.7 15.1 MB/s scalar Pi 4 ARMv7+NEON (GOARM=7) 50 73 131.5 MB/s NEON The asm path beats AF_ALG by ~1.8x on Pi 4 (no kernel transition per chunk) and is ~2.4x faster than pure-Go on Pi 1 where there's no NEON path at all (CryptoGAMS's optimized scalar inner loop). Setting TS_WG_ASM=0 in the environment forces the pure-Go x/crypto implementation, as an escape hatch for hardware regressions or asm bugs. See tsasm/arm/README.md for design notes where the non-trivial parts of plan9-xlate.pl are documented, such as Plan 9's frame layout (the SP shift for the auto-saved LR), R10=g shadowing, NEON encoding strategy, the data-label folding trick for sigma/one/rot8, and the Go-side length trimming that avoids cross-function branches in chacha NEON. Updates tailscale/tailscale#7053 Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
golang.org/x/crypto ships no chacha20poly1305 assembly on 32-bit ARM,
so the WireGuard data-path AEAD falls back to a slow pure-Go path.
The previous AF_ALG approach (commit 30595e7) borrowed the kernel's
NEON crypto via a socket; that helped on NEON hardware but cost a
syscall per chunk and was a net loss on ARMv6.
This adds a real in-process assembly path, regenerated from the
upstream CryptoGAMS Perl scripts (chacha-armv4.pl, poly1305-armv4.pl
from Andy Polyakov's CryptoGAMS distribution, dual-licensed
BSD-3-Clause / OpenSSL -- see tsasm/arm/regen/LICENSE.cryptogams).
The .pl files are NOT vendored: tsasm/arm/regen/regen.sh fetches them
on demand from a Tailscale-hosted mirror at a pinned commit SHA, with
SHA256 verification on each file. Updating upstream is a SHA bump in
regen.sh plus re-running it.
A new plan9-xlate.pl post-processor turns the GAS-syntax output into
Plan 9 ARM assembly the Go assembler accepts; a companion
neon_encode.pl encodes NEON instructions in pure Perl since Go's ARM
assembler rejects every NEON mnemonic. NEON dispatch happens at
runtime via golang.org/x/sys/cpu's HasNEON, so a single binary
covers both NEON and non-NEON ARM.
Performance, 1420-byte payload (typical WireGuard packet), median of
three runs:
The asm path beats AF_ALG by ~1.8x on Pi 4 (no kernel transition per
chunk) and is ~2.4x faster than pure-Go on Pi 1 where there's no
NEON path at all (CryptoGAMS's optimized scalar inner loop).
Setting TS_WG_ASM=0 in the environment forces the pure-Go
x/crypto implementation, as an escape hatch for hardware
regressions or asm bugs.
See tsasm/arm/README.md for design notes where the non-trivial
parts of plan9-xlate.pl are documented, such as Plan 9's frame
layout (the SP shift for the auto-saved LR), R10=g shadowing,
NEON encoding strategy, the data-label folding trick for
sigma/one/rot8, and the Go-side length trimming that avoids
cross-function branches in chacha NEON.
Updates tailscale/tailscale#7053