Skip to content

AArch64: NEON omatcopy CT/RT kernels (s/d)#5843

Open
artem-dmitriev wants to merge 1 commit into
OpenMathLib:developfrom
artem-dmitriev:omatcopy
Open

AArch64: NEON omatcopy CT/RT kernels (s/d)#5843
artem-dmitriev wants to merge 1 commit into
OpenMathLib:developfrom
artem-dmitriev:omatcopy

Conversation

@artem-dmitriev

Copy link
Copy Markdown
Contributor

AArch64 has no vectorized transpose copy - all variants hit the scalar generic. Adds NEON ct/rt kernels for s/d (register transpose + stnp)
Passes the utest extension suite (1460/1460)
Bench on Neoverse-N1 (domatcopy, 1thread): the scalar path degrades with matrix size while the NEON kernel stays flat, giving roughly 1.2x at 2k up to ~4.5x at 18k. Single precision gap is larger.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant