Paper 2025/366
Enabling Microarchitectural Agility: Taking ML-KEM & ML-DSA from Cortex-M4 to M7 with SLOTHY
Abstract
Highly-optimized assembly is commonly used to achieve the best performance for popular cryptographic schemes such as the newly standardized ML-KEM and ML-DSA. The majority of implementations today rely on hand-optimized assembly for the core building blocks to achieve both security and performance. However, recent work by Abdulrahman et al. takes a new approach, writing a readable base assembly implementation first and leaving the bulk of the optimization work to a tool named SLOTHY based on constraint programming. SLOTHY performs instruction scheduling, register allocation, and software pipelining simultaneously using constraints modeling the architectural and microarchitectural details of the target platform. In this work, we extend SLOTHY and investigate how it can be used to migrate already highly hand-optimized assembly to a different microarchitecture, while maximizing performance. As a case study, we optimize state-of-the-art Arm Cortex-M4 implementations of ML-KEM and ML-DSA for the Arm Cortex-M7. Our results suggest that this approach is promising: For the number-theoretic transform (NTT) – the core building block of both ML-DSA and ML-KEM – we achieve speed-ups of $1.97\times$ and $1.69\times$, respectively. For Keccak – the permutation used by SHA-3 and SHAKE and also vastly used in ML-DSA and ML-KEM – we achieve speed-ups of 30% compared to the M4 code and 5% compared to hand-optimized M7 code. For many other building blocks, we achieve similarly significant speed-ups of up to $2.35\times$. Overall, this results in 11 to 33% faster code for the entire cryptosystems.
Metadata
- Available format(s)
-
PDF
- Category
- Implementation
- Publication info
- Preprint.
- Keywords
- Post-Quantum CryptographyArm Cortex-M7Arm Cortex-M4ML-KEMML-DSASuperoptimizationConstraint Solving
- Contact author(s)
-
amin @ abdulrahman de
matthias @ kannwischer eu
han lim @ chelpis com - History
- 2025-03-04: approved
- 2025-02-26: received
- See all versions
- Short URL
- https://ia.cr/2025/366
- License
-
CC BY
BibTeX
@misc{cryptoeprint:2025/366, author = {Amin Abdulrahman and Matthias J. Kannwischer and Thing-Han Lim}, title = {Enabling Microarchitectural Agility: Taking {ML}-{KEM} & {ML}-{DSA} from Cortex-M4 to M7 with {SLOTHY}}, howpublished = {Cryptology {ePrint} Archive, Paper 2025/366}, year = {2025}, url = {https://eprint.iacr.org/2025/366} }