Enabling Microarchitectural Agility: Taking ML-KEM & ML-DSA from Cortex-M4 to M7 with SLOTHY

Amin Abdulrahman; Matthias J. Kannwischer; Thing-Han Lim

Paper 2025/366

Enabling Microarchitectural Agility: Taking ML-KEM & ML-DSA from Cortex-M4 to M7 with SLOTHY

Amin Abdulrahman

, Max Planck Institute for Security and Privacy (MPI-SP), Bochum, Germany

Matthias J. Kannwischer, Quantum Safe Migration Center, Chelpis Quantum Corp, Taipei, Taiwan

Thing-Han Lim, Quantum Safe Migration Center, Chelpis Quantum Corp, Taipei, Taiwan

Abstract

Highly-optimized assembly is commonly used to achieve the best performance for popular cryptographic schemes such as the newly standardized ML-KEM and ML-DSA. The majority of implementations today rely on hand-optimized assembly for the core building blocks to achieve both security and performance. However, recent work by Abdulrahman et al. takes a new approach, writing a readable base assembly implementation first and leaving the bulk of the optimization work to a tool named SLOTHY based on constraint programming. SLOTHY performs instruction scheduling, register allocation, and software pipelining simultaneously using constraints modeling the architectural and microarchitectural details of the target platform. In this work, we extend SLOTHY and investigate how it can be used to migrate already highly hand-optimized assembly to a different microarchitecture, while maximizing performance. As a case study, we optimize state-of-the-art Arm Cortex-M4 implementations of ML-KEM and ML-DSA for the Arm Cortex-M7. Our results suggest that this approach is promising: For the number-theoretic transform (NTT) – the core building block of both ML-DSA and ML-KEM – we achieve speed-ups of $1.97\times$ and $1.69\times$, respectively. For Keccak – the permutation used by SHA-3 and SHAKE and also vastly used in ML-DSA and ML-KEM – we achieve speed-ups of 30% compared to the M4 code and 5% compared to hand-optimized M7 code. For many other building blocks, we achieve similarly significant speed-ups of up to $2.35\times$. Overall, this results in 11 to 33% faster code for the entire cryptosystems.

Metadata

Available format(s): PDF
Category: Implementation
Publication info: Preprint.
Keywords: Post-Quantum Cryptography Arm Cortex-M7 Arm Cortex-M4 ML-KEM ML-DSA Superoptimization Constraint Solving
Contact author(s): amin @ abdulrahman de
matthias @ kannwischer eu
han lim @ chelpis com
History: 2025-03-04: approved; 2025-02-26: received; See all versions
Short URL: https://ia.cr/2025/366
License: CC BY

BibTeX

@misc{cryptoeprint:2025/366,
      author = {Amin Abdulrahman and Matthias J. Kannwischer and Thing-Han Lim},
      title = {Enabling Microarchitectural Agility: Taking {ML}-{KEM} & {ML}-{DSA} from Cortex-M4 to M7 with {SLOTHY}},
      howpublished = {Cryptology {ePrint} Archive, Paper 2025/366},
      year = {2025},
      url = {https://eprint.iacr.org/2025/366}
}