Paper 2025/366

Enabling Microarchitectural Agility: Taking ML-KEM & ML-DSA from Cortex-M4 to M7 with SLOTHY

Amin Abdulrahman, Max Planck Institute for Security and Privacy (MPI-SP), Bochum, Germany
Matthias J. Kannwischer, Quantum Safe Migration Center, Chelpis Quantum Corp, Taipei, Taiwan
Thing-Han Lim, Quantum Safe Migration Center, Chelpis Quantum Corp, Taipei, Taiwan
Abstract

Highly-optimized assembly is commonly used to achieve the best performance for popular cryptographic schemes such as the newly standardized ML-KEM and ML-DSA. The majority of implementations today rely on hand-optimized assembly for the core building blocks to achieve both security and performance. However, recent work by Abdulrahman et al. takes a new approach, writing a readable base assembly implementation first and leaving the bulk of the optimization work to a tool named SLOTHY based on constraint programming. SLOTHY performs instruction scheduling, register allocation, and software pipelining simultaneously using constraints modeling the architectural and microarchitectural details of the target platform. In this work, we extend SLOTHY and investigate how it can be used to migrate already highly hand-optimized assembly to a different microarchitecture, while maximizing performance. As a case study, we optimize state-of-the-art Arm Cortex-M4 implementations of ML-KEM and ML-DSA for the Arm Cortex-M7. Our results suggest that this approach is promising: For the number-theoretic transform (NTT) – the core building block of both ML-DSA and ML-KEM – we achieve speed-ups of $1.97\times$ and $1.69\times$, respectively. For Keccak – the permutation used by SHA-3 and SHAKE and also vastly used in ML-DSA and ML-KEM – we achieve speed-ups of 30% compared to the M4 code and 5% compared to hand-optimized M7 code. For many other building blocks, we achieve similarly significant speed-ups of up to $2.35\times$. Overall, this results in 11 to 33% faster code for the entire cryptosystems.

Metadata
Available format(s)
PDF
Category
Implementation
Publication info
Preprint.
Keywords
Post-Quantum CryptographyArm Cortex-M7Arm Cortex-M4ML-KEMML-DSASuperoptimizationConstraint Solving
Contact author(s)
amin @ abdulrahman de
matthias @ kannwischer eu
han lim @ chelpis com
History
2025-03-04: approved
2025-02-26: received
See all versions
Short URL
https://ia.cr/2025/366
License
Creative Commons Attribution
CC BY

BibTeX

@misc{cryptoeprint:2025/366,
      author = {Amin Abdulrahman and Matthias J. Kannwischer and Thing-Han Lim},
      title = {Enabling Microarchitectural Agility: Taking {ML}-{KEM} & {ML}-{DSA} from Cortex-M4 to M7 with {SLOTHY}},
      howpublished = {Cryptology {ePrint} Archive, Paper 2025/366},
      year = {2025},
      url = {https://eprint.iacr.org/2025/366}
}
Note: In order to protect the privacy of readers, eprint.iacr.org does not use cookies or embedded third party content.