Paper 2024/689

Automated Creation of Source Code Variants of a Cryptographic Hash Function Implementation Using Generative Pre-Trained Transformer Models

Elijah Pelofske, Sandia National Laboratories
Vincent Urias, Sandia National Laboratories
Lorie M. Liebrock, New Mexico Cybersecurity Center of Excellence, New Mexico Tech, Sandia National Laboratories
Abstract

Generative pre-trained transformers (GPT's) are a type of large language machine learning model that are unusually adept at producing novel, and coherent, natural language. Notably, these technologies have also been extended to computer programming languages with great success. However, GPT model outputs in general are stochastic and not always correct. For programming languages, the exact specification of the computer code, syntactically and algorithmically, is strictly required in order to ensure the security of computing systems and applications. Therefore, using GPT models to generate computer code poses an important security risk -- while at the same time allowing for potential innovation in how computer code is generated. In this study the ability of GPT models to generate novel and correct versions, and notably very insecure versions, of implementations of the cryptographic hash function SHA-1 is examined. The GPT models Llama-2-70b-chat-hf, Mistral-7B-Instruct-v0.1, and zephyr-7b-alpha are used. The GPT models are prompted to re-write each function using a modified version of the localGPT framework and langchain to provide word embedding context of the full source code and header files to the model, resulting in over $130,000$ function re-write GPT output text blocks (that are potentially correct source code), approximately $40,000$ of which were able to be parsed as C code and subsequently compiled. The generated code is analyzed for being compilable, correctness of the algorithm, memory leaks, compiler optimization stability, and character distance to the reference implementation. Remarkably, several generated function variants have a high implementation security risk of being correct for some test vectors, but incorrect for other test vectors. Additionally, many function implementations were not correct to the reference algorithm of SHA-1, but produced hashes that have some of the basic characteristics of hash functions. Many of the function re-writes contained serious flaws such as memory leaks, integer overflows, out of bounds accesses, use of uninitialised values, and compiler optimization instability. Compiler optimization settings and SHA-256 hash checksums of the compiled binaries are used to cluster implementations that are equivalent but may not have identical syntax - using this clustering over $100,000$ distinct, novel, and correct versions of the SHA-1 codebase were generated where each component C function of the reference implementation is different from the original code.

Metadata
Available format(s)
PDF
Category
Implementation
Publication info
Preprint.
Keywords
GPTSHA-1Cryptographic Hash FunctionC ImplementationGenerative Pre-Trained TransformerMachine LearningLLM
Contact author(s)
elijah pelofske @ protonmail com
History
2024-05-06: approved
2024-05-06: received
See all versions
Short URL
https://ia.cr/2024/689
License
Creative Commons Attribution
CC BY

BibTeX

@misc{cryptoeprint:2024/689,
      author = {Elijah Pelofske and Vincent Urias and Lorie M. Liebrock},
      title = {Automated Creation of Source Code Variants of a Cryptographic Hash Function Implementation Using Generative Pre-Trained Transformer Models},
      howpublished = {Cryptology ePrint Archive, Paper 2024/689},
      year = {2024},
      note = {\url{https://eprint.iacr.org/2024/689}},
      url = {https://eprint.iacr.org/2024/689}
}
Note: In order to protect the privacy of readers, eprint.iacr.org does not use cookies or embedded third party content.