ML-Based Software Vulnerability Detection

Our team from left to right; Jacob Kim, Kyle Jiang, Michael Piseno, Albert Ko

Inspiration

Compiled C and C++ binaries compose a large portion of the software vulnerabilities present in today's server and commercial codebase. Oftentimes, these vulnerabilities can be detected and prevented by static analysis algorithms, such as the CLang static analyzer. However, with ever increasing complexity and frequency of exploits involving various mechanisms of memory corruption and arbitrary program control, static methods are becoming ineffective in identifying all possible attack surfaces in a given program. Machine learning, a powerful data analysis technique that has been used for finding patterns in a wide variety of datasets, is proposed to be a solution to more quickly and effectively identify potential weak points in a program so that they may be patched before deployment.

What it does

CodeHeat (short for automatic Code Heat Map analysis), is a machine-learning based vulnerability detection built specifically for C and C++ programs, but whose concepts may be easily expanded to perform similar analysis on programs in other languages. Instead of analyzing compiled binaries - which is what disassembers such as IDA and Ghidra do - CodeHeat analyzes the source program directly, exactly what is visible to the developer. This offers several advantages: first, source file analysis allows the developer to make changes to his or her program as it is being built, without having to repeatedly wait for compilation. Furthermore, vulnerabilities at the source level are much easier for the developer to identify and fix. This is much easier than having to map the compiled code back to the text to address a vulnerability.

How we built it

The machine learning library used to generate, train, and evaluate the model was Keras, which runs atop Tensorflow. Since Keras is a Python library, all analysis programs we built were written in Python. The data that is passed to the classifier is a series of tokens - C / C++ text source files had to first be tokenized with a lexer. The lexer was implemented from scratch in Python using the PLY (Python Lex-Yacc) library. The machine learning model itself consists of 7 types of internal layers: (1) embedding layer, (2) reshaping layer, (3) 2-dimensional convolutional layer, (4) a maximum pooling layer, (5) a flattening layer, (6) a dropout layer, and (7) dense layers (there are three). Parameters were selected according toa previous research paper investing the properties of a similar vulnerability generation.

The convolutional neural network is apposite for this application because the tokens are embedded into a higher space, resulting in a block of program text to be representable by an intensity image. In programming, neighboring tokens are known to affect each others' meanings, and the convolution reflects this proximity.

Challenges we ran into

Tokenization of the C code became our biggest challenge. The research paper we were following used their own custom tokenizer that reduced the token space to 156 symbols, and we had a hard time matching that, while still accounting for the different symbols that could be captured.

Accomplishments that we're proud of

We picked an idea that we thought was interesting, and we stuck with it beginning to end no matter the challenges. We've had to overcome many hurdles and although we didn't make the results that we would have liked, we are very happy with the progress we made.

What we learned

We learned about the process of lexing program text data into a set of symbols that makes it easiest for a machine learning model to find patterns among the program data. We also expanded our thinking about machine learning and its applicability to various problems - even though our datsets were text files (at most described by a one-dimensional string of characters), embedding into a higher space and using convolution enables patterns that would otherwise be difficult to observe to become clear.

What's next for ML-Based Software Vulnerability Detection

To improve CodeHeat, the central model must be trained to better identify offending code. This can be accomplished by selecting appropriate token rules for a tokenizer that more effectively represents the program code and meaning. Additionally, visualization of which parts of the code are most vulnerable would also be desirable. Visualization of which parts of the code are likely the most dangerous can be obtained by careful use of the output layer of the beginning of the convolutional network.