DeepObfusCode: Source Code Obfuscation Through Sequence-to-Sequence Networks

The proposed and implemented system provides allows cloud computing companies to execute obfuscated code (non-plaintext source code), and allows clients to obfucate code through a seq2seq-based obfuscation network. Other than the applications of neural network archictectures in source code obfuscation, the paper released also introduces a quantitative framework to evaluate obfuscation methods.


Technical summary

  • Implemented reversible character-embedded encoder-decoder model that takes plaintext input, recursively generates obfuscated code to ensure the execution program can run the obfiscated code without error, then returns obfuscated code, h5 model files, and char/word-to-index dictionaries

  • Set up experimental pipeline to obfuscate benchmark source code, and compare/plot defined metrics between benchmark obfuscated and seq2seq obfuscated code


The three segments of the model is (i) ciphertext generation, (ii) key generation, (iii) execution. First, we use an encoder-decoder model to generate a ciphertext from the original source code, which automates the obfuscation process (in contrast to the usual obfuscation practice of manually altering the logic of the program, or inserting non-functional code).


Then the code owner trains the key generation network using the source code and obfuscated code as inputs, and retrieves the model weights.


The machine host (the owner of the machine that will execute the obfuscated code) has an execution system that takes the model files and obfuscated code as inputs, and the code can be executed without the host reading the code in plaintext.


To make comparisons to existing obfuscation methods, it was not feasible to use the qualitative methods used in prior work. The paper compared manually-obfuscated code to the proposed obfuscation method using:0 (i) the length of the source code, (ii) the Levenstein distance (as a proxy for difference between the original source code and the obfuscated code), (iii) the character variation in the obfuscated code, (iv) length of the obfuscated code.