Submitted by vintergroena t3_123asbg in MachineLearning
Looking at how GPT can work with source code mixed with language, I am thinking that similar techniques could perhaps be used to construct a decent decomplier. Consider a language like C. There are plenty of open sources which could be compiled. Then you can use the dataset consisting of (source code, compiled code) pairs to train a generative model to learn the inverse operation from data. Ofc, the model would need to fill in the information lost during compilation (variable names etc) in a human-understandable way, but looking at the recent language models and how they work with source codes, this now seems rather doable. Is anyone working on this already? I would consider such an application to be extremely beneficial.
mil24havoc t1_jdu05qo wrote
There's already research on this. For example, see "DIRECT: A Transformer Model for Decompiled Variable Name Recovery" by Nitin et al.