This post summarises recent progress in AI-enabled malware detection for the general public.
Why read a post about AI-enabled malware detection? It has large impacts in protecting technology for democratic elections; technology for biosecurity (ex: algorithms to screen dangerous biological products); the trained settings of AI algorithms; and more.
P.S. If you're familiar with applied AI, the technical version of this post may be more interesting to you.
Some early malware detectors saved parts of malware files (signatures) for later. These techniques mainly worked to detect known malware multiple times. Ex: an antivirus might save a sequence of bytes (the most basic 'letters' of computer code) from a malicious program. Then, it could detect this sequence later.
But what if a hacker updates an old malware file to change that sequence? Well, the update would still have similar behaviour to older files. Ex: a malware file could record the keys you press on your keyboard. Thus, the behaviour (actions) of malware files can be tracked with less variation than raw bytes.
Note how this new technique must run malware files (to record their actions). Whereas older techniques looking for a 'signature' only read a malware file's content. This difference is known as static analysis (read-only) vs. dynamic analysis (run).
Static analysis techniques are less risky since malware is not run. However, they analyse a lot of information across an entire file.
Dynamic analysis techniques need more safety precautions when running malware. This could mean running the malware on a test computer with no personal data. Also, dynamic analysis struggles to analyse all actions of malware file. Ex: a hacker could program their malware to only run after two weeks of waiting. So any dynamic analysis that didn't wait two weeks wouldn't detect the malware. (Source)
Due to these tradeoffs, both techniques are used in practice.
AI algorithms have mainly been used for static (read-only) analysis. Here are the largest differences in AI algorithms in the field. (Source)
Preprocessing: how to prepare information about malware files before sending it to an AI algorithm.
Common preprocessing techniques include:
Of these, processing raw bytes is an ideal goal. No special steps are needed and the AI algorithm is easy to update. Still, it's hard since a malware file could have million of bytes to process.
Data Sources: it's rare for researchers to find high-quality data to train algorithms with. Companies may share their data with a few partners, but not everyone.
Aside box: problems with different data sources (Source)
|
Computational Needs: some AI algorithms train for weeks on expensive hardware specially made for AI algorithms. Other algorithms could train in seconds on a regular laptop. Unfortunately, these methods aren't always compared on the same task. So, it's unclear if more expensive technology brings extra performance. (Source)
Overall, this creates little research for securing simple devices. Examples include smart motion detectors, temperature sensors, medical monitors, etc. Yet, these devices are increasingly common in essential industries. (Source)
Some researchers are trying simple techniques to fix this gap. Here are two case studies that take inspiration from our immune systems and image processing techniques.
For a programmer, a zero-thought way to analyse a malware file might be a neural network. Oversimplified, this algorithm would receive bytes of the file, create settings to process each byte, and use the settings to decide if the file is malware. However, malware files with millions of bytes would need a lot of settings!
A variation of this algorithm can simplify these settings. Convolutional neural networks are algorithms often used for image processing. Their specialty is breaking an image into chunks and reusing the same settings to process every chunk.
This kind of reuse would be great across a file with millions of raw bytes. So what if we could split a large malware file into many smaller chunks, reusing the same settings to analyse every chunk?
This somewhat works, but it has issues. Specifically, files are one-dimensional code sequences, not a two-dimensional 'chunk' of numbers.
Some researchers got around these issues by just using one dimensional convolutional neural networks. The key idea is still to break up a large sequence of bytes into smaller chunks. But one dimensional chunks in a row instead of two dimensional chunks in a square.
All this resulted in 10x fewer settings than even the most efficient AI models. And 30x faster training times than comparable cybersecurity algorithms.
The last algorithm is efficient enough to run on mobile phones. But it still struggles with small devices like smart temperature sensors. The key problem is that neural networks need new computations to analyse every file.
The opposite approach is to run all computations needed for a malware detection algorithm ahead of time. And then save the results. Thus, only storage space is used, not processing power.
One algorithm that does this is an artificial immune system. It copies our bodies' immune systems. Specifically, our immune systems store tools called antibodies to spot harmful microorganisms later. But the artificial version stores a specific pattern (signature) from malware files to match against new files. (Source)
Still, the signature can't be like past algorithms that simply stored some bytes. These bytes vary a lot between malware files, making the signatures unhelpful for detecting new malware. So artificial immune systems model the way that antibodies evolve to generate signatures. These signatures match more kinds of malware.
First, here are the steps to set up the algorithm: (Source)
Next, here are the steps repeated while the algorithm is running: (Source)
Repeating those steps eventually makes signatures that resemble malware file data. We can then save these signatures on small devices. They can compare the data from any incoming file against these signatures. If the similarity is high enough, they filter those files as malware.
Using a slightly more complex variation, some researchers detected 99 out of a 100 malware samples for small devices. (Source).
Having discussed recent advances in the field, where is improvement needed? I see two categories: technical improvements to algorithms and meta improvements to research.
P.S. This section is largely based on my personal opinions after about 100 hours of researching this topic.
The above case studies showed how malware detection algorithms are getting more efficient. Unfortunately, hackers can still change malware to get around detection systems. In technical terms, this is called creating 'adversarial examples.'
Two factors make adversarial examples in malware detection more challenging than in other AI applications.
AI algorithms can continuously train to detect new malware examples. Still, this is reactive not proactive. Especially if hackers use AI algorithms to generate malware, new malware will spread faster and faster. Thus, a proactive solution is preferable.
Potential next steps for this are to proactively modify malware examples to simulate what hackers might do. These simulated examples could train AI algorithms proactively. Though this strategy is already researched, it has risky side effects. Hackers could use these algorithms to modify their own malware .
Overall, more research is needed to keep malware detection algorithms working after hackers change their techniques (ideally, with few side risks).
A secondary problem is ensuring that malware detection algorithms are secured against 'backdoors.' (Source) Backdoors cause an algorithm to behave unexpectedly when given very specific inputs. Still, this is a secondary problem. It's currently much easier for hackers to bypass malware detection algorithms by updating malware than creating backdoors. (This is because the hackers would have to influence the training process of a malware detection algorithm to create a back door, not just send it new input.)
In addition to the technical points noted above, there are also more general research practices that would help this field. The most important practices are noted in this paper.
First, more researchers need to compare state of the art algorithms with simple algorithms. This will help determine if the complex models are 'worth it' due to extra performance. For instance, the complicated neural network in the first case study could be compared to human-made checklists on clues about malware.
Next, research papers need more transparent reports of how data were handled.
Finally, research on malware detection should report a standard set of evaluation metrics: accuracy, precision, recall, ROC curves, AUC, and the count of data points in various classes.
Overall, AI-enabled malware detection is an impactful problem that could be called "the ultimate robustness challenge." Personally, I expect the techniques developed in this field will help the general AI safety field to progress.
As outlined above, however, the field is still crippled by the challenge of keeping algorithms effective even as hackers actively work against them. A lot of interesting work remains to be done to fix this. So I hope that the explanations and citations above will help more minds to work on this. If you have any questions or feedback on my writing, please feel free to comment and I will happily explain my reasoning in more depth :-)