It’s the year 2018, somewhere around the end of October in one of the most beautiful islands in the world. The weather is cold and rainy, and I am just about to finish my talk about the research we do at MRG Effitas. I throw a lot of technical words at the audience. Some follow my talk, others can only think about their next coffee. The city is surrounded by boring geysers, glaciers and volcanoes, but luckily the “conference” is about super interesting and exciting standards, procedures and templates discussing how to test AntiVirus in a fair way.
After the end of my talk, some rush to their daily dose of coffee, others are still processing what I just said. And suddenly someone familiar greets me. His name is Dr. Hyrum Anderson and he works for Endgame. He is very enthusiastic about what he is going to tell me, I can see that. He shares his idea about organizing a Machine learning evasion competition. As we both work with malware on a daily basis, we both know this is not about sticking our head in the sand so that we can avoid talking about ML in the future. I like the idea. I mean I love the idea. Hyrum’s team can provide ML detection models for the competition, and we can hunt together for samples for the test. MRG Effitas can host the malware samples and I can create a submission platform where contestants can submit their modified malware samples in the hope of bypassing the chosen ML models.
It is easy to modify a sample in a way that it is not detected by ML models. It is a bit more challenging when the sample is a Windows executable file because the modifications can change the behavior of the modified malware sample. Therefore, we have to make sure that this does not happen. Luckily, there are already solutions t
I still remember the second day of Christmas. Instead of playing with my presents, I am already checking the API of VMRay to see how I can use it to achieve our goal. In February, a colleague of mine shows me Flask Admin which is exactly the framework I am looking for. A simple, clean webserver with templates developed in Python. Flask and Flask Admin are new to me and it is both challenging and sometimes frustrating to work with a new framework. You know, the love and hate relationship. Love it when it works, hate it when it does not. In March, we decide that we announce the competition at DEF CON 27, August. In May, I already have a working site where certain functionalities already work. I try to keep things simple everywhere I can. Like who wants to deal with user registration, lost password, multiple registrations for bypassing limits and stuff like that when you can simply use Google
Time flies, code does not. August approaches fast, so I do what every coder does in these cases: write code faster! Spoiler alert: it does not work. In July, I spend a lot of time finishing all the functionalities. The prod environment is deployed into Amazon, I put NGINX and gunicorn in front of the app. Not because I have to, just because I read this is the best practice. To make TLS easy and the website fast, I put it behind Cloudflare. I perform some tests to make sure that the CDN does not affect t
At DEF CON, Hyrum presents the competition to the people at AI Village. People are excited. Both because this is a unique challenge, and because the prize for the winner is a pretty nice GPU card. Handy when you are into Machine Learning. Or gaming. Or both. The competition starts, it is on.
When I get back to Budapest through Toronto (note to self, never fly Air Canada Rouge again) I am already greeted with valid complaints from the participants on our Slack channel, saying that some things do not work. Around sixty commits and three weeks later, the framework more or less works. During these three weeks, the framework does a lot of things to drive the competitors crazy. Valid samples are marked as invalid, invalid samples are marked as valid, upload limits are reached. Some people think they achieve
Finally, on August 28, 15:25 UTC William Fleshman uploads his final piece of the puzzle and achieves the highest 150 points. But on the same day, just some hours later another contestant does the same. Some days later, both Hyrum and
Looking at the solutions, people followed the following routes:
- appending extra data to the executable, also known as overlay
- adding new sections to the executable, and it is even better if these sections are from known benign files
- packing the samples with a packer
If you have read William’s post, you know that things are never as simple as they seem. Adding sections to certain malware files rendered the executable useless.
Also, if you have read Jakub’s post, you can see that packing files works on some of the samples, but not on all.
So, let’s start the bits and bytes section. How is it possible to detect these samples? Well, turns out it is easy. Because traditional AV signature scanning still detects files where
When it comes to adding
When it comes to appending new sections to the PE, things get complicated. Some samples are detected by fewer AV engines. But why? I suppose certain AV engines have shortcuts for performance reasons, and they check simple things like the number of sections before a signature test is performed. Fun fact, this can even fool certain production ML engines. The original sample is detected by the ML engine, but not the one with the sections mentioned by William. Clearly, the bypass against ML engines works because the sample contains a lot of known benign sections, and not because the malware modification changes the number of sections.
When it comes to packed files, most AV engines have a solid unpacker engine already in place. Nevertheless, packers are still the Nr. 1. bypass techniques against static AV signatures because even slight modifications to the packer algorithm can break the unpacker engine. When it comes to most ML engines, things are a bit different. As most ML engines do not unpack the files, they mostly flag packers as
Pack Windows calc.exe with Themida with a valid Taggant, even
Pack Windows calc.exe with VMProtect and OMG happens:
Moral of the story?
The more techniques are used to detect the samples, the harder it is for attackers to evade them. Combine AV signatures with ML, combine it with behavior and heuristics.
Is it still possible to bypass them? Yes.
Is it more difficult? Yes.
Will it produce more False Positive alerts? Probably yes
Footnote on SSDeep hashes
While checking the SSDeep hashes of the submitted files, I found a fun comparison of the original malware SSDeep hashes and the modified ones. Can you spot which sample was the original one and which one was generated just to bypass the ML detection?