13 Sep. 2019

Machine learning evasion contest – the AV tester’s perspective

By Zoltan Balazs, @zh4ckSeptember 17, 2019|  

The beginning

It’s the year 2018, somewhere around the end of October in one of the most beautiful islands in the world. The weather is cold and rainy, and I am just about to finish my talk about the research we do at MRG Effitas. I throw a lot of technical words at the audience. Some follow my talk, others can only think about their next coffee. The city is surrounded by boring geysers, glaciers and volcanoes, but luckily the “conference” is about super interesting and exciting standards, procedures and templates discussing how to test AntiVirus in a fair way.

After the end of my talk, some rush to their daily dose of coffee, others are still processing what I just said. And suddenly someone familiar greets me. His name is Dr. Hyrum Anderson and he works for Endgame. He is very enthusiastic about what he is going to tell me, I can see that. He shares his idea about organizing a Machine learning evasion competition. As we both work with malware on a daily basis, we both know this is not about sticking our head in the sand so that we can avoid talking about ML in the future. I like the idea. I mean I love the idea. Hyrum’s team can provide ML detection models for the competition, and we can hunt together for samples for the test. MRG Effitas can host the malware samples and I can create a submission platform where contestants can submit their modified malware samples in the hope of bypassing the chosen ML models.


It is easy to modify a sample in a way that it is not detected by ML models. It is a bit more challenging when the sample is a Windows executable file because the modifications can change the behavior of the modified malware sample. Therefore, we have to make sure that this does not happen. Luckily, there are already solutions to this problem. As we already have good connections with the amazing guys at VMRay, we can ask them whether they can help us with this competition.

I still remember the second day of Christmas. Instead of playing with my presents, I am already checking the API of VMRay to see how I can use it to achieve our goal. In February, a colleague of mine shows me Flask Admin which is exactly the framework I am looking for. A simple, clean webserver with templates developed in Python. Flask and Flask Admin are new to me and it is both challenging and sometimes frustrating to work with a new framework. You know, the love and hate relationship. Love it when it works, hate it when it does not. In March, we decide that we announce the competition at DEF CON 27, August. In May, I already have a working site where certain functionalities already work. I try to keep things simple everywhere I can. Like who wants to deal with user registration, lost password, multiple registrations for bypassing limits and stuff like that when you can simply use Google OAUTH.

Time flies, code does not. August approaches fast, so I do what every coder does in these cases: write code faster! Spoiler alert: it does not work. In July, I spend a lot of time finishing all the functionalities. The prod environment is deployed into Amazon, I put NGINX and gunicorn in front of the app. Not because I have to, just because I read this is the best practice. To make TLS easy and the website fast, I put it behind Cloudflare. I perform some tests to make sure that the CDN does not affect the website functionality. All is good. August 9 approaches fast, and I am planning to hike in California before DEF CON starts. Before leaving to the airport, I commit and push my final code changes, and I am positive that the whole system works.


At DEF CON, Hyrum presents the competition to the people at AI Village. People are excited. Both because this is a unique challenge, and because the prize for the winner is a pretty nice GPU card. Handy when you are into Machine Learning. Or gaming. Or both. The competition starts, it is on.

The contest

When I get back to Budapest through Toronto (note to self, never fly Air Canada Rouge again) I am already greeted with valid complaints from the participants on our Slack channel, saying that some things do not work. Around sixty commits and three weeks later, the framework more or less works. During these three weeks, the framework does a lot of things to drive the competitors crazy. Valid samples are marked as invalid, invalid samples are marked as valid, upload limits are reached. Some people think they achieve maximum score, just to discover days later that it is a system error and they do not actually have it. Turns out running a unique competition is hard, and unexpected challenges happen all the time. At the end around 70 people registered to the competition and 11 of them were able to bypass at least one ML model.
Finally, on August 28, 15:25 UTC William Fleshman uploads his final piece of the puzzle and achieves the highest 150 points. But on the same day, just some hours later another contestant does the same. Some days later, both Hyrum and me conclude that the solution is indeed valid and William is the winner! Here you can find his excellent blogpost about how his journey. I highly recommend this other great writeup here from Jakub.

The solutions

Looking at the solutions, people followed the following routes:

  • appending extra data to the executable, also known as overlay
  • adding new sections to the executable, and it is even better if these sections are from known benign files
  • packing the samples with a packer

By default the overlay method seems to bypass some but not all ML models. But turns out at the end of the day, this can be a winning scenario.

If you have read William’s post, you know that things are never as simple as they seem. Adding sections to certain malware files rendered the executable useless.

Also, if you have read Jakub’s post, you can see that packing files works on some of the samples, but not on all.

So, let’s start the bits and bytes section. How is it possible to detect these samples? Well, turns out it is easy. Because traditional AV signature scanning still detects files where overlay method is used. At least most of the time.

When it comes to adding overlay, this method rarely confuses AV engines. It changes the hash in a way that the hash does not appear on a blacklist, but otherwise the sample can still be detected in most cases.

When it comes to appending new sections to the PE, things get complicated. Some samples are detected by fewer AV engines. But why? I suppose certain AV engines have shortcuts for performance reasons, and they check simple things like the number of sections before a signature test is performed. Fun fact, this can even fool certain production ML engines. The original sample is detected by the ML engine, but not the one with the sections mentioned by William. Clearly, the bypass against ML engines works because the sample contains a lot of known benign sections, and not because the malware modification changes the number of sections.

When it comes to packed files, most AV engines have a solid unpacker engine already in place. Nevertheless, packers are still the Nr. 1. bypass techniques against static AV signatures because even slight modifications to the packer algorithm can break the unpacker engine. When it comes to most ML engines, things are a bit different. As most ML engines do not unpack the files, they mostly flag packers as suspicious, or mark them as benign. This is a known limitation of ML engines. For example, just by packing Windows calc.exe with UPX, we get at least one ML detection:

Pack Windows calc.exe with Themida with a valid Taggant, even more ML detection:

Pack Windows calc.exe with VMProtect and OMG happens:

Moral of the story?

The more techniques are used to detect the samples, the harder it is for attackers to evade them. Combine AV signatures with ML, combine it with behavior and heuristics.
Is it still possible to bypass them? Yes.
Is it more difficult? Yes.
Will it produce more False Positive alerts? Probably yes.
I would really like to thank Endgame and VMRay for their cooperation in this great and unique competition. I hope next year we can push the limits even further.

Footnote on SSDeep hashes

While checking the SSDeep hashes of the submitted files, I found a fun comparison of the original malware SSDeep hashes and the modified ones. Can you spot which sample was the original one and which one was generated just to bypass the ML detection?