One of the big problems in detecting malware is that there are so many different forms of the same malicious code. This problem of polymorphism is what led Rick Wesson to develop icewater, a clustering technique that identifies malware.
Presented at Shmoocon 2016, the icewater project is a new way to process and filter the vast number of samples one finds on the Internet. Processing 300,000 new samples a day to determine if they have polymorphic malware in them is a daunting task. The approach used here is to create a fingerprint from each binary sample by using a space-filling curve. Polymorphism will change a lot of the bits in each sample, but as with human fingerprints, patterns are still present in this binary fingerprints that indicate the sample is a variation on a previously known object.
The images you’re seeing above are graphic representations of these fingerprints. Images aren’t actually part of the technique, but by converting each byte value to greyscale it is a good way for humans to understand what the computer is using in its analysis.
Once the fingerprint is made, it’s simple to compare and cluster together samples that are likely the same. The expensive part of this method is running the space-filling curve. It take a lot of time to run this using a CPU. FPGAs are idea, but the hardware is comparatively costly. In its current implementation, GPUs are the best balance of time and expense.
This expense measurement gets really interesting when you start talking Internet-scale problems; needing to constantly processing huge amounts of data. The current GPU method can calculate an object in about 33ms, allowing for a couple hundred thousand samples per day. This is about four orders of magnitude better than CPU methods. But the goal is to transition away form GPUs to leverage the parallel processing found in FPGAs.
Rick’s early testing with Xenon Phi/Altera FPGAs can calculate space-filling curves at a rate of one object every 586µs. This represents a gain of nine orders of magnitude over CPUs but he’s still not satisfied. His goal is to get icewater down to 150µs per object which would allow 10 million samples to be processed in four hours with a power cost of 4000 Watts.
How to do you compare computations on hardware the has a different cost to manufacture and different power budgets? Rick plans to reduce the problem with a measurement he calls InfoJoules. This is an expression of computational decisions versus Watt seconds. 1000 new pieces of information calculated in 1 second on a machine consuming 1000 Watts is 1 InfoJoule. This will make the choice of hardware a bit easier as you can weigh both the cost of acquiring the hardware with the operational cost per new piece of information.
Watts isn’t a unit of energy. Maybe you meant Watt hours? ;-)
I have to say this is definitely a cool hack. Sorry for the nitpicking but I just couldn’t help myself.
A Watt is a Joule / second, so you really can think of a Joule as being a Watt*second. (And a Watt hour is 3600 Joules.) In short, you’re right. But so are we. :)
No Elliot y’re not. Watt is power Watt.hour is energy Aidan is right!
Elliot is correct, did you read the article? Correct dimensions were specified in both cases: “would allow 10 million samples to be processed in four hours with a power cost of 4000 Watts”. Power and time. The second case specifies InfoJoules: “Rick plans to reduce the problem with a measurement he calls InfoJoules. This is an expression of computational decisions versus Watt seconds. 1000 new pieces of information calculated in 1 second on a machine consuming 1000 Watts is 1 InfoJoule.”
This is bad logic….
The space filling curves remind me of ..cantor.dust..
Could have sworn there was a HAD article about that as well at some point…
I wonder what happend to that?
https://sites.google.com/site/xxcantorxdustxx/
https://www.reddit.com/r/ReverseEngineering/comments/1izity/cantordust_a_binary_visualization_tool
Neat stuff, I’ll see if cantor dust can be leveraged. It doesn’t appear to lend itself to the easy translation to locality preservation of data what is what we are leveraging to cluster with.
I cant seem to find any updates on their program.
There is an empty github repository on the developer’s github: https://github.com/xoreaxeaxeax so maybe something will come in a while. You also have Senseye that has this kind of mapping in a complex way – https://github.com/letoram/senseye/wiki
and tons more @ http://reverseengineering.stackexchange.com/questions/6003/visualizing-elf-binaries
Being able to quantify the cost/benefit ratios for different problem solutions is very important. I remember having to write a G-Code database search engine where the results were based on how the end geometry looked. i.e. Return all files that produce an object similar to example X within a similarity distance of D. I got it working well enough for management to find the results spooky but the hardware they gave me to run it on was so shitty that I had to precompute a lot of data each night ready to be used the next day so if a bunch of new files were created, or a batch was imported from another project from an external source, then they would not show up in the search. So in a way it was solving a very similar problem.
Hack a day has hit a new low in basic writing quality. Has everyone abandoned spellcheck now?
The first thing I did in my shmoocon talk was note that I am dyslexic and enjoy meeting others that find both reading and writing difficult.
Better than suffering from Sphinctus Gigantus disease like M does. Of course SG is not officially recognised as a disease yet, which is surprising given how many people in politics also suffer from Sphinctus Gigantus disease. I guess they are in denial…
you mean sphincterus minusculis? aka tight-arse?
Ah, that too perhaps, but in this case I mean that people who abuse other people because of their language disabilities are giant assholes.
What we need is a non-intrusive AV software package that plays nice with power-users.
Just yesterday I had to release a new version of a piece of software. I use NSIS (http://nsis.sourceforge.net/) to create an installer, and there’s a tiny bit of code in there to check for and download the .NET framework. Previous release was fine, now suddenly 3 out of 55 AV Scanners (on virustotal.com) identify it as :
– Win.Adware.Agent-59030
– BehavesLike.Win32.Dropper.tc
– suspected of Trojan.Downloader.gen.h
And this is not the first time i had an AV go bonkers over an executable i just compiled in VS, and there are never any details available. Just a single message with a cryptic internal name/id of what it supposedly found. And then there are often only the options to “fix” or quarantine/delete, or completely disable AV altogether.
I don’t believe in passive scanning for patterns anymore, false positives are rampant, and detection is never guaranteed.
It can serve as an immediate stopper only for well-known pieces of code, large and specific enough not to trigger false positives.
And then there’s the intrusiveness of the free AV packages..
I am so fed up with all the current AV publishers that I’m just one excuse away from uninstalling AV completely, even though i believe that’s irresponsible.
for the record i’m running AVG. Probably like many people here, i recommended it to everyone for years. In the beginning, 15 years ago or so, it was imo one of the best and least intrusive options out there, but that changed quickly and now it’s just another EVIL(tm) product.
As oddball as that show was, they need to bring it back. If only just to fix the loose plots.
Yes, you are not alone – I’ve had the same problem with my software. I use Innosetup (http://www.jrsoftware.org/isinfo.php) to create my installers, and one guy marked a particular version of Innosetup bad on Virustotal. Suddenly Microsoft Security Essentials was flagging all my installers as bad. Reverting to an older version of Innosetup fixed the problem, although Virustotal and Microsoft have since caught a wake-up.
This guy has had it too, and has some good suggestions: http://www.nirsoft.net/false_positive_report.html
Use a better OS and you won’t need these useless AVs.
just because an OS is hard to crack, doesnt mean there isnt someone trying to.
an antivirus is a protection against maliccious people, not malicious code, directly.
yeah, and i’ll force the hundreds of car workshops that use this software to switch as well ?! If you’re not trolling then you’re an idiot.
At one point I switched in a short period of time from:
1) Norton (bluescreen crashes), to…
2) Avira (system slowed down the longer it ran, until you could barely even move the mouse), to…
3) AVG (excessive false positives and only remembered exclusions until next reboot – which they claimed was by design and couldn’t see how this could be a problem, plus so many pop-ups to buy the full version that it might as well be malware, needless to say I didn’t buy it)
And then just decided to go AV-less while I pondered my next move. That took a while, and by then, I was wondering exactly how long I could go without getting any malware. Of course I had a firewall, and good backups. Plus used a VM for very risky behavior, although I did take a few calculated risks on the real machine.
I terminated the experiment two years later, with ZERO infections. (And just one in a VM.) Power users don’t need an aggressive AV that sucks up resources, and scores a little higher on comparative tests at the expense of vastly increased false positives. Common sense is their main protection. They just need something that provides moderate protection against the occasional exposure, and is designed to be as unobtrusive as possible.
So I’ve been very happy with Microsoft Security Essentials. No, it’s not perfect. Maybe something will slip through eventually. But even if I had to do a full restore once a year, each time losing a few files modified in the last 24 hours, it would *still* be less grief than other AV have caused me.
You might want to give EMET a look: https://support.microsoft.com/en-us/kb/2458544
No idea why it isn’t in Windows by default.
Thank you! I hadn’t seen EMET before.
EMET appears to block several general classes of exploits by which malware can elevate its own privileges and self-install. Statistically, I don’t think that it will block much. The vast majority of infections I see are a result of a user being foolish enough to grant the needed privileges, rather than the malware using an exploit to infect without user assistance. Not saying the latter can’t or doesn’t happen, but it seems rare by comparison; the last verified instance I’ve personally witnessed was on Windows 2000, pre-SP4.
And it may not be particularly effective in performing its intended tasks. In searching for details (the linked Microsoft page gives virtually none), I found some pages like these, suggesting EMET runs in user space and can therefore be bypassed:
http://www.pcworld.com/article/2101640/researchers-bypass-protections-in-microsofts-emet-antiexploitation-tool.html
But I wonder if, in some cases, it may still block some of the methods a typical user-assisted infection might employ to actively resist disinfection. On this I’m not certain, but if so, that would be a plus.
I will test it in a VM, then my own system. If it has no downsides, then I will use it to harden my employer’s servers. Every bit of protection is welcome there, no matter how small.
Yup. Layered security buys you time to detect. :)
Not irresponsible at all – by all means, do have an antivirus installed (better yet, have two or three – although admittedly that gets hard with all the wretched “services” each one insists to install, stepping on each other’s toes) just make sure not one of them is allowed to run anything at startup or ever after, until you explicitly call them to scan something – which you do whenever you feel you’re doing something slightly dodgy.
In theory, you could configure any of them to only scan at those special times (such as when downloading something) automatically and leave everything else normally alone, but we know that’s never how it works – if you let them start up, they’ll keep finding stuff to scan left and right anyway, bogging down even the beastliest centa-core monster to a crawl – and sadly, both false negatives and positives abound even so.
There’s no silver bullet – ultimately, while judicious privilege management (ie. non-root / admin use) might save your system from being infected, it will ultimately do you no good at all considering you _need_ write access to your files to use them, and that’s all an encryptor needs to pwn you (and possibly your backups, too, unless you really do keep many past versions of everything – but do you now, really? Do you even _have_ backups…? Do they actually work? Are you sure…?)
@Max-
I personally haven’t used any real-time protection on any of my own systems for several years, although I do insist on it for my customer’s computers. I just manually scan downloads before executing them, and have never had an infection…
Anyone got a link to details of the theory/workings of the fingerprint generator? (Not the FPGA/GPU part). I been googling but I just get blogs full of this article copy-pasted.
Presumably this is something that could be implemented in a dedicated IC and not necessarily leveraging the ability of FPGAs to reconfigure circuits, correct? If so, then this seems like something that might be marketable enough to shove on some CPU dies, which would lead to a performance/efficiency increase.
Presumably if you knew what you were talking about you would have used the term ASIC and know that a standard ASIC will cost you around a million dollars for the set-up, plus the minimum order. That is for block level design, not transistor level customisation or optimisation. If you are lucky and skilled, in that area, you may get it down to a quarter of a million for a couple of thousand chips. One error and you blow your money, then have to start again which doubles your costs.
You really are tiring. I know what an ASIC is, but I tend to avoid that term when talking about large-scale commercial ICs. What I had in mind was someone like intel embedding it on their server chips, just as they do with other security-related features. Do you call those items ASICs too?
1. You are a terrible liar and 2. the cost for a CPU design like an i7 is much larger again. So if you don’t like learning stuff and do not have the courage to admit you don’t know something you probably are lurking on the wrong website. The problem that is being solved is not on one that the majority of end users need to solve so there is not going to be any cheap commodity solution to it ever, well other than what you get if you wait ten years for of the shelf CPUs to get faster, but by then the problem dataset may have grown too.
Can you cut back on the attitude? You could say what you’ve said in a constructive and friendly manner, and we’d learn the same things, except [msat] who would learn more because they wouldn’t feel like they are being talked down to. You would come across as intelligent and helpful and people would want to listen to everything you have to say. Instead, you keep attacking people as if their very act of asking something is an affront to your person-hood. It makes you look like a teenager who is unsure of their ability and wants prove them self at every turn (I was one of those once, so I know of what I speak). Chill, and we’ll all respect you more.
I don’t want your respect, I don’t need it, my words are either true of false and in the case of that particular individual they are spot on. I suggest that you examine their history for the explanation, then you will see that your words are not entirely true, therefore they are of far less value than you imagine.
+1 on the InfoJoule – Go meme go!!!
On the surface, this reminds me a bit of audio fingerprinting. The details of either are deep black magic to me though.
That’s at least the second time I’ve read that in the comments. (Was it you the other time?)
If I hear that one more time, I’m gonna write up Shazam for you.
Wasn’t me, at least not within the last few months.
I do use audio fingerprinting in my personalized media system, through command line interface to “fpcalc.exe” from MusicBrainz, to generate an AcoustID. Just for local duplicate detection right now, not yet any online look-ups or actual identification. There’s a good tutorial on comparing AcoustIDs to each other, so that much was easy. No clue how fingerprints are generated in the first place.
xenon are a form of headlights. Xeon are the Intel trademark used on the Phi chips.