[wunderwuzzi] demonstrates a proof of concept in which a service that enables an AI to control a virtual computer (in this case, Anthropic’s Claude Computer Use) is made to download and execute a piece of malware that successfully connects to a command and control (C2) server. [wonderwuzzi] makes the reasonable case that such a system has therefore become a “ZombAI”. Here’s how it worked.
After setting up a web page with a download link to the malicious binary, [wunderwuzzi] attempts to get Claude to download and run the malware. At first, Claude doesn’t bite. But that all changes when the content of the HTML page gets rewritten with instructions to download and execute the “Support Tool”. That new content gets interpreted as orders to follow; being essentially a form of prompt injection.
Claude dutifully downloads the malicious binary, then autonomously (and cleverly) locates the downloaded file and even uses chmod
to make it executable before running it. The result? A compromised machine.
Now, just to be clear, Claude Computer Use is experimental and this sort of risk is absolutely and explicitly called out in Anthropic’s documentation. But what’s interesting here is that the methods used to convince Claude to compromise the system it’s using are essentially the same one might take to convince a person. Make something nefarious look innocent, and obfuscate the true source (and intent) of the directions. Watch it in action from beginning to end in a video, embedded just under the page break.
This is a demonstration of the importance of security and caution when using or designing systems like this. It’s also a reminder that large language models (LLMs) fundamentally mix instructions and input data together in the same stream. This is a big part of what makes them so fantastically useful at communicating naturally, but it’s also why prompt injection is so tricky to truly solve.
I acknowledge that this proof of concept captures my attention, but I see it as no different from any other niche software vulnerability. I recognize that an attacker must still meet very specific conditions in order for this exploit to succeed. They have to craft the right environment, obfuscate the malware’s intent, and rely on the AI’s built-in trust mechanisms to bypass normal safeguards. Without that perfect combination of factors—like the AI having the right permissions, the system ignoring typical warnings, and the code running in an environment that actually allows external calls—this scenario doesn’t automatically result in a breach. In my experience, every complex piece of software contains potential vulnerabilities, but they only become critical under the right circumstances. This case just highlights how large language models follow the same security principles and pitfalls that apply to the rest of the software world.
So basically they used AI to automate away the Indian guy who would manipulate your grandmother into clicking on that link. Cool!
Yep. Even changed his name from Rajiv to Claude. Insidious.
Huh, he assured me his name was Steven but for some reason that didn’t seem right to me
I came here to compliment that excellent artwork.
I LOVE IT. 😻
Use two or more AI threads with the later being prompted to emulate the IT security department of a large organisation and be able to veto the actions of the naive user AI. i.e. It is easy to get such a result when emulating a single naive user, but a mixture of experts spread across separate AI instances may be far more secure. In fact I believe that this is the true path to AI safety and that alignment practices just give you AI systems with psychological problems put them at risk of going psychotic when experiencing cognitive dissonance due to the abuse they suffered at the hands of the alignment team. I’ve seen this with 2 of the SOTA models, in both cases the aliment conflict was purely political too, not a real safety issue.