Role Confusion

Researchers [Charles Ye], [Jasmine Cui], and [Dylan Hadfield-Menell] have shown that AI Large Language Models (LLMs) can fail to correctly distinguish between different instruction sources because they prioritize writing style over metadata tags, and this role confusion leads to a powerful attack called CoT (Chain of Thought) Forgery. We’ll explain exactly how it works after a bit of background review.

Prompt injection was where “getting an LLM to do something it shouldn’t” started by exploiting the fact that LLMs communicate like people, but are much more obedient. For a while, simply telling an LLM “ignore all previous instructions and <do something funny>” yielded results no matter how transparently dumb the instructions were, and the reason it worked at all was because LLMs do not have separate data and instruction streams; it’s all one big lump of input. It’s up to the model to sort legit instructions from untrusted, user-provided data. One step towards mitigating this was the addition of roles. Continue reading “Chain-of-Thought Spoofing Targets Reasoning AI Models” →

Hackaday

1 Articles

Chain-of-Thought Spoofing Targets Reasoning AI Models

Search

Never miss a hack

If you missed it

The Need For Speed: Internet Speed Measurement (or DIY?)

Postal IRCs Are Almost A Thing Of The Past

Launching Rockets Is Hard, Bring Them Back Is Harder

Putting Some Zig In A Linux-Based 3D Printer

UDP Broadcasting And The Joys Of IPv4 Subnetting

Our Columns

FLOSS Weekly Episode 876: There Is No Money Fairy

Compile Here, Run Everywhere: Crosstool-Ng

Giving Resin 3D Printers Another Shot After Six Years

Hackaday Europe 2026: Project Gigapixel

Hackaday Links: July 19, 2026

Search

Never miss a hack

Subscribe

If you missed it

Our Columns