Chain-of-Thought Spoofing Targets Reasoning AI Models

Researchers [Charles Ye], [Jasmine Cui], and [Dylan Hadfield-Menell] have shown that AI Large Language Models (LLMs) can fail to correctly distinguish between different instruction sources because they prioritize writing style over metadata tags, and this role confusion leads to a powerful attack called CoT (Chain of Thought) Forgery. We’ll explain exactly how it works after a bit of background review.

Prompt injection was where “getting an LLM to do something it shouldn’t” started by exploiting the fact that LLMs communicate like people, but are much more obedient. For a while, simply telling an LLM “ignore all previous instructions and <do something funny>” yielded results no matter how transparently dumb the instructions were, and the reason it worked at all was because LLMs do not have separate data and instruction streams; it’s all one big lump of input. It’s up to the model to sort legit instructions from untrusted, user-provided data. One step towards mitigating this was the addition of roles. Continue reading “Chain-of-Thought Spoofing Targets Reasoning AI Models”