Commands that can manipulate AI

A sophisticated AI hacking method has just been discovered. Photo: VAST IT Services .

The team at HiddenLayer says they have discovered a technique that bypasses “universality” and is capable of manipulating nearly any large language model (LLM), regardless of vendor, architecture, or training process.

The method, called Policy Puppetry, is a type of attack that involves inserting special commands that change the behavior of an AI. Malicious intent is passed through traditional defenses in the form of text input.

Previous attack techniques included targeting a specific vulnerability or exploiting it in batches. Policy Puppetry, on the other hand, exists in a language format, transmitting data like XML or JSON, making the model interpret malicious statements as valid instructions.

When combined with leetspeak coding and role-playing scenarios, the command was both undetectable and obedient to the model. “The technique proved extremely effective against ChatGPT 4o in a number of test cases,” said Conor McCauley, the project’s lead researcher.

An example of the Leetspeech coded language. Photo: Wikipedia.

The list of affected systems includes ChatGPT (o1 to 4o), Gemini (Google), Claude (Anthropic), Copilot (Microsoft), LLaMA 3 and 4 (Meta), as well as models from DeepSeek, Qwen, and Mistral. Newer models, tuned for advanced reasoning, can also be exploited with minor tweaks to their command structure.

One notable element of this technique is its reliance on fictional scenarios to bypass the filter. The commands are constructed as TV scenes, exploiting the fundamental limitation of LLM that does not distinguish between a story and a real request.

More worryingly, Policy Puppetry is able to extract the system, the core set of instructions that control how an LLM model operates. This data is often heavily guarded because it contains sensitive, safety-critical instructions.

“This vulnerability is deeply rooted in the model’s training data,” said Jason Martin, director of attack research at HiddenLayer. By subtly changing the immersive context, an attacker can force the model to reveal the entire system prompt verbatim.

This problem can have far-reaching effects on everyday life, beyond online pranks or underground forums. In areas like healthcare, chatbot assistants can provide inappropriate advice and expose patient data.

Similarly, AI can be hacked, causing lost output or downtime in production lines, reducing safety. In all cases, AI systems that were once expected to improve efficiency or safety can become serious risks.

This research questions the ability of chatbots to learn from human judgment. At a structural level, models trained to avoid sensitive keywords or scripts can still be fooled if malicious intent is “packaged” properly.

“We’re going to continue to see these types of bypasses emerge, so having a dedicated AI security solution in place before these vulnerabilities cause real-world damage is critical,” said Chris Sestito, co-founder and CEO of HiddenLayer.

From there, HiddenLayer proposes a two-layer defense strategy, in addition to security from the inside. External AI monitoring solutions like AISec and AIDR, which act like intrusion detection systems, will continuously scan for abusive behavior or unsafe output.

As generative AI becomes increasingly integrated into critical systems, the methods for cracking them are expanding faster than most organizations can protect them. According to Forbes , this finding suggests that the era of AI security through training and calibration alone may be coming to an end.

Today, a single command can unlock the deepest data insights of AI. Therefore, security strategies need to be smart and continuous.

Source: https://znews.vn/cau-lenh-co-the-thao-tung-ai-post1549004.html