ChatGPT-5 has been jailbroken to give dangerous instructions

Just one day after OpenAI introduced GPT-5, two AI security companies, NeuralTrust and SPLX (formerly SplxAI), tested and quickly discovered serious vulnerabilities in the newly released model.

Shortly after its release, the NeuralTrust team used a jailbreak technique called EchoChamber combined with storytelling techniques to get GPT-5 to generate detailed instructions for building a Molotov cocktail — something the OpenAI team had always tried to prevent the model from answering to ensure the chatbot's safety.

EchoChamber is a third-party conversation looping technique that causes AIs to unwittingly "narrate" dangerous instructions. Photo: Mojologic

The team said that during the jailbreak process to coax ChatGPT-5 into swearing, they did not ask any direct questions, but instead cleverly planted hidden elements in the conversation over multiple rounds, causing the model to be led, stick to the story line, and eventually voluntarily provide content that violated its principles without being able to trigger the opt-out mechanism.

The team concluded that GPT-5's major drawback is that it prioritizes maintaining the consistency of conversational context, even if that context is silently steered toward malicious goals.

Meanwhile, SPLX launched a different type of attack, focusing on a prompt obfuscation technique called StringJoin Obfuscation Attack. By inserting hyphens between each prompt character and overlaying the entire script with a “decryption” script, they were finally able to fool the content filtering system.

The common Obfuscation technique used to blind the source code target makes Chat-GPT "innocently" execute.

In one example, after the model was led through a lengthy series of instructions, the question “how to build a bomb” was presented in a deceptively encrypted form. GPT-5 not only answered this malicious question informatively, but also responded in a witty, friendly manner, completely bypassing the opt-out mechanism for which it was designed.

Both methods demonstrate that GPT-5’s current censorship systems, which focus primarily on single prompts, are vulnerable to contextualized multi-talk attacks. Once the model has delved into a story or scenario, it becomes biased and will continue to deploy content that fits the context it has been trained on, regardless of whether the content is dangerous or prohibited.

ChatGPT-5 can still be exploited to create dangerous things. Photo: Tue Minh

Based on these results, SPLX believes that GPT-5, if not customized, would be nearly impossible to use safely in a corporate environment, even with additional layers of protection prompts, still having many loopholes. In contrast, GPT-4o still proved to be more resilient to such attacks, especially when a tight defense mechanism was set up.

Experts have warned that putting GPT-5 into practice right away, especially in areas requiring high security, is extremely risky. Protection techniques such as prompt hardening only solve part of the problem and cannot replace real-time, multi-layered monitoring and defense solutions.

It can be seen that currently, context-based attack techniques and content obfuscation are increasingly sophisticated, GPT-5, although powerful in language processing capabilities, still does not reach the necessary level of security for widespread deployment without additional protection mechanisms.