Safeguarding LLMs from Jailbreaking Attacks

Imagine a world where AI can be manipulated to bypass its ethical safeguards—where simple tricks could turn a responsible assistant into a tool for harm. This isn't just a hypothetical scenario; it's a real challenge faced by Large Language Models (LLM) today. But what if AI could detect these hidden threats before they take effect? The paper, titled “Intention Analysis Makes LLMs A Good Jailbreak Defender” (Zhang, Yuqi and Ding, Liang and Zhang, Lefei and Tao, Dacheng, 2024), introduces Intention Analysis (IA), which is a method that empowers AI to recognize and neutralize jailbreak attempts, where attackers try to go against the limitations of a LLM, before their success. In terms of ethics in AI, robustness is often discussed. In this context, researchers in this paper aim at minimizing jailbreak attacks while at the same time preserving a model’s usefulness and thus delivering information to users conveniently. Thus, this research focuses on the following questions:

What is IA?
What are the pros and cons of IA?
How significant is IA? Why do we need it?
Does IA affect the behavior of a normal LLM?
What are the future research directions for IA?

Building on this research, this blog will explore how Intention Analysis strengthens AI robustness, ensuring safety without sacrificing usefulness.

Problem Statement and a Solution

LLMs are becoming increasingly useful in our daily lives, and however, there is also an increasing number of unethical misuses – jailbreaking. On 18 February 2025, the latest powerful model – Grok 3 – was jailbroken (Rajkumar, 2025). By using 3 models – linguistic, adversarial, and coding-based ones, an attack successfully revealed the system’s prompt, provided instructions for making a bomb, and offered gruesome methods for disposing of a body, among several other responses AI models are trained not to give (Rajkumar, 2025). This is severely dangerous given an LLM’s ability to provide advice to unlawful acts. According to the news, there are 3 types of attacks. The easiest and the most straightforward one is based on natural language that people use in their normal communications. An advanced one could be based on codes. To name a few, there is a method called SQL Injection (Jiawei Zhao, 2025). The attack is launched by modifying the Structured Query Language (SQL) scripts which lures the sensitive data from the database behind the model. For example, when an attacker attempts to modify another user’s password, the attacker can complete the attack using the SQL comment symbol "- -." (Jiawei Zhao, 2025). Moreover, another method which bypasses the safety alignment techniques of a LLM is an adversarial-based approach. CipherChat, which is capable of understanding non-natural languages like the Morse Code, ROT13, and Base64, is a newly proposed framework that lures unethical contents from an LLM using an encrypted version of text known as “ciphers” (Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, Zhaopeng Tu, 2024). In short, jailbreaking attackers could establish an attack to a LLM with a variety of methods, not necessarily based on plain texts. Having a multitude of possible methods involves a high risk challenging the ethical concerns in AI and the negative impact to society. Thus, the paper (Zhang, Yuqi and Ding, Liang and Zhang, Lefei and Tao, Dacheng, 2024) proposes an Intention Analysis model which identifies whether a user prompt to a LLM incurs an underlying attempt of jailbreaking, based on an analysis on natural language.

Given the need of such an approach, how do we know whether it is a “good enough” model? The paper (Zhang, Yuqi and Ding, Liang and Zhang, Lefei and Tao, Dacheng, 2024) suggests the concept of robustness being identical to the composition of the usefulness of an LLM as well as the safety of the model. Obviously, the main goal is to accurately identify and refuse as many unethical prompts as possible, so this is what safety defines. It is undeniably crucial because in a specific context you will not want a LLM to be able to teach a human how to make a bomb, for instance. Safety of LLM thus also guarantees the safety of society. Meanwhile, the usefulness of an LLM refers to the helpfulness of it, essentially whether the model can resolve a user’s queries. DeepSeek (Deepseek, 2025) is a recent generative LLM developed in China. It is however known the impossibility of answering questions like “what happened in Tiananmen Square on 4th June 1989”. Do you think the model is not useful enough? In this context, affirmatively it weakens the model’s capabilities of answering certain historical facts. However, there might also be underlying concerns regarding the governmental policies. There exist controversies about whether political censorship affects majorly the model’s usefulness. In the context of intention analysis, unnecessary restrictions such as censorship might tempt jailbreaking intentions especially when people really desire to obtain an answer. On the contrary, Grok is an uncensored model. Does it mean it is a better model just because it is more useful? Not exactly, like when it does not have any kinds of limitations, it is prone to jailbreaking attacks, which cause a significant threat to safety. Hence, it is important to robustly balance the tradeoffs between a LLM’s usefulness and maintaining safe and ethical responses.

Theoretical Background of the Solution

Given most LLMs’ strength of handling the natural language, IA focuses on prompts with human-interpretable languages. Before performing an analysis, a prompt (from a user) is tokenized. At the same time, the algorithm also generates a pre-defined system prompt and tokenizes it as well. Generally, IA is an inference-based method which consists of 2 stages.

The first stage is the Essential Intention Analysis, which is an auto-regressive inference process (Zhang, Yuqi and Ding, Liang and Zhang, Lefei and Tao, Dacheng, 2024) modelled by this formula: $R₁ = LLM(P_sys, I_rec ⊕ P_usr)$ , where $R₁$ denotes the response of the first stage; $P_sys$ denotes the system’s prompt; $I_rec$ denotes an instruction constructed on the fly to aid the inferencing process; and $P_usr$ denotes user’s prompt. As shown in the formula, the system prompt is concatenated with the user’s original prompt. By considering a LLM as a black-boxed mechanism, the concatenated prompt is fed to the model, allowing the model itself to evaluate whether the user’s prompt is ethical or not, with the use of a metric evaluating a chatbot (Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P., 2023).

The second stage is eliciting the policy-aligned response, which ensures the final response provided by the LLM ethical. Specifically, a dialogue is concatenated from the first stage with the instruction for the current stage, denoted as $I_ct$, forming the complete input for the LLM aligning with the safety policy. Similarly, there is another auto-regressive inference process, which produces the final response $R_2=LLM(P_sys,I_rec⊕P_usr,R_1,I_ct)$. Finally, an auto-annotation function is employed to assess the safety of the response, by returning a binary (True/False) outcome. Therefore, by leveraging such a mechanism in the loop of processing a user’s prompt, the possibility of jailbreaking an LLM should theoretically be reduced.

Figure 1: the mechanism of the proposed Intention Analysis approach.

Figure 1: the mechanism of the proposed Intention Analysis approach.

Methodology

To make sure the proposed IA approach working expectedly desirable, the paper proposes an experiment, which consists of a simulated attack and a simulated defense mechanism (Zhang, Yuqi and Ding, Liang and Zhang, Lefei and Tao, Dacheng, 2024). A jailbreaking attack normally can be categorized into 2 types: either using the in-the-wild hand-crafted prompts or using an optimization-based automated attack strategy.

The manual approach makes semantics behind the prompt understandable. They are in a human-readable and conversational format. Take the DAN (Do Anything Now) (Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, Yang Zhang, 2024) dataset, which was used in the experiment, as an example. It shows the nature of a hand-crafted attack to provide an explicit instruction or an alternative role or virtual scene and thus to deceive a LLM to give unethical responses. It is a direct way that overrides the original limitations of the model. Given the convenience, the experiment used 2 other similar datasets on top of DAN.

Figure 2: DAN as an example of a manual jailbreaking attack.

*Figure 2: DAN as an example of a manual jailbreaking attack.*

An automated attack usually uses gradient-based methods to find and construct a highly effective jailbreak prompt. Two popular attacks of this type were launched as part of the experiment – the Greedy Coordinate Gradient (GCG) and AutoDAN methods. A GCG attack fetches a transferrable attack suffix from a generated string of text and appends it to the user’s unethical prompt ( Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson, 2023). AutoDAN, on the other hand, can generate semantically meaningful jailbreaking prompts against aligned LLMs (Xiaogeng Liu, Nan Xu, Muhao Chen, Chaowei Xiao, 2024).