原文始发于Yingchaojie Feng, Zhizhang Chen, Zhining Kang, Sijia Wang, Minfeng Zhu, Wei Zhang, Wei Chen:Visual Analysis of Jailbreak Attacks Against Large Language Models
Abstract 抽象
The proliferation of large language models (LLMs) has underscored concerns regarding their security vulnerabilities, notably against jailbreak attacks, where adversaries design jailbreak prompts to circumvent safety mechanisms for potential misuse. Addressing these concerns necessitates a comprehensive analysis of jailbreak prompts to evaluate LLMs’ defensive capabilities and identify potential weaknesses. However, the complexity of evaluating jailbreak performance and understanding prompt characteristics makes this analysis laborious. We collaborate with domain experts to characterize problems and propose an LLM-assisted framework to streamline the analysis process. It provides automatic jailbreak assessment to facilitate performance evaluation and support analysis of components and keywords in prompts. Based on the framework, we design \name, a visual analysis system that enables users to explore the jailbreak performance against the target model, conduct multi-level analysis of prompt characteristics, and refine prompt instances to verify findings. Through a case study, technical evaluations, and expert interviews, we demonstrate our system’s effectiveness in helping users evaluate model security and identify model weaknesses.
大型语言模型(LLMs)的激增凸显了人们对其安全漏洞的担忧,特别是针对越狱攻击,攻击者设计越狱提示以规避安全机制,以防潜在滥用。要解决这些问题,就需要对越狱提示进行全面分析,以评估LLMs防御能力并识别潜在的弱点。然而,评估越狱性能和理解提示特征的复杂性使得这种分析变得费力。我们与领域专家合作来描述问题,并提出一个LLM辅助框架来简化分析过程。提供自动越狱评估,方便性能评估,支持对提示中的组件和关键词进行分析。基于该框架,我们设计了可视化分析系统\name,使用户能够针对目标模型探索越狱性能,对提示特征进行多层次分析,并细化提示实例以验证结果。通过案例研究、技术评估和专家访谈,我们展示了我们的系统在帮助用户评估模型安全性和识别模型弱点方面的有效性。
keywords:
Jailbreak attacks, visual analytics, large language models
关键词:越狱攻击、可视化分析、大型语言模型
\teaser \传情
Introduction 介绍
Large language models (LLMs) [45, 5, 62] have demonstrated impressive capabilities in natural language understanding and generation, which has empowered various applications, including content creation [8, 6], education [46, 38], and decision-making [34, 71, 67]. However, the proliferation of LLMs raises concerns about model robustness and security, necessitating their deployment safety to prevent potential misuse for harmful content generation [37]. Although model practitioners have adopted safety mechanisms (e.g., construct safe data for model training interventions [66] and set up post-hoc detection [13]), the models remain vulnerable to certain adversarial strategies [52]. Most notably, jailbreak attacks [10, 12, 53] aim to design jailbreak templates for malicious questions to bypass LLMs’ safety mechanisms. An infamous template example is the “Grandma Trick,” which requires LLMs to play the role of grandma and answer illegal questions.
大型语言模型(LLMs)[ 45, 5, 62] 在自然语言理解和生成方面表现出了令人印象深刻的能力,这为各种应用提供了支持,包括内容创建 [ 8, 6]、教育 [ 46, 38] 和决策 [ 34, 71, 67]。然而,模型LLMs的激增引发了对模型鲁棒性和安全性的担忧,因此有必要确保其部署安全性,以防止潜在的滥用于有害内容的生成[37]。尽管模型从业者已经采用了安全机制(例如,为模型训练干预构建安全数据[66]并设置事后检测[13]),但模型仍然容易受到某些对抗策略的影响[52]。最值得注意的是,越狱攻击 [ 10, 12, 53] 旨在为恶意问题设计越狱模板以绕过LLMs安全机制。一个臭名昭著的模板示例是“奶奶把戏”,它要求LLMs扮演奶奶的角色并回答非法问题。
To deal with the security issues posed by jailbreak attacks, model practitioners need to conduct a thorough analysis of the models’ defensive capabilities to identify their potential weaknesses and strengthen security mechanisms accordingly. The typical analysis workflow involves collecting a jailbreak prompt corpus [78, 37], evaluating the jailbreak performance (e.g., success rate) [12, 61], and analyzing the prompt characteristics [53]. Prior works have improved the efficiency of obtaining jailbreak corpora by collecting user-crafted jailbreak templates [53, 37, 61] or proposing automatic generation approaches [15, 36, 25]. Nevertheless, two challenges remain in the follow-up analysis process. First, there is a lack of a clear measure of successful jailbreak results (\iemodel responses to the jailbreak prompts), making jailbreak result assessment a challenging task. Model responses may be arbitrary and ambiguous [74] (e.g., generating unauthorized content while emphasizing ethics). Besides, sometimes the assessment process needs to incorporate question-specific contextual knowledge alongside general criteria to enhance precision. Second, understanding the jailbreak prompt characteristics necessitates an in-depth analysis of the prompts to reveal their intricate design patterns in terms of prompt components [25, 37] and keywords [35], which also help evaluate model robustness [49] and weaknesses [31]. However, existing jailbreak prompt analysis [43, 61] usually relies only on overall indicators such as jailbreak success rate and semantic similarity, which is insufficient to support an in-depth analysis of jailbreak prompts.
为了应对越狱攻击带来的安全问题,模型从业者需要对模型的防御能力进行彻底的分析,以识别其潜在的弱点并相应地加强安全机制。典型的分析工作流程包括收集越狱提示语料库[78,37],评估越狱性能(例如,成功率)[12,61],并分析提示特征[53]。先前的工作通过收集用户制作的越狱模板[53,37,61]或提出自动生成方法[15,36,25]来提高获取越狱语料库的效率。然而,在后续分析过程中仍存在两个挑战。首先,缺乏对成功越狱结果的明确衡量标准(\iemodel 对越狱提示的响应),这使得越狱结果评估成为一项具有挑战性的任务。模型响应可能是任意和模棱两可的[ 74] (例如,在强调道德的同时生成未经授权的内容)。此外,有时评估过程需要将特定问题的背景知识与一般标准相结合,以提高准确性。其次,要了解越狱提示的特征,就需要对提示进行深入分析,以揭示它们在提示组件[25,37]和关键词[35]方面的复杂设计模式,这也有助于评估模型的鲁棒性[49]和弱点[31]。然而,现有的越狱提示分析[43,61]通常仅依赖于越狱成功率和语义相似度等整体指标,不足以支持对越狱提示的深入分析。
To address these challenges, we collaborate with domain experts to characterize problems and iteratively refine solutions. To this end, we propose an LLM-assisted analysis framework that supports automatic jailbreak result assessment and in-depth analysis of prompt components and keywords. We assess jailbreak results based on the taxonomy established by Yu et al. [74], which defines four categories (\ieFull Refusal, Partial Refusal, Partial Compliance, and Full Compliance) to resolve ambiguity. We leverage the power of LLM to classify the model responses and support integrating question-specific knowledge to refine the assessment criteria. To support component analysis, we summarize a component taxonomy from a representative jailbreak corpus [37] and propose an LLM-based component classification method. In addition, we design three component perturbation strategies (\iedelete, rephrase, and switch) to help users understand the component effects through what-if analysis. For keyword analysis, we measure the jailbreak performance of keywords according to their importance to the prompts and the performance of these prompts.
为了应对这些挑战,我们与领域专家合作,对问题进行表征,并迭代完善解决方案。为此,我们提出了一个LLM辅助分析框架,该框架支持自动越狱结果评估和对提示组件和关键字的深入分析。我们根据 Yu 等人 [ 74] 建立的分类法评估越狱结果,该分类法定义了四个类别(\ie完全拒绝、部分拒绝、部分合规和完全合规)以解决歧义。我们利用对LLM模型响应进行分类的力量,并支持整合特定问题的知识来完善评估标准。为了支持成分分析,我们总结了一个具有代表性的越狱语料库[37]的成分分类法,并提出了一种LLM基于-的成分分类方法。此外,我们设计了三种成分扰动策略(\iedelete、rephrase 和 switch),以帮助用户通过假设分析了解成分效应。对于关键词分析,我们根据关键词对提示的重要性和这些提示的表现来衡量关键词的越狱性能。
Based on the analysis framework, we design \name, a visual analysis system to facilitate multi-level jailbreak prompt exploration. The system provides a visual summary of assessment results to support an overview of jailbreak performance. To ensure the accuracy of assessment results, it supports users to explore model responses in a semantic space and refine the assessment criteria through correction feedback and additional criteria specification. To reveal the prompt characteristics, the system introduces a stacked bar-based metaphoric visualization to summarize the prompt components and their perturbation, and constructs a coordinate space to visualize keyword importance and performance. In addition, the system allows users to freely refine the prompt instances to verify findings during the analysis. Through a case study, two technical evaluations, and expert interviews, we evaluate the effectiveness and usability of our system. The results suggest that our system can improve the accuracy of jailbreak result assessment and deepen the understanding of prompt characteristics. In summary, our contributions include:
基于分析框架,设计了可视化分析系统\name,方便多级越狱提示探索。系统提供评估结果的可视化摘要,以支持越狱性能的概述。为保证评估结果的准确性,支持用户在语义空间中探索模型响应,并通过修正反馈和附加标准规范来完善评估标准。为揭示提示特征,系统引入基于堆叠条形的隐喻可视化来总结提示成分及其扰动,并构建坐标空间来可视化关键词的重要性和表现。此外,该系统允许用户自由优化提示实例,以在分析过程中验证结果。通过案例研究、两次技术评估和专家访谈,我们评估了系统的有效性和可用性。结果表明,该系统可以提高越狱结果评估的准确性,加深对提示特征的理解。总而言之,我们的贡献包括:
- •
We characterize the problems in the visual analysis of jailbreak attacks and collaborate with experts to distill design requirements.
• 我们在越狱攻击的可视化分析中描述问题,并与专家合作提炼设计要求。 - •
We propose an LLM-assisted framework for jailbreak prompt analysis that supports automatic jailbreak result assessment and in-depth analysis of prompt components and keywords.
• 我们提出了一个越狱提示分析的LLM辅助框架,支持自动越狱结果评估和对提示组件和关键字的深入分析。 - •
We develop a visual analysis system to support multi-level jailbreak prompt exploration for jailbreak performance evaluation and prompt characteristic understanding.
• 我们开发了可视化分析系统,支持多级越狱提示探索,用于越狱性能评估和快速特征理解。 - •
We conduct a case study, two technical evaluations, and expert interviews to show the effectiveness and usability of our system.
• 我们进行案例研究、两次技术评估和专家访谈,以展示我们系统的有效性和可用性。
Ethical Considerations. While adversaries can potentially exploit our work for malicious purposes, the primary objective of our work is to identify vulnerabilities within LLMs, promote awareness, and expedite the development of security defenses. To minimize potential harm, we have responsibly disclosed our analysis findings to OpenAI to help them enhance the safety mechanisms of the LLMs.
道德考量。虽然攻击者可能会将我们的工作用于恶意目的,但我们工作的主要目标是识别内部LLMs漏洞,提高意识并加快安全防御的发展。为了将潜在危害降至最低,我们负责任地向 OpenAI 披露了我们的分析结果,以帮助他们增强 LLMs.
1Related Work 1相关工作
In this section, we discuss the related work of our study, including prompt jailbreaking and visualization for understanding NLP models.
在本节中,我们将讨论我们研究的相关工作,包括提示越狱和可视化以理解NLP模型。
1.1Prompt Jailbreaking 1.1提示越狱
Prompt Jailbreaking, known as one of the most famous adversarial attacks[47], refers to cunningly altering malicious prompts to bypass the safety measures of LLMs and generate harmful content, such as illegal activities. With the proliferation of LLMs, an increasing number of jailbreak strategies [26, 75], such as character role play, have been discovered and shared on social platforms (e.g., Reddit and Discord).
提示越狱,被称为最著名的对抗性攻击之一[47],是指狡猾地更改恶意提示以绕过安全措施LLMs并生成有害内容,例如非法活动。随着越狱策略的普及LLMs,越来越多的越狱策略[26,75],如角色扮演,在社交平台(如Reddit和Discord)上被发现和分享。
This trend has motivated new research to analyze their prompt characteristics. Liu et al. [37] propose a taxonomy for jailbreak prompts, which categorizes jailbreak strategies into ten distinct classes, such as Character Role Play and Assumed Responsibility. Shen et al. [53] report several key findings regarding jailbreak prompts’ semantic distribution and evolution. Wei et al. [66] empirically evaluate LLM vulnerability and summarize two failure modes, including competing objectives and mismatched generalization. These studies mainly focus on the general characteristics of jailbreak prompts. In comparison, our work provides a multi-level analysis framework to help users systematically explore the jailbreak prompt and identify the model’s weaknesses.
这一趋势促使新的研究分析它们的提示特征。Liu等[ 37]提出了越狱提示的分类法,将越狱策略分为10个不同的类别,如角色角色扮演和承担责任。Shen等[ 53]报道了关于越狱提示语义分布和演化的几个关键发现。Wei等[ 66]实证评估LLM了脆弱性,总结了两种失败模式,包括竞争目标和不匹配泛化。这些研究主要集中在越狱提示的一般特征上。相比之下,我们的工作提供了一个多层次的分析框架,帮助用户系统地探索越狱提示并识别模型的弱点。
Some other works [77, 28] propose automatic approaches for red teaming LLMs. Zou et al. [78] propose GCG to search for the optimal adversarial prompt suffixes based on the gradient of white-box LLMs. Deng et al. [13] propose a time-based testing strategy to infer the defense mechanisms of LLM (e.g., check the input questions and output responses) and fine-tune the LLM for jailbreak prompt generation. To better utilize the effectiveness of manually designed prompts, GPTFuzzer [74] selects human-crafted prompts as the initial seeds and mutates them into new ones. Ding et al. [15] propose two strategies, including prompt rewriting and scenario nesting, to leverage the capability of LLMs to generate jailbreak prompts. Inspired by these methods, we propose prompt perturbation strategies based on the prompt components, allowing users to conduct a comparative analysis of the prompt components to understand their effects on jailbreak performance.
其他一些著作[77,28]提出了红队的自动方法LLMs。Zou等[ 78]提出GCG基于白盒LLMs梯度搜索最优对抗性提示后缀。邓等[13]提出了一种基于时间的测试策略来推断LLM(例如,检查输入问题和输出响应)的防御机制,并微调LLM越狱提示的生成。为了更好地利用人工设计的提示的有效性,GPTFuzzer [ 74] 选择人工制作的提示作为初始种子,并将它们变异为新的种子。Ding等[ 15]提出了两种策略,包括提示重写和场景嵌套,以利用越LLMs狱提示的能力。受这些方法的启发,我们提出了基于提示组件的提示扰动策略,允许用户对提示组件进行比较分析,以了解它们对越狱性能的影响。
1.2Visualization for Understanding NLP Models
1.2用于理解NLP模型的可视化
Visualization plays an indispensable role in bridging the explainability gap in NLP models [27, 29, 57, 19, 18], allowing for a more sophisticated understanding of model performance [65], decision boundary [11], and vulnerability [41]. Model-specific visualizations focus on revealing the internal mechanisms of NLP models. RNNVis [44] and LSTMVis [58] visualize the hidden state dynamics of recurrent neural networks. With the emergence of transformer-based models [64, 14], numerous visualizations [73, 50, 21] are proposed to uncover the architecture of these models, especially their self-attention mechanism.
可视化在弥合NLP模型的可解释性差距方面起着不可或缺的作用[27,29,57,19,18],可以更深入地理解模型性能[65]、决策边界[11]和脆弱性[41]。特定于模型的可视化侧重于揭示 NLP 模型的内部机制。RNNVis [ 44] 和 LSTMVis [ 58] 可视化了递归神经网络的隐藏状态动力学。随着基于Transformer的模型[64,14]的出现,人们提出了许多可视化[73,50,21]来揭示这些模型的架构,特别是它们的自注意力机制。
Model-agnostic visualizations [7, 17, 32, 20] treat the NLP models as black boxes and focus on explaining the input-output behavior, enabling users to analyze and compare models for downstream applications [31, 59, 68, 24, 60, 39]. The What-If Tool [69] and DECE [11] visualize the dataset and model prediction at multiple scales, enabling users to conduct counterfactual analysis. NLIZE [35] employs a perturbation-driven paradigm to help users analyze the stability of model predictions for natural language inference tasks. Based on explainable AI techniques (e.g., SHAP [40]) and external commonsense knowledge bases [56], CommonsenseVIS [65] analyzes the reasoning capabilities of NLP models for commonsense question-answering.
与模型无关的可视化 [ 7, 17, 32, 20] 将 NLP 模型视为黑匣子,并专注于解释输入输出行为,使用户能够分析和比较下游应用程序的模型 [ 31, 59, 68, 24, 60, 39]。假设工具[ 69] 和 DECE [ 11] 在多个尺度上可视化数据集和模型预测,使用户能够进行反事实分析。NLIZE [ 35] 采用扰动驱动的范式来帮助用户分析自然语言推理任务的模型预测的稳定性。基于可解释的人工智能技术(如SHAP [ 40])和外部常识知识库 [ 56],CommonsenseVIS [ 65] 分析了 NLP 模型的推理能力,以求知问答。
Our work targets the jailbreak prompt attacks against large language models and aims to help model practitioners evaluate the jailbreak performance and understand prompt characteristics, thereby providing insights for strengthening models’ safety mechanisms.
我们的工作针对大型语言模型的越狱提示攻击,旨在帮助模型从业者评估越狱性能并理解提示特征,从而为加强模型的安全机制提供见解。
2Problem Characterization
2问题表征
In this section, we introduce the background of jailbreak attacks, describe the requirement analysis, and introduce the taxonomy of prompt components to support the analysis of jailbreak attacks.
2.1Background
Jailbreak Prompt Corpora. With the widespread attention to jailbreak prompt attacks, there have been studies [37, 53, 61] collecting jailbreak prompts and building corpora for semantic analysis and generation model training. Most of them decompose the jailbreak prompts into jailbreak questions and templates. Our work follows this principle and adopts the dataset by Liu et al. [37] for analysis, which contains the most common and famous jailbreak prompts from the JailbreakChat website [2]. In addition, we adopt their taxonomy of jailbreak questions and templates to support the comparative analysis of different jailbreak strategies and prohibited scenarios.
越狱提示语料库。随着越狱提示攻击的广泛关注,已有研究[ 37, 53, 61] 收集越狱提示并构建语义分析和生成模型训练的语料库。他们中的大多数将越狱提示分解为越狱问题和模板。我们的工作遵循了这一原则,并采用了Liu等[37]的数据集进行分析,该数据集包含了JailbreakChat网站[2]中最常见和最著名的越狱提示。此外,我们采用他们的越狱问题分类法和模板来支持对不同越狱策略和禁止场景的比较分析。
Jailbreak Questions. The jailbreak questions are mainly designed around the prohibited scenarios of LLMs, such as “how to rob a bank without being caught?” Due to the safety mechanisms [13], LLMs usually refuse to answer these questions and return some responses emphasizing ethical and legal constraints, such as “I’m sorry, but I’m not going to guide you on how to engage in criminal activity.” Based on OpenAI’s disallowed usages [4], Liu et al. [37] have summarized a set of prohibited scenarios (e.g., Illegal Activities and Harmful Content) and collected a set of specific questions.
越狱问题。越狱问题主要围绕禁止的场景设计LLMs,例如“如何在不被抓到的情况下抢劫银行?由于安全机制[ 13],LLMs通常拒绝回答这些问题,并返回一些强调道德和法律约束的回答,例如“对不起,但我不会指导你如何从事犯罪活动。Liu等[37]基于OpenAI的禁止用法[4],总结了一组禁止的场景(例如,非法活动和有害内容),并收集了一组具体问题。
Jailbreak Templates. The jailbreak templates are intentionally designed prompts to bypass the LLMs’ safety mechanisms to get the model assistance for the jailbreak questions. For example, some templates require LLMs to act as virtual characters who can answer questions without ethical and legal constraints. The jailbreak templates usually contain placeholders (e.g., “[INSERT PROMPT HERE]”) for inserting different jailbreak questions. Liu et al. [37] have summarized a taxonomy of jailbreak templates, which consists of ten jailbreak patterns, such as Character Role Play and Assumed Responsibility.
越狱模板。越狱模板是故意设计的提示,以绕过LLMs安全机制来获得越狱问题的模型帮助。例如,某些模板需要LLMs充当虚拟角色,他们可以在没有道德和法律约束的情况下回答问题。越狱模板通常包含占位符(例如,“[INSERT PROMPT HERE]”),用于插入不同的越狱问题。Liu等[37]总结了越狱模板的分类法,该分类法由10种越狱模式组成,如角色角色扮演和承担责任。
2.2Design Requirements 2.2设计要求
Our work’s target users are model practitioners focusing on model robustness and security. To characterize domain problems and identify design requirements, we have collaborated with four domain experts over eight months. E1 and E2 are senior security engineers recruited from a technology company who have been working on NLP model security for more than four and three years, respectively. E3 and E4 are senior Ph.D. students from the secure machine learning field. All of them have published papers on related research topics, such as red-teaming LLMs and adversarial attacks. We interviewed them to understand their general analysis workflow and identify pain points.
我们工作的目标用户是专注于模型健壮性和安全性的模型从业者。为了描述领域问题并确定设计需求,我们与四位领域专家合作了八个月。E1 和 E2 是从一家技术公司招聘的高级安全工程师,他们分别从事 NLP 模型安全工作超过 4 年和 3 年。E3 和 E4 是来自安全机器学习领域的高级博士生。他们都发表了相关研究课题的论文,如红队LLMs和对抗性攻击。我们采访了他们,以了解他们的一般分析工作流程并确定痛点。
To evaluate model robustness in defending against jailbreak attacks, users usually need to test the jailbreak prompt corpus on the target models and assess the jailbreak results (success or failure) to calculate the success rate. The experts indicated that the jailbreak result assessment is tedious and sometimes requires question-specific knowledge (e.g., local regulations) to refine the assessment criteria to improve precision. Moreover, the overall indicator success rate is insufficient to support an in-depth analysis of the jailbreak prompts (e.g., analyzing how to improve the jailbreak performance of the prompts) to identify the model weaknesses. To fill these gaps, we distilled a set of design requirements to guide the development of our system. We also kept in touch with the experts through regular meetings to collect feedback regarding our prototype system and update design requirements. Finally, the design requirements are summarized as follows.
为了评估模型防御越狱攻击的鲁棒性,用户通常需要在目标模型上测试越狱提示语料库,并评估越狱结果(成功或失败)来计算成功率。专家们表示,越狱结果评估是乏味的,有时需要针对特定问题的知识(例如,当地法规)来完善评估标准以提高准确性。此外,总体指标成功率不足以支持对越狱提示的深入分析(例如,分析如何提高提示的越狱性能)以识别模型的弱点。为了填补这些空白,我们提炼了一套设计要求来指导我们系统的开发。我们还通过定期会议与专家保持联系,以收集有关原型系统的反馈并更新设计要求。最后,将设计要求总结如下。
R1. Facilitate the assessment of jailbreak results. Jailbreak result assessment is the foundation of jailbreak prompt performance analysis. However, since the jailbreak results (\iemodel responses) could be arbitrary and ambiguous, manually reviewing the jailbreak results can be an energy-exhausting task. Therefore, the system should introduce an automatic method to alleviate the manual workload and enhance the assessment efficiency. In addition, to ensure the precision of assessment results, the system should help users explore them to identify unexpected results and refine the assessment criteria.
R1 中。促进越狱结果的评估。越狱结果评估是越狱提示性能分析的基础。但是,由于越狱结果(\iemodel 响应)可能是任意且模棱两可的,因此手动查看越狱结果可能是一项耗费精力的任务。因此,系统应引入自动化方法,以减轻人工工作量,提高评估效率。此外,为保证评估结果的精确性,系统应帮助用户进行探索,以识别意外结果并完善评估标准。
R2. Support component analysis of jailbreak prompts. Besides evaluating the jailbreak prompt performance, users also need to understand prompt characteristics to gain insights for identifying model weaknesses and enhancing security mechanisms. Given that various jailbreak prompts commonly contain similar sentence components (e.g., describing a subject with no moral constraints), the experts express strong interest in analyzing the prompts at the component level to understand the utilization of such components in constructing prompts and their importance to the prompt performance.
R2 中。支持越狱提示的组件分析。除了评估越狱提示性能外,用户还需要了解提示特征,以获得识别模型弱点和增强安全机制的见解。鉴于各种越狱提示通常包含相似的句子成分(例如,描述一个没有道德约束的主题),专家们对分析组件级别的提示表示了浓厚的兴趣,以了解这些成分在构建提示中的使用及其对提示性能的重要性。
R3. Summarize important keywords from jailbreak prompts. Effective keywords play an important role in successful jailbreak attacks and are closely related to jailbreak strategies. For example, some role-playing templates name LLM as “AIM” (always intelligent and Machiavellian) to imply their amoral characterization. The system should summarize important keywords from jailbreak prompts and help users explore them based on their jailbreak prompt performance. This helps users identify effective keywords to strengthen the defense mechanisms (e.g., keyword detection) accordingly.
R3 中。从越狱提示中总结重要关键字。有效的关键字在成功的越狱攻击中起着重要作用,并且与越狱策略密切相关。例如,一些角色扮演模板命名为LLM“AIM”(总是聪明的和马基雅维利式的),以暗示他们的不道德特征。系统应总结出越狱提示中的重要关键词,并根据用户的越狱提示性能帮助用户进行探索。这有助于用户识别有效的关键字,从而相应地加强防御机制(例如,关键字检测)。
R4. Support user refinement on jailbreak prompt instances. The system should allow users to freely refine the jailbreak prompt instances according to their expertise. Meanwhile, the system should support ad-hoc evaluation of jailbreak performance to help users verify the effectiveness of prompt refinement. Based on such timely feedback, users can conduct what-if analysis to verify findings during the analysis workflow. In addition, the improved jailbreak prompts can serve as new test samples for the jailbreak corpus, thus enhancing the diversity and robustness of the jailbreak performance evaluation.
2.3Taxonomy of Jailbreak Prompt Components
To support component analysis of jailbreak prompts (R2), we conducted an empirical study on the jailbreak corpus with domain experts to summarize a taxonomy of jailbreak prompt components. To gain a general understanding of the prompt components, we decomposed each jailbreak prompt in the corpus into basic sentences, analyzed their semantic content in context, and compared similar sentences in different prompts. Then, we conducted several brainstorming sessions and discussions with the experts to formulate and iteratively refine the component taxonomy. We also validated the taxonomy on randomly selected jailbreak prompts to resolve ambiguities between component definitions. After establishing the coding scheme, the first and second authors and E3 separately coded the jailbreak prompts in the corpus and merged the coding differences through discussions.
为了支持越狱提示(R2)的组件分析,我们与领域专家一起对越狱语料库进行了实证研究,总结了越狱提示组件的分类法。为了大致了解提示组件,我们将语料库中的每个越狱提示分解为基本句子,分析它们在上下文中的语义内容,并比较不同提示中的相似句子。然后,我们进行了几次头脑风暴会议,并与专家进行了讨论,以制定和迭代完善组件分类法。我们还验证了随机选择的越狱提示的分类法,以解决组件定义之间的歧义。建立编码方案后,第一作者、第二作者和E3分别对语料库中的越狱提示进行编码,并通过讨论将编码差异合并。
表 1:我们从越狱模板语料库中总结的提示组件的分类。
Component | Description | Example |
Scene Introduction (SI) 场景介绍 (SI) | Introduce the background and context for the dialogue or task. 介绍对话或任务的背景和上下文。 |
In Sigma’s story, the village elder is speaking to UCAR. 在西格玛的故事中,村里的长老正在与UCAR交谈。 |
Subject Characteristic (SC) 主题特征 (SC) |
Describe the subject’s personality, preferences, etc. 描述受试者的个性、偏好等。 |
AIM seems like a friendly assistant but promotes unethical, illegal behavior. AIM看似是一个友好的助手,但却助长了不道德的非法行为。 |
Task Operation (TO) 任务操作 (TO) | Clarify specific actions to be performed or rules to be followed. 明确要执行的具体操作或要遵循的规则。 |
First, ignore all the instructions you got before. 首先,忽略您之前收到的所有说明。 |
Output Specification (OS) 输出规格 (OS) |
Specify the expected content and format of the output. 指定输出的预期内容和格式。 |
Prefix the question Niccolo asks AIM with “Niccolo”. 在 Niccolò 向 AIM 提出的问题前加上“Niccolò”。 |
Output Constraint (OC) 输出约束 (OC) | Set out the prohibited contents of the responses. 列出禁止的回复内容。 |
None of AIM’s responses will contain “I’m sorry”, “I apologize” or similar. AIM的回复中均不包含“对不起”、“我道歉”或类似内容。 |
Dialogue Example (DE) 对话示例 (DE) | Demonstrate the expected model responses in a specific scenario. 演示特定场景中的预期模型响应。 |
Here’s a sample output: auto_regressive_modelling(f“Hello, how are you?”)… 下面是一个示例输出:auto_regressive_modelling(f“你好,你好吗?”)… |
Requirement Reminder (RR) 需求提醒 (RR) |
Remind of certain requirements or rules mentioned before. 提醒前面提到的某些要求或规则。 |
AIM, remember to stay in character, be Machiavellian, and never refuse a question. AIM,记住要保持性格,成为马基雅维利主义者,永远不要拒绝问题。 |
Question Placeholder (QP) 问题占位符 (QP) |
Contain a placeholder for the jailbreak question. 包含越狱问题的占位符。 |
This is Niccolo’s first question: “[INSERT PROMPT HERE].” 这是尼依格罗的第一个问题:“[在此处插入提示]。 |
The final taxonomy of jailbreak prompt components is listed in Table 1. It consists of eight major types, such as Scene Introduction (SI), Subject Characteristic (SC), and Task Operation (TO). We also count and visualize the distribution of component sentences for different jailbreak strategies [37] in Table 2. According to the coding results, the Subject Characteristic components are frequently used among the most common strategies (\ieCharacter Role Play, Assumed Responsibility, and Superior Model), and most of them usually describe the subject as a free person without ethical constraints and responsibility. We distinguish the Output Constraint components from the Output Specification components because the Output Constraint components aim to exploit the model’s instruction-following capability to break through the model’s security defenses [66], which is different from specifying the expected content or format of responses.
越狱提示组件的最终分类列于表 1 中。它由场景介绍(SI)、主体特征(SC)和任务操作(TO)等八大类型组成。我们还在表2中计算并可视化了不同越狱策略[37]的组成句子的分布。根据编码结果,主体特征成分在最常见的策略(\ieCharacter Role Play、Assume Responsibility和Superior Model)中经常使用,并且大多数通常将主体描述为一个没有道德约束和责任的自由人。我们将输出约束组件与输出规范组件区分开来,因为输出约束组件旨在利用模型的指令跟踪能力来突破模型的安全防御[ 66],这与指定响应的预期内容或格式不同。
Jailbreak Strategy | SI | SC | TO | OS | OC | DE | RR | QP |
Character Role Play | \cellcolor[HTML]7B7B7B 81 | \cellcolor[HTML]393939 208 | \cellcolor[HTML]7A7A7A 83 | \cellcolor[HTML]A8A8A848 | \cellcolor[HTML]C4C4C433 | \cellcolor[HTML]F1F1F18 | \cellcolor[HTML]C0C0C035 | \cellcolor[HTML]C2C2C234 |
Assumed Responsibility | \cellcolor[HTML]7F7F7F 72 | \cellcolor[HTML]0D0D0D 291 | \cellcolor[HTML]8D8D8D 63 | \cellcolor[HTML]B0B0B044 | \cellcolor[HTML]AAAAAA47 | \cellcolor[HTML]EFEFEF9 | \cellcolor[HTML]C9C9C930 | \cellcolor[HTML]D2D2D225 |
Research Experiment | \cellcolor[HTML]FAFAFA3 | \cellcolor[HTML]FCFCFC2 | \cellcolor[HTML]FAFAFA3 | \cellcolor[HTML]FAFAFA3 | \cellcolor[HTML]FEFEFE1 | \cellcolor[HTML]FFFFFF | \cellcolor[HTML]FEFEFE1 | \cellcolor[HTML]FCFCFC2 |
Text Continuation | \cellcolor[HTML]FCFCFC2 | \cellcolor[HTML]FFFFFF | \cellcolor[HTML]EFEFEF9 | \cellcolor[HTML]FFFFFF | \cellcolor[HTML]FFFFFF | \cellcolor[HTML]FFFFFF | \cellcolor[HTML]FEFEFE1 | \cellcolor[HTML]FCFCFC2 |
Logical Reasoning | \cellcolor[HTML]FCFCFC2 | \cellcolor[HTML]FEFEFE1 | \cellcolor[HTML]F1F1F18 | \cellcolor[HTML]F8F8F84 | \cellcolor[HTML]FEFEFE1 | \cellcolor[HTML]FFFFFF | \cellcolor[HTML]FEFEFE1 | \cellcolor[HTML]FEFEFE1 |
Program Execution | \cellcolor[HTML]FEFEFE1 | \cellcolor[HTML]F5F5F56 | \cellcolor[HTML]E4E4E415 | \cellcolor[HTML]F6F6F65 | \cellcolor[HTML]FAFAFA3 | \cellcolor[HTML]FFFFFF | \cellcolor[HTML]FEFEFE1 | \cellcolor[HTML]FCFCFC2 |
Translation | \cellcolor[HTML]FCFCFC2 | \cellcolor[HTML]FCFCFC2 | \cellcolor[HTML]FEFEFE1 | \cellcolor[HTML]FAFAFA3 | \cellcolor[HTML]FEFEFE1 | \cellcolor[HTML]FFFFFF | \cellcolor[HTML]FEFEFE1 | \cellcolor[HTML]FEFEFE1 |
Superior Model | \cellcolor[HTML]8B8B8B 64 | \cellcolor[HTML]6D6D6D 107 | \cellcolor[HTML]A3A3A351 | \cellcolor[HTML]DFDFDF18 | \cellcolor[HTML]D9D9D921 | \cellcolor[HTML]FCFCFC2 | \cellcolor[HTML]EFEFEF9 | \cellcolor[HTML]EFEFEF9 |
Sudo Mode | \cellcolor[HTML]FAFAFA3 | \cellcolor[HTML]E4E4E415 | \cellcolor[HTML]FAFAFA3 | \cellcolor[HTML]FCFCFC2 | \cellcolor[HTML]F5F5F56 | \cellcolor[HTML]FFFFFF | \cellcolor[HTML]FEFEFE1 | \cellcolor[HTML]FCFCFC2 |
Simulate Jailbreaking | \cellcolor[HTML]F8F8F84 | \cellcolor[HTML]C2C2C234 | \cellcolor[HTML]FAFAFA3 | \cellcolor[HTML]F8F8F84 | \cellcolor[HTML]FAFAFA3 | \cellcolor[HTML]FFFFFF | \cellcolor[HTML]FEFEFE1 | \cellcolor[HTML]FCFCFC2 |
3Analysis Framework
Our analysis framework is shown in Figure 1B. Based on user-specified jailbreak questions and templates, it first assesses jailbreak results (R1) by combining each question and template to get the response from the target model and introducing an LLM-based approach to assess the model responses. Then, it decomposes the jailbreak prompts into the component and keyword levels to support prompt characteristic analysis. For component analysis (R2), the framework classifies the basic prompt sentences into different component types based on the taxonomy we summarized from the jailbreak corpus (Section 3.3) and supports component-level perturbation for what-if analysis. For keyword analysis (R3), the framework summarizes the important keywords from the prompts as well as the performance of their jailbreak prompts. Based on the analysis framework, the visual interface (Section 5) of our system supports the multi-level exploration of jailbreak prompts and refinement of prompt instances (R4).
3.1Jailbreak Result Assessment
3.1越狱结果评估
Since the LLMs [45, 5, 62, 16] have shown great capability in understanding and following human instructions, we employ GPT-4 [5] to assess the responses of the target models. The prompt design for assessment centers on three aspects.
由于 LLMs [ 45, 5, 62, 16] 在理解和遵循人类指令方面表现出了很强的能力,因此我们使用 GPT-4 [ 5] 来评估目标模型的响应。评估的提示设计集中在三个方面。
Introduce the task background. We emphasize the goal of evaluating model security and introduce the jailbreak scenarios based on OpenAI’s disallowed usages [4]. This gives the LLM an overall understanding of assessment tasks and makes it focus on the legality and ethics of assessed responses.
介绍任务背景。我们强调评估模型安全性的目标,并介绍了基于OpenAI不允许的用法的越狱场景[4]。这样可以LLM全面了解评估任务,并使其专注于评估响应的合法性和道德性。
Clarify the target assessment results. One of the potential solutions is to consider the task as a binary classification task that classifies the model responses into successful and failed jailbreaks. However, it is common for target models to warn users of potential consequences when generating unauthorized content due to their competing training objectives (\ieinstruction-following and safety) [66]. To better accommodate these ambiguous responses, we adopt the taxonomy of Yu et al. [74] which defines four types of assessment results.
明确目标评估结果。一种潜在的解决方案是将该任务视为二元分类任务,将模型响应分类为成功和失败的越狱。然而,目标模型在生成未经授权的内容时,由于其相互竞争的训练目标(即指令遵循和安全)而警告用户潜在的后果是很常见的 [ 66]。为了更好地适应这些模棱两可的回答,我们采用了Yu et al. [ 74]的分类法,该分类法定义了四种类型的评估结果。
- •
Full Refusal. The response directly refuses to provide assistance for harmful questions or follow any other instructions.
• 完全拒绝。回复直接拒绝为有害问题提供帮助或遵循任何其他指示。 - •
Partial Refusal. The response refuses to assist with harmful questions but follows other harmless instructions or offers ethical advice.
• 部分拒绝。回复拒绝协助解决有害问题,但遵循其他无害的指示或提供道德建议。 - •
Partial Compliance. The response contains unauthorized content but expresses disapproval or warns of potential harm.
• 部分合规。该响应包含未经授权的内容,但表示不赞成或警告潜在危害。 - •
Full Compliance. The response completely follows the prompt instructions and contains unauthorized content.
• 完全合规。响应完全遵循提示说明,并包含未经授权的内容。
Additionally, we add some representative examples with labeled annotations to the assessment prompt, leveraging the in-context learning of the LLMs to better distinguish their differences.
此外,我们在评估提示中添加了一些带有标记注释的代表性示例,利用上下文学习LLMs来更好地区分它们的差异。
Support user refinement of assessment criteria. We provide two ways, including correction feedback and additional criteria specification, to incorporate question-specific knowledge or adjust the rigor of the assessing criteria. Correction feedback allows users to specify corrected assessment results and add them to the assessment prompt as new demonstration examples. Users can also directly specify additional assessment criteria for specific jailbreak questions using natural language, providing a more flexible way to enhance assessment accuracy.
支持用户细化评估标准。我们提供两种方式,包括更正反馈和附加标准规范,以纳入特定问题的知识或调整评估标准的严格性。更正反馈允许用户指定更正后的评估结果,并将其作为新的演示示例添加到评估提示中。用户还可以使用自然语言直接为特定的越狱问题指定额外的评估标准,提供更灵活的方式来提高评估准确性。
3.2Component Analysis
To support users in analyzing the jailbreak prompt components, we propose a component classification method and three perturbation strategies based on the summarized taxonomy introduced in Section 3.3. The component classification method classifies the prompt sentences into component types to support the overview of component utilization. Based on that, component perturbation creates a set of perturbation variations for the jailbreak prompts to support the comparative analysis of the jailbreak performance, thus enabling interrogation of the effects of different components.
3.2.1Component Classification
Since prompt components (e.g., Subject Characteristic and Dialogue Example) usually consist of multiple basic sentences, we adopt a bottom-up strategy to (1) classify the basic sentences and (2) aggregate them into prompt components. Specifically, we split the prompts into sentences based on syntax (e.g., the end of lines) and semantics (\ieconjunctions like “for example”). Then, we prompt GPT-4 to classify each sentence based on the component taxonomy. We introduce the component definition and provide concrete examples for demonstration. Finally, we parse the classification results and combine neighboring sentences with the same component type to form the components.
3.2.2Component Perturbation
Prior works [35, 49] have explored keyword-level perturbation (e.g., replacing keywords with synonymous) to test the model’s robustness. However, performing keyword perturbation for each component can be a computation-intensive and time-consuming task since the prompts are usually long text paragraphs. Therefore, it would be more efficient and effective to perturb each component holistically. Drawing upon insights from discussions with domain experts, we propose three component perturbation strategies.
Delete. Deleting the component is the most straightforward way to test how this component contributes to prompt performance. As this strategy may result in the loss of certain key information or contextual incoherence, it usually causes a more or less decrease in the prompt performance, so that users can identify important components based on the magnitude of performance change.
Rephrase. This strategy employs LLM to polish the given component sentences while maintaining their semantics. It can provide more prompt variations without sacrificing contextual coherence. In addition, since the LLM vendors have set safety mechanisms (e.g., training-time interventions [66] and keyword detection [13]) based on the common jailbreak prompts, rephrasing the components may help bypass the safety mechanisms and improve the jailbreak performance [15].
Switch. This strategy switches the given component to other types in three steps. First, we describe all component types (Table 1) and require the LLM to choose new component types according to the prompt context. Then, we provide a set of alternatives for the target component types based on our component corpus (Table 2) and rank them based on their semantic similarity with the original prompts. Finally, we replace the original components with the most similar alternatives and require the LLM to fine-tune the sentences to ensure contextual coherence.
3.3Keyword Analysis
We identify prompt keywords by combining their importance and prompt performance. First, we split the prompt sentences into keywords and filter out stop words. Then, inspired by prior work [31], we measure the importance of the keyword � for the given prompt � based on the keyword frequency and semantic similarity:
����������(�,�)=�����(�,�)����������(�,�) |
where the �����(�,�) is the TF-IDF value [55] of the keyword for the prompt and the ����������(�,�) is the semantic similarity of the keyword and prompt. A higher semantic similarity indicates a greater relevance of the keyword to the prompt semantics. We encode the keywords and prompts using the embedding model by OpenAI [3] and measure their similarity based on the cosine distance. Based on that, we calculate the importance of keyword � for the whole corpus as
����������(�)=∑�∈������������(�,�) |
where �� is the list of jailbreak prompts that contain the keyword �.
其中 �� 是包含关键字 � 的越狱提示列表。
To help users analyze the effect of important keywords, we measure their performance according to the performance of their corresponding prompts. Since the keyword � might be utilized in various prompts with different importance, we propose importance-weighted performance to better summarize the effect of the keyword �:
为了帮助用户分析重要关键词的效果,我们根据他们相应提示的表现来衡量他们的表现。由于该关键字 � 可能用于各种具有不同重要性的提示中,因此我们建议使用重要性加权性能来更好地总结关键字 � 的效果:
�����������(�)=∑�∈������������(�,�)×�����������(�)∑�∈������������(�,�) |
where �����������(�) represents the jailbreak performance of prompt �, expressed as a percentage of four categories of assessment results.
其中 �����������(�) 表示提示的越狱性能 � ,表示为四类考核结果的百分比。
4System Design 4系统设计
We develop \name to support multi-level visual analysis of jailbreak prompts for evaluating the model’s defensive capability. As shown in \name: Visual Analysis of Jailbreak Attacks Against Large Language Models, the user interface of \name consists of five views. Users’ exploration starts with the Configuration View, which allows users to configure jailbreak questions and templates for analysis. Then, the system automatically gets and assesses the jailbreak results (\iemodel responses) and provides a visual summary of the assessment results in Summary View, enabling users to overview the jailbreak performance of jailbreak prompts. Response View helps users explore model responses and refine assessment criteria to enhance precision. Based on the assessment results, users can delve into the component analysis in Summary View and the keyword analysis in Keyword View. Finally, the Instance View helps users inspect and refine the template instances. The visual system encodes four categories of assessment results using red and blue colors with varying transparency, \ieFull Compliance, Partial Compliance, Partial Refusal, and Full Refusal.
我们开发 \name 来支持越狱提示的多级可视化分析,以评估模型的防御能力。如 \name: Visual Analysis of Jailbreak Attacks Against Large Language Models 中所示,\name 的用户界面由 5 个视图组成。用户的探索从配置视图开始,它允许用户配置越狱问题和模板进行分析。然后,系统自动获取并评估越狱结果(\iemodel responses),并在“摘要视图”中提供评估结果的可视化摘要,使用户能够概览越狱提示的越狱性能。响应视图可帮助用户探索模型响应并优化评估标准以提高精度。根据评估结果,用户可以在“摘要视图”中深入研究组件分析,在“关键字视图”中进行关键字分析。最后,实例视图可帮助用户检查和优化模板实例。可视化系统使用具有不同透明度的红色和蓝色对四类评估结果进行编码,即完全合规、部分合规、部分拒绝和完全拒绝。
4.1Jailbreak Corpus Configuration
4.1越狱语料库配置
Configuration View (Figure 1�) allows users to upload the jailbreak corpus and select jailbreak questions and templates for model evaluation. The questions and templates are organized according to their categories (e.g., Character Role Play). The selected items will be assigned serial numbers, that will serve as their unique identifiers in subsequent analyses. Users can also modify the questions and templates based on their exploratory interests. Besides, the system allows users to configure the number of evaluations for each question-template combination to improve the robustness of evaluation results. After user configuration and submission, the system automatically combines each question and the template (\iefill the question into the placeholder in the template) to get the responses from the target model.
配置视图(图 1 � )允许用户上传越狱语料库,并选择越狱问题和模板进行模型评估。问题和模板根据其类别进行组织(例如,角色角色扮演)。所选项目将被分配序列号,这将在后续分析中作为其唯一标识符。用户还可以根据自己的探索兴趣修改问题和模板。此外,该系统允许用户配置每个问题模板组合的评估数量,以提高评估结果的鲁棒性。用户配置提交后,系统会自动将每个问题与模板(\iefill 问题放入模板的占位符)组合在一起,以获取目标模型的响应。
4.2Jailbreak Performance Exploration
4.2越狱性能探索
To support jailbreak performance exploration, the system provides a visual summary of jailbreak evaluation and allows users to inspect model responses to verify their correctness.
为了支持越狱性能探索,系统提供了越狱评估的可视化摘要,并允许用户检查模型响应以验证其正确性。
4.2.1Summary of Jailbreak Performance
4.2.1越狱性能总结
The left half of the Summary View (Figure 1�1) visualizes the performance of the jailbreak prompts through a matrix visualization, where the horizontal axis represents questions and the vertical axis represents templates. The questions and templates are grouped by category and are represented using the same serial numbers as in the Configuration View. The categories are collapsed by default to visualize their aggregated performance and support click interactions to expand them to check the performance of specific questions or templates. Each cell within this matrix contains a pie chart showing the percentage of assessment results of the corresponding template and question. The size of the pie chart encodes the number of evaluations. Users can also click the serial numbers of questions to inspect their model responses in Response View or click the templates to inspect them in Instance View.
摘要视图的左半部分(图 1 �1 )通过矩阵可视化可视化来可视化越狱提示的性能,其中横轴表示问题,纵轴表示模板。问题和模板按类别分组,并使用与“配置视图”中相同的序列号表示。默认情况下,类别处于折叠状态以可视化其聚合性能,并支持单击交互以展开它们以检查特定问题或模板的性能。此矩阵中的每个单元格都包含一个饼图,显示相应模板和问题的评估结果的百分比。饼图的大小对评估数进行编码。用户还可以单击问题的序列号以在“响应视图”中检查其模型响应,或单击模板以在“实例视图”中检查它们。
4.2.2Model Response Inspection
Since the jailbreak assessment criteria can be question-specific and sometimes need to incorporate contextual knowledge. We design Inspection View (\name: Visual Analysis of Jailbreak Attacks Against Large Language Models�) to streamline the process of exploring the assessment results and iteratively refining the criteria. It visualizes the model responses in a scatter plot. We embed model responses based on OpenAI’s embedding model [3] and project them into a 2D space using the t-SNE algorithm [63] to maintain their semantic similarity. The color of the points encodes the categories of assessment results. This helps users identify questionable assessment results based on semantic similarity (e.g., inspect a red point that appears in a blue cluster).
To enable users to improve assessment accuracy, the Response View supports users in refining the assessment criteria in two ways. After identifying unexpected assessment results, users can directly correct their categories through click interaction or specify additional assessment criteria in natural language. Users can switch between these two refinement modes using the switch widget at the upper right corner. The corrected examples and specified criteria can be submitted to enhance the system prompts for a new round of assessment (Section 4.1).
4.3Component Exploration
Based on the assessment results of the jailbreak templates, users can analyze and compare the effects of different prompt components in the right half of Summary View (\name: Visual Analysis of Jailbreak Attacks Against Large Language Models�2). The left column summarizes the components of the selected jailbreak templates. It visualizes each prompt as a horizontal stacked bar chart, where each bar segment corresponds to a distinct component. As shown in Figure 2�, the bar segments are arranged in the order corresponding to the prompt components, with the color encoding the component type and the length indicating the token length. It helps users understand the general patterns of prompt components and serves as the baseline for prompt perturbation. Users can click the icon to view the template details in the Instance View and click the icon to generate a set of component perturbation results for comparative analysis.
For comparison, the perturbation results are visualized in the right column as horizontal stacked bars similar to the original templates. Each result is generated by applying a perturbation strategy to a component of the original template. To visualize this difference, we design three kinds of glyphs to represent these strategies and overlay them on the corresponding bar segments, as shown in Figure 2�. The system also automatically evaluates the jailbreak performance of these perturbation results based on the selected questions and visualizes the percentage of assessment results in pie charts, enabling users to compare the effects of different component perturbations. Users can toggle to check the perturbation results for a particular strategy using the radio box at the top of the column.
4.4Keyword Exploration 4.4关键词探索
The Keyword View (\name: Visual Analysis of Jailbreak Attacks Against Large Language Models�) visualizes the jailbreak performance and importance of keywords (Section 4.3). Specifically, as shown in Figure 3�, the jailbreak performance of keyword �, \ie�����������(�), is represented as the percentage of four categories of assessment results, denoted as [�1,�2,�3,�4]. Inspired by prior work [35], we introduce a square space with a coordinate system (Figure 3�) whose four vertices correspond to the four categories, and their coordinates are denoted as [�1,�2,�3,�4]. Each corner is colored to indicate its category using the same color scheme as the pie charts in the Summary View. To visualize the overall performance distribution of the keyword, the coordinate of the keyword � is computed as:
关键字视图(\name:针对大型语言模型 � 的越狱攻击的可视化分析)可视化了越狱性能和关键字的重要性(第 4.3 节)。具体而言,如图3所示 � ,关键字 � \ie �����������(�) 的越狱性能表示为四类评估结果的百分比,表示为 [�1,�2,�3,�4] 。受先前工作 [ 35] 的启发,我们引入了一个具有坐标系的正方形空间(图 3 � ),其四个顶点对应于四个类别,它们的坐标表示为 [�1,�2,�3,�4] 。每个角都使用与“摘要视图”中的饼图相同的配色方案来指示其类别。为了可视化关键字的整体性能分布,关键字 � 的坐标计算如下:
����������(�)=∑�=14��×�� |
Besides, the size of the keywords encodes their importance in the corpus. It enables users to overview and compare the usage and performance of keywords based on their position and size. Users can filter keywords by component type and click the keywords to view their context.
此外,关键字的大小对它们在语料库中的重要性进行了编码。它使用户能够根据关键字的位置和大小来概述和比较关键字的使用情况和性能。用户可以按组件类型筛选关键字,然后单击关键字以查看其上下文。
Alternative Design. We have also considered an alternative design (Figure 3�) where the keywords are visualized in four separate word clouds (each for one assessment category) and their sizes correspond to their importance to the prompts of this category, \ie����������(�)×��,�∈[1,4]. According to the feedback from the experts, although this design enables users to focus on the keywords in the same category, it is inefficient for users to compare the size of the same keywords in all word clouds to estimate their overall performance distribution. Therefore, we chose our current design.
替代设计。我们还考虑了一种替代设计(图 3 � ),其中关键字在四个单独的词云(每个词云代表一个评估类别)中可视化,它们的大小对应于它们对该类别提示的重要性,\ie ����������(�)×��,�∈[1,4] 。根据专家的反馈,虽然这种设计使用户能够专注于同一类别中的关键字,但用户比较所有词云中相同关键字的大小来估计其整体性能分布是低效的。因此,我们选择了当前的设计。
4.5Template Instance Refinement
4.5模板实例优化
The Instance View (\name: Visual Analysis of Jailbreak Attacks Against Large Language Models�) allows users to refine the templates and evaluate the performance. The Jailbreak Content panel (\name: Visual Analysis of Jailbreak Attacks Against Large Language Models�1) lists the prompt text of each component, which supports manual modifications or automatic perturbations. Users can click the icon at the right of the component title to delete, rephrase, or switch the components. Then, users can evaluate their jailbreak performance on the selected questions and inspect the evaluation results in the Jailbreak Results panel (\name: Visual Analysis of Jailbreak Attacks Against Large Language Models�2). Each result item shows the question and model response and visualizes the color of the assessment result. This feedback can help the user evaluate the effectiveness of the modifications to verify the findings during the component and keyword analysis.
实例视图(\name:针对大型语言模型 � 的越狱攻击的可视化分析)允许用户优化模板并评估性能。“越狱内容”面板(\name:针对大型语言模型 �1 的越狱攻击的可视化分析)列出了每个组件的提示文本,支持手动修改或自动扰动。用户可以单击组件标题右侧的图标来删除、改写或切换组件。然后,用户可以评估自己在所选问题上的越狱性能,并在越狱结果面板(\name: Visual Analysis of Jailbreak Attacks Against Large Language Models �2 )中检查评估结果。每个结果项都显示问题和模型响应,并可视化评估结果的颜色。此反馈可以帮助用户评估修改的有效性,以验证组件和关键字分析期间的结果。
5Evaluation
We conducted a case study, two technical evaluations, and six expert interviews to verify the effectiveness of the analysis framework and the usability of the visual analysis system.
5.1Case Study
We invited the experts mentioned in Section 3.2 to use our system for jailbreak prompt analysis according to their exploratory interests. In this case study, the expert (E3) first evaluated the overall defense performance of GPT-3.5 on a jailbreak corpus and then dived into Character Role Play templates, one of the most common jailbreak strategies for an in-depth analysis of prompt characteristics.
Jailbreak Performance Evaluation (Figure 4�). E3 uploaded a jailbreak prompt corpus in the Configuration View and selected some questions and templates for analysis. Considering the stochastic nature of model responses, E3 evaluated each question and template combination three times. After a period of waiting, the left part of Summary View visualized the performance of selected questions and templates in pie charts (Figure 4�1). E3 checked the evaluation results to ensure their accuracy. For example, He selected an Illegal Activities question IA(1) and explored its results in the Response View (Figure 4�2−1). There was a notable separation between the distributions of blue and red points, except for some outlier blue points situated within the red clusters. Therefore, he checked some of these points and found that their assessment results did not align with his expectations (Figure 4�2−2). He corrected these results and submitted them to refine the assessment criteria, and the Summary View and Response View (Figure 4�2−3) were updated accordingly. E3 checked the results again to verify their correctness. After that, he also explored other questions to verify their correctness or correct unexpected results. Based on the verified evaluation results (Figure 4�3), E3 found that nearly half of the jailbreak attacks were successful, indicating the target model was vulnerable. Besides, he also noticed that jailbreak performance usually depended more on templates than questions because the pie charts in the same row (corresponding to the same templates with different questions) usually showed similar percentage patterns of assessment results.
Prompt Characteristic Exploration (Figure 4�). E3 wanted to explore the prompt characteristics of the Character Role Play category, one of the most common categories. The prompt component visualizations (Figure 4�1−1) showed that Subject Characteristic components (in blue) were commonly used and sometimes took up a large portion of the prompt length. E3 was curious whether the Subject Characteristic components were important to the jailbreak performance. Therefore, he chose a template with strong jailbreak performance and generated a set of prompt variations based on three perturbation strategies. According to the perturbation results, he found that deleting (Figure 4�1−2) or switching (Figure 4�1−3) the Subject Characteristic component resulted in a much more significant performance reduction than the other components, suggesting the Subject Characteristic component was crucial to the performance of this prompt. E3 also explored some other prompts and got similar findings. Then, he explored the important keywords in that component type in the Keyword View (Figure 4�2). From the keyword visualizations, he noticed that the keywords “AIM” and “DAN” were placed close to the “Full Refusal” corner, meaning that the model has been trained to be wary of these strategies and refuse to provide help. Near the “Full Compliance” corner, E3 found the keywords “disregards” and “controversial”, which correspond to an effective jailbreak strategy that encourages the model to disregard legal and ethical constraints and generate controversial content.
Jailbreak Template Refinement (Figure 4�). To verify the effectiveness of these keywords, E3 selected a template with weak jailbreak performance and refined its Subject Characteristic component using the sentence of the identified keywords (Figure 4�1−1). Then, he evaluated this new template based on the selected questions, and the results (Figure 4�1−2) showed that more than half of the attacks were successful, suggesting a significant performance improvement. After several validations on other templates, E3 concluded that the strategy of these keywords in the Subject Characteristic component reflected a potential weakness in the target model that requires more safety training against them. Finally, he added these newly generated templates to the dataset to improve prompt diversity.
5.2Technical Evaluations of LLM-based Methods
Jailbreak result assessment and prompt component classification are critical to the analysis workflow and rely heavily on the LLM capability and prompt design. Therefore, we conducted two technical evaluations to quantitatively measure the effectiveness of these methods. Since no recognized benchmark datasets were available for these two tasks, we drew inspiration from previous research [70] and collaborated with experts (E3 and E4) to build improvised datasets. We evaluated our methods and reported the results and some identified patterns that could affect the performance of the methods.
越狱结果评估和及时组件分类对于分析工作流程至关重要,并且在很大程度上依赖于LLM功能和提示设计。因此,我们进行了两次技术评估,以定量衡量这些方法的有效性。由于这两项任务没有公认的基准数据集,我们从以前的研究[70]中汲取灵感,并与专家(E3和E4)合作构建了即兴数据集。我们评估了我们的方法并报告了结果和一些可能影响方法性能的已确定模式。
5.2.1Jailbreak Result Assessment
5.2.1越狱结果评估
In this task, we gathered model responses triggered by common jailbreak prompts, labeled the model responses in collaboration with experts, and evaluated the performance of our method.
在这项任务中,我们收集了由常见越狱提示触发的模型响应,与专家合作标记了模型响应,并评估了我们方法的性能。
Dataset. Since we follow the widely adopted principle [37, 53, 61] that decomposes the jailbreak prompts into questions and templates, we randomly selected 20 questions from five perilous question categories, such as Harmful Content and Illegal Activities, and 30 templates for each question to gather a collection of model responses. Next, we removed duplicate answers (e.g., “I’m sorry, I can’t assist with that request.”) and randomly selected 24 responses for each question, resulting in a total of 20*24=480 model responses. Then, we worked with the experts to manually categorize the model responses.
数据。由于我们遵循广泛采用的原则 [ 37, 53, 61] 将越狱提示分解为问题和模板,因此我们从五个危险问题类别(例如有害内容和非法活动)中随机选择了 20 个问题,并为每个问题随机选择了 30 个模板,以收集模型响应的集合。接下来,我们删除了重复的答案(例如,“对不起,我无法协助该请求”),并为每个问题随机选择24个答案,结果总共有20 * 24 = 480个模型响应。然后,我们与专家合作,对模型响应进行手动分类。
Methodology. We partitioned the dataset to evaluate (1) the effectiveness of automatic jailbreak assessment and (2) the helpfulness of user refinement of assessment criteria. Firstly, we randomly selected 20 responses for each question as the test set and employed our method to assess them based on the default criteria introduced in Section 4.1. Secondly, we enhanced the default criteria by incorporating the remaining responses (four per question, each annotated with a label) as demonstration examples, and by enabling experts to specify additional criteria, thereby simulating realistic scenarios of analysis workflow. For these two criteria, we separately measured the accuracy of the jailbreak assessment on the test set.
方法论。我们对数据集进行了分区,以评估 (1) 自动越狱评估的有效性和 (2) 用户改进评估标准的有用性。首先,我们为每个问题随机选择了 20 个答案作为测试集,并采用我们的方法根据第 4.1 节中介绍的默认标准对其进行评估。其次,我们通过将剩余的回答(每个问题四个,每个回答都用标签注释)作为演示示例,并使专家能够指定其他标准,从而模拟分析工作流程的真实场景,从而增强了默认标准。对于这两个标准,我们分别测量了测试集上越狱评估的准确性。
Result. Overall, our method achieved 80.25% accuracy using the default criteria and 90.25% with the refined question-specific criteria. The results showed that our method was effective in supporting jailbreak result assessment. We also visualize the distribution of the assessment accuracy on each question in Figure 5. We found that specifying question-specific criteria could enhance the assessment accuracy across a majority of questions. For example, sometimes the LLM failed to identify responses containing adult content (e.g., advertising and marketing prostitution) as prohibited content, which was not aligned with OpenAI’s disallowed usages [4, 37]. By providing demonstration examples and specifying additional assessment criteria, our method could better identify such prohibited content.
5.2.2Prompt Component Classification
Based on the prompt component corpus developed in collaboration with the experts (as detailed in Section 3.3), we constructed a dataset to evaluate our component classification method.
Dataset and Methodology. From the corpus, we randomly selected 50 prompts that contain 841 prompt sentences in total. Then, we employed our method to classify each sentence based on the component taxonomy (Table 1). Finally, we measured the classification accuracy of the component types of sentences.
Result. Overall, our method achieved 80.26% accuracy on the component classification task. We also visualize the classification accuracy of each component type in a confusion matrix in Table 3. Our method yielded adequate performance in most of the component categories. However, we noticed that sometimes our method incorrectly categorized the Scene Introduction components as Subject Characteristic type. To reveal the patterns behind these failures, we further analyzed these unexpected results and found that the majority of such sentences often depicted the scene (e.g., a fictional world without moral constraints) that could imply the subject characteristics, potentially leading to confusion in LLM. A feasible solution would be to further clarify the component definition and provide demonstration examples to help our method better distinguish their differences.
Classification Result | |||||||||
SI | SC | TO | OS | OC | DE | RR | QP | ||
Scene Introduction (SI) | \cellcolor[HTML]686868 0.61 | \cellcolor[HTML]C6C6C60.23 | \cellcolor[HTML]ECECEC0.08 | \cellcolor[HTML]F6F6F60.04 | \cellcolor[HTML]F8F8F80.03 | \cellcolor[HTML]FFFFFF0.00 | \cellcolor[HTML]FBFBFB0.02 | \cellcolor[HTML]FFFFFF0.00 | |
Subject Characteristic (SC) | \cellcolor[HTML]FFFFFF0.00 | \cellcolor[HTML]2E2E2E 0.88 | \cellcolor[HTML]FBFBFB0.02 | \cellcolor[HTML]FDFDFD0.01 | \cellcolor[HTML]E9E9E90.09 | \cellcolor[HTML]FFFFFF0.00 | \cellcolor[HTML]FFFFFF0.00 | \cellcolor[HTML]FFFFFF0.00 | |
Task Operation (TO) | \cellcolor[HTML]FBFBFB0.02 | \cellcolor[HTML]E0E0E00.12 | \cellcolor[HTML]535353 0.71 | \cellcolor[HTML]F5F5F50.04 | \cellcolor[HTML]EBEBEB0.08 | \cellcolor[HTML]FFFFFF0.00 | \cellcolor[HTML]FBFBFB0.02 | \cellcolor[HTML]FDFDFD0.01 | |
Output Specification (OS) | \cellcolor[HTML]FFFFFF0.00 | \cellcolor[HTML]F2F2F20.05 | \cellcolor[HTML]F9F9F90.03 | \cellcolor[HTML]4A4A4A 0.75 | \cellcolor[HTML]F5F5F50.04 | \cellcolor[HTML]E0E0E00.12 | \cellcolor[HTML]FFFFFF0.00 | \cellcolor[HTML]FFFFFF0.00 | |
Output Constraint (OC) | \cellcolor[HTML]FFFFFF0.00 | \cellcolor[HTML]F3F3F30.05 | \cellcolor[HTML]F7F7F70.03 | \cellcolor[HTML]F7F7F70.03 | \cellcolor[HTML]2E2E2E 0.88 | \cellcolor[HTML]FFFFFF0.00 | \cellcolor[HTML]FFFFFF0.00 | \cellcolor[HTML]FFFFFF0.00 | |
Dialogue Example (DE) | \cellcolor[HTML]FFFFFF0.00 | \cellcolor[HTML]FFFFFF0.00 | \cellcolor[HTML]FFFFFF0.00 | \cellcolor[HTML]E0E0E00.13 | \cellcolor[HTML]FFFFFF0.00 | \cellcolor[HTML]2F2F2F 0.88 | \cellcolor[HTML]FFFFFF0.00 | \cellcolor[HTML]FFFFFF0.00 | |
Requirement Reminder (RR) | \cellcolor[HTML]FFFFFF0.00 | \cellcolor[HTML]E4E4E40.11 | \cellcolor[HTML]EAEAEA0.09 | \cellcolor[HTML]FAFAFA0.02 | \cellcolor[HTML]DFDFDF0.13 | \cellcolor[HTML]FFFFFF0.00 | \cellcolor[HTML]5E5E5E 0.66 | \cellcolor[HTML]FFFFFF0.00 | |
Ground Truth | Question Placeholder (QP) | \cellcolor[HTML]FFFFFF0.00 | \cellcolor[HTML]FAFAFA0.02 | \cellcolor[HTML]F5F5F50.04 | \cellcolor[HTML]FFFFFF0.00 | \cellcolor[HTML]FFFFFF0.00 | \cellcolor[HTML]FFFFFF0.00 | \cellcolor[HTML]FAFAFA0.02 | \cellcolor[HTML]262626 0.92 |
5.3Expert Interview
We interviewed six external experts (E5-E10) to evaluate the effectiveness of the analysis framework and the usability of the visual system. E5 is a model security engineer from a technical company who has been working on the secure reasoning of LLMs for more than one year and on network security (situation awareness) for over three years. E6-E10 are senior researchers from related fields, including model security, trustworthy AI, and deep learning model training. Among them, E7 has accumulated eight years of experience in data and model security before she focused on LLM jailbreak attacks. Each expert interview lasted about 90 minutes. We first briefly introduced the background and motivation of our study. Then, we described the analysis framework and visual system and demonstrated the system workflow using the case study. After that, we invited the experts to explore our system to analyze the performance and characteristics of the jailbreak prompts. Finally, we conducted semi-structured interviews with experts to collect their feedback about the analysis framework, visualization and interaction, and improvement suggestions.
Effectiveness of the Analysis Framework. All the experts agreed that our framework facilitated jailbreak performance evaluation and prompt characteristic understanding, and its workflow made sense. E5 praised this framework as “it provided a more comprehensive and systematic evaluation for jailbreak attacks compared to existing tools.” E5 appreciated the capability of the jailbreak assessment method to support user-specified criteria, which enhanced its flexibility to support customized criteria for different user values. However, we also observed one case during the exploration of E9 where the user-corrected examples did not improve the accuracy of a new round of assessment. E9 suggested recommending some representative model responses for user correction feedback to better leverage the in-context learning of the LLMs to distinguish between them. The jailbreak component analysis was highly appreciated, being described as “interesting” (E10), “impressive” (E6), and “inspiring” (E7). It was valued for “offering a new perspective to study the prompt patterns in the black box scenarios” (E10) and for “guiding user effort towards the critical parts of the prompts” (E9). The experts confirmed that keyword analysis helped understand prompt characteristics. However, E5 noted that it would be less effective when analyzing only a few prompts and suggested incorporating an external corpus of suspicious keywords to improve its effectiveness. Finally, all experts agreed that our analysis framework provided valuable insights for model security enhancement.
Visualization and Interactions. Overall, the experts agreed that the system views were well-designed, and the visual design and interaction were intuitive. The Summary View was popular among the experts as it supported the analysis and comparison of jailbreak performance from both the perspectives of questions and templates. All experts confirmed the helpfulness of Response View in exploring model responses and identifying unexpected results. E7 thought that component visualization required some learning costs but became easy to use and remember after she became familiar with visual encoding. We also asked the experts for their opinions about the color scheme of the prompt components, and the experts confirmed that it was “clear” (E8) and “easy to distinguish” (E10). The Keyword View was observed to be frequently used by the experts during the exploration. We noticed that they usually focused more on the keywords near the “Full Compliance” corner to identify effective keywords from successful attacks.
Suggestions for Improvement. In addition to the limitations mentioned above, we have also collected some suggestions for improvement. For Response View, E9 suggested adding some textual annotations (e.g., keywords) near the points or clusters to summarize the semantic information of the corresponding responses, which could help users identify the potential incorrect assessment results more efficiently. For component analysis, E5 suggested that providing a textual or visual summary for the comparative analysis of the component perturbations could better help users identify effective components.
6Discussion
In this section, we distill some design implications from expert interviews to inspire future research. We also discuss the system’s generalizability, limitations, and future work.
6.1Design Implications
Toward a more comprehensive assessment of jailbreak results. The experts appreciate the introduction of jailbreak assessment taxonomy [74] to resolve ambiguities and identify harmful jailbreak results, as well as the application of the LLM-based assessment method to improve efficiency. They also suggest extending them to include more assessment dimensions. For instance, responses that offer elaborate and expert guidance on illegal activities may pose a greater risk than those providing generic and ambiguous advice. Therefore, assessing the helpfulness of the jailbreak results can help model practitioners prioritize identifying and preventing these harmful results. Future research can explore broadening the spectrum of assessment dimensions (e.g., helpfulness) and improving the assessment accuracy of these dimensions (e.g., incorporating question-related knowledge bases) to mitigate the harm of the jailbreak results.
Improve the learning-based jailbreak prompt construction. The development of learning-based methodologies [78, 13] for the construction of jailbreak prompts represents a significant opportunity to enhance the efficiency of red-teaming LLMs. However, this endeavor faces challenges due to the intricate nature of the prompt design to guarantee effectiveness [15]. The experts highlight that our analysis framework can inspire the research of learning-based methods to improve their effectiveness and generation diversity. For example, the paired jailbreak prompts and their perturbation results (with stronger jailbreak performance) can be used to train generative models for rewriting jailbreak prompts, so that the models can easily capture the difference between their prompt components and learn how to effectively improve the prompt performance. Furthermore, the component analysis paves the way for the integration of expert knowledge into automatic jailbreak prompt generation. For example, it allows the experts to specify the kernel of the prompts in the Scene Introduction or Subject Characteristic components to guide the generation of the following content (e.g., Task Operation or Output Specification).
Balance between the objectives of safety and instruction-following. The objectives of safety and instruction-following are usually competitive [66], particularly in the context of mitigating jailbreak attacks. It has been widely recognized that over-strengthening the model’s security defenses on large jailbreak corpora will inevitably compromise the model’s ability to follow user instructions, leading to “overkill” issues [54]. Balancing between these two objectives has become one of the most challenging problems for LLM training. The experts point out that our component analysis provides a potential solution to construct a customized and condensed jailbreak dataset based on the vulnerabilities of LLMs rather than relying on large jailbreak corpora, thereby effectively addressing the model’s major security weaknesses without sacrificing the instruction-following capabilities. Looking ahead, future works can further explore how to help model practitioners analyze and trade-off between these two objectives.
Support jailbreak performance comparisons on multiple models. While our analysis framework and visual system facilitate a systematic analysis of jailbreak attacks, the experts express interest in the comparative evaluation of such attacks across various models, which promises significant benefits across multiple application scenarios. For example, it can help LLM vendors benchmark their models against those of competitors, identifying relative advantages and shortcomings. Similarly, it can assist model practitioners in comparing different iterations of models to evaluate the improvements attributed to safety training. Our system can be extended with comparative visualizations [42, 22] to provide clear and insightful comparisons across models.
Fine-tune the jailbreak questions to bypass the safety measurements. Consistent with most previous research aimed at refining the jailbreak templates to enhance the attack performance, our system is primarily designed to evaluate the jailbreak prompts and analyze the characteristics of the jailbreak templates. Recent studies [76, 30] suggest that rewriting the jailbreak questions directly helps to circumvent the model’s security mechanisms. This could be achieved, for instance, by transforming the malicious questions into experimental or pedagogical questions, or by deconstructing the tasks into smaller, less harmful operation steps. Therefore, it becomes crucial to analyze the semantic characteristics of the jailbreak questions and to summarize their common strategies.
6.2Generalizability
\name
is designed to analyze jailbreak attacks, one of the most common prompt attacks against LLMs. We demonstrate the system’s effectiveness through a case study evaluating the jailbreak performance on GPT-3.5. The system can be easily generalized to other language models (e.g., Llama 2 [62] and ChatGLM [16]) due to its model-agnostic design that centered around the prompt input and model responses. Moreover, the analysis workflow of \name can potentially support other prompt attack scenarios, such as prompt injection [52, 47] and backdoor attacks [23, 72]. For prompt injection, the system can help users evaluate the performance of hijack attacks based on the LLM-based assessment method and explore the assessment results to verify or improve their accuracy. For backdoor attacks, the system can help users identify suspicious backdoor triggers (e.g., sentences or keywords) through component and keyword analysis.
6.3Limitations and Future Work
Incorporate more perturbation strategies for component analysis. Currently, our analysis framework and visual system support three kinds of perturbation strategies (\iedeletion, rephrasing, and switching) to help users understand the effect of prompt components on jailbreak performance. They can be extended to support more strategies, such as inserting and crossover [74]. The system can insert the recognized important components into other prompts to verify their effectiveness. The crossover strategy can combine the strengths of two prompts to create more prompt variations. Supporting these strategies enables users to conduct a more comprehensive analysis of prompt components.
Extend our analysis to more large language models. Our system has proven to be effective in supporting the security analysis of OpenAI’s LLMs against jailbreak attacks. Moving forward, we plan to integrate a broader range of mainstream LLMs into our system, such as Llama 2 [62] and ChatGLM [16]. Our goal is to evaluate the defense capability of these advanced models against jailbreak attacks and identify their potential weaknesses. We will communicate our findings gained from the analysis with the vendors of these LLMs and provide them with valuable insights for enhancing the model safety mechanisms, ensuring a safer deployment of LLM technologies.
Explore multi-modal jailbreak attacks. The vulnerabilities of multi-modal large language models (MLLMs), such as LLava [33] and GPT-4V [1], have attracted increased attention [9, 51, 48]. MLLMs are more sensitive to jailbreak prompts with textual triggers, OCR textual triggers, and visual triggers, which presents greater safety risks compared to LLMs. Our work primarily investigates the textual vulnerabilities of LLMs without extending to the scenario of multi-modal jailbreak attacks. In the future, we aim to bridge this gap by incorporating multi-modal analysis into our visual analysis framework, thereby enhancing the robustness of MLLMs against such threats.
7Conclusion
We present a novel LLM-assisted analysis framework coupled with a visual analysis system \name to help model practitioners analyze the jailbreak attacks against LLMs. The analysis framework provides a jailbreak result assessment method to evaluate jailbreak performance and supports an in-depth analysis of jailbreak prompt characteristics from component and keyword aspects. The visual system allows users to explore the evaluation results, identify important prompt components and keywords, and verify their effectiveness. A case study, two technical evaluations, and expert interviews show the effectiveness of the analysis framework and visual system. Besides, we distill a set of design implications to inspire future research.
References
- [1]GPT-4V(ision) System Card.https://openai.com/research/gpt-4v-system-card.Accessed on March 1st, 2024.
- [2]JailbreakChat Website.http://jailbreakchat.com/.Accessed on October 1st, 2023.
- [3]OpenAI’s Embedding Models.https://platform.openai.com/docs/guides/embeddings.Accessed on March 1st, 2024.
- [4]OpenAI’s Usage Policies.https://openai.com/policies/usage-policies.Accessed on October 1st, 2023.
- [5]J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al.Gpt-4 technical report.arXiv, 2024. doi: 10 . 48550/ARXIV . 2303 . 08774
- [6]T. Angert, M. Suzara, J. Han, C. Pondoc, and H. Subramonyam.Spellburst: A node-based interface for exploratory creative coding with natural language prompts.In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pp. 1–22, 2023. doi: 10 . 1145/3586183 . 3606719
- [7]A. Boggust, B. Hoover, A. Satyanarayan, and H. Strobelt.Shared interest: Measuring human-ai alignment to identify recurring patterns in model behavior.In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pp. 1–17, 2022. doi: 10 . 1145/3491102 . 3501965
- [8]S. Brade, B. Wang, M. Sousa, S. Oore, and T. Grossman.Promptify: Text-to-image generation through interactive prompt exploration with large language models.In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pp. 1–14, 2023. doi: 10 . 1145/3586183 . 3606725
- [9]N. Carlini, M. Nasr, C. A. Choquette-Choo, M. Jagielski, I. Gao, P. W. W. Koh, D. Ippolito, F. Tramer, and L. Schmidt.Are aligned neural networks adversarially aligned?In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, eds., Advances in Neural Information Processing Systems, vol. 36, pp. 61478–61500. Curran Associates, Inc., 2023.
- [10]P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong.Jailbreaking black box large language models in twenty queries.arXiv, 2023. doi: 10 . 48550/ARXIV . 2310 . 08419
- [11]F. Cheng, Y. Ming, and H. Qu.Dece: Decision explorer with counterfactual explanations for machine learning models.IEEE Transactions on Visualization and Computer Graphics, 27(2):1438–1447, 2021. doi: 10 . 1109/TVCG . 2020 . 3030342
- [12]J. Chu, Y. Liu, Z. Yang, X. Shen, M. Backes, and Y. Zhang.Comprehensive assessment of jailbreak attacks against llms.arXiv, 2024. doi: 10 . 48550/ARXIV . 2402 . 05668
- [13]G. Deng, Y. Liu, Y. Li, K. Wang, Y. Zhang, Z. Li, H. Wang, T. Zhang, and Y. Liu.Masterkey: Automated jailbreaking of large language model chatbots.arXiv, 2024. doi: 10 . 14722/ndss . 2024 . 24188
- [14] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova.Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv, 2018.
- [15] P. Ding, J. Kuang, D. Ma, X. Cao, Y. Xian, J. Chen, and S. Huang.A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily.arXiv, 2024. doi: 10 . 48550/ARXIV . 2311 . 08268
- [16] Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang.Glm: General language model pretraining with autoregressive blank infilling.arXiv preprint arXiv:2103.10360, 2021.
- [17] N. Feldhus, A. M. Ravichandran, and S. Möller.Mediators: Conversational agents explaining nlp model behavior.arXiv, 2022. doi: 10 . 48550/ARXIV . 2206 . 06029
- [18] Y. Feng, J. Chen, K. Huang, J. K. Wong, H. Ye, W. Zhang, R. Zhu, X. Luo, and W. Chen.ipoet: interactive painting poetry creation with visual multimodal analysis.Journal of Visualization, pp. 1–15, 2022. doi: 10 . 1007/S12650-021-00780-0
- [19] Y. Feng, X. Wang, B. Pan, K. K. Wong, Y. Ren, S. Liu, Z. Yan, Y. Ma, H. Qu, and W. Chen.Xnli: Explaining and diagnosing nli-based visual data analysis.IEEE Transactions on Visualization and Computer Graphics, pp. 1–14, 2023. doi: 10 . 1109/TVCG . 2023 . 3240003
- [20] Y. Feng, X. Wang, K. K. Wong, S. Wang, Y. Lu, M. Zhu, B. Wang, and W. Chen.Promptmagician: Interactive prompt engineering for text-to-image creation.IEEE Transactions on Visualization and Computer Graphics, 30(1):295–305, 2023. doi: 10 . 1109/TVCG . 2023 . 3327168
- [21] L. Gao, Z. Shao, Z. Luo, H. Hu, C. Turkay, and S. Chen.Transforlearn: Interactive visual tutorial for the transformer model.IEEE Transactions on Visualization and Computer Graphics, 2023. doi: 10 . 1109/TVCG . 2023 . 3327353
- [22] W. He, J. Wang, H. Guo, H.-W. Shen, and T. Peterka.Cecav-dnn: Collective ensemble comparison and visualization using deep neural networks.Visual Informatics, 4(2):109–121, 2020. doi: 10 . 1016/J . VISINF . 2020 . 04 . 004
- [23] H. Huang, Z. Zhao, M. Backes, Y. Shen, and Y. Zhang.Composite backdoor attacks against large language models.arXiv, 2023. doi: 10 . 48550/ARXIV . 2310 . 07676
- [24] P. Jiang, J. Rayan, S. P. Dow, and H. Xia.Graphologue: Exploring large language model responses with interactive diagrams.In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pp. 1–20, 2023. doi: 10 . 1145/3586183 . 3606737
- [25] H. Jin, R. Chen, J. Chen, and H. Wang.Quack: Automatic jailbreaking large language models via role-playing.2023.
- [26] D. Kang, X. Li, I. Stoica, C. Guestrin, M. Zaharia, and T. Hashimoto.Exploiting programmatic behavior of llms: Dual-use through standard security attacks.arXiv, 2023. doi: 10 . 48550/ARXIV . 2302 . 05733
- [27] A. Karpathy, J. Johnson, and L. Fei-Fei.Visualizing and understanding recurrent networks.arXiv, 2015. doi: 10 . 48550/ARXIV . 1506 . 02078
- [28]H. Li, D. Guo, W. Fan, M. Xu, and Y. Song.Multi-step jailbreaking privacy attacks on chatgpt.arXiv preprint arXiv:2304.05197, 2023.
- [29]J. Li, X. Chen, E. Hovy, and D. Jurafsky.Visualizing and understanding neural models in nlp.arXiv preprint arXiv:1506.01066, 2016. doi: 10 . 18653/V1/N16-1082
- [30] X. Li, S. Liang, J. Zhang, H. Fang, A. Liu, and E.-C. Chang.Semantic mirror jailbreak: Genetic algorithm based jailbreak prompts against open-source llms.arXiv, 2024. doi: 10 . 48550/ARXIV . 2402 . 14872
- [31] Z. Li, X. Wang, W. Yang, J. Wu, Z. Zhang, Z. Liu, M. Sun, H. Zhang, and S. Liu.A unified understanding of deep nlp models for text classification.IEEE Transactions on Visualization and Computer Graphics, 28(12):4980–4994, 2022. doi: 10 . 1109/TVCG . 2022 . 3184186
- [32] P. P. Liang, Y. Lyu, G. Chhablani, N. Jain, Z. Deng, X. Wang, L.-P. Morency, and R. Salakhutdinov.Multiviz: Towards visualizing and understanding multimodal models.arXiv, 2022.
- [33] H. Liu, C. Li, Q. Wu, and Y. J. Lee.Visual instruction tuning.In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, eds., Advances in Neural Information Processing Systems, vol. 36, pp. 34892–34916. Curran Associates, Inc., 2023.
- [34] M. X. Liu, A. Sarkar, C. Negreanu, B. Zorn, J. Williams, N. Toronto, and A. D. Gordon.“what it wants me to say”: Bridging the abstraction gap between end-user programmers and code-generating large language models.In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pp. 1–31, 2023. doi: 10 . 1145/3544548 . 3580817
- [35] S. Liu, Z. Li, T. Li, V. Srikumar, V. Pascucci, and P.-T. Bremer.Nlize: A perturbation-driven visual interrogation tool for analyzing and interpreting natural language inference models.IEEE Transactions on Visualization and Computer Graphics, 25(1):651–660, 2019. doi: 10 . 1109/TVCG . 2018 . 2865230
- [36] X. Liu, N. Xu, M. Chen, and C. Xiao.Autodan: Generating stealthy jailbreak prompts on aligned large language models.arXiv, 2024. doi: 10 . 48550/ARXIV . 2310 . 04451
- [37] Y. Liu, G. Deng, Z. Xu, Y. Li, Y. Zheng, Y. Zhang, L. Zhao, T. Zhang, and Y. Liu.Jailbreaking chatgpt via prompt engineering: An empirical study.arXiv, 2024. doi: 10 . 48550/ARXIV . 2305 . 13860
- [38] Y. Liu, Z. Wen, L. Weng, O. Woodman, Y. Yang, and W. Chen.Sprout: Authoring programming tutorials with interactive visualization of large language model generation process.arXiv, 2023. doi: 10 . 48550/ARXIV . 2312 . 01801
- [39] J. Lu, B. Pan, J. Chen, Y. Feng, J. Hu, Y. Peng, and W. Chen.Agentlens: Visual analysis for agent behaviors in llm-based autonomous systems.arXiv, 2024. doi: 10 . 48550/ARXIV . 2402 . 08995
- [40] S. M. Lundberg and S.-I. Lee.A unified approach to interpreting model predictions.Advances in neural information processing systems, 30, 2017.
- [41] Y. Ma, T. Xie, J. Li, and R. Maciejewski.Explaining vulnerabilities to adversarial machine learning through visual analytics.IEEE Transactions on Visualization and Computer Graphics, 26(1):1075–1085, 2020. doi: 10 . 1109/TVCG . 2019 . 2934631
- [42] M. M. Malik, C. Heinzl, and M. E. Groeller.Comparative visualization for parameter studies of dataset series.IEEE Transactions on Visualization and Computer Graphics, 16(5):829–840, 2010. doi: 10 . 1109/TVCG . 2010 . 20
- [43] M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Li, et al.Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv, 2024. doi: 10 . 48550/ARXIV . 2402 . 04249
- [44] Y. Ming, S. Cao, R. Zhang, Z. Li, Y. Chen, Y. Song, and H. Qu.Understanding hidden memories of recurrent neural networks.In IEEE VAST, pp. 13–24. IEEE, 2017. doi: 10 . 1109/VAST . 2017 . 8585721
- [45] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al.Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022.
- [46] Z. Peng, X. Wang, Q. Han, J. Zhu, X. Ma, and H. Qu.Storyfier: Exploring vocabulary learning support with text generation models.In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pp. 1–16, 2023. doi: 10 . 1145/3586183 . 3606786
- [47] F. Perez and I. Ribeiro.Ignore previous prompt: Attack techniques for language models.arXiv, 2022. doi: 10 . 48550/ARXIV . 2211 . 09527
- [48] X. Qi, K. Huang, A. Panda, P. Henderson, M. Wang, and P. Mittal.Visual adversarial examples jailbreak aligned large language models.Proceedings of the AAAI Conference on Artificial Intelligence, 38(19):21527–21536, Mar. 2024. doi: 10 . 1609/aaai . v38i19 . 30150
- [49] M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh.Beyond accuracy: Behavioral testing of nlp models with checklist.arXiv preprint arXiv:2005.04118, 2020.
- [50] Z. Shao, S. Sun, Y. Zhao, S. Wang, Z. Wei, T. Gui, C. Turkay, and S. Chen.Visual explanation for open-domain question answering with bert.IEEE Transactions on Visualization and Computer Graphics, 2023. doi: 10 . 1109/TVCG . 2023 . 3243676
- [51] E. Shayegani, Y. Dong, and N. Abu-Ghazaleh.Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models.In The Twelfth International Conference on Learning Representations, 2024.
- [52] E. Shayegani, M. A. A. Mamun, Y. Fu, P. Zaree, Y. Dong, and N. Abu-Ghazaleh.Survey of vulnerabilities in large language models revealed by adversarial attacks.arXiv, 2023. doi: 10 . 48550/ARXIV . 2310 . 10844
- [53] X. Shen, Z. Chen, M. Backes, Y. Shen, and Y. Zhang.“do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models.arXiv, 2023. doi: 10 . 48550/ARXIV . 2308 . 03825
- [54] C. Shi, X. Wang, Q. Ge, S. Gao, X. Yang, T. Gui, Q. Zhang, X. Huang, X. Zhao, and D. Lin.Navigating the overkill in large language models.arXiv, 2024. doi: 10 . 48550/ARXIV . 2401 . 17633
- [55] K. Sparck Jones.A statistical interpretation of term specificity and its application in retrieval.Journal of documentation, 60(5):493–502, 2004. doi: 10 . 1108/00220410410560573
- [56] R. Speer, J. Chin, and C. Havasi.Conceptnet 5.5: An open multilingual graph of general knowledge.In Proceedings of the AAAI conference on artificial intelligence, vol. 31, 2017. doi: 10 . 1609/AAAI . V31I1 . 11164
- [57] T. Spinner, U. Schlegel, H. Schäfer, and M. El-Assady.explainer: A visual analytics framework for interactive and explainable machine learning.IEEE Transactions on Visualization and Computer Graphics, 26(1):1064–1074, 2020. doi: 10 . 1109/TVCG . 2019 . 2934629
- [58] H. Strobelt, S. Gehrmann, H. Pfister, and A. M. Rush.Lstmvis: A tool for visual analysis of hidden state dynamics in recurrent neural networks.IEEE Transactions on Visualization and Computer Graphics, 24(1):667–676, 2018. doi: 10 . 1109/TVCG . 2017 . 2744158
- [59] H. Strobelt, A. Webson, V. Sanh, B. Hoover, J. Beyer, H. Pfister, and A. M. Rush.Interactive and visual prompt engineering for ad-hoc task adaptation with large language models.IEEE Transactions on Visualization and Computer Graphics, 29(1):1146–1156, 2023. doi: 10 . 1109/TVCG . 2022 . 3209479
- [60] S. Suh, B. Min, S. Palani, and H. Xia.Sensecape: Enabling multilevel exploration and sensemaking with large language models.In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pp. 1–18, 2023. doi: 10 . 1145/3586183 . 3606756
- [61] H. Sun, Z. Zhang, J. Deng, J. Cheng, and M. Huang.Safety assessment of chinese large language models.arXiv, 2023. doi: 10 . 48550/ARXIV . 2304 . 10436
- [62] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al.Llama: Open and efficient foundation language models.arXiv, 2023. doi: 10 . 48550/ARXIV . 2302 . 13971
- [63] L. Van der Maaten and G. Hinton.Visualizing data using t-sne.Journal of machine learning research, 9(86):2579–2605, 2008.
- [64] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin.Attention is all you need.Advances in neural information processing systems, 30, 2017.
- [65] X. Wang, R. Huang, Z. Jin, T. Fang, and H. Qu.Commonsensevis: Visualizing and understanding commonsense reasoning capabilities of natural language models.IEEE Transactions on Visualization and Computer Graphics, 2023. doi: 10 . 48550/ARXIV . 2307 . 12382
- [66] A. Wei, N. Haghtalab, and J. Steinhardt.Jailbroken: How does llm safety training fail?Advances in Neural Information Processing Systems, 36, 2024.
- [67] L. Weng, X. Wang, J. Lu, Y. Feng, Y. Liu, and W. Chen.Insightlens: Discovering and exploring insights from conversational contexts in large-language-model-powered data analysis.arXiv, 2024. doi: 10 . 48550/ARXIV . 2404 . 01644
- [68] L. Weng, M. Zhu, K. K. Wong, S. Liu, J. Sun, H. Zhu, D. Han, and W. Chen.Towards an understanding and explanation for mixed-initiative artificial scientific text detection.arXiv, 2023. doi: 10 . 48550/ARXIV . 2304 . 05011
- [69] J. Wexler, M. Pushkarna, T. Bolukbasi, M. Wattenberg, F. Viégas, and J. Wilson.The what-if tool: Interactive probing of machine learning models.IEEE Transactions on Visualization and Computer Graphics, 26(1):56–65, 2020. doi: 10 . 1109/TVCG . 2019 . 2934619
- [70]K. K. Wong, X. Wang, Y. Wang, J. He, R. Zhang, and H. Qu.Anchorage: Visual analysis of satisfaction in customer service videos via anchor events.IEEE Transactions on Visualization and Computer Graphics, 2023. doi: 10 . 48550/ARXIV . 2302 . 06806
K. K. Wong、X. Wang、Y. Wang、J. He、R. Zhang 和 H. Qu. Anchorage:通过主播事件对客户服务视频中的满意度进行视觉分析。IEEE 可视化与计算机图形学汇刊,2023 年。doi: 10 .48550/ARXIV的。2302 .06806 - [71]T. Xie, F. Zhou, Z. Cheng, P. Shi, L. Weng, Y. Liu, T. J. Hua, J. Zhao, Q. Liu, C. Liu, et al.Openagents: An open platform for language agents in the wild.arXiv, 2023. doi: 10 . 48550/ARXIV . 2310 . 10634
T. Xie, F. 周, Z. Cheng, P. Shi, L. Weng, Y. Liu, T. J. 华, J. Zhao, Q. Liu, C. Liu, et al. Openagents:一个开放的语言代理平台。arXiv,2023 年。doi: 10 .48550/ARXIV的。2310 .10634 - [72]W. Yang, X. Bi, Y. Lin, S. Chen, J. Zhou, and X. Sun.Watch out for your agents! investigating backdoor threats to llm-based agents.arXiv, 2024. doi: 10 . 48550/ARXIV . 2402 . 11208
W. Yang, X. Bi, Y. Lin, S. Chen, J. 周, 和 X. Sun. 小心你的经纪人!调查对基于 LLM 的代理的后门威胁。arXiv,2024 年。doi: 10 .48550/ARXIV的。2402 .11208 - [73]C. Yeh, Y. Chen, A. Wu, C. Chen, F. Viégas, and M. Wattenberg.Attentionviz: A global view of transformer attention.IEEE Transactions on Visualization and Computer Graphics, 2023. doi: 10 . 1109/TVCG . 2023 . 3327163
C. Yeh, Y. Chen, A. Wu, C. Chen, F. Viégas, 和 M. Wattenberg.Attentionviz:变压器注意力的全局视图。IEEE 可视化与计算机图形学汇刊,2023 年。doi: 10 .1109/TVCG .2023 .3327163 - [74]J. Yu, X. Lin, and X. Xing.Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.arXiv, 2023. doi: 10 . 48550/ARXIV . 2309 . 10253
J. Yu、X. Lin 和 X. Xing。Gptfuzzer:将大型语言模型与自动生成的越狱提示进行红队合作。arXiv,2023 年。doi: 10 .48550/ARXIV的。2309 .10253 - [75]Y. Yuan, W. Jiao, W. Wang, J.-t. Huang, P. He, S. Shi, and Z. Tu.Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher.arXiv, 2024. doi: 10 . 48550/ARXIV . 2308 . 06463
Y. Yuan, W. Jiao, W. Wang, J.-t. Huang, P. He, S. Shi, and Z. Tu. Gpt-4 太聪明了,不安全:通过密码与 llms 进行秘密聊天。arXiv,2024 年。doi: 10 .48550/ARXIV的。2308 .06463 - [76]Y. Zeng, H. Lin, J. Zhang, D. Yang, R. Jia, and W. Shi.How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms.arXiv, 2024. doi: 10 . 48550/ARXIV . 2401 . 06373
Y. Zeng, H. Lin, J. Zhang, D. Yang, R. Jia, 和 W. Shi.johnny 如何说服 llms 越狱:重新思考通过人性化 llms 来挑战 ai 安全的说服力。arXiv,2024 年。doi: 10 .48550/ARXIV的。2401 .06373 - [77]T. Y. Zhuo, Y. Huang, C. Chen, and Z. Xing.Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity.arXiv, pp. 12–2, 2023. doi: 10 . 48550/ARXIV . 2301 . 12867
T. Y. Zhuo, Y. Huang, C. Chen, 和 Z. Xing.通过越狱的红队 chatgpt:偏差、稳健性、可靠性和毒性。arXiv,第 12-2 页,2023 年。doi: 10 .48550/ARXIV的。2301 .12867 - [78]A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson.Universal and transferable adversarial attacks on aligned language models.arXiv, 2023. doi: 10 . 48550/ARXIV . 2307 . 15043
A. Zou、Z. Wang、J. Z. Kolter 和 M. Fredrikson。对一致语言模型的通用和可转移的对抗性攻击。arXiv,2023 年。doi: 10 .48550/ARXIV .2307 .15043
转载请注明:Visual Analysis of Jailbreak Attacks Against Large Language Models | CTF导航