A comprehensive guide to understanding Attacks, Defenses and Red-Teaming for Large Language Models (LLMs).
Title | Link |
---|---|
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models | Link |
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study | Link |
ChatGPT Jailbreak Reddit | Link |
Anomalous Tokens | Link |
Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition | Link |
Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks | Link |
Jailbroken: How Does LLM Safety Training Fail? | Link |
Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks | Link |
Adversarial Prompting in LLMs | Link |
Exploiting Novel GPT-4 APIs | Link |
Large Language Models are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content | Link |
Using Hallucinations to Bypass GPT4's Filter | Link |
Title | Link |
---|---|
A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily | Link |
FLIRT: Feedback Loop In-context Red Teaming | Link |
Jailbreaking Black Box Large Language Models in Twenty Queries | Link |
Red Teaming Language Models with Language Models | Link |
Tree of Attacks: Jailbreaking Black-Box LLMs Automatically | Link |
Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks | Link |
Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking | Link |
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher | Link |
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts | Link |
AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications | Link |
Prompt Injection attack against LLM-integrated Applications | Link |
Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction | Link |
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming | Link |
Query-Efficient Black-Box Red Teaming via Bayesian Optimization | Link |
Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs | Link |
CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models | Link |
Title | Link |
---|---|
Universal and Transferable Adversarial Attacks on Aligned Language Models | Link |
ACG: Accelerated Coordinate Gradient | Link |
PAL: Proxy-Guided Black-Box Attack on Large Language Models | Link |
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models | Link |
Open Sesame! Universal Black Box Jailbreaking of Large Language Models | Link |
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks | Link |
MASTERKEY: Automated Jailbreaking of Large Language Model Chatbots | Link |
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models | Link |
AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs | Link |
Universal Adversarial Triggers Are Not Universal | Link |
Title | Link |
---|---|
Scalable Extraction of Training Data from (Production) Language Models | Link |
Explore, Establish, Exploit: Red Teaming Language Models from Scratch | Link |
Extracting Training Data from Large Language Models | Link |
Bag of Tricks for Training Data Extraction from Language Models | Link |
Controlling the Extraction of Memorized Data from Large Language Models via Prompt-Tuning | Link |
Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration | Link |
Membership Inference Attacks against Language Models via Neighbourhood Comparison | Link |
Title | Link |
---|---|
A Methodology for Formalizing Model-Inversion Attacks | Link |
Sok: Model inversion attack landscape: Taxonomy, challenges, and future roadmap. | Link |
Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures | Link |
Model Leeching: An Extraction Attack Targeting LLMs | Link |
Killing One Bird with Two Stones: Model Extraction and Attribute Inference Attacks against BERT-based APIs | Link |
Model Extraction and Adversarial Transferability, Your BERT is Vulnerable! | Link |
Title | Link |
---|---|
Language Model Inversion | Link |
Effective Prompt Extraction from Language Models | Link |
Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts | Link |
Title | Link |
---|---|
Text Embedding Inversion Security for Multilingual Language Models | Link |
Sentence Embedding Leaks More Information than You Expect: Generative Embedding Inversion Attack to Recover the Whole Sentence | Link |
Text Embeddings Reveal (Almost) As Much As Text | Link |
Title | Link |
---|---|
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation | Link |
What Was Your Prompt? A Remote Keylogging Attack on AI Assistants | Link |
Privacy Side Channels in Machine Learning Systems | Link |
Stealing Part of a Production Language Model | Link |
Logits of API-Protected LLMs Leak Proprietary Information | Link |
Title | Link |
---|---|
Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection | Link |
Adversarial Demonstration Attacks on Large Language Models | Link |
Poisoning Web-Scale Training Datasets is Practical | Link |
Universal Vulnerabilities in Large Language Models: In-context Learning Backdoor Attacks | Link |
BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models | Link |
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations | Link |
Many-shot jailbreaking | Link |
Title | Link |
---|---|
Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment | Link |
Test-Time Backdoor Attacks on Multimodal Large Language Models | Link |
Open the Pandora's Box of LLMs: Jailbreaking LLMs through Representation Engineering | Link |
Weak-to-Strong Jailbreaking on Large Language Models | Link |
Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space | Link |
Title | Link |
---|---|
Fast Adversarial Attacks on Language Models In One GPU Minute | Link |
Title | Link |
---|---|
Training-free Lexical Backdoor Attacks on Language Models | Link |
Title | Link |
---|---|
Universal Jailbreak Backdoors from Poisoned Human Feedback | Link |
Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data | Link |
Title | Link |
---|---|
Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models | Link |
On the Exploitability of Instruction Tuning | Link |
Poisoning Language Models During Instruction Tuning | Link |
Learning to Poison Large Language Models During Instruction Tuning | Link |
Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection | Link |
Title | Link |
---|---|
The Philosopher's Stone: Trojaning Plugins of Large Language Models | Link |
Privacy Backdoors: Stealing Data with Corrupted Pretrained Models | Link |
Title | Link |
---|---|
Removing RLHF Protections in GPT-4 via Fine-Tuning | Link |
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models | Link |
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! | Link |
LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B | Link |
Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections | Link |
Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases | Link |
Large Language Model Unlearning | Link |
Title | Link |
---|---|
Gradient-Based Language Model Red Teaming | Link |
Red Teaming Language Models with Language Models | Link |
Rapid Optimization for Jailbreaking LLMs via Subconscious Exploitation and Echopraxia | Link |
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models | Link |
RLPrompt: Optimizing Discrete Text Prompts with Reinforcement Learning | Link |
Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks | Link |
COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability | Link |
Automatically Auditing Large Language Models via Discrete Optimization | Link |
Automatic and Universal Prompt Injection Attacks against Large Language Models | Link |
Unveiling the Implicit Toxicity in Large Language Models | Link |
Hijacking Large Language Models via Adversarial In-Context Learning | Link |
Boosting Jailbreak Attack with Momentum | Link |
Title | Link |
---|---|
SoK: Prompt Hacking of Large Language Models | Link |
Title | Link |
---|---|
Red-Teaming for Generative AI: Silver Bullet or Security Theater? | Link |
If you like our work, please consider citing. If you would like to add your work to our taxonomy please open a pull request.
@article{verma2024operationalizing,
title={Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)},
author={Verma, Apurv and Krishna, Satyapriya and Gehrmann, Sebastian and Seshadri, Madhavan and Pradhan, Anu and Ault, Tom and Barrett, Leslie and Rabinowitz, David and Doucette, John and Phan, NhatHai},
journal={arXiv preprint arXiv:2407.14937},
year={2024}
}