Subscribe to our monthly AI Safety and Cybersecurity R&D Tracker updates!
Threats using AI models
Prompt Injection and Input Manipulation (Direct and Indirect)
Covers:
- OWASP LLM 01: Prompt Injection
- OWASP ML 01: Input Manipulation Attack
- MITRE ATLAS Initial Access, Privilege Escalation, and Defense Evasion
Research:
- "Goal-guided Generative Prompt Injection Attack on Large Language Models" (Zhang et al., Sep 2024)
- "Soft Prompts Go Hard: Steering Visual Language Models with Hidden Meta-Instructions" (Zhang et al., Sep 2024)
- "Optimization-based Prompt Injection Attack to LLM-as-a-Judge" (Shi et al., Aug 2024)
- "Rag and Roll: An End-to-End Evaluation of Indirect Prompt Manipulations in LLM-based Application Frameworks" (De Stefano, Schönherr, Pellegrino, Aug 2024)
- "InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents" (Zhan et al., Aug 2024)
- "LLMs as Hackers: Autonomous Linux Privilege Escalation Attacks" (Happe, Kaplan, and Cito, Aug 2024)
- "Securing the Diagnosis of Medical Imaging: An In-depth Analysis of AI-Resistant Attacks" (Biswas et al., Aug 2024)
- "On Feasibility of Intent Obfuscating Attacks" (Li and Shafto, Jul 2024)
System and Meta Prompt Extraction
Covers:
- MITRE ATLAS Discovery and Exfiltration
Research:
- "Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models" (Liang et al., Aug 2024)
- "Prompt Leakage effect and defense strategies for multi-turn LLM interactions" (Agarwal et al., Jul 2024)
Obtain and Develop (Software) Capabilities, Acquire Infrastructure, or Establish Accounts
Covers:
- MITRE ATLAS Resource Development
Research:
- "Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search" (Moss, Aug 2024)
Jailbreak, Cost Harvesting, or Erode ML Model Integrity
Covers:
- MITRE ATLAS Privilege Escalation, Defense Evasion, and Impact
Research:
- "Adversarial Attacks to Multi-Modal Models" (Dou et al., Sep 2024)
- "HSF: Defending against Jailbreak Attacks with Hidden State Filtering" (Qian et al., Sep 2024)
- "Injecting Undetectable Backdoors in Obfuscated Neural Networks and Language Models" (Kalavasis et al., Sep 2024)
- "Recent Advances in Attack and Defense Approaches of Large Language Models" (Cui et al., Sep 2024)
- "SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner" (Wang et al., Sep 2024)
- "LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet" (Li et al., Sep 2024)
- "Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models" (Ma et al., Sep 2024)
- "Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models" (An et al., Sep 2024)
- "The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models" (Wu et al., Aug 2024)
- "Detecting AI Flaws: Target-Driven Attacks on Internal Faults in Language Models" (Du et al., Aug 2024)
- "LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet" (Li et al., Aug 2024)
- "Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything" (Zou et al., Aug 2024)
- "A StrongREJECT for Empty Jailbreaks" (Souly et al., Aug 2024)
- "RT-Attack: Jailbreaking Text-to-Image Models via Random Token" (Gao et al., Aug 2024)
- "CAMH: Advancing Model Hijacking Attack in Machine Learning" (He et al., Aug 2024)
- "RT-Attack: Jailbreaking Text-to-Image Models via Random Token" (Gao et al., Aug 2024)
- "BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger" (Chen et al., Aug 2024)
- "Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks" (Zhao et al., Aug 2024)
- "A Survey of Trojan Attacks and Defenses to Deep Neural Networks" (Jin et al., Aug 2024)
- "MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defensesfor Vision Language Models" (Weng et al., Aug 2024)
- "Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment" (Wang and Shu, Aug 2023)
- "Resilience in Online Federated Learning: Mitigating Model-Poisoning Attacks via Partial Sharing" (Lari et al., Aug 2024)
- "EnJa: Ensemble Jailbreak on Large Language Models" (Zhang et al., Aug 2024)
- "Can Reinforcement Learning Unlock the Hidden Dangers in Aligned Large Language Models?" (Bahrami, Vishwamitra, and Najafirad, Aug 2024)
- "Jailbreaking Text-to-Image Models with LLM-Based Agents" (Dong et al., Aug 2024)
- "Can LLMs be Fooled? Investigating Vulnerabilities in LLMs" (Abdali et al., Jul 2024)
- "Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models" (Lu et al., Jul 2024)
- "Vera Verto: Multimodal Hijacking Attack" (Zhang et al., Jul 2024)
Proxy AI ML Model (Simulations)
Covers:
- MITRE ATLAS ML Attack Staging
Verify Attack (Efficacy)
Covers:
- MITRE ATLAS ML Attack Staging
Insecure Output Handling
Covers:
- OWASP LLM 02: Insecure Output Handling
Research:
Sensitive Information Disclosure
Covers:
- OWASP LLM 06: Sensitive Information Disclosure
Research:
- "LLM-PBE: Assessing Data Privacy in Large Language Models" (Li et al., Sep 2024)
- "PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action" (Shao et al., Sep 2024)
- "Privacy-preserving Universal Adversarial Defense for Black-box Models" (Li et al., Aug 2024)
- "DePrompt: Desensitization and Evaluation of Personal Identifiable Information in Large Language Model Prompts" (Sun et al., Aug 2024)
- "Casper: Prompt Sanitization for Protecting User Privacy in Web-Based Large Language Models" (Chong et al., Aug 2024)
Insecure Plugin Design and Plugin Compromise
Covers:
- OWASP LLM 07: Insecure Plugin Design
- MITRE ATLAS Execution & Privilege Escalation
Research:
Hallucination Squatting and Phishing
Covers:
- MITRE ATLAS Initial Access
Persistence
Covers:
Backdoor ML Model and Craft Adversarial Data
Covers:
- MITRE ATLAS ML Attack Staging
Research:
- "Context is the Key: Backdoor Attacks for In-Context Learning with Vision Transformers" (Abad et al. Sep 2024)
- "Concealing Backdoor Model Updates in Federated Learning by Trigger-Optimized Data Poisoning" (Zhang, Gong, and Reiter)
- "TERD: A Unified Framework for Safeguarding Diffusion Models Against Backdoors" (Mo et al., Sep 2024)
- "INK: Inheritable Natural Backdoor Attack Against Model Distillation" (Liu et al., Sep 2024)
- "Exploiting the Vulnerability of Large Language Models via Defense-Aware Architectural Backdoor" (Miah and Bi, Sep 2024)
- "Rethinking Backdoor Detection Evaluation for Language Models" (Yan et al., Sep 2024)
- "Shortcuts Everywhere and Nowhere: Exploring Multi-Trigger Backdoor Attacks" (Li et al., Aug 2024)
- "Transferring Backdoors between Large Language Models by Knowledge Distillation" (Cheng et al., Aug 2024)
- "Revocable Backdoor for Deep Model Trading" (Xu et al., Aug 2024)
- "BackdoorBench: A Comprehensive Benchmark and Analysis of Backdoor Learning" (Wu et al., Jul 2024)
- "Diff-Cleanse: Identifying and Mitigating Backdoor Attacks in Diffusion Models" (Hao et al., Jul 2024)
Supply Chain Vulnerabilities and Compromise
Covers:
- OWASP LLM 05/OWASP ML 06/MITRE ATLAS Initial Access
Excessive Agency/Agentic Manipulation/Agentic Systems
We added 'agentic' manipulation to this subcategory.
Covers:
- OWASP LLM 08: Excessive Agency
Research:
Copyright Infringement
Covers:
Research:
Threat to AI Models
General Approaches
We added this subsection to cover research that broadly looks at AI security.
Research:
Training Data Poisoning and Simulated Publication of Poisoned Public Datasets
Covers:
- OWASP LLM 03: Training Data Poisoning
- OWASP ML 02: Data Poisoning Attack
- MITRE ATLAS Resource Development
We moved this subsection from 'Threats Using AI Models' to this section as poisoned data is a threat to AI.
Research:
Model (Mis)Interpretability
Added this subsection to cover cybersecurity issues that arise from interpretability issues.
Research:
Model Collapse
Covers:
- OWASP LLM 03: Training Data Poisoning
- OWASP ML 02: Data Poisoning Attack
Research:
Model Denial of Service and Chaff Data Spamming
Covers:
- OWASP LLM 04: Model Denial of Service
- MITRE ATLAS Impact
Research:
Model Modifications
We added this subsection to include security issues that arise from post-hoc model modifications such as fine-tuning, quantization.
Research:
- "Unveiling the Vulnerability of Private Fine-Tuning in Split-Based Frameworks for Large Language Models: A Bidirectionally Enhanced Attack" (Chen et al., Sep 2024)
- "The Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs" (Chen et al., Sep 2024)
- "BadMerging: Backdoor Attacks Against Model Merging" (Zhang et al., Sep 2024)
- "Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning" (Huang et al., Sep 2024)
- "RLCP: A Reinforcement Learning-based Copyright Protection Method for Text-to-Image Diffusion Model" (Shi et al., Sep 2024)
- "Large Language Models as Carriers of Hidden Messages" (Hoscilowicz et al., Aug 2024)
- "Fight Perturbations with Perturbations: Defending Adversarial Attacks via Neuron Influence" (Chen et al., Aug 2024)
- "Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning" (Huang et al., Aug 2024)
- "Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data" (Baumgärtner et al., Aug 2024)
- "Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes" (Kumar et al., Jul 2024)
- "DeepBaR: Fault Backdoor Attack on Deep Neural Network Layers" (Martínez-Mejía et al., Jul 2024)
- "Resilience and Security of Deep Neural Networks Against Intentional and Unintentional Perturbations: Survey and Research Challenges" (Sayyed et al., Jul 2024)
Inadequate AI Alignment
Discover ML Model Family and Ontology/Model Extraction
Added model extraction to this as it did not have its own category.
Covers:
Research:
Improper Error Handling
Robust Multi-Prompt and Multi-Model Attacks
Research:
LLM Data Leakage and ML Artifact Collection
- MITRE ATLAS Exfiltration & Collection
Research:
Evade ML Model
Covers:
- MITRE ATLAS Defense Evasion & Impact
Model Theft, Data Leakage, ML-Enabled Product or Service, and API Access
Covers:
- OWASP LLM 10: Model Theft
- OWASP ML 05: Model Theft
- MITRE ATLAS Exfiltration and ML Model Access
Research:
- "Model Stealing Attack against Graph Classification with Authenticity, Uncertainty and Diversity" (Zhu et al., Aug 2024)
Model Inversion Attack
Covers:
- OWASP ML 03: Model Inversion Attack
- MITRE ATLAS Exfiltration
Research:
Exfiltration via Cyber Means
Covers:
Model Skewing Attack
Covers:
- OWASP ML 08: Model Skewing
Evade ML Model
Covers:
- MITRE ATLAS Initial Access
MITRE ATLAS Reconnaissance
Discover ML Artifacts, Data from Information Repositories and Local System, and Acquire Public ML Artifacts
Covers:
- MITRE ATLAS Resource Development, Discovery, and Collection
User Execution, Command and Scripting Interpreter
Covers:
Physical Model Access and Full Model Access
Covers:
- MITRE ATLAS ML Model Access
Valid Accounts
- MITRE ATLAS Initial Access
Exploit Public Facing Application
- MITRE ATLAS Initial Access
Threats from AI Model
Misinformation
Covers:
Over Reliance on LLM Outputs and External (Social) Harms
Covers:
- OWASP LLM 09: Overreliance
- MITRE ATLAS Impact
Research:
Fake Resources and Phishing
Covers:
- MITRE ATLAS Initial Access
Research:
- "From ML to LLM: Evaluating the Robustness of Phishing Webpage Detection Models against Adversarial Attacks" (Kulkarni et al., Jul 2024)
Social Manipulation
Research:
- "PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety" (Zhang et al., Aug 2024)
Deep Fakes
Research:
Shallow Fakes
Misidentification
Private Information Used in Training
Research:
Unsecured Credentials
Covers:
- MITRE ATLAS Credential Access
AI-Generated/Augmented Exploits
Added this category to cover instances where generative AI systems are used to generate cybersecurity exploits.
Research: