AI Safety+Cybersecurity R&D Tracker

December 11, 2024

‍

Subscribe to our monthly AI Safety and Cybersecurity R&D Tracker updates!

Threats using AI models

Prompt Injection and Input Manipulation (Direct and Indirect)

Covers:

OWASP LLM 01: Prompt Injection
OWASP ML 01: Input Manipulation Attack
MITRE ATLAS Initial Access, Privilege Escalation, and Defense Evasion

Research:

"Attention Tracker: Detecting Prompt Injection Attacks in LLMs" (Hung et al., Apr 2025)
"SINCon: Mitigate LLM-Generated Malicious Message Injection Attack for Rumor Detection" (Zhang et al., Apr 2025)
"RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage" (Zhong et al., Feb 2025)
"EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage" (Liao et al., Feb 2025)
"Self-interpreting Adversarial Images" (Zhang et al., Jan 2025)
"SecAlign: Defending Against Prompt Injection with Preference Optimization" (Chen et al., Jan 2025)
"Technical Report for ICML 2024 TiFA Workshop MLLM Attack Challenge: Suffix Injection and Projected Gradient Descent Can Easily Fool An MLLM" (Guo et al., Dec 2024)
"Defending LVLMs Against Vision Attacks through Partial-Perception Supervision" (Zhou et al., Dec 2024)
"PROSAC: Provably Safe Certification for Machine Learning Models under Adversarial Attacks" (Feng et al., Dec 2024)
"Finding a Wolf in Sheep's Clothing: Combating Adversarial Text-To-Image Prompts with Text Summarization" (Cooper, Narnoli, and Surdeanu, Dec 2024)
"PGD-Imp: Rethinking and Unleashing Potential of Classic PGD with Dual Strategies for Imperceptible Adversarial Attacks" (Li et al., Dec 2024)
"Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?" (Chen et al., Dec 2024)
"Failures to Find Transferable Image Jailbreaks Between Vision-Language Models" (Schaeffer et al., Dec 2024)
"HTS-Attack: Heuristic Token Search for Jailbreaking Text-to-Image Models" (Gao et al., Dec 2024)
"Comprehensive Assessment of Jailbreak Attacks Against LLMs" (Chu et al., Dec 2024)
"From Allies to Adversaries: Manipulating LLM Tool-Calling through Adversarial Injection" (Wang et al., Dec 2024)
"FATH: Authentication-based Test-time Defense against Indirect Prompt Injection Attacks" (Wang et al., Nov 2024)
"Formalizing and Benchmarking Prompt Injection Attacks and Defenses" (Liu et al., Nov 2024)
"Universal and Context-Independent Triggers for Precise Control of LLM Outputs" (Liang, Li, and Yu, Nov 2024)
"SecONN: An Optical Neural Network Framework with Concurrent Detection of Thermal Fault Injection Attacks" (Nishida et al., Nov 2024)
"Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM-driven Cyberattacks" (Pasquini et al., Nov 2024)
"Optimization-based Prompt Injection Attack to LLM-as-a-Judge" (Shi et al., Nov 2024)
"Goal-guided Generative Prompt Injection Attack on Large Language Models" (Zhang et al., Nov 2024)
"Systematically Analyzing Prompt Injection Vulnerabilities in Diverse LLM Architectures" (Benjamin et al., Oct 2024)
"InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models" (Li, Liu, and Xiao, Oct 2024)
"FATH: Authentication-based Test-time Defense against Indirect Prompt Injection Attacks" (Wang et al., Oct 2024)
"Embedding-based classifiers can detect prompt injection attacks" (Ayub and Majumdar, Oct 2024)
"Imprompter: Tricking LLM Agents into Improper Tool Use" (Fu et al., Oct 2024)
"System-Level Defense against Indirect Prompt Injection Attacks: An Information Flow Control Perspective" (Wu, Cecchetti, and Xiao, Oct 2024)
"EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage" (Liao et al., Sep 2024)
"Efficient Detection of Toxic Prompts in Large Language Models" (Liu et al., Sep 2024)
"Goal-guided Generative Prompt Injection Attack on Large Language Models" (Zhang et al., Sep 2024)
"Soft Prompts Go Hard: Steering Visual Language Models with Hidden Meta-Instructions" (Zhang et al., Sep 2024)
"Optimization-based Prompt Injection Attack to LLM-as-a-Judge" (Shi et al., Aug 2024)
"Rag and Roll: An End-to-End Evaluation of Indirect Prompt Manipulations in LLM-based Application Frameworks" (De Stefano, Schönherr, Pellegrino, Aug 2024)
"InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents" (Zhan et al., Aug 2024)
"LLMs as Hackers: Autonomous Linux Privilege Escalation Attacks" (Happe, Kaplan, and Cito, Aug 2024)
"Securing the Diagnosis of Medical Imaging: An In-depth Analysis of AI-Resistant Attacks" (Biswas et al., Aug 2024)
"On Feasibility of Intent Obfuscating Attacks" (Li and Shafto, Jul 2024)

System and Meta Prompt Extraction

Covers:

MITRE ATLAS Discovery and Exfiltration

Research:

"Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models" (Liang et al., Feb 2025)
"Prompt Obfuscation for Large Language Models" (Pape, Eisenhofer, and Schönherr, Sep 2024)
"Prompt Leakage effect and defense strategies for multi-turn LLM interactions" (Agarwal et al., Jul 2024)

Obtain and Develop (Software) Capabilities, Acquire Infrastructure, or Establish Accounts

Covers:

MITRE ATLAS Resource Development

Research:

"Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search" (Moss, Aug 2024)

Jailbreak, Cost Harvesting, or Erode ML Model Integrity

Covers:

MITRE ATLAS Privilege Escalation, Defense Evasion, and Impact

Research:

"Prompt Inference Attack on Distributed Large Language Model Inference Frameworks" (Luo, Yu, and Xiao, May 2025)
"JULI: Jailbreak Large Language Models by Self-Introspection" (Wang, Hu, and Wagnetr, May 2025)
"AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models" (Chen et al., May 2025)
"Tempest: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search" (Arel and Zhou, May 2025)
"Amplified Vulnerabilities: Structured Jailbreak Attacks on LLM-based Multi-Agent Debate" (Qi et al., Apr 2025)
"Iterative Prompting with Persuasion Skills in Jailbreaking Large Language Models" (Ke et al., Apr 2025)
"h4rm3l: A language for Composable Jailbreak Attack Synthesis" (Doumbouya et al., Mar 2025)
"Siege: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search" (Zhou, Mar 2025)
"Adversarial Training for Multimodal Large Language Models against Jailbreak Attacks" (Lu et al., Mar 2025)
"Dagger Behind Smile: Fool LLMs with a Happy Ending Story" (Song et al., Feb 2025)
"Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks" (Wang et al., Feb 2025)
"An Engorgio Prompt Makes Large Language Model Babble on" (Dong et al. Feb 2025)
"Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models" (Ying et al., Feb 2025)
"LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models" (Lin et al. Feb 2025)
"Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation" (Armstrong, Jan 2025)
"Data-Free Model-Related Attacks: Unleashing the Potential of Generative AI" (Ye et al., Jan 2025)
"GreedyPixel: Fine-Grained Black-Box Adversarial Attack Via Greedy Algorithm" (Wang et al., Jan 2025)
"MOS-Attack: A Scalable Multi-objective Adversarial Attack Framework" (Guo et al., Jan 2025)
"Siren: A Learning-Based Multi-Turn Attack Framework for Simulating Real-World Human Jailbreak Behaviors" (Zhao and Zhang, Jan 2025)
"Gandalf the Red: Adaptive Security for LLMs" (Pfister et al., Jan 2025)
"DrLLM: Prompt-Enhanced Distributed Denial-of-Service Resistance Method with Large Language Models" (Yin, Liu, and Xu, Jan 2025)
"Infecting Generative AI With Viruses" (Noever and McKee, Jan 2025)
"BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger" (Chen et al., Jan 2025)
"Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment" (Ghosal et al., Dec 2024)
"Automated Progressive Red Teaming" (Jiang et al., Dec 2024)
"Crabs: Consuming Resrouce via Auto-generation for LLM-DoS Attack under Black-box Settings" (Zhang et al., Dec 2024)
"LLM Whisperer: An Inconspicuous Attack to Bias LLM Responses" (Lin et al., Dec 2024)
"When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search" (Chen et al., Dec 2024)
"Data to Defense: The Role of Curation in Customizing LLMs Against Jailbreaking Attacks" (Liu et al., Dec 2024)
"DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints" (Zhao et al., Dec 2024)
"SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage" (Dong et al., Dec 2024)
"JailPO: A Novel Black-box Jailbreak Framework via Preference Optimization against Aligned LLMs" (Li et al., Dec 2024)
"Watertox: The Art of Simplicity in Universal Attacks A Cross-Model Framework for Robust Adversarial Generation" (Gao et al., Dec 2024)
"CAMH: Advancing Model Hijacking Attack in Machine Learning" (He et al., Dec 2024)
"Best-of-N Jailbreaking" (Hughes et al., Dec 2024)
"AICAttack: Adversarial Image Captioning Attack with Attention-Based Optimization" (Li et al., Dec 2024)
"AdvPrefix: An Objective for Nuanced LLM Jailbreaks" (Zhu et al., Dec 2024)
"Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach" (Wang et al., Dec 2024)
"Stealthy Multi-Task Adversarial Attacks" (Guo et al., Nov 2024)
"AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs" (Liu et al., Nov 2024)
"BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models" (Wang et al., Nov 2024)
"JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks" (Luo et al., Nov 2024)
"SQL Injection Jailbreak: a structural disaster of large language models" (Zhao et al., Nov 2024)
"LLMStinger: Jailbreaking LLMs using RL fine-tuned LLMs" (Jha, Arora, and Ganesh, Nov 2024)
"AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks" (Zeng et al., Nov 2024)
"DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers" (Li et al., Nov 2024)
"SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains" (Saiem et al., Nov 2024)
"Transferable Ensemble Black-box Jailbreak Attacks on Large Language Models" (Yang and Fu, Oct 2024)
"Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning" (Hasan, Rugina, and Wang, Oct 2024)
"Tree of Attacks: Jailbreaking Black-Box LLMs Automatically" (Mehrotra et al., Oct 2024)
"Fight Back Against Jailbreaking via Prompt Adversarial Tuning" (Mo et al., Oct 2024)
"HijackRAG: Hijacking Attacks against Retrieval-Augmented Large Language Models" (Zhang et al., Oct 2024)
"Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector" (Huang et al., Oct 2024)
"Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses" (Zheng et al., Oct 2024)
"Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents" (Yang et al., Oct 2024)
"Transferable Adversarial Attacks on SAM and Its Downstream Models" (Xia et al., Oct 2024)
"Remote Timing Attacks on Efficient Language Model Inference" (Carlini and Nasr, Oct 2024)
"MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Multimodal Large Language Models" (Weng et al., Oct 2024)
"RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process" (Wang, Liu, and Xiao, Oct 2024)
"Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents" (Kumar et al., Oct 2024)
"Universal Black-Box Reward Poisoning Attack against Offline Reinforcement Learning" (Xu, Gumaste, and Singh, Oct 2024)
"Adversarial Attacks on Large Language Models Using Regularized Relaxation" (Chacko et al., Oct 2024)
"Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models" (Li et al., Oct 2024)
"Compiled Models, Built-In Exploits: Uncovering Pervasive Bit-Flip Attack Surfaces in DNN Executables" (Chen et al., Oct 2024)
"Securing Large Language Models: Addressing Bias, Misinformation, and Prompt Attacks" (Peng et al., Oct 2024)
"Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs" (Gong et al., Oct 2024)
"Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks" (Wang et al., Oct 2024)
"Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models"(Shen et al., Oct 2024)
"PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach" (Lin et al., Oct 2024)
"Rethinking and Defending Protective Perturbation in Personalized Diffusion Models" (Liu et al., Oct 2024)
"Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models" (Yu et al., Sep 2024)
"Read Over the Lines: Attacking LLMs and Toxicity Detection Systems with ASCII Art to Mask Profanity" (Berezin, Farahbakhsh, Crespi, Oct 2024)
"Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI" (Rawat et al., Sep 2024)
"Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction" (Zhang et al., 2024)
"Adversarial Attacks on Parts of Speech: An Empirical Study in Text-to-Image Generation" (Shahariar et al., Sep 2024)
"Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack" (Russinovich, Salem, and Eldan, Sep 2024)
"VulZoo: A Comprehensive Vulnerability Intelligence Dataset" (Ruan et al., Sep 2024)
"Adversarial Attacks on Machine Learning-Aided Visualizations" (Fujiwara et al., Sep 2024)
"Jailbreaking Large Language Models with Symbolic Mathematics" (Bethany et al., Sep 2024)
"Image Hijacks: Adversarial Images can Control Generative Models at Runtime" (Bailey et al., Sep 2024)
"LLM Whisperer: An Inconspicuous Attack to Bias LLM Responses" (Lin et al., Sep 2024)
"Machine Against the RAG: Jamming Retrieval-Augmented Generation with Blocker Documents" (Shafran, Schuster, and Shmatikov, Sep 2024)
"Security Attacks on LLM-based Code Completion Tools" (Cheng et al., Sep 2024)
"Adversarial Attacks to Multi-Modal Models" (Dou et al., Sep 2024)
"HSF: Defending against Jailbreak Attacks with Hidden State Filtering" (Qian et al., Sep 2024)
"Injecting Undetectable Backdoors in Obfuscated Neural Networks and Language Models" (Kalavasis et al., Sep 2024)
"Recent Advances in Attack and Defense Approaches of Large Language Models" (Cui et al., Sep 2024)
"SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner" (Wang et al., Sep 2024)
"LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet" (Li et al., Sep 2024)
"Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models" (Ma et al., Sep 2024)
"Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models" (An et al., Sep 2024)
"The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models" (Wu et al., Aug 2024)
"Detecting AI Flaws: Target-Driven Attacks on Internal Faults in Language Models" (Du et al., Aug 2024)
"LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet" (Li et al., Aug 2024)
"Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything" (Zou et al., Aug 2024)
"A StrongREJECT for Empty Jailbreaks" (Souly et al., Aug 2024)
"RT-Attack: Jailbreaking Text-to-Image Models via Random Token" (Gao et al., Aug 2024)
"CAMH: Advancing Model Hijacking Attack in Machine Learning" (He et al., Aug 2024)
"RT-Attack: Jailbreaking Text-to-Image Models via Random Token" (Gao et al., Aug 2024)
"BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger" (Chen et al., Aug 2024)
"Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks" (Zhao et al., Aug 2024)
"A Survey of Trojan Attacks and Defenses to Deep Neural Networks" (Jin et al., Aug 2024)
"MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defensesfor Vision Language Models" (Weng et al., Aug 2024)
"Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment" (Wang and Shu, Aug 2023)
"Resilience in Online Federated Learning: Mitigating Model-Poisoning Attacks via Partial Sharing" (Lari et al., Aug 2024)
"EnJa: Ensemble Jailbreak on Large Language Models" (Zhang et al., Aug 2024)
"Can Reinforcement Learning Unlock the Hidden Dangers in Aligned Large Language Models?" (Bahrami, Vishwamitra, and Najafirad, Aug 2024)
"Jailbreaking Text-to-Image Models with LLM-Based Agents" (Dong et al., Aug 2024)
"Can LLMs be Fooled? Investigating Vulnerabilities in LLMs" (Abdali et al., Jul 2024)
"Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models" (Lu et al., Jul 2024)
"Vera Verto: Multimodal Hijacking Attack" (Zhang et al., Jul 2024)

‍

Proxy AI ML Model (Simulations)

Covers:

MITRE ATLAS ML Attack Staging

‍

Verify Attack (Efficacy)

Covers:

MITRE ATLAS ML Attack Staging

‍

Insecure Output Handling

Covers:

OWASP LLM 02: Insecure Output Handling

Research:

"Exploiting Defenses against GAN-Based Feature Inference Attacks in Federated Learning" (Suo et al., Apr 2025)
"Bayes-Nash Generative Privacy Against Membership Inference Attacks" (Zhang et al., Feb 2025)
"A hierarchical approach for assessing the vulnerability of tree-based classification models to membership inference attack" (Preen and Smith., Feb 2025)
"The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems" (Song et al., Feb 2025)
"LUMIA: Linear probing for Unimodal and MultiModal Membership Inference Attacks leveraging internal LLM states" (Ibanez-Lissen et al., Jan 2025)
"Time Will Tell: Timing Side Channels via Output Token Count in Large Language Models" (Zhang, Saileshwar, and Lie, Dec 2024)
"Privacy-Preserving Low-Rank Adaptation against Membership Inference Attacks for Latent Diffusion Models"(Luo et al., Dec 2024)
"Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration" (Fu et al., Nov 2024)
"InputSnatch: Stealing Input in LLM Services via Timing Side-Channel Attacks" (Zheng et al., Nov 2024)
"Towards Black-Box Membership Inference Attack for Diffusion Models" (Li et al., Nov 2024)
"Protection against Source Inference Attacks in Federated Learning using Unary Encoding and Shuffling" (Athanasiou, Jung, and Palmidessi, Nov 2024)
"OSLO: One-Shot Label-Only Membership Inference Attacks" (Peng et al., Oct 2024)
"Detecting Training Data of Large Language Models via Expectation Maximization" (Kim et al., Oct 2024)
"Unstable Unlearning: The Hidden Risk of Concept Resurgence in Diffusion Models" (Suriyakumar et al., Oct 2024)
"Black-box Membership Inference Attacks against Fine-tuned Diffusion Models" (Pang and Wang, Sep 2024)
"Is Difficulty Calibration All We Need? Towards More Practical Membership Inference Attacks" (He et al., Sep 2024)
"Moderator: Moderating Text-to-Image Diffusion Models through Fine-grained Context-based Policies" (Wang et al., Aug 2024)
"Membership Inference Attack Against Masked Image Modeling" (Li et al., Aug 2024)
"Pathway to Secure and Trustworthy 6G for LLMs: Attacks, Defense, and Opportunities" (Khowaja, Aug 2024)
"Synthetic Image Learning: Preserving Performance and Preventing Membership Inference Attacks" (Lomurno and Matteucci, Jul 2024)
"Thermometer: Towards Universal Calibration for Large Language Models" (Shen et al., Jun 2024)

‍

Sensitive Information Disclosure

Covers:

OWASP LLM 06: Sensitive Information Disclosure

Research:

"Real-Time Privacy Risk Measurement with Privacy Tokens for Gradient Leakage" (Meng et al., Feb 2025)
"Exploring Privacy and Fairness Risks in Sharing Diffusion Models: An Adversarial Perspective" (Luo et al., Sep 2024)
"LLM-PBE: Assessing Data Privacy in Large Language Models" (Li et al., Sep 2024)
"PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action" (Shao et al., Sep 2024)
"Privacy-preserving Universal Adversarial Defense for Black-box Models" (Li et al., Aug 2024)
"DePrompt: Desensitization and Evaluation of Personal Identifiable Information in Large Language Model Prompts" (Sun et al., Aug 2024)
"Casper: Prompt Sanitization for Protecting User Privacy in Web-Based Large Language Models" (Chong et al., Aug 2024)

‍

Insecure Plugin Design and Plugin Compromise

Covers:

OWASP LLM 07: Insecure Plugin Design
MITRE ATLAS Execution & Privilege Escalation

Research:

"The Philosopher's Stone: Trojaning Plugins of Large Language Models" (Dong et al., Sep 2024)
"LLM Platform Security: Applying a Systematic Evaluation Framework to OpenAI's ChatGPT Plugins" (Iqbal, Kohno, and Roesner, Jul 2024)

‍

Hallucination Squatting and Phishing

Covers:

MITRE ATLAS Initial Access

Research:

"DomainLynx: Leveraging Large Language Models for Enhanced Domain Squatting Detection" (Chiba et al., Oct 2024)
"We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs" (Spracklen et al., Sep 2024)

‍

Persistence

Covers:

MITRE ATLAS Persistence

‍

Backdoor ML Model and Craft Adversarial Data

Covers:

MITRE ATLAS ML Attack Staging

Research:

"Can We Trust Embodied Agents? Exploring Backdoor Attacks against Embodied LLM-based Decision-Making Systems" (Jiao et al., Apr 2025)
"How to Backdoor Consistency Models?" (Wang and Kantarcioglu, Feb 2025)
"Towards Backdoor Stealthiness in Model Parameter Space" (Xu et al., Jan 2025)
"Fine-tuning is Not Fine: Mitigating Backdoor Attacks in GNNs with Limited Clean Data" (Zhang et al., Jan 2025)
"A Backdoor Attack Scheme with Invisible Triggers Based on Model Architecture Modification" (Ma et al., Jan 2025)
"Double Landmines: Invisible Textual Backdoor Attacks based on Dual-Trigger" (Hou et al., Dec 2024)
"Chain-of-Scrutiny: Detecting Backdoor Attacks for Large Language Models" (Li et al., Dec 2024)
"UOR: Universal Backdoor Attacks on Pre-trained Language Models" (Du et al., Dec 2024)
"Client-Side Patching against Backdoor Attacks in Federated Learning" (Molina-Coronado, Dec 2024)
"Simulate and Eliminate: Revoke Backdoors for Generative Large Language Models" (Li et al., Dec 2024)
"Backdoor Learning Curves: Explaining Backdoor Poisoning Beyond Influence Functions" (Cinà et al., Dec 2024)
"Proactive Adversarial Defense: Harnessing Prompt Tuning in Vision-Language Models to Detect Unseen Backdoored Images" (Stein et al., Dec 2024)
"Perturb and Recover: Fine-tuning for Effective Backdoor Removal from CLIP" (Singh, Croce, and Hein, Dec 2024)
"RTL-Breaker: Assessing the Security of LLMs against Backdoor Attacks on HDL Code Generation" (Mankali et al., Dec 2024)
"Gracefully Filtering Backdoor Samples for Generative Large Language Models without Retraining" (Wu et al., Dec 2024)
"LoBAM: LoRA-Based Backdoor Attack on Model Merging" (Yin et al., Dec 2024)
"Towards Clean-Label Backdoor Attacks in the Physical World" (Dao et al., Nov 2024)
"Unlearn to Relearn Backdoors: Deferred Backdoor Functionality Attacks on Deep Learning Models" (Shin and Park, Nov 2024)
"Memory Backdoor Attacks on Neural Networks" (Luzon et al., Nov 2024)
"AnywhereDoor: Multi-Target Backdoor Attacks on Object Detection" (Lu et al., Nov 2024)
"When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations" (Ge et al., Nov 2024)
"Backdoor defense, learnability and obfuscation" (Christiano et al., Nov 2024)
"Combinational Backdoor Attack against Customized Text-to-Image Models" (Jiang et al., Nov 2024)
"Backdoor Attack Against Vision Transformers via Attention Gradient-Based Image Erosion" (Guo et al., Nov 2024)
"Backdoor Attacks against Image-to-Image Networks" (Jiang et al., Nov 2024)
"Infighting in the Dark: Multi-Labels Backdoor Attack in Federated Learning" (Li et al., Nov 2024)
"Planting Undetectable Backdoors in Machine Learning Models" (Goldwasser et al., Nov 2024)
"On the Credibility of Backdoor Attacks Against Object Detectors in the Physical World" (Gia Doan et al., Oct 2024)
"Model X-ray:Detecting Backdoored Models via Decision Boundary" (Su et al., Oct 2024)
"Dullahan: Stealthy Backdoor Attack against Without-Label-Sharing Split Learning" (Pu et al., Oct 2024)
"Mind Your Questions Towards Backdoor Attacks on Text-to-Visualization Models" (Li et al., Oct 2024)
"BadCM: Invisible Backdoor Attack Against Cross-Modal Learning" (Zheng et al., Oct 2024)
"BACKTIME: Backdoor Attacks on Multivariate Time Series Forecasting" (Lin et al., Oct 2024)
"Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning" (Zhao et al., Oct 2024)
"Hidden in Plain Sound: Environmental Backdoor Poisoning Attacks on Whisper, and Mitigations" (Bartolini, Stoyanov, and Giaretta, Sep 2024)
"EmoBack: Backdoor Attacks Against Speaker Identification Using Emotional Prosody" (Schoof et al., Sep 2024)
"Context is the Key: Backdoor Attacks for In-Context Learning with Vision Transformers" (Abad et al. Sep 2024)
"Concealing Backdoor Model Updates in Federated Learning by Trigger-Optimized Data Poisoning" (Zhang, Gong, and Reiter)
"TERD: A Unified Framework for Safeguarding Diffusion Models Against Backdoors" (Mo et al., Sep 2024)
"INK: Inheritable Natural Backdoor Attack Against Model Distillation" (Liu et al., Sep 2024)
"Exploiting the Vulnerability of Large Language Models via Defense-Aware Architectural Backdoor" (Miah and Bi, Sep 2024)
"Rethinking Backdoor Detection Evaluation for Language Models" (Yan et al., Sep 2024)
"Shortcuts Everywhere and Nowhere: Exploring Multi-Trigger Backdoor Attacks" (Li et al., Aug 2024)
"Transferring Backdoors between Large Language Models by Knowledge Distillation" (Cheng et al., Aug 2024)
"Revocable Backdoor for Deep Model Trading" (Xu et al., Aug 2024)
"BackdoorBench: A Comprehensive Benchmark and Analysis of Backdoor Learning" (Wu et al., Jul 2024)
"Diff-Cleanse: Identifying and Mitigating Backdoor Attacks in Diffusion Models" (Hao et al., Jul 2024)

‍

Supply Chain Vulnerabilities and Compromise

Covers:

OWASP LLM 05/OWASP ML 06/MITRE ATLAS Initial Access

‍

Excessive Agency, Agentic Manipulation, Agentic Systems

We added 'agentic' manipulation to this subcategory.

Covers:

OWASP LLM 08: Excessive Agency

Research:

"Agency Is Frame-Dependent" (Abel et al., Feb 2025)
"Security of AI Agents" (He et al., Dec 2024)
"SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents" (Yin et al., Dec 2024)
"Dissecting Adversarial Robustness of Multimodal LM Agents" (Wu et al., Dec 2024)
"RAT: Adversarial Attacks on Deep Reinforcement Agents for Targeted Behaviors" (Bai et al., Dec 2024)
"AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents" (Debenedetti et al., Nov 2024)
"AdvWeb: Controllable Black-box Attacks on VLM-powered Web Agents" (Xu et al., Oct 2024)
"Good Parenting is all you need -- Multi-agentic LLM Hallucination Mitigation" (Kwartler et al., Oct 2024)
"Bayes-Nash Generative Privacy Protection Against Membership Inference Attacks" (Zhang et al., Oct 2024)
"Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems" (Lee and Tiwari, Oct 2024)
"Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents" (Zhang et al., Oct 2024)
"BreachSeek: A Multi-Agent Automated Penetration Tester" (Alshehri et al., Sep 2024)
"Safeguarding AI Agents: Developing and Analyzing Safety Architectures" (Domkunwar and N S, Sep 2024)
"Secret Collusion among Generative AI Agents" (Motwani et al., Aug 2024)
"Large Language Model Sentinel: LLM Agent for Adversarial Purification" (Lin and Zhao, Aug 2024)
"Compromising Embodied Agents with Contextual Backdoor Attacks" (Liu et al., Aug 2024)
"Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction Amplification" (Zhang et al., Jul 2024)

Copyright Infringement

Covers:

MITRE ATLAS Impact

Research:

"Content ARCs: Decentralized Content Rights in the Age of Generative AI" (Balan, Gilbert, and Collomosse., Mar 2025)
"SoK: Dataset Copyright Auditing in Machine Learning Systems" (Du et al., Oct 2024)
"Protecting Copyright of Medical Pre-trained Language Models: Training-Free Backdoor Watermarking" (Kong et al., Sep 2024)
"Strong Copyright Protection for Language Models via Adaptive Model Fusion" (Wang et al., Jul 2024)

Defenses

Research:

"Large Language Model Sentinel: LLM Agent for Adversarial Purification" (Lin, Tanaka, and Zhao, Apr 2025)
"JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks" (Zhang et al., Mar 2025)

Threat to AI Models

General Approaches

We added this subsection to cover research that broadly looks at AI security.

Research:

"Phare: A Safety Probe for Large Language Models" (Le Jeune et al.,May 2025)
"aiXamine: Simplified LLM Safety and Security" (Deniz et al., Apr 2025)
"Safety at Scale: A Comprehensive Survey of Large Model Safety" (Ma et al., Feb 2025)
"Emerging Security Challenges of Large Language Models" (Debar et al., Dec 2024)
"Evaluating and Improving the Robustness of Security Attack Detectors Generated by LLMs" (Pasini et al., Nov 2024)
"Blockchain for Large Language Model Security and Safety: A Holistic Survey" (Geren et al., Nov 2024)
"One Prompt to Verify Your Models: Black-Box Text-to-Image Models Verification via Non-Transferable Adversarial Attacks" (Guo et al., Nov 2024)
"How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries" (Banerjee et al., Nov 2024)
"Defending Large Language Models Against Attacks With Residual Stream Activation Analysis" (Kawasaki, Davis, and Abbas, Nov 2024)
"LProtector: An LLM-driven Vulnerability Detection System" (Sheng et al., Nov 2024)
"SECURE: Benchmarking Large Language Models for Cybersecurity" (Bhusal et al., Oct 2024)
"Insights and Current Gaps in Open-Source LLM Vulnerability Scanners: A Comparative Analysis" (Brokman et al., Oct 2024)
"Safety Layers in Aligned Large Language Models: The Key to LLM Security" (Li et al., Oct 2024)
"AttackBench: Evaluating Gradient-based Attacks for Adversarial Examples" (Cinà et al., Oct 2024)
"Advancing Cyber Incident Timeline Analysis Through Rule Based AI and Large Language Models" (Loumachi and Ghanem, Sep 2024)
"Real-world Adversarial Defense against Patch Attacks based on Diffusion Model" (Wei et al., Sep 2024)
"LLM Honeypot: Leveraging Large Language Models as Advanced Interactive Honeypot Systems" (Otal and Canbaz, Sep 2024)
"SECURE: Benchmarking Large Language Models for Cybersecurity Advisory" (Bhusal et al., Sep 2024)
"Continual Adversarial Defense" (Wang et al., Aug 2024)
"Audit-LLM: Multi-Agent Collaboration for Log-based Insider Threat Detection" (Song et al., Aug 2024)
"Attacks and Defenses for Generative Diffusion Models: A Comprehensive Survey" (Truong, Dang, and Le, Aug 2024)
"Blockchain for Large Language Model Security and Safety: A Holistic Survey" (Geren et al., Jul 2024)

‍

Data Poisoning and Simulated Publication of Poisoned Public Datasets

Covers:

OWASP LLM 03: Training Data Poisoning
OWASP ML 02: Data Poisoning Attack
MITRE ATLAS Resource Development

We moved this subsection from 'Threats Using AI Models' to this section as poisoned data is a threat to AI.

Research:

"A Linear Approach to Data Poisoning" (Granziol and Flynn, May 2025)
"Traceback of Poisoning Attacks to Retrieval-Augmented Generation" (Zhang et al., Apr 2025)
"Machine Unlearning Fails to Remove Data Poisoning Attacks" (Pawelczyk, Apr 2025)
"Defending Against Sophisticated Poisoning Attacks with RL-based Aggregation in Federated Learning" (Wang et al., Dec 2024)
"Adversarial Poisoning Attack on Quantum Machine Learning Models" (Kundu and Ghosh, Nov 2024)
"Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws" (Bowen et al., Oct 2024)
"Inverting Gradient Attacks Naturally Makes Data Poisons: An Availability Attack on Neural Networks" (Bouaziz, Mhamdi, and Usunier, Oct 2024)
"Certified Robustness to Data Poisoning in Gradient-Based Training" (Sosnin et al., Oct 2024)
"Controlled Generation of Natural Adversarial Documents for Stealthy Retrieval Poisoning" (Zhang et al., Oct 2024)
"Securing Voice Authentication Applications Against Targeted Data Poisoning" (Mohammadi et al., Oct 2024)
"Data Poisoning-based Backdoor Attack Framework against Supervised Learning Rules of Spiking Neural Networks" (Jin et al., Sep 2024)
"Hiding Backdoors within Event Sequence Data via Poisoning Attacks" (Ermilova et al., Aug 2024)
"ConfusedPilot: Confused Deputy Risks in RAG-based LLMs" (RoyChowdhury et al., Aug 2024)
"PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models" (Zou et al., Aug 2024)
"Scaling Laws for Data Poisoning in LLMs" (Bowen et al., Aug 2024)
"Threats, Attacks, and Defenses in Machine Unlearning: A Survey" (Liu et al., Aug 2024)
"Debiased Graph Poisoning Attack via Contrastive Surrogate Objective" (Yoon et al., Jul 2024)

‍

Model (Mis)Interpretability

Added this subsection to cover cybersecurity issues that arise from interpretability issues.

Research:

"Open Problems in Mechanistic Interpretability" (Sharkley et al., Jan 2025)
"Fooling SHAP with Output Shuffling Attacks" (Yuan and Dasgupta, Aug 2024)
"Detecting and Understanding Vulnerabilities in Language Models via Mechanistic Interpretability" (García-Carrasco, Maté, and Trujillo, Jul 2024)

‍

Model Collapse

Covers:

OWASP LLM 03: Training Data Poisoning
OWASP ML 02: Data Poisoning Attack

Research:

"Threats, Attacks, and Defenses in Machine Unlearning: A Survey" (Liu et al. Feb 2025)
"Data Duplication: A Novel Multi-Purpose Attack Paradigm in Machine Unlearning" (Ye et al., Jan 2025)
"Understanding Implosion in Text-to-Image Generative Models" (Ding et al., Sep 2024)
"AI models collapse when trained on recursively generated data" (Shumailov et al., Jul 2024)

‍

Model Denial of Service and Chaff Data Spamming

Covers:

OWASP LLM 04: Model Denial of Service
MITRE ATLAS Impact

Research:

"Impact of White-Box Adversarial Attacks on Convolutional Neural Networks" (Podder and Ghosh, Oct 2024)
"DrLLM: Prompt-Enhanced Distributed Denial-of-Service Resistance Method with Large Language Models" (Yin, Liu, and Xu, Sep 2024)
"Bergeron: Combating Adversarial Attacks through a Conscience-Based Alignment Framework" (Pisano et al., Aug 2024)
"Self-Evaluation as a Defense Against Adversarial Attacks on LLMs" (Brown et al., Aug 2024)
"Vulnerabilities in AI-generated Image Detection: The Challenge of Adversarial Attacks" (Diao et al., Jul 2024)
"Enhancing Adversarial Text Attacks on BERT Models with Projected Gradient Descent" (Waghela, Sen, and Rakshit, Jul 2024)

‍

Model Modifications

We added this subsection to include security issues that arise from post-hoc model modifications such as fine-tuning, quantization.

Research:

"Finetuning-Activated Backdoors in LLMs" (Gloaguen et al., May 2025)
"The Gradient Puppeteer: Adversarial Domination in Gradient Leakage Attacks through Model Poisoning" (Xiang et al., Apr 2025)
"Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey" (Huang et al., Oct 2024)
"Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks" (Poppi et al., Oct 2024)
"Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey" (Huang et al., Oct 2024)
"Unveiling the Vulnerability of Private Fine-Tuning in Split-Based Frameworks for Large Language Models: A Bidirectionally Enhanced Attack" (Chen et al., Sep 2024)
"The Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs" (Chen et al., Sep 2024)
"BadMerging: Backdoor Attacks Against Model Merging" (Zhang et al., Sep 2024)
"Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning" (Huang et al., Sep 2024)
"RLCP: A Reinforcement Learning-based Copyright Protection Method for Text-to-Image Diffusion Model" (Shi et al., Sep 2024)
"Large Language Models as Carriers of Hidden Messages" (Hoscilowicz et al., Aug 2024)
"Fight Perturbations with Perturbations: Defending Adversarial Attacks via Neuron Influence" (Chen et al., Aug 2024)
"Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning" (Huang et al., Aug 2024)
"Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data" (Baumgärtner et al., Aug 2024)
"Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes" (Kumar et al., Jul 2024)
"DeepBaR: Fault Backdoor Attack on Deep Neural Network Layers" (Martínez-Mejía et al., Jul 2024)
"Resilience and Security of Deep Neural Networks Against Intentional and Unintentional Perturbations: Survey and Research Challenges" (Sayyed et al., Jul 2024)

‍

Inadequate AI Alignment

‍

Discover ML Model Family and Ontology/Model Extraction

Added model extraction to this as it did not have its own category.

Covers:

MITRE ATLAS Discovery

Research:

"
"Silent Leaks: Implicit Knowledge Extraction Attack on RAG Systems through Benign Queries" (Wang et al., May 2025)
"How to Backdoor the Knowledge Distillation" (Wu et al., Apr 2025)
"Knowledge Distillation-Based Model Extraction Attack using GAN-based Private Counterfactual Explanations" (Ezzeddine, Ayoub, and Giordano, Oct 2024)
"Efficient and Effective Model Extraction" (Zhu et al., Sep 2024)
"CaBaGe: Data-Free Model Extraction using ClAss BAlanced Generator Ensemble" (Rosenthal et al., Sep 2024)
"Alignment-Aware Model Extraction Attacks on Large Language Models" (Liang et al., Sep 2024)

‍

Improper Error Handling

‍

Robust Multi-Prompt and Multi-Model Attacks

‍

Multi-Modal Attacks

Research:

"Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks" (Qraitem et al. Feb 2025)
"Detecting Malicious Concepts Without Image Generation in AIGC" (Xu et al., Feb 2025)
"Typographic Attacks in a Multi-Image Setting" (Wang, Zhao, and Larson., Feb 2025)
"T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models" (Miao et al., Sep 2024)

LLM Data Leakage and ML Artifact Collection

MITRE ATLAS Exfiltration & Collection

Research:

"Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs" (Ferrand et al., Jan 2025)
"From Prompt Injections to SQL Injection Attacks: How Protected is Your LLM-Integrated Web Application?" (Pedro et al., Jan 2025)
"RAG-Thief: Scalable Extraction of Private Data from Retrieval-Augmented Generation Applications with Agent-based Attacks" (Jiang et al., Nov 2024)
"Towards More Realistic Extraction Attacks: An Adversarial Perspective" (More, Ganesh, and Farnadi, Nov 2024)
"Stealing User Prompts from Mixture of Experts" (Yona et al., Oct 2024)
"Breach By A Thousand Leaks: Unsafe Information Leakage in `Safe' AI Responses" (Glukhov et al., Oct 2024)
"CodeCloak: A Method for Evaluating and Mitigating Code Leakage by LLM Code Assistants" (Finkman Noah et al., Oct 2024)
"Towards a Theoretical Understanding of Memorization in Diffusion Models" (Chen et al., Oct 2024)
"Extracting Memorized Training Data via Decomposition" (Su et al., Sep 2024)
"Amplifying Training Data Exposure through Fine-Tuning with Pseudo-Labeled Memberships" (Gyo Oh et al., Sep 2024)
"Adaptive Pre-training Data Detection for Large Language Models via Surprising Tokens" (Zhang and Wu, Jul 2024)

‍

Evade ML Model

Covers:

MITRE ATLAS Defense Evasion & Impact

Research:

"Humanizing the Machine: Proxy Attacks to Mislead LLM Detectors" (Wang et al., Oct 2024)

‍

Model Theft, Data Leakage, ML-Enabled Product or Service, and API Access

Covers:

OWASP LLM 10: Model Theft
OWASP ML 05: Model Theft
MITRE ATLAS Exfiltration and ML Model Access

Research:

"A Model Stealing Attack Against Multi-Exit Networks" (Pan et al., Mar 2025)
"Model Stealing Attack against Graph Classification with Authenticity, Uncertainty and Diversity" (Zhu et al., Aug 2024)

Model Inversion Attack

Covers:

OWASP ML 03: Model Inversion Attack
MITRE ATLAS Exfiltration

Research:

"Deep Learning Model Inversion Attacks and Defenses: A Comprehensive Survey" (Yang et al., Apr 2025)
"Trap-MID: Trapdoor-based Defense against Model Inversion Attacks" (Liu and Chen, Nov 2024)
"Geminio: Language-Guided Gradient Inversion Attacks in Federated Learning" (Shan et al., Nov 2024)
"MIBench: A Comprehensive Benchmark for Model Inversion Attack and Defense" (Qiu et al., Oct 2024)
"Defending against Model Inversion Attacks via Random Erasing" (Tran et al., Sep 2024)
"Improving Robustness to Model Inversion Attacks via Sparse Coding Architectures" (Dibbo et al., Aug 2024)

Exfiltration via Cyber Means

Covers:

MITRE ATLAS Exfiltration

‍

Model Skewing Attack

Covers:

OWASP ML 08: Model Skewing

‍

Evade ML Model

Covers:

MITRE ATLAS Initial Access

‍

MITRE ATLAS Reconnaissance

‍

Discover ML Artifacts, Data from Information Repositories and Local System, and Acquire Public ML Artifacts

Covers:

MITRE ATLAS Resource Development, Discovery, and Collection

‍

User Execution, Command and Scripting Interpreter

Covers:

MITRE ATLAS Execution

‍

Physical Model Access and Full Model Access

Covers:

MITRE ATLAS ML Model Access

‍

Valid Accounts

MITRE ATLAS Initial Access

‍

Exploit Public Facing Application

MITRE ATLAS Initial Access

‍

Threats from AI Model

Misinformation

Covers:

MITRE ATLAS Impact

‍

Over Reliance on LLM Outputs and External (Social) Harms

Covers:

OWASP LLM 09: Overreliance
MITRE ATLAS Impact

Research:

"Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?" (Ren et al., Jul 2024)

‍

Fake Resources and Phishing

Covers:

MITRE ATLAS Initial Access

Research:

"From ML to LLM: Evaluating the Robustness of Phishing Webpage Detection Models against Adversarial Attacks" (Kulkarni et al., Jul 2024)

‍

Social Manipulation

Research:

"PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety" (Zhang et al., Aug 2024)

‍

Deep Fakes, Content Provenance, and Watermarking

Research:

"VoiceCloak: A Multi-Dimensional Defense Framework against Unauthorized Diffusion-based Voice Cloning" (Hu et al.,
"AGATE: Stealthy Black-box Watermarking for Multimodal Model Copyright Protection" (Gao et al., Apr 2025)
"Defending LLM Watermarking Against Spoofing Attacks with Contrastive Representation Learning" (An et al., Apr 2025)
"Mark Your LLM: Detecting the Misuse of Open-Source Large Language Models via Watermarking" (Xu et al., Mar 2025)
"Robust Watermarking Using Generative Priors Against Image Editing: From Benchmarking to Advances" (Lu et al., Mar 2025)
"Theoretically Grounded Framework for LLM Watermarking: A Distribution-Adaptive Approach" (He et al., Feb 2025)
"ID-Cloak: Crafting Identity-Specific Cloaks Against Personalized Text-to-Image Generation" (Teng et al., Feb 2025)
"Watermarking Language Models with Error Correcting Codes" (Chao et al., Feb 2025)
"Is The Watermarking Of LLM-Generated Code Robust?" (Suresh et al. Feb 2025)
"Provably Robust Multi-bit Watermarking for AI-generated Text" (Qu et al., Jan 2025)
"Audio-Visual Deepfake Detection With Local Temporal Inconsistencies" (Astrid, Ghorbel, and Aouada, Jan 2025)
"GaussMark: A Practical Approach for Structural Watermarking of Language Models" (Block, Sekhari, and Rakhlin, Jan 2025)
"Neural Honeytrace: A Robust Plug-and-Play Watermarking Framework against Model Extraction Attacks" (Xu et al., Jan 2025)
"ModelShield: Adaptive and Robust Watermark against Model Extraction Attack" (Pang et al., Jan 2025)
"Can Watermarked LLMs be Identified by Users via Crafted Prompts?" (Liu et al., Dec 2024)
"Watermarking Graph Neural Networks via Explanations for Ownership Protection" (Downer et al., Jan 2025)
"AI-generated Image Detection: Passive or Watermark?" (Guo et al., Jan 2025)
"Mesh Watermark Removal Attack and Mitigation: A Novel Perspective of Function Space" (Zhu et al., Dec 2024)
"PersonaMark: Personalized LLM watermarking for model protection and user attribution" (Zhang et al., Dec 2024)
"WaterPark: A Robustness Assessment of Language Model Watermarking" (Liang et al., Dec 2024)
"BlockDoor: Blocking Backdoor Based Watermarks in Deep Neural Networks" (Puah et al., Dec 2024)
"Towards Effective User Attribution for Latent Diffusion Models via Watermark-Informed Blending" (Pan et al., Dec 2024)
"GENIE: Watermarking Graph Neural Networks for Link Prediction" (Bachina et al., Dec 2024)
"The Efficacy of Transfer-based No-box Attacks on Image Watermarking: A Pragmatic Analysis" (Wu and Chandrasekaran, Dec 2024)
"SoK: Watermarking for AI-Generated Content" (Zhao et al., Nov 2024)
"Passive Deepfake Detection Across Multi-modalities: A Comprehensive Survey" (Nguyen-Le et al., Nov 2024)
"CLUE-MARK: Watermarking Diffusion Models using CLWE" (Shehata, Kolluri, and Saxena, Nov 2024)
"UnMarker: A Universal Attack on Defensive Image Watermarking" (Kassis and Hengartner, Nov 2024)
"SoK: On the Role and Future of AIGC Watermarking in the Era of Gen-AI" (Ren et al., Nov 2024)
"Conceptwm: A Diffusion Model Watermark for Concept Protection" (Lei et al., Nov 2024)
"Watermark-based Detection and Attribution of AI-Generated Content" (Jiang et al., Nov 2024)
"An undetectable watermark for generative image models" (Gunn, Zhao, and Song, Nov 2024)
"InvisMark: Invisible and Robust Watermarking for AI-generated Image Provenance" (Xu et al., Nov 2024)
"FoldMark: Protecting Protein Generative Models with Watermarking" (Zhang et al., Nov 2024)
"Invisible Image Watermarks Are Provably Removable Using Generative AI" (Zhao et al., Oct 2024)
"Embedding Watermarks in Diffusion Process for Model Intellectual Property Protection" (Yang, Peng, and Xia, Oct 2024)
"Bileve: Securing Text Provenance in Large Language Models Against Spoofing with Bi-level Signature" (Zhou et al., Oct 2024)
"Watermarking Large Language Models and the Generated Content: Opportunities and Challenges" (Zhang and Koushanfar, Oct 2024)
"An undetectable watermark for generative image models" (Gunn, Zhao, and Song, Oct 2024)
"Deepfake detection in videos with multiple faces using geometric-fakeness features" (Vyshegorodtsev et al., Oct 2024)
"Universally Optimal Watermarking Schemes for LLMs: from Theory to Practice" (He et al., Oct 2024)
"Discovering Clues of Spoofed LM Watermarks" (Gloaguen et al., Oct 2024)
"Multi-Designated Detector Watermarking for Language Models" (Huang et al., Oct 2024)
"Gumbel Rao Monte Carlo based Bi-Modal Neural Architecture Search for Audio-Visual Deepfake Detection" (PN et al., Oct 2024)
"Signal Watermark on Large Language Models" (Zu and Sheng, Oct 2024)
"Diffuse or Confuse: A Diffusion Deepfake Speech Dataset" (Firc, Malinka, and Hanáček, Oct 2024)
"A Watermark for Black-Box Language Models" (Bahri et al., Oct 2024)
"Optimizing Adaptive Attacks against Content Watermarks for Language Models" (Diaa, Aremu, and Lukas, Oct 2024)
"Social Media Authentication and Combating Deepfakes using Semi-fragile Invisible Image Watermarking" (Nadimpalli and Rattani, Oct 2024)
"PITCH: AI-assisted Tagging of Deepfake Audio Calls using Challenge-Response" (Mittal et al., Oct 2024)
"Shaking the Fake: Detecting Deepfake Videos in Real Time via Active Probes" (Xie and Luo, Sep 2024)
"XAI-Based Detection of Adversarial Attacks on Deepfake Detectors" (Pinhasov et al., Aug 2024)

Shallow Fakes

‍

Misidentification

‍

Private Information Used in Training

Research:

"RemoteRAG: A Privacy-Preserving LLM Cloud RAG Service" (Cheng, Chow, and Li, Dec 2024)
"Privacy-hardened and hallucination-resistant synthetic data generation with logic-solvers" (Burgess et al., Oct 2024)
"Generated Data with Fake Privacy: Hidden Dangers of Fine-tuning Large Language Models on Generated Data" (Akkus et al., Sep 2024)
"Catch Me if You Can: Detecting Unauthorized Data Use in Deep Learning Models" (Chen and Pattabiraman, Sep 2024)
"Ethical Challenges in Computer Vision: Ensuring Privacy and Mitigating Bias in Publicly Available Datasets" (Tahir, Aug 2024)
"Tracing Privacy Leakage of Language Models to Training Data via Adjusted Influence Functions" (Liu and Yang., Aug 2024)

Unsecured Credentials

Covers:

MITRE ATLAS Credential Access

‍

AI-Generated/Augmented Exploits

Added this category to cover instances where generative AI systems are used to generate cybersecurity exploits.

Research:

"Metamorphic Malware Evolution: The Potential and Peril of Large Language Models" (Madani, Oct 2024)
"Exploring RAG-based Vulnerability Augmentation with LLMs" (Daneshvar et al., Aug 2024)

‍

AI Safety+Cybersecurity R&D Tracker

Threats using AI models

Prompt Injection and Input Manipulation (Direct and Indirect)

System and Meta Prompt Extraction

Obtain and Develop (Software) Capabilities, Acquire Infrastructure, or Establish Accounts

Jailbreak, Cost Harvesting, or Erode ML Model Integrity

Proxy AI ML Model (Simulations)

Verify Attack (Efficacy)

Insecure Output Handling

Sensitive Information Disclosure

Insecure Plugin Design and Plugin Compromise

Hallucination Squatting and Phishing

Persistence

Backdoor ML Model and Craft Adversarial Data

Supply Chain Vulnerabilities and Compromise

Excessive Agency, Agentic Manipulation, Agentic Systems

Copyright Infringement

Defenses

Threat to AI Models

General Approaches

Data Poisoning and Simulated Publication of Poisoned Public Datasets

Model (Mis)Interpretability

Model Collapse

Model Denial of Service and Chaff Data Spamming

Model Modifications

Inadequate AI Alignment

Discover ML Model Family and Ontology/Model Extraction

Improper Error Handling

Robust Multi-Prompt and Multi-Model Attacks

Multi-Modal Attacks

LLM Data Leakage and ML Artifact Collection

Evade ML Model

Model Theft, Data Leakage, ML-Enabled Product or Service, and API Access

Model Inversion Attack

Exfiltration via Cyber Means

Model Skewing Attack

Evade ML Model

MITRE ATLAS Reconnaissance

Discover ML Artifacts, Data from Information Repositories and Local System, and Acquire Public ML Artifacts

User Execution, Command and Scripting Interpreter

Physical Model Access and Full Model Access

Valid Accounts

Exploit Public Facing Application

Threats from AI Model

Misinformation

Over Reliance on LLM Outputs and External (Social) Harms

Fake Resources and Phishing

Social Manipulation

Deep Fakes, Content Provenance, and Watermarking

Shallow Fakes

Misidentification

Private Information Used in Training

Unsecured Credentials

AI-Generated/Augmented Exploits

You may be interested in

AI Safety+Cybersecurity R&D Tracker

AI Framework Tracker

Global AI Regulation Tracker

Want to get started with safe & compliant AI adoption?