AAAI 2026 大模型安全相关论文整理

news2026/4/11 18:50:49

AAAI 2026 大模型安全相关论文整理总目录大模型安全研究论文整理 2026年版https://blog.csdn.net/WhiffeYF/article/details/159047894https://claude.ai/chat/916dfe36-9753-4199-baa2-44fc2f709fb6统计共收集 27 篇论文来自 AAAI 2026第40届2026年1月新加坡主要来源AI Alignment 特别 TrackVol 40 No.44和主技术 TrackNLP / ML 等分类概览越狱攻击方法Jailbreak Attack10 篇安全防御与对齐Defense Alignment10 篇安全评估与基准Benchmark Evaluation4 篇隐私与数据安全2 篇智能体安全Agent Safety1 篇1 越狱攻击方法Jailbreak Attackid论文名Track链接1MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMsAI AlignmentPDF2Differentiated Directional Intervention: A Framework for Evading LLM Safety AlignmentAI AlignmentPDF3HumorReject: Decoupling LLM Safety from Refusal Prefix via a Little HumorAI AlignmentPDF4StyleBreak: Revealing Alignment Vulnerabilities in Large Audio-Language Models via Style-Aware Audio JailbreakAI AlignmentPDF5STACK: Adversarial Attacks on LLM Safeguard PipelinesAI AlignmentPDF6Cost-Minimized Label-Flipping Poisoning Attack to LLM AlignmentAI AlignmentPDF7Chain-of-Thought Driven Adversarial Scenario Extrapolation for Robust Language ModelsAI AlignmentPDF8Response Attack: Exploiting Contextual Priming to Jailbreak Large Language ModelsMain Track (NLP)PDF9Multi-Turn Jailbreaking Large Language Models via Attention ShiftingMain Track (NLP)PDF10Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs (CognitiveAttack)Main Track (NLP)PDF2 安全防御与对齐Defense Alignmentid论文名Track链接1AlignTree: Efficient Defense Against LLM Jailbreak AttacksAI AlignmentPDF2EASE: Practical and Efficient Safety Alignment for Small Language ModelsAI AlignmentPDF3Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-TrainingAI AlignmentPDF4STAR-1: Safer Alignment of Reasoning LLMs with 1K DataAI AlignmentPDF5CluCERT: Certifying LLM Robustness via Clustering-Guided Denoising SmoothingAI AlignmentPDF6WALKSAFE: Risk-aware Graph Random Walk with Bi-GRPO for LLM SafetyMain Track (NLP)PDF7DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety PromptAI AlignmentPDF8Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor AttacksAI AlignmentPDF9MirrorShield: Towards Dynamic Adaptive Defense Against Jailbreaks via Entropy-Guided Mirror CraftingMain Track (NLP)dblp10AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMsMain Track (NLP)dblp3 安全评估与基准Benchmark Evaluationid论文名Track链接1Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language ModelsAI AlignmentPDF2MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based AttacksAI AlignmentPDF3MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language ModelsMain TrackPDF4Benchmarking Trustworthiness in Multimodal LLMs for Video UnderstandingAI AlignmentPDF4 隐私与数据安全id论文名Track链接1CoSPED: Consistent Soft Prompt Targeted Data Extraction and DefenseAI AlignmentPDF2Towards Benchmarking Privacy Vulnerabilities in Selective Forgetting with Large Language ModelsAI AlignmentPDF5 智能体安全Agent Safetyid论文名Track链接1Shadows in the Code: Exploring the Risks and Defenses of LLM-based Multi-Agent Software Development SystemsAI AlignmentPDF6 其他相关论文对齐理论 / 推理安全 / 幻觉检测 / 可解释性等以下论文虽然不直接属于攻击/防御但与大模型安全密切相关id论文名方向链接1DNR Bench: Benchmarking Over-Reasoning in Reasoning LLMs推理冗余/OverthinkingPDF2Deep Hidden Cognition Facilitates Reliable Chain-of-Thought Reasoning推理安全/CoT可靠性PDF3Bolster Hallucination Detection via Prompt-Guided Data Augmentation幻觉检测PDF4Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models幻觉/可靠性PDF5Silenced Biases: The Dark Side LLMs Learned to Refuse对齐副作用/过度拒绝PDF6Beyond I’m Sorry, I Can’t: Dissecting Large-Language-Model Refusal拒绝机制分析PDF7Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation微调导致的对齐失效PDF8AdvBDGen: A Robust Framework for Generating Adaptive and Stealthy Backdoors in LLM Alignment后门攻击PDF9Editing as Unlearning: Are Knowledge Editing Methods Strong Baselines for Large Language Model Unlearning?机器遗忘PDF10Polarity-Aware Probing for Quantifying Latent Alignment in Language Models对齐可解释性PDF11FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight推理缺陷检测PDF12Backdoor Attacks on Open Vocabulary Object Detectors via Multi-Modal Prompt Tuning多模态后门攻击PDF13Security Attacks on LLM-based Code Completion Tools代码工具安全PDF14MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control智能体安全评估PDF15MAJIC: Markovian Adaptive Jailbreaking via Iterative Composition of Diverse Innovative Strategies越狱攻击dblp16From Chaos to Cure: A Prefix Heuristics Guided Model-Agnostic Adaptive Detoxification Framework去毒化防御dblp17An LLM-based Quantitative Framework for Evaluating High-Stealthy Backdoor Risks in OSS Supply Chains供应链后门AAAI 2026备注以上论文主要从AI Alignment 特别 TrackVol 40 No. 44完整扫描获取以及从NLP / ML 主技术 Track和dblp 索引中搜索安全相关关键词获取AAAI 2026 共收到约 29,000 篇投稿安全相关论文散布在多个 Track 中NLP I-VI, ML I-XI, Application Domains 等上述列表可能未能完全覆盖所有 Track 中的安全论文论文链接均指向 AAAI Press 官方论文集https://ojs.aaai.org/index.php/AAAI/完整 AI Alignment Track 目录https://ojs.aaai.org/index.php/AAAI/issue/view/726AAAI 2026 全部论文集目录https://aaai.org/proceeding/aaai-40-2026/

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.coloradmin.cn/o/2507069.html

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈，一经查实，立即删除！