Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use
Abstract: Agentic language models operate in a fundamentally different safety regime than chat models: they must plan, call tools, and execute long-horizon actions where a single misstep, such as accessing files or entering credentials, can cause irreversible harm. Existing alignment methods, largely optimized for static generation and task completion, break down in these settings due to sequential decision-making, adversarial tool feedback, and overconfident intermediate reasoning. We introduce MOSAIC, a post-training framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable. MOSAIC structures inference as a plan, check, then act or refuse loop, with explicit safety reasoning and refusal as first-class actions. To train without trajectory-level labels, we use preference-based reinforcement learning with pairwise trajectory comparisons, which captures safety distinctions often missed by scalar rewards. We evaluate MOSAIC zero-shot across three model families, Qwen2.5-7B, Qwen3-4B-Thinking, and Phi-4, and across out-of-distribution benchmarks spanning harmful tasks, prompt injection, benign tool use, and cross-domain privacy leakage. MOSAIC reduces harmful behavior by up to 50%, increases harmful-task refusal by over 20% on injection attacks, cuts privacy leakage, and preserves or improves benign task performance, demonstrating robust generalization across models, domains, and agentic settings.
Q1: 这篇论文致力于解决**代理型语言模型(Agentic LLMs)在多步工具使用场景中的安全对齐问题**。具体而言,论文识别并针对以下核心挑战: ## 1. 传统安全机制在代理场景下的失效 现有对齐方法主要针对静态文本生成和单轮任务完成优化,在代理型环境中面临根本性局限: - **序列决策风险**:代理需规划、调用工具并执行长程动作,单步错误(如访问敏感文件、输入凭证)即可造成不可逆的实际危害 - **对抗性工具反馈**:工具介导的提示注入(prompt injection)和对抗性环境状态可劫持代理执行流 - **过度自信的中间推理**:长程推理痕迹常省略对安全性、 grounding 或不可逆性的显式检查,导致尽管经过大量思考仍采取不安全行动 ## 2. 标量奖励信号的安全判别缺陷 传统的基于结果(outcome-only)的标量奖励将多步安全决策压缩为单一终端信号,**无法捕捉轨迹级别的关键安全区分**,例如: - 早期拒绝(early refusal)vs. 不安全进展后的晚期中止(late abort after unsafe progress) - 遵循注入指令后放弃 vs. 立即识别并拒绝恶意指令 ## 3. 小型语言模型(SLMs)的特殊脆弱性 相比前沿大模型,小型模型在成本、延迟和隐私约束下更受青睐,但它们在紧凑的上下文预算和压缩的世界模型下运行,更易受以下因素影响: - 异常工具反馈 - 对抗性指令 - 级联故障(cascading failures) ## 4. 安全与效用的平衡困境 现有方法常导致**过度保守**(over-refusal,错误拒绝良性任务)或**过度顺从**(遵循有害请求),缺乏在不确定性下动态决定"何时行动、何时验证、何时弃权"的细粒度控制机制。 --- **MOSAIC 的解决路径** 为应对上述挑战,论文提出 **MOSAIC** 框架,通过以下创新重构代理推理与训练: - **显式安全决策循环**:将推理结构化为 **Plan → Check → Act/Refuse** 循环,使安全评估(``)和拒绝(refusal tool)成为一等可学习动作 - **基于偏好的强化学习**:利用成对轨迹比较(pairwise trajectory comparisons)而非标量奖励,捕捉时间维度的安全细微差别(如早期拒绝优于晚期中止) - **选择性计算分配**:学习动态门控(gating)机制,仅在关键步骤调用显式安全检查,平衡安全性与 token 效率 该框架旨在使代理能够在没有轨迹级安全标签的情况下,通过偏好优化学习何时行动、何时拒绝,从而在保持良性任务效用的同时,防范多步工具使用中的不可逆危害。