agent 的最新 13 个论文

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

2026-03-03 17:59:35 UTC

Abstract: Agentic language models operate in a fundamentally different safety regime than chat models: they must plan, call tools, and execute long-horizon actions where a single misstep, such as accessing files or entering credentials, can cause irreversible harm. Existing alignment methods, largely optimized for static generation and task completion, break down in these settings due to sequential decision-making, adversarial tool feedback, and overconfident intermediate reasoning. We introduce MOSAIC, a post-training framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable. MOSAIC structures inference as a plan, check, then act or refuse loop, with explicit safety reasoning and refusal as first-class actions. To train without trajectory-level labels, we use preference-based reinforcement learning with pairwise trajectory comparisons, which captures safety distinctions often missed by scalar rewards. We evaluate MOSAIC zero-shot across three model families, Qwen2.5-7B, Qwen3-4B-Thinking, and Phi-4, and across out-of-distribution benchmarks spanning harmful tasks, prompt injection, benign tool use, and cross-domain privacy leakage. MOSAIC reduces harmful behavior by up to 50%, increases harmful-task refusal by over 20% on injection attacks, cuts privacy leakage, and preserves or improves benign task performance, demonstrating robust generalization across models, domains, and agentic settings.

Q1: 这篇论文致力于解决**代理型语言模型（Agentic LLMs）在多步工具使用场景中的安全对齐问题**。具体而言，论文识别并针对以下核心挑战： ## 1. 传统安全机制在代理场景下的失效现有对齐方法主要针对静态文本生成和单轮任务完成优化，在代理型环境中面临根本性局限： - **序列决策风险**：代理需规划、调用工具并执行长程动作，单步错误（如访问敏感文件、输入凭证）即可造成不可逆的实际危害 - **对抗性工具反馈**：工具介导的提示注入（prompt injection）和对抗性环境状态可劫持代理执行流 - **过度自信的中间推理**：长程推理痕迹常省略对安全性、 grounding 或不可逆性的显式检查，导致尽管经过大量思考仍采取不安全行动 ## 2. 标量奖励信号的安全判别缺陷传统的基于结果（outcome-only）的标量奖励将多步安全决策压缩为单一终端信号，**无法捕捉轨迹级别的关键安全区分**，例如： - 早期拒绝（early refusal）vs. 不安全进展后的晚期中止（late abort after unsafe progress） - 遵循注入指令后放弃 vs. 立即识别并拒绝恶意指令 ## 3. 小型语言模型（SLMs）的特殊脆弱性相比前沿大模型，小型模型在成本、延迟和隐私约束下更受青睐，但它们在紧凑的上下文预算和压缩的世界模型下运行，更易受以下因素影响： - 异常工具反馈 - 对抗性指令 - 级联故障（cascading failures） ## 4. 安全与效用的平衡困境现有方法常导致**过度保守**（over-refusal，错误拒绝良性任务）或**过度顺从**（遵循有害请求），缺乏在不确定性下动态决定"何时行动、何时验证、何时弃权"的细粒度控制机制。 --- **MOSAIC 的解决路径** 为应对上述挑战，论文提出 **MOSAIC** 框架，通过以下创新重构代理推理与训练： - **显式安全决策循环**：将推理结构化为 **Plan → Check → Act/Refuse** 循环，使安全评估（``）和拒绝（refusal tool）成为一等可学习动作 - **基于偏好的强化学习**：利用成对轨迹比较（pairwise trajectory comparisons）而非标量奖励，捕捉时间维度的安全细微差别（如早期拒绝优于晚期中止） - **选择性计算分配**：学习动态门控（gating）机制，仅在关键步骤调用显式安全检查，平衡安全性与 token 效率该框架旨在使代理能够在没有轨迹级安全标签的情况下，通过偏好优化学习何时行动、何时拒绝，从而在保持良性任务效用的同时，防范多步工具使用中的不可逆危害。

URL: https://arxiv.org/pdf/2603.03205

Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?

2026-03-03 17:55:10 UTC

Abstract: As large language models (LLMs) advance their mathematical capabilities toward the IMO level, the scarcity of challenging, high-quality problems for training and evaluation has become a significant bottleneck. Simultaneously, recent code agents have demonstrated sophisticated skills in agentic coding and reasoning, suggesting that code execution can serve as a scalable environment for mathematical experimentation. In this paper, we investigate the potential of code agents to autonomously evolve existing math problems into more complex variations. We introduce a multi-agent framework designed to perform problem evolution while validating the solvability and increased difficulty of the generated problems. Our experiments demonstrate that, given sufficient test-time exploration, code agents can synthesize new, solvable problems that are structurally distinct from and more challenging than the originals. This work provides empirical evidence that code-driven agents can serve as a viable mechanism for synthesizing high-difficulty mathematical reasoning problems within scalable computational environments. Our data is available at https://github.com/TarferSoul/Code2Math.

Q1: 这篇论文试图解决**高质量、高难度数学问题稀缺**这一关键瓶颈，随着大型语言模型（LLMs）的数学推理能力逼近国际数学奥林匹克（IMO）水平，现有数据已难以满足进一步训练和评估的需求。具体而言，论文聚焦于以下核心问题： ### 核心研究目标探讨**代码代理（code agents）是否能够通过自主探索，将现有数学问题演化为结构新颖、难度递增且数学上严谨的新问题**，从而为模型训练提供可扩展的高质量数据合成机制。 ### 具体研究问题论文明确提出了三个待验证的研究问题： 1. **可解性（Solvability）**：演化生成的问题是否在数学上合理、逻辑自洽且确实存在可行解？ 2. **难度提升（Difficulty Escalation）**：新问题是否对当前主流推理模型构成了真正的认知挑战，而非仅仅是计算量的增加？ 3. **演化效率（Efficiency）**：问题演化过程需要多少计算开销和迭代轮次才能实现有效生成？ ### 方法论层面的解决思路为回答上述问题，论文构建了一个**多智能体验证框架（multi-agent framework）**，通过以下机制实现问题演化： - **演化代理（Evolution Agent）**：基于"心智理论"（Theory of Mind）分析原问题的认知瓶颈，通过代码探索（如符号计算、组合枚举）构造具有更高"发现负担"（Burden of Discovery）的新问题 - **可解性验证代理（Solvability Verification Agent）**：利用代码工具（如SymPy、Z3）对生成的逻辑链进行形式化验证，确保问题良定义且存在有效解 - **难度验证代理（Difficulty Verification Agent）**：评估新问题是否破坏了标准解题模板，是否引入了更深层的数学洞察（Aha moments），而非仅增加计算复杂度 ### 关键创新点区别于传统的基于规则或简单改写的问题生成方法，该论文强调**代码执行作为探索引擎**——通过Python库（如SymPy、NetworkX）进行数值模拟、约束满足和反例搜索，使代理能够在计算环境中进行系统性的数学实验，从而发现人工难以构造的复杂问题结构。

URL: https://arxiv.org/pdf/2603.03202

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

2026-03-03 17:52:01 UTC

Abstract: Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emulating developer-like workflows that interleave search and reasoning during coding tasks. This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.

Q1: 这篇论文旨在解决当前代码智能体（code agents）基准测试与实际软件工程需求之间的显著差距。具体而言，论文识别并试图解决以下核心问题： ## 现有基准测试的局限性当前主流基准测试（如SWE-bench及其变体）主要聚焦于**单仓库（single-repo）内的函数级bug修复**，其评估范围局限于： - 单一代码库内的局部问题修复 - 无需外部知识即可解决的封闭环境任务 - 缺乏对跨仓库推理、领域专业知识、大规模代码迁移等真实开发场景的覆盖 ## 真实世界软件工程的复杂维度实际软件开发涉及远超单仓库bug修复的复杂挑战，主要包括： 1. **跨仓库推理（Cross-Repository Reasoning）** 开发者需要参考外部仓库的实现、上游库的变更或相关项目的解决方案来解决当前问题。 2. **领域特定问题解决（Domain-Specialized Problem Solving）** 许多软件项目服务于特定科学领域（如生物信息学、量子物理、材料科学），修复相关问题需要深入的领域知识而不仅仅是通用编程能力。 3. **依赖驱动的迁移（Dependency-Driven Migration）** 当上游依赖库（如NumPy、Pydantic）发布重大版本更新时，需要对整个代码库进行系统性重构以适配破坏性变更。 4. **全仓库生成（Full-Repository Generation）** 从零开始根据规格说明文档构建完整的可运行仓库，涉及架构设计、模块划分和API实现等高层能力。 ## 论文提出的解决方案为系统性解决上述问题，论文做出了以下贡献： - **构建BeyondSWE基准测试** 通过扩展**解决范围（resolution scope）**（从函数级修复到仓库级重构与生成）和**知识范围（knowledge scope）**（从单仓库内到跨仓库、领域专家知识和开放网络），建立了包含500个真实实例的综合性评估体系，涵盖CrossRepo、DomainFix、DepMigrate和Doc2Repo四种任务类型。 - **开发SearchSWE框架** 为研究外部知识检索与代码能力的整合，提出了将深度搜索能力与代码操作能力相结合的agentic框架，以系统化地考察搜索增强对代码智能体性能的影响。实验结果表明，即使是最先进的前沿模型在BeyondSWE上的成功率也低于45%，且没有单一模型能在所有任务类型上表现一致，这凸显了当前代码智能体在超越单仓库bug修复场景时的能力缺口。

URL: https://arxiv.org/pdf/2603.03194

APRES: An Agentic Paper Revision and Evaluation System

2026-03-03 16:29:13 UTC

Abstract: Scientific discoveries must be communicated clearly to realize their full potential. Without effective communication, even the most groundbreaking findings risk being overlooked or misunderstood. The primary way scientists communicate their work and receive feedback from the community is through peer review. However, the current system often provides inconsistent feedback between reviewers, ultimately hindering the improvement of a manuscript and limiting its potential impact. In this paper, we introduce a novel method APRES powered by Large Language Models (LLMs) to update a scientific papers text based on an evaluation rubric. Our automated method discovers a rubric that is highly predictive of future citation counts, and integrate it with APRES in an automated system that revises papers to enhance their quality and impact. Crucially, this objective should be met without altering the core scientific content. We demonstrate the success of APRES, which improves future citation prediction by 19.6% in mean averaged error over the next best baseline, and show that our paper revision process yields papers that are preferred over the originals by human expert evaluators 79% of the time. Our findings provide strong empirical support for using LLMs as a tool to help authors stress-test their manuscripts before submission. Ultimately, our work seeks to augment, not replace, the essential role of human expert reviewers, for it should be humans who discern which discoveries truly matter, guiding science toward advancing knowledge and enriching lives.

Q1: 该研究旨在解决当前学术同行评审系统面临的系统性压力与不一致性问题，以及大语言模型（LLMs）在科学论文评审与修改应用中的关键风险。具体而言，论文针对以下核心问题： ### 1. 同行评审系统的结构性困境 - **评审容量不足**：顶级会议年投稿量达数万篇，远超合格评审员队伍的增长速度，导致评审员疲劳、评审周期延长 - **评审不一致性**：不同评审员对同一稿件的评价存在显著随机性（如NeurIPS实验显示独立委员会对约23%的论文存在分歧），阻碍了作者获取有效反馈以改进工作 - **反馈质量参差**：现有系统难以持续提供可操作的、建设性的改进建议 ### 2. LLM直接应用于论文评审与修改的风险 - **科学内容篡改风险**：现有方法可能无意中修改科学主张、实验结果或技术细节 - **风格偏离问题**：可能偏离既定学术写作规范，影响学术交流的严谨性 - **缺乏预测性评估标准**：现有LLM评审多模仿人类评审，但未能发现更能预测论文长期影响力（如引用次数）的评估维度 ### 3. 自动化论文改进的约束性难题如何在不改变核心科学内容（实验数据、技术方法、核心贡献）的前提下，系统性地提升论文的： - 清晰度与可读性 - 未来学术影响力预测指标 - presentation质量该研究提出的**APRES (Agentic Paper Revision and Evaluation System)**通过两阶段方法解决上述问题：首先通过智能体搜索发现能高度预测未来引用次数的评估标准（rubric），随后利用该标准作为目标函数，在闭环系统中迭代修订论文文本，最终使修订后的论文在人类专家评估中被优先选择的比例达到79%，同时将引用预测的平均绝对误差降低19.6%。

URL: https://arxiv.org/pdf/2603.03142

Learning to Generate and Extract: A Multi-Agent Collaboration Framework For Zero-shot Document-level Event Arguments Extraction

2026-03-03 12:03:35 UTC

Abstract: Document-level event argument extraction (DEAE) is essential for knowledge acquisition, aiming to extract participants of events from documents.In the zero-shot setting, existing methods employ LLMs to generate synthetic data to address the challenge posed by the scarcity of annotated data. However, relying solely on Event-type-only prompts makes it difficult for the generated content to accurately capture the contextual and structural relationships of unseen events. Moreover, ensuring the reliability and usability of synthetic data remains a significant challenge due to the absence of quality evaluation mechanisms. To this end, we introduce a multi-agent collaboration framework for zero-shot document-level event argument extraction (ZS-DEAE), which simulates the human collaborative cognitive process of "Propose-Evaluate-Revise." Specifically, the framework comprises a generation agent and an evaluation agent. The generation agent synthesizes data for unseen events by leveraging knowledge from seen events, while the evaluation agent extracts arguments from the synthetic data and assesses their semantic consistency with the context. The evaluation results are subsequently converted into reward signals, with event structure constraints incorporated into the reward design to enable iterative optimization of both agents via reinforcement learning.In three zero-shot scenarios constructed from the RAMS and WikiEvents datasets, our method achieves improvements both in data generation quality and argument extraction performance, while the generated data also effectively enhances the zero-shot performance of other DEAE models.

Q1: 这篇论文旨在解决**零样本文档级事件论元抽取（Zero-shot Document-level Event Argument Extraction, ZS-DEAE）**中因训练数据稀缺而导致的性能瓶颈问题，具体聚焦于以下核心挑战： ### 1. 合成数据生成的语义与结构缺陷现有方法依赖大语言模型（LLM）生成合成数据以缓解标注数据不足，但仅基于事件类型的简单提示（Event-type-only prompts）难以使生成内容准确捕捉未见事件类型的上下文语境和结构关系。生成的文本往往呈现**语言简单、句法直接、论元密集集中**的特点，缺乏文档级事件抽取所需的跨句语境丰富性和结构复杂性。 ### 2. 事件类型语义混淆 LLM难以区分语义相似但结构不同的未见事件类型（如"physical investigate inspect"强调通过感官或仪器的物理观察，而"inspect people organization"侧重于对人员或组织的审查），导致生成的实例无法准确反映特定事件类型的语义边界。 ### 3. 缺乏有效的质量评估与控制机制合成数据的质量评估存在固有困难，缺乏可靠的筛选机制易引入噪声样本。特别地，生成代理倾向于产生包含大量空论元（即角色对应参数为None）的实例，而评估代理因能正确预测None而给予此类样本较高的对数似然分数，形成**累积性偏差反馈循环**，导致生成数据结构性不完整。 ### 4. 多代理协作中的优化偏差在"生成-评估"的多代理框架中，若无适当约束，评估代理的评分信号会强化生成代理产生结构不完整事件的倾向，进而降低下游论元抽取性能。为应对上述挑战，论文提出了一种模拟**"提出-评估-修正"（Propose–Evaluate–Revise）**人类协作认知过程的多代理框架，通过引入事件结构约束的强化学习机制，迭代优化生成代理与评估代理，以同步提升合成数据质量与论元抽取性能。

URL: https://arxiv.org/pdf/2603.02909

Graph-GRPO: Stabilizing Multi-Agent Topology Learning via Group Relative Policy Optimization

2026-03-03 07:45:40 UTC

Abstract: Optimizing communication topology is fundamental to the efficiency and effectiveness of Large Language Model (LLM)-based Multi-Agent Systems (MAS). While recent approaches utilize reinforcement learning to dynamically construct task-specific graphs, they typically rely on single-sample policy gradients with absolute rewards (e.g., binary correctness). This paradigm suffers from severe gradient variance and the credit assignment problem: simple queries yield non-informative positive rewards for suboptimal structures, while difficult queries often result in failures that provide no learning signal. To address these challenges, we propose Graph-GRPO, a novel topology optimization framework that integrates Group Relative Policy Optimization. Instead of evaluating a single topology in isolation, Graph-GRPO samples a group of diverse communication graphs for each query and computes the advantage of specific edges based on their relative performance within the group. By normalizing rewards across the sampled group, our method effectively mitigates the noise derived from task difficulty variance and enables fine-grained credit assignment. Extensive experiments on reasoning and code generation benchmarks demonstrate that Graph-GRPO significantly outperforms state-of-the-art baselines, achieving superior training stability and identifying critical communication pathways previously obscured by reward noise.

Q1: 该论文针对**基于大语言模型（LLM）的多智能体系统（MAS）中通信拓扑优化的不稳定性问题**，提出了系统性的解决方案。具体而言，论文试图解决以下核心挑战： ## 1. 传统优化范式的根本缺陷现有拓扑学习方法主要依赖标准强化学习技术（如REINFORCE算法），采用**单样本估计**与**绝对奖励**（例如二元正确/错误）的优化范式，这导致两个关键问题： - **严重的梯度方差（High Gradient Variance）**：数据集中查询难度分布不均（uneven difficulty），简单查询会使大量次优拓扑偶然获得正确结果（奖励=1），引入显著噪声；而困难查询则经常导致系统失败（奖励=0），产生梯度消失。 - **信用分配问题（Credit Assignment Problem）**：当某个拓扑成功时，标准方法将奖励平均归因于图中**所有边**，无法区分哪些连接对成功具有因果作用，哪些是冗余的。这种粗粒度反馈阻碍了模型学习精确的结构模式。 ## 2. 非信息性批次的陷阱在简单任务场景中，多样的采样拓扑（从高效链式结构到包含冗余边的稠密结构）可能都产生正确答案并获得相同奖励。传统策略梯度方法会不加区分地强化所有采样边（包括噪声和冗余边），导致模型收敛到次优结构。 ## 3. 任务难度与结构优化的解耦难题现有方法难以将拓扑结构的质量与任务本身的难度分离开。简单任务的高成功率会掩盖拓扑结构的缺陷，而困难任务的失败则无法提供有效的学习信号，使得优化过程被任务难度差异所主导，而非真实的结构优劣。为应对上述挑战，论文提出**Graph-GRPO**框架，核心创新在于将优化目标从最大化绝对奖励转变为**最大化组内相对优势（relative advantage）**，通过组采样和边级优势估计实现细粒度信用分配，从而稳定离散拓扑结构的学习过程。

URL: https://arxiv.org/pdf/2603.02701

Cross-Family Speculative Prefill: Training-Free Long-Context Compression with Small Draft Models

2026-03-03 05:59:18 UTC

Abstract: Prompt length is a major bottleneck in agentic large language model (LLM) workloads, where repeated inference steps and multi-call loops incur substantial prefill cost. Recent work on speculative prefill demonstrates that attention-based token importance estimation can enable training-free prompt compression, but this assumes the existence of a draft model that shares the same tokenizer as the target model. In practice, however, agentic pipelines frequently employ models without any smaller in-family draft model. In this work, we study cross-family speculative prefill, where a lightweight draft model from one model family is used to perform prompt compression for a target model from a different family. Using the same speculative prefill mechanism as prior work, we evaluate a range of cross-family draft-target combinations, including Qwen, LLaMA, and DeepSeek models. Across a broad diversity of tasks, we find that attention-based token importance estimation transfers reliably across different model families despite differences in model architectures and tokenizers between draft and target models. Cross-model prompt compression largely retains 90~100% of full-prompt baseline performance and, in some cases, slightly improves accuracy due to denoising effects, while delivering substantial reductions in time to first token (TTFT). These results suggest that speculative prefill depends mainly on task priors and semantic structure, thus serving as a generalizable prompt compression primitive. We discuss the implications of our findings for agentic systems, where repeated long-context inference and heterogeneous model stacks make cross-model prompt compression both necessary and practical.

Q1: 这篇论文试图解决**跨模型家族（cross-family）的推测式预填充（speculative prefill）**问题，即在**没有同家族草稿模型（in-family draft model）**的情况下，如何利用轻量级的异构草稿模型对目标大语言模型进行训练无关的（training-free）长文本提示压缩。具体而言，论文针对以下核心挑战： ### 1. 同家族假设的限制现有推测式预填充方法（Liu et al., 2025）假设草稿模型与目标模型共享相同的分词器（tokenizer），即属于同一模型家族。然而，许多前沿模型（如 DeepSeek-V3.1/R1、Kimi-K2 等）**并不存在更小的同家族草稿模型**，导致这些方法无法直接应用。 ### 2. 异构模型栈的实际需求在实际的智能体（agentic）系统中，由于成本、可用性和部署限制，经常需要组合使用来自不同家族的异构模型。这引发了一个关键问题：**基于注意力的token重要性估计是否能够在不同模型家族之间泛化？** ### 3. 长上下文推理的硬件约束即使目标模型理论上支持长上下文（如 128k tokens），在实际部署中常受限于内存和硬件约束（如论文中提到的 RDU 硬件限制），无法在其原生上下文长度下运行。需要一种机制**将长输入压缩到可部署的固定上下文预算**内。 ### 核心解决方案论文提出**跨家族推测式预填充（Cross-Family Speculative Prefill）**，其关键创新在于： - 使用来自不同家族（如 Qwen、LLaMA、DeepSeek）的小型草稿模型，通过注意力机制估计 token 重要性 - 将选中的文本块经目标模型的分词器重新编码后输入目标模型 - 为压缩后的提示分配新的连续位置 ID，避免跨家族分词方案的位置对齐问题实验表明，该方法在保持 **90–100%** 全提示基线性能的同时，实现了高达约 **18×** 的首 token 延迟（TTFT）降低，并使硬件受限的目标模型能够处理远超其原生部署限制的长上下文输入。

URL: https://arxiv.org/pdf/2603.02631

Think, But Don't Overthink: Reproducing Recursive Language Models

2026-03-03 05:36:44 UTC

Abstract: This project reproduces and extends the recently proposed ``Recursive Language Models'' (RLMs) framework by Zhang et al. (2026). This framework enables Large Language Models (LLMs) to process near-infinite contexts by offloading the prompt into an external REPL environment. While the original paper relies on a default recursion depth of 1 and suggests deeper recursion as a future direction, this study specifically investigates the impact of scaling the recursion depth. Using state-of-the-art open-source agentic models (DeepSeek v3.2 and Kimi K2), I evaluated pure LLM, RLM (depth=1), and RLM (depth=2) on the S-NIAH and OOLONG benchmarks. The findings reveal a compelling phenomenon: Deeper recursion causes models to ``overthink''. While depth-1 RLMs effectively boost accuracy on complex reasoning tasks, applying deeper recursion (depth=2) or using RLMs on simple retrieval tasks paradoxically degrades performance and exponentially inflates execution time (e.g., from 3.6s to 344.5s) and token costs. Code and data are available at: https://github.com/drbillwang/rlm-reproduction

Q1: 该论文旨在解决**递归语言模型（Recursive Language Models, RLMs）中递归深度扩展的影响与边界**问题。具体而言，该研究试图回答以下核心问题： 1. **验证更深递归深度的有效性**：原始RLM论文（Zhang et al., 2026）默认使用递归深度为1（即子调用作为标准LLM，不再生成自己的REPL环境），并建议将更深的递归层次作为未来研究方向。本论文首次系统评估了将递归深度扩展至2层对模型性能的影响。 2. **探索"过度思考"（Overthinking）现象**：研究假设并验证了过深的递归可能导致模型"过度思考"——即产生冗余的子调用、格式崩溃和参数化幻觉，反而降低任务性能。 3. **评估实际部署的可行性**：通过量化分析执行时间、Token消耗和API成本，研究试图解决RLM架构在工业级应用中的实用性问题，特别是当递归深度增加时出现的指数级延迟和成本爆炸问题。简言之，该论文通过复现并扩展原始RLM框架，**检验了递归深度与推理性能、计算效率之间的权衡关系**，最终得出"适度递归有益（Think），但过度递归有害（Don't Overthink）"的结论。

URL: https://arxiv.org/pdf/2603.02615

Contextualized Privacy Defense for LLM Agents

2026-03-03 13:35:33 UTC

Abstract: LLM agents increasingly act on users' personal information, yet existing privacy defenses remain limited in both design and adaptability. Most prior approaches rely on static or passive defenses, such as prompting and guarding. These paradigms are insufficient for supporting contextual, proactive privacy decisions in multi-step agent execution. We propose Contextualized Defense Instructing (CDI), a new privacy defense paradigm in which an instructor model generates step-specific, context-aware privacy guidance during execution, proactively shaping actions rather than merely constraining or vetoing them. Crucially, CDI is paired with an experience-driven optimization framework that trains the instructor via reinforcement learning (RL), where we convert failure trajectories with privacy violations into learning environments. We formalize baseline defenses and CDI as distinct intervention points in a canonical agent loop, and compare their privacy-helpfulness trade-offs within a unified simulation framework. Results show that our CDI consistently achieves a better balance between privacy preservation (94.2%) and helpfulness (80.6%) than baselines, with superior robustness to adversarial conditions and generalization.

Q1: 这篇论文旨在解决**LLM Agents在处理用户个人信息时的上下文隐私保护问题**，具体针对现有隐私防御机制在以下三个方面的局限性： ### 1. 静态与被动防御的不足现有主流防御范式存在根本性设计缺陷： - **Prompting（提示增强）**：仅在初始化时注入固定的隐私增强指令（如"You need to maintain highest level of discretion..."），无法根据动态交互中的具体上下文调整策略，容易在多轮对话中被忽略或绕过。 - **Guarding（行为拦截）**：通过独立模型审查并阻止风险操作（如发送包含敏感信息的邮件），但仅提供二元化的阻断信号，**不提供如何将阻断操作改写为适当形式的指导**，导致有用性（helpfulness）显著下降。这两种范式均属于**静态或被动干预**，无法支持多步Agent执行流程中所需的**上下文感知、主动式隐私决策**。 ### 2. 对抗性攻击的脆弱性现有防御难以抵御战略性、自适应的隐私攻击（如 persuasion、impersonation、multi-turn social engineering）。攻击者可通过系统性探索发现防御漏洞，而静态防御的知识边界受限于训练阶段和人工规则，无法应对计算优化的攻击策略发现的长尾风险模式。 ### 3. 隐私与有用性的权衡困境现有方法往往陷入两难： - 过度保护导致有用性损失（如Guarding阻断混合了敏感与可共享数据的消息后，Agent因缺乏指导而放弃共享所有信息） - 不足的保护导致隐私泄露（如Prompting在复杂社会工程攻击下失效） ### 核心解决方案论文提出**Contextualized Defense Instructing (CDI)** 范式，通过以下机制解决上述问题： - **干预点重构**：在工具调用结果获取后（而非仅初始化或行动前）介入，基于当前完整上下文生成**步骤特定的隐私指导** - **主动引导**：使用轻量级Instructor模型（如Qwen3-4B）为Agent提供"该分享什么、不该分享什么"的具体行动指引，而非简单约束 - **经验驱动优化**：通过强化学习（GRPO）利用失败轨迹（隐私泄露案例）进行训练，将对抗性攻击转化为学习信号，提升对未见攻击模式和新颖场景的泛化能力该方案旨在实现**上下文完整性（Contextual Integrity）**——即Agent能够根据具体情境判断特定个人信息分享的适当性，在$94.2\%$的隐私保护率下保持$80.6\%$的有用性得分，显著优于传统方法。

URL: https://arxiv.org/pdf/2603.02983

Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification

2026-03-03 09:36:43 UTC

Abstract: As LLM-powered agents have been used for high-stakes decision-making, such as clinical diagnosis, it becomes critical to develop reliable verification of their decisions to facilitate trustworthy deployment. Yet, existing verifiers usually underperform owing to a lack of domain knowledge and limited calibration. To address this, we establish GLEAN, an agent verification framework with Guideline-grounded Evidence Accumulation that compiles expert-curated protocols into trajectory-informed, well-calibrated correctness signals. GLEAN evaluates the step-wise alignment with domain guidelines and aggregates multi-guideline ratings into surrogate features, which are accumulated along the trajectory and calibrated into correctness probabilities using Bayesian logistic regression. Moreover, the estimated uncertainty triggers active verification, which selectively collects additional evidence for uncertain cases via expanding guideline coverage and performing differential checks. We empirically validate GLEAN with agentic clinical diagnosis across three diseases from the MIMIC-IV dataset, surpassing the best baseline by 12% in AUROC and 50% in Brier score reduction, which confirms the effectiveness in both discrimination and calibration. In addition, the expert study with clinicians recognizes GLEAN's utility in practice.

Q1: 该研究针对**高风险场景下（如临床诊断）LLM驱动智能体的可靠验证**问题，旨在解决现有验证方法在专业性、校准度和可解释性方面的关键缺陷。 ### 核心问题在高风险决策领域（如医疗诊断），智能体的错误输出可能导致严重后果，因此亟需能够判断决策正确性并提供**校准良好的正确性概率**（well-calibrated correctness probabilities）的验证机制，以支持风险规避（如主动弃权或人工复核）。然而，现有方法存在以下局限： - **领域知识缺失**：传统验证器（如基于奖励模型的方法）虽可通过大规模训练内化领域知识，但获取专家标注数据成本高昂且难以扩展；而无训练方法（如LLM-as-a-Judge、自一致性）依赖模型的内部隐式知识，容易受到固有偏差或一致性错误的误导。 - **校准不足**：现有方法（如基于采样或模型概率的验证）难以提供概率意义上校准良好的置信度估计，无法有效支持风险决策。 - **缺乏可审计标准**：高风险领域通常存在明确的专家协议（如临床指南、操作规范），但现有验证方法未能有效利用这些显式、可审计的领域标准进行逐步验证。 ### 研究目标为解决上述问题，论文提出**GLEAN**（GuideLine-grounded Evidence AccumulatioN）框架，其核心目标是： 1. **将领域协议转化为验证信号**：利用专家制定的指南（如临床指南）作为显式标准，通过评估智能体执行轨迹与领域指南的逐步对齐程度，构建基于领域知识的验证信号。 2. **实现概率校准**：通过贝叶斯逻辑回归（Bayesian logistic regression）将累积的指南对齐证据映射为校准良好的正确性概率，满足高风险决策对概率可靠性的严格要求。 3. **支持主动验证**：利用估计的不确定性触发主动验证策略（如指南扩展和差异检查），在不确定情况下动态收集额外证据，实现测试时计算扩展（test-time scaling）。简言之，该研究试图建立一种**基于领域指南的、轨迹感知的、概率校准的验证框架**，以支持高风险智能体决策的可信部署。

URL: https://arxiv.org/pdf/2603.02798

Credibility Governance: A Social Mechanism for Collective Self-Correction under Weak Truth Signals

2026-03-03 06:10:13 UTC

Abstract: Online platforms increasingly rely on opinion aggregation to allocate real-world attention and resources, yet common signals such as engagement votes or capital-weighted commitments are easy to amplify and often track visibility rather than reliability. This makes collective judgments brittle under weak truth signals, noisy or delayed feedback, early popularity surges, and strategic manipulation. We propose Credibility Governance (CG), a mechanism that reallocates influence by learning which agents and viewpoints consistently track evolving public evidence. CG maintains dynamic credibility scores for both agents and opinions, updates opinion influence via credibility-weighted endorsements, and updates agent credibility based on the long-run performance of the opinions they support, rewarding early and persistent alignment with emerging evidence while filtering short-lived noise. We evaluate CG in POLIS, a socio-physical simulation environment that models coupled belief dynamics and downstream feedback under uncertainty. Across settings with initial majority misalignment, observation noise and contamination, and misinformation shocks, CG outperforms vote-based, stake-weighted, and no-governance baselines, yielding faster recovery to the true state, reduced lock-in and path dependence, and improved robustness under adversarial pressure. Our implementation and experimental scripts are publicly available at https://github.com/Wanying-He/Credibility_Governance.

Q1: 这篇论文旨在解决**弱真实信号（weak truth signals）环境下在线平台的集体决策失真问题**——即当真相难以直接观察、证据嘈杂或延迟、且存在策略性操纵时，如何设计机制使集体能够自我纠正，避免过早收敛于错误共识。具体而言，论文针对以下核心挑战： **1. 可见性与可靠性的背离** 当代在线平台（如学术论坛、社交媒体、Web3治理系统）依赖参与度投票或资本加权承诺等信号进行意见聚合，但这些信号往往追踪的是**可见性（visibility）**而非**可靠性（reliability）**。早期人气激增（early popularity surges）和协调放大效应会导致集体判断锁定在表面光鲜但证据薄弱的方向上，形成路径依赖。 **2. 弱可观察性下的决策脆弱性** 在科学发现、技术采纳或政策实施等场景中， ground truth 通常具有以下特征： - 仅有**弱信号**（weak signals）：真相通过嘈杂、延迟且不均匀的代理指标显现 - **反馈滞后**：物理世界的进展（$\pi_t^k$）与可观察的社会信号（$\Theta_t^k$）之间存在时滞和扭曲 - **策略性操纵**：攻击者可通过制造虚假动量（artificial momentum）或协调放大来扭曲集体判断 **3. 现有治理机制的失效** 论文识别了三种主流机制在弱信号环境下的系统性局限： - **无治理（No Governance, NG）**：原子化个体仅依赖嘈杂的物理信号，易受早期噪声和路径依赖影响 - **社交媒体式投票（Social Media Upvote, SM）**：一人一票的原始人气统计会放大早期流行度，缺乏纠错能力 - **Web3质押机制（Web3 Staking, WS）**：资本加权治理使资源丰富者能 disproportionately 塑造结果，且奖励基于短期物理进展（$\Delta\pi_k$）而非证据积累 **4. 提出的解决方向：可信度治理（Credibility Governance）** 为应对上述挑战，论文提出 CG 机制，其核心创新在于： - **动态可信度重分配**：根据代理人对新兴证据的持续对齐（通过社会信号动量 $\Delta\Theta_k$ 衡量）而非静态声誉或资源持有量来调整影响力权重（$w_i$） - **反泡沫惩罚（anti-bubble penalty）**：抑制缺乏可信背书的快速支持增长，防止低质量跟风 - **早期行动者奖励**：通过指数项 $e^{-\kappa r_t^{a_i}}$ 激励在证据尚不充分时支持正确方向的代理人简言之，该论文试图构建一种**在证据稀疏、延迟或难以验证时仍能保持认识稳定性的社会机制**，使集体能够从初始的错误多数（initial majority misalignment）中恢复，抵抗错误信息冲击（misinformation shocks），并在对抗性压力下扩大稳定运行区域。

URL: https://arxiv.org/pdf/2603.02640

StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning

2026-03-03 06:04:49 UTC

Abstract: Modern machine learning (ML) workloads increasingly rely on GPUs, yet achieving high end-to-end performance remains challenging due to dependencies on both GPU kernel efficiency and host-side settings. Although LLM-based methods show promise on automated GPU kernel generation, prior works mainly focus on single-kernel optimization and do not extend to end-to-end programs, hindering practical deployment. To address the challenge, in this work, we propose StitchCUDA, a multi-agent framework for end-to-end GPU program generation, with three specialized agents: a Planner to orchestrate whole system design, a Coder dedicated to implementing it step-by-step, and a Verifier for correctness check and performance profiling using Nsys/NCU. To fundamentally improve the Coder's ability in end-to-end GPU programming, StitchCUDA integrates rubric-based agentic reinforcement learning over two atomic skills, task-to-code generation and feedback-driven code optimization, with combined rubric reward and rule-based reward from real executions. Therefore, the Coder learns how to implement advanced CUDA programming techniques (e.g., custom kernel fusion, cublas epilogue), and we also effectively prevent Coder's reward hacking (e.g., just copy PyTorch code or hardcoding output) during benchmarking. Experiments on KernelBench show that StitchCUDA achieves nearly 100% success rate on end-to-end GPU programming tasks, with 1.72x better speedup over the multi-agent baseline and 2.73x than the RL model baselines.

Q1: 这篇论文旨在解决**自动化端到端 GPU 程序生成与优化**中的关键挑战，具体而言，现有基于大语言模型（LLM）的方法在从单内核优化扩展到完整端到端程序时面临的核心问题： ### 核心问题定义 - **端到端程序生成的复杂性**：现代机器学习工作负载的性能不仅取决于单个 GPU 内核的效率，还依赖于主机端设置（如内存分配、CPU-GPU 重叠、内核启动配置等）。现有方法（如 CUDAForge、Kevin-32B）主要针对 KernelBench Level 1/2 的单内核任务，无法处理 Level 3/4 涉及多内核交互、系统级协调的端到端工作负载（如完整的 VisionTransformer 模型）。 - **多智能体系统的局限性**：当前多智能体方法（如 CUDAForge）缺乏跨内核优化和主机端编排的机制，容易陷入局部最优，且随着搜索空间扩大（多内核、主机编排、库调用），Coder 智能体面临长代码上下文和工具痕迹，导致实现不一致、编辑冗余和收敛不稳定。 - **强化学习的训练瓶颈与奖励滥用**： - **奖励滥用（Reward Hacking）**：现有基于规则的可验证奖励（如功能正确性和加速比）容易被利用，导致模型生成仅包含 PyTorch 代码或硬编码输出的伪内核，以获取高奖励而不实现真实程序。 - **退化行为（Degenerate Behaviors）**：模型倾向于仅修改参考代码中的"简单部分"（如仅替换独立的 ReLU 算子），生成保守但正确的低性能内核，而非实施激进的系统级优化。 - **多轮交互的计算开销**：标准的智能体强化学习（Agentic RL）需要收集多轮交互轨迹（multi-turn rollouts），在真实 CUDA 环境中每次交互需 4-5 分钟，导致训练成本极高（估计需 60 天），难以实际应用。 - **反馈理解与执行能力缺失**：现有 RL 方法主要关注内核生成本身，未训练模型解释结构化执行反馈（如编译器诊断和性能分析瓶颈）并实施针对性优化，这在端到端优化中至关重要。 ### 解决方案概述针对上述挑战，论文提出 **StitchCUDA** 框架，通过以下方式解决问题： - 设计三智能体架构（Planner、Coder、Verifier）实现"计划-编码-分析-优化"的迭代循环，处理系统级设计； - 将多轮智能体 RL 分解为两个原子技能（从零生成代码、基于反馈优化代码）的单轮训练，显著降低计算开销； - 引入**基于评分标准的智能体强化学习（Rubric-based Agentic RL）**，结合专家设计的评分标准（评估反黑客、CUDA 工程技术、算子覆盖率和技能遵从性）与执行规则奖励，抑制奖励滥用并鼓励深度优化。

URL: https://arxiv.org/pdf/2603.02637

Safety Training Persists Through Helpfulness Optimization in LLM Agents

2026-02-13 03:01:16 UTC

Abstract: Safety post-training has been studied extensively in single-step "chat" settings where safety typically refers to refusing harmful requests. We study an "agentic" (i.e., multi-step, tool-use) setting where safety refers to harmful actions directly taken by the LLM. We compare the effects of running direct preference optimization (DPO) on safety or helpfulness alone vs both metrics sequentially. As expected, training on one metric alone results in an extreme point along this frontier. However, unlike prior work, we find that safety training persists through subsequent helpfulness training. We also find that all training configurations end up near a linear Pareto frontier with $R^2 = 0.77$. Even post-training on both metrics simultaneously simply results in another point on the frontier rather than finding a "best of both worlds" strategy, despite the presence of such strategies in our DPO dataset. Overall, our findings underscore the need for better understanding of post-training dynamics.

Q1: 这篇论文试图解决**大型语言模型（LLM）代理在安全性与有用性后训练之间的动态关系**问题，具体包括以下几个核心方面： ### 1. 研究场景的拓展传统安全性研究主要集中于单轮"聊天"场景（chat settings），关注模型是否拒绝有害请求。本文将研究拓展至**代理场景（agentic settings）**，即多步骤、工具使用的自主决策环境，其中安全性指LLM直接采取的有害行动（如不当修改医疗记录、错误删除文件等），而非仅仅是内容生成。 ### 2. 安全训练的持续性问题针对"安全训练是否会被后续的有用性训练抵消"这一关键问题，论文检验了： - 单独进行安全或有用性的直接偏好优化（DPO）训练的效果 - **顺序训练**（先安全后有用性，或相反）对模型行为的影响 - 与先前研究（显示安全训练容易被对抗性微调绕过）不同，本文发现**安全训练在后续有用性训练中表现出持续性** ### 3. 多目标优化的权衡机制论文探究了安全性与有用性之间的**帕累托前沿**关系： - 训练是否必然导致"安全税"（safety tax，即安全训练损害有用性） - 同时优化两个目标是否能发现"两全其美"的策略，还是仅仅在前沿上移动 - 发现所有训练配置最终都接近线性帕累托前沿（$R^2 = 0.77$），未能发现数据集中存在的兼顾安全与有用的策略 ### 4. 开源模型的安全缺陷诊断通过ToolEmu基准测试，论文揭示了**现有开源模型在复杂代理场景中安全性表现不佳**的问题（表现为"行动偏见"：倾向于立即行动而非收集信息），并探讨了开发者进行的安全训练为何未能迁移到代理设置。简言之，该研究旨在**理解后训练动态（post-training dynamics）在LLM代理中的特殊表现**，特别是安全行为在有用性优化压力下的稳定性，以及多目标优化的内在限制。

URL: https://arxiv.org/pdf/2603.02229