中國科大認(rèn)知全重實驗室發(fā)布Agent-R1訓(xùn)練框架,支持自主思考與工具調(diào)用!
什么是 Agent-R1
Agent-R1是由中科大認(rèn)知智能全國重點實驗室開發(fā)的智能體強化學(xué)習(xí)訓(xùn)練框架,致力于推進強化學(xué)習(xí)與智能體技術(shù)的融合發(fā)展??蚣懿捎枚说蕉藦娀瘜W(xué)習(xí)方法,突破了依賴人工設(shè)計工作流的傳統(tǒng)智能體開發(fā)瓶頸,讓AI直接從與環(huán)境的交互中學(xué)習(xí)最優(yōu)策略,實現(xiàn)自主決策與行動。開發(fā)者只需定義特定領(lǐng)域的工具和獎勵函數(shù),即可將Agent-R1擴展到各種應(yīng)用場景,無需編寫復(fù)雜的工作流程。
背景
隨著大型語言模型(LLM)技術(shù)的快速發(fā)展,智能體(Agent)已成為人工智能領(lǐng)域的熱門話題。然而,當(dāng)前市場上大多數(shù)所謂的"智能體"實際上只是工作流系統(tǒng)——通過預(yù)定義的代碼路徑來編排LLM和工具的使用,而非真正的自主智能。正如Anthropic最新定義的那樣,真正的LLM智能體應(yīng)當(dāng)是"系統(tǒng)中的LLM能夠動態(tài)指導(dǎo)自己的處理過程和工具使用,保持對任務(wù)完成方式的控制權(quán)"。
傳統(tǒng)工作流系統(tǒng)在實際應(yīng)用中面臨著嚴(yán)重的局限性:
? 無法規(guī)劃,往往陷入死胡同
? 記憶能力有限,難以維持超過5-10分鐘的任務(wù)
? 長期行動效率低下,行動序列常因累積誤差而失敗
這些痛點揭示了AI發(fā)展的"苦澀教訓(xùn)":通過硬編碼知識和規(guī)則來構(gòu)建智能體系統(tǒng)只能帶來短期收益,長期來看必然會遇到瓶頸。而突破性進展最終將通過基于搜索和學(xué)習(xí)的計算擴展方法到來。
在這一背景下,OpenAI、Anthropic等前沿實驗室已開始研發(fā)基于強化學(xué)習(xí)訓(xùn)練的真正智能體。OpenAI的DeepResearch和Claude Sonnet 3.7在搜索和編程任務(wù)中展現(xiàn)出了令人印象深刻的能力——它們能夠規(guī)劃搜索策略,交叉引用多個來源,并基于中間反饋完成復(fù)雜的多步驟任務(wù)。
這一技術(shù)突破表明,通過將強化學(xué)習(xí)與推理能力結(jié)合,LLM可以從與環(huán)境的直接交互中學(xué)習(xí)最優(yōu)策略,無需依賴人工設(shè)計的提示詞和工作流。這些智能體不僅能夠思考,還能夠自主行動、規(guī)劃和調(diào)整策略,真正實現(xiàn)長期、復(fù)雜任務(wù)的自主完成。
Agent-R1正是在這一關(guān)鍵技術(shù)變革期應(yīng)運而生的開源框架,致力于推進強化學(xué)習(xí)與智能體技術(shù)的融合發(fā)展,讓這一前沿技術(shù)能夠被更廣泛的研究者和開發(fā)者所使用,共同推動AI真正邁向自主智能的新時代。
核心特點
多輪工具調(diào)用能力
Agent-R1采用端到端強化學(xué)習(xí)方法,訓(xùn)練智能體從完整交互軌跡中學(xué)習(xí)。這使模型能夠進行連續(xù)多輪工具調(diào)用,理解長期因果關(guān)系,將當(dāng)前行動與未來結(jié)果關(guān)聯(lián)起來,形成真正的規(guī)劃和迭代能力。
多工具協(xié)調(diào)機制
框架支持智能體學(xué)習(xí)如何協(xié)調(diào)使用多種工具共同解決復(fù)雜任務(wù)。智能體能夠靈活選擇最適合當(dāng)前情境的工具組合,形成有效的工具組合策略,而非被限制在單一工具的使用模式中。
過程獎勵系統(tǒng)
創(chuàng)新的過程獎勵機制允許對每個工具調(diào)用的有效性進行單獨評估,而非僅關(guān)注最終結(jié)果。通過獎勵歸一化技術(shù),該方法平衡了過程獎勵與結(jié)果獎勵,確保智能體在追求最終目標(biāo)的同時也注重解決問題的過程質(zhì)量。
自定義工具和環(huán)境
框架與主流LLM工具調(diào)用格式完全兼容,開發(fā)者可以輕松擴展自己的工具和應(yīng)用場景。只需定義特定領(lǐng)域的工具和獎勵函數(shù),即可將Agent-R1應(yīng)用到任何專業(yè)領(lǐng)域,無需復(fù)雜的工作流設(shè)計。
多種強化學(xué)習(xí)算法
Agent-R1 支持多種先進的強化學(xué)習(xí)算法,包括PPO、GRPO和REINFORCE++,為不同需求和場景提供靈活選擇。開發(fā)者可以根據(jù)具體應(yīng)用選擇最適合的算法,優(yōu)化智能體的學(xué)習(xí)效果。
多模態(tài)支持
最新版本實現(xiàn)了全面的多模態(tài)支持,無縫集成視覺-語言模型(VLMs),使智能體能夠處理和推理文本與視覺輸入的組合。這一功能極大擴展了智能體的應(yīng)用場景,能夠在豐富的多模態(tài)環(huán)境中進行有效交互。
案例展示
在HotpotQA上的實驗驗證顯示,即使是1.5B的小模型,在強化學(xué)習(xí)后也能自主推理并調(diào)用搜索引擎,解決復(fù)雜問題。
User
Question: The album that contains the song "Get Together" by Madonna was mainly recorded in whose home studio?
Assistant
<think>
To answer this question, we need to find out the song "Get Together" by Madonna and then determine in which home studio she recorded the album it belongs to. I will use the search tool to find information about the song "Get Together" by Madonna.
</think>
<tool_call>
{"name": "search", "arguments": {"query": "Get Together by Madonna"}}
</tool_call>
Tool
<tool_response>
{"results": ["Get Together (Madonna song) \"Get Together\" is a song by American singer Madonna from her tenth studio album \"Confessions on a Dance Floor\" (2005). Produced by Madonna and Stuart Price, the song was released as the third single from the album by Warner Bros. Records on June 6, 2006. The decision was spurred by the fact that \"Get Together\" was the third most downloaded song from the album. It was also released to coincide with the start of Madonna's Confessions Tour. Inspired by Stardust's single \"Music Sounds Better with You\", \"Get Together\" portrays an anthem-like picture with its lyrics, about the different possibilities of finding love on the dance floor.", "Keep It Together (song) \"Keep It Together\" is a song by American singer Madonna from her fourth studio album \"Like a Prayer\" (1989). It was released on January 30, 1990 by Sire Records as the sixth and final single from the album in the United States, Canada and Japan. Written and produced by Madonna and Stephen Bray, the main inspiration behind \"Keep It Together\" was Madonna's relationship with her family\u2014whom she dearly missed after her divorce from actor Sean Penn. The song was dedicated to American funk and soul band Sly and the Family Stone. The lyrics deal with the realization of how important Madonna's family has been to her life. A pop and funk song consisting of an upbeat rhythm and groove, \"Keep It Together\" features percussion, banjo and a conga.", ...]}
</tool_response>
Assistant
<think>
Based on the search results, "Get Together" is indeed a song by Madonna from her album "Confessions on a Dance Floor" (2005). This provides the name of the album and its release details.
Next, I need to find out where this album was mainly recorded.
I will now use the search tool to find information about the album "Confessions on a Dance Floor" by Madonna, specifically looking for details on its recording process.
</think>
<tool_call>
{"name": "search", "arguments": {"query": "Confessions on a Dance Floor (album)"} }
</tool_call>
Tool
<tool_response>
{"results": ["Confessions on a Dance Floor Confessions on a Dance Floor is the tenth studio album by American singer and songwriter Madonna. It was released on November 9, 2005 by Warner Bros. Records. A complete departure from her previous studio album \"American Life\" (2003), the album includes influences of 1970s and 1980s disco, as well as modern-day club music. Initially, she began working with Mirwais Ahmadza\u00ef for the album, but later felt that their collaboration was not going in the direction she desired. Madonna took her collaboration with Stuart Price who was overviewing her documentary \"I'm Going to Tell You a Secret\". The album was mainly recorded at Price's home-studio where Madonna spent most of her time during the recordings.", "Confessions (Usher album) Confessions is the fourth studio album by American singer Usher. It was released on March 23, 2004, by Arista Records. Recording sessions for the album took place during 2003 to 2004, with its production handled by his longtime collaborator Jermaine Dupri, alongside with Jimmy Jam and Terry Lewis and Lil Jon, among others. Primarily an R&B album, \"Confessions\" showcases Usher as a crooner through a mixture of ballads and up-tempos, incorporating musical genres of dance-pop, hip hop and crunk. The album's themes generated controversy about Usher's personal relationships; however, the album's primary producer Jermaine Dupri claimed the album reflects his personal story.",...]}
</tool_response>
Assistant
<think>
From the search results, we learn that "Confessions on a Dance Floor" by Madonna was indeed released in 2005, but the primary information we seek is where this album was mainly recorded. The results indicate that it was "mainly recorded at Price's home-studio".
The album "Confessions on a Dance Floor" by Madonna was mainly recorded at Stuart Price's home studio.
Therefore, the answer to the question "The album that contains the song 'Get Together' by Madonna was mainly recorded in whose home studio?" is Stuart Price's home studio.
</think>
<answer>
Stuart Price's home studio
</answer>
訓(xùn)練過程分析
以下三組圖片分別展示了使用三種主流的強化學(xué)習(xí)算法訓(xùn)練時模型性能、工具調(diào)用輪數(shù)以及最終回答長度的變化過程。
PPO
REINFORCE++
GRPO
可以觀察到,模型 (Qwen2.5-1.5B-Instruct) 在面對具有挑戰(zhàn)性的多跳問題時,有效地學(xué)會了先思考再多輪調(diào)用工具,最終提高了精確匹配(EM)分?jǐn)?shù)。雖然不同強化學(xué)習(xí)算法的效果各異,但總體趨勢保持一致。
值得注意的是,實驗揭示了一個顯著的相關(guān)性:模型性能、工具調(diào)用輪數(shù)和最終回答長度在整個訓(xùn)練過程中呈現(xiàn)一致的變化趨勢。這展示了 Scaling Law 的全新維度——一個與智能體-環(huán)境交互頻率相關(guān)的維度。隨著智能體學(xué)會通過多次工具調(diào)用更有效地與環(huán)境交互,性能也相應(yīng)提升,這表明多輪環(huán)境交互能力對智能體性能的重要性可能與傳統(tǒng)的擴展因素同等關(guān)鍵。
結(jié)語
我們正處于AI技術(shù)革命的前沿,Agent-R1不僅體現(xiàn)了一種技術(shù)框架,還彰顯了對AI未來發(fā)展的長遠(yuǎn)愿景——從被動執(zhí)行到主動思考,從固定路徑到自主探索。Agent-R1的推出,旨在讓更多的研究者和開發(fā)者能夠接觸到前沿的強化學(xué)習(xí)與智能體技術(shù),推動技術(shù)的廣泛應(yīng)用。只有通過開放合作,才能真正推動AI從模仿到創(chuàng)造,從受限到自由的飛躍。
項目地址:???https://github.com/0russwest0/Agent-R1??
本文轉(zhuǎn)載自???NLP PaperWeekly???,作者:NLP PaperWeekly
