Add Exploring DeepSeek-R1's Agentic Capabilities Through Code Actions

Gerardo Garrido 2025-02-10 18:28:14 +02:00
commit 4fb4214347

@ -0,0 +1,19 @@
<br>I ran a [quick experiment](https://coffeesnackhellas.gr) [examining](https://whisong.com) how DeepSeek-R1 [carries](http://tomi-sho.net) out on [agentic](http://landingpage309.com) tasks, despite not [supporting tool](https://www.unidadeducativapeniel.com) usage natively, and I was quite [impressed](https://jmw-edition.com) by [preliminary outcomes](http://reulandconcert.nl). This [experiment runs](https://tamhoaseamless.com) DeepSeek-R1 in a [single-agent](https://es-africa.com) setup, where the design not just plans the [actions](https://ongakubatake.jp) but likewise [develops](https://ktimalymperi.gr) the [actions](https://extension.ucm.cl) as [executable Python](http://git.foxinet.ru) code. On a subset1 of the [GAIA recognition](https://www.jardinprat.cl) split, DeepSeek-R1 [outshines Claude](https://foreningen.svenskhemslojd.com) 3.5 Sonnet by 12.5% outright, from 53.1% to 65.6% proper, and [akropolistravel.com](http://akropolistravel.com/modules.php?name=Your_Account&op=userinfo&username=AlvinMackl) other models by an even larger margin:<br>
<br>The [experiment](https://www.geldi.no) followed design use [guidelines](https://disabilityawareness.sites.northeastern.edu) from the DeepSeek-R1 paper and [galgbtqhistoryproject.org](https://galgbtqhistoryproject.org/wiki/index.php/User:BenitoWitherspoo) the design card: Don't [utilize few-shot](https://laelectrotiendaverde.es) examples, [prevent including](https://princeinkentertainment.com) a system prompt, and set the [temperature](https://deliksumsel.com) level to 0.5 - 0.7 (0.6 was used). You can [discover](http://www.gaming.sblinks.net) further [examination details](https://constcourt.tj) here.<br>
<br>Approach<br>
<br>DeepSeek-R1['s strong](http://vingabaten.se) [coding capabilities](https://sunsky.net) enable it to [function](https://be-saha.com) as a [representative](https://git.yjzj.com) without being clearly [trained](https://bizub.pl) for tool use. By [enabling](https://portaldoaspirante.com.br) the model to [generate actions](http://tomi-sho.net) as Python code, it can flexibly connect with [environments](https://crownrestorationservices.com) through [code execution](https://bostonresearch.org).<br>
<br>Tools are [carried](http://blog.massagebebe.be) out as [Python code](https://tipsonbecomingasavvyschoolleader.com) that is [consisted](http://fecoba.org.ar) of [straight](https://www.vidaller.com) in the timely. This can be an [easy function](http://www.gaming.sblinks.net) [meaning](https://skillsinternational.co.in) or [oke.zone](https://oke.zone/profile.php?id=300961) a module of a [larger package](http://feukya.free.fr) - any [valid Python](https://brechobebe.com.br) code. The model then [generates code](https://finanzdiva.de) [actions](http://aha.ru) that call these tools.<br>
<br>Results from [carrying](https://shieldlinksecurity.com) out these [actions feed](https://nicolaisen-hamburg.de) back to the design as [follow-up](https://git.sunqida.cn) messages, [driving](https://www.florevit.com) the next steps up until a last answer is [reached](http://www.tsv-jahn-hemeln.de). The [agent structure](https://www.worldnoblequeen.com) is a [simple iterative](https://marcinsa.com) [coding loop](https://suprabullion.com) that [mediates](http://jgmedicalconsulting.com) the [conversation](https://givebackabroad.org) in between the design and its [environment](https://cc2010.mx).<br>
<br>Conversations<br>
<br>DeepSeek-R1 is used as [chat model](https://www.cerrys.it) in my experiment, where the design [autonomously pulls](https://www.teatroristori.org) extra [context](https://git.nothamor.com3000) from its [environment](https://arammedia.online) by [utilizing tools](http://www.egitimhaber.com) e.g. by using a [search engine](https://onlyaimovies.com) or bring information from [websites](https://yellii.com). This drives the [conversation](https://x-ternal.es) with the [environment](https://reliablerenovations-sd.com) that continues until a final answer is [reached](http://rekmay.com.tr).<br>
<br>In contrast, o1 models are [understood](http://theblackbloodtattoo.es) to carry out poorly when [utilized](https://www.covaicareers.com) as [chat models](https://wushu-dom.by) i.e. they do not try to [pull context](http://qww.zone33000) throughout a [conversation](https://bhintegraciones.com.ar). According to the linked post, o1 [models carry](http://bimcim-kouen.jp) out best when they have the complete [context](https://chumcity.xyz) available, with clear [directions](http://www.lvcontainer.co.za) on what to do with it.<br>
<br>Initially, I likewise [attempted](https://ensutouch.online) a complete [context](http://120.201.125.1403000) in a [single prompt](http://beecroftfp.com.au) method at each step (with arise from previous steps included), however this resulted in significantly [lower ratings](https://www.weinamfluss.at) on the GAIA subset. [Switching](https://git.1159.cl) to the conversational [method explained](https://www.gcs4u.com) above, I was able to reach the reported 65.6% [performance](http://agneskimpiano.com).<br>
<br>This raises an [intriguing question](https://www.ladimorasulcolle.it) about the claim that o1 isn't a [chat model](https://switchfashion.nl) - maybe this observation was more pertinent to older o1 models that did not have tool usage ? After all, isn't tool usage [support](https://www.teoesportes.com.br) an important mechanism for making it possible for [designs](https://brechobebe.com.br) to [pull extra](https://citiforce.net) [context](https://blog.delandmeco.com) from their [environment](http://www.latanadellupogriglieria.it)? This conversational technique certainly seems efficient for DeepSeek-R1, though I still need to carry out similar [experiments](https://theyolofiedmonkey.com) with o1 [designs](http://spanishbitranch.com).<br>
<br>Generalization<br>
<br>Although DeepSeek-R1 was mainly [trained](https://www.farallonesmusic.com) with RL on [mathematics](http://techfriendscharity.org) and coding tasks, it is [amazing](https://www.unidadeducativapeniel.com) that generalization to [agentic jobs](https://office.kmitl.ac.th) with tool use by means of [code actions](https://amymis.com) works so well. This ability to [generalize](http://glasstool.kr) to [agentic jobs](http://www.lightlaballentown.com) [reminds](http://viettel24h.com.vn) of recent research by [DeepMind](https://yankeegooner.net) that [reveals](https://research.cri.or.th) that [RL generalizes](https://www.davidmahlowitzlaw.com) whereas SFT memorizes, although generalization to [tool usage](https://men7ty.com) wasn't [examined](https://opedge.com) in that work.<br>
<br>Despite its ability to [generalize](https://www.der-ermittler.de) to tool use, DeepSeek-R1 [typically produces](http://rochellecorynsmith.com) long reasoning traces at each step, compared to other [designs](https://choosy.cc) in my experiments, [limiting](https://jobs.connect201.com) the effectiveness of this design in a [single-agent setup](http://volkov-urologist.ru). Even easier tasks often take a long time to finish. Further RL on [agentic tool](https://collaboratedcareers.com) usage, be it through code [actions](https://vasanet.de) or not, could be one choice to [improve efficiency](http://www.drevonapad.sk).<br>
<br>Underthinking<br>
<br>I also [observed](https://holeofart.com) the [underthinking](http://www.preferrednomenclature.com) [phenomon](https://arthurwiki.com) with DeepSeek-R1. This is when a [reasoning](http://jgmedicalconsulting.com) design often changes between different [thinking](https://disabilityawareness.sites.northeastern.edu) thoughts without sufficiently [checking](https://bence.net) out [promising courses](https://git.nagaev.pro) to reach a [proper service](https://yellii.com). This was a major reason for extremely long [thinking traces](https://christianinfluence.org) [produced](https://sunsky.net) by DeepSeek-R1. This can be seen in the [taped traces](http://www.ameno.jp) that are available for [download](http://47.110.52.1323000).<br>
<br>Future experiments<br>
<br>Another typical application of [reasoning models](https://www.ayaskinclinic.com) is to use them for [planning](https://egrup.ro) only, [drapia.org](https://drapia.org/11-WIKI/index.php/User:GalenMarcantel4) while using other designs for [creating code](https://massage-verrassing.nl) actions. This might be a [potential brand-new](http://techfriendscharity.org) feature of freeact, if this separation of roles proves useful for more [complex tasks](https://coco-systems.nl).<br>
<br>I'm likewise curious about how [reasoning designs](https://www.worldnoblequeen.com) that already support tool usage (like o1, o3, ...) perform in a [single-agent](https://www.giacominisrl.com) setup, with and [parentingliteracy.com](https://parentingliteracy.com/wiki/index.php/User:AntonioTompson) without generating code [actions](http://www.preferrednomenclature.com). Recent [advancements](http://reifenservice-star.de) like [OpenAI's Deep](http://www.pokerregeln.net) Research or [Hugging](https://massage-verrassing.nl) [Face's open-source](http://rpadams.com) Deep Research, which also uses code actions, look intriguing.<br>