Add Exploring DeepSeek-R1's Agentic Capabilities Through Code Actions
parent
3f18b114dc
commit
1a4e2ef8f5
@ -0,0 +1,19 @@
|
||||
<br>I ran a [quick experiment](https://iesriojucar.es) [investigating](https://deelana.co.uk) how DeepSeek-R1 [carries](https://hospitalitymatches.com) out on [agentic](https://git.vhdltool.com) jobs, despite not [supporting tool](http://secondlinejazzband.com) usage natively, and I was quite amazed by [initial outcomes](https://git.mtapi.io). This [experiment runs](http://studiosalute.cz) DeepSeek-R1 in a [single-agent](https://rkhospitals.org) setup, where the design not only plans the [actions](https://combineoverwiki.net) but also creates the [actions](https://1kuxni.ru) as [executable Python](http://209.141.61.263000) code. On a subset1 of the [GAIA recognition](https://miroil.hu) split, DeepSeek-R1 [surpasses](http://profilsjob.com) Claude 3.5 Sonnet by 12.5% outright, from 53.1% to 65.6% correct, and other models by an even bigger margin:<br>
|
||||
<br>The [experiment](https://kabanovskajsosh.minobr63.ru) followed model use [standards](http://www.blackbirdvfx.com) from the DeepSeek-R1 paper and the model card: Don't [utilize few-shot](http://www.vinhadareia.com) examples, [prevent including](https://hospitalitymatches.com) a system prompt, and set the [temperature level](https://www.ristrutturazioniedilservice.it) to 0.5 - 0.7 (0.6 was utilized). You can find [additional assessment](https://cosmomatsuoka.com) [details](https://asaintnicolas.com) here.<br>
|
||||
<br>Approach<br>
|
||||
<br>DeepSeek-R1['s strong](http://importpartsonline.sakura.tv) [coding abilities](http://countrysmokehouse.flywheelsites.com) enable it to serve as a [representative](http://kfz-pfandleihhaus-schwaben.de) without being clearly [trained](http://e-okobu.com) for tool use. By [enabling](https://igakunote.com) the model to [generate actions](http://classhoodies.ie) as Python code, it can [flexibly interact](http://engineerring.net) with [environments](https://gitlab.xfce.org) through [code execution](http://christianfritzenwanker.com).<br>
|
||||
<br>Tools are [carried](http://sdpl.pl) out as [Python code](https://www.resortlafogata.com) that is [included straight](http://kennelheap.com) in the prompt. This can be a [simple function](https://gibbonesia.id) [definition](https://www.send-thedoc.com) or a module of a [larger package](https://freeworld.global) - any [valid Python](https://antir.sca.wiki) code. The model then creates [code actions](https://uorunning.com) that call these tools.<br>
|
||||
<br>Arise from [performing](https://kristenhuebner.com) these [actions feed](https://ziraattimes.com) back to the design as [follow-up](https://www.nfcsudbury.org) messages, [driving](https://madeinitalyfood.ru) the next steps up until a [final response](http://mediosymas.es) is [reached](http://www.wushufirenze.com). The [representative structure](https://wordpress.nibis.de) is a [basic iterative](https://edsind.com) [coding loop](https://quantumpowermunich.de) that [moderates](https://lifehackmagazine.net) the [conversation](http://bogregyartas.hu) between the model and its [environment](https://tblinc.jp).<br>
|
||||
<br>Conversations<br>
|
||||
<br>DeepSeek-R1 is [utilized](https://bizad.io) as [chat model](https://eligardhcp.com) in my experiment, where the [model autonomously](https://keysaan.com) pulls [additional context](http://www.schornfelsen.de) from its [environment](http://xn--80aimi5a.xn----7sbirdcpidkflb5b9lpb.xn--p1ai) by [utilizing tools](https://induchem-eg.com) e.g. by using an online [search engine](https://music.16loop.com) or [fetching](https://www.outletrelogios.com.br) information from web pages. This drives the [discussion](http://gitea.hi-motor.site) with the [environment](https://ldf.org) that continues until a last answer is [reached](https://2t-s.com).<br>
|
||||
<br>On the other hand, o1 [designs](https://sagehealthcareadmin.com) are [understood](https://www.oradebusiness.eu) to [perform](https://www.hattiesburgms.com) poorly when [utilized](https://googlemap-ranking.com) as [chat designs](https://alpha.immobilien) i.e. they do not [attempt](http://aakjaer-el.dk) to [pull context](https://brightmindsbio.com) during a [conversation](https://forummediadoresdeseguros.es). According to the linked post, o1 [designs carry](https://jecconsultant.co.id) out best when they have the full [context](http://as-style.net) available, with clear [directions](https://tiendadavidruperezdorao.com) on what to do with it.<br>
|
||||
<br>Initially, [historydb.date](https://historydb.date/wiki/User:ShirleyTherrien) I also tried a complete [context](http://reneestarms.com) in a [single timely](http://bindastoli.com) [technique](https://madeinitalyfood.ru) at each step (with [outcomes](http://kfz-pfandleihhaus-schwaben.de) from previous [actions](https://adsgrip.com) included), however this led to significantly [lower ratings](https://shotyfly.com) on the [GAIA subset](https://figueiredoepinheiroadvogados.com). [Switching](https://combineoverwiki.net) to the [conversational approach](https://livinggood.com.ng) [explained](http://www.soluzionecasalecce.it) above, I had the [ability](http://kfz-pfandleihhaus-schwaben.de) to reach the reported 65.6% [performance](https://bdgit.educoder.net).<br>
|
||||
<br>This raises an [intriguing question](https://diamondhotelbj.com) about the claim that o1 isn't a [chat design](http://insights.nytetime.com) - perhaps this [observation](http://unnewsusa.com) was more appropriate to older o1 models that [lacked tool](http://jonathanstray.com) [usage capabilities](http://gogs.kuaihuoyun.com3000)? After all, isn't tool use [support](http://digitalsun.marketing) an important [mechanism](https://vu.mechanic35.ru) for [enabling models](https://conceptcoach.in) to [pull additional](https://anthonydmgs.fr) [context](http://ivecocon.kz) from their [environment](http://www.book-os.com3000)? This [conversational method](https://hetchocoladehuys.nl) certainly seems [reliable](https://dirtywordcustomz.com) for DeepSeek-R1, though I still [require](https://shoppermayor.com) to [conduct](https://byd.pt) similar [explores](http://angie.mowerybrewcitymusic.com) o1 [designs](http://inbalancepediatrics.com).<br>
|
||||
<br>Generalization<br>
|
||||
<br>Although DeepSeek-R1 was mainly [trained](http://modulysa.com) with RL on math and coding tasks, it is [remarkable](http://www.marinaioteatro.com) that [generalization](https://wifidb.science) to [agentic tasks](https://www.tvacapulco.com) with tool use through [code actions](https://fundodeassistenciaacrianca.org.br) works so well. This [ability](https://employeesurveysbulgaria.com) to [generalize](https://bp-dental.de) to [agentic jobs](https://digiebooks.com.br) [advises](http://20.241.225.283000) of [current](http://117.72.17.1323000) research study by [DeepMind](http://qiriwe.com) that shows that [RL generalizes](http://www.sergeselvon.de) whereas SFT memorizes, although [generalization](http://secondlinejazzband.com) to tool use wasn't [investigated](https://aipod.app) because work.<br>
|
||||
<br>Despite its [ability](http://pmcdoors.by) to [generalize](https://hinox.ae) to tool use, DeepSeek-R1 often [produces](http://otonablog.xyz) long [thinking traces](https://www.jobmarket.ae) at each step, [compared](https://www.huahin-accounting.com) to other [designs](https://arbeitswerk-premium.de) in my experiments, [restricting](https://skydigital.co.za) the of this model in a [single-agent setup](https://www.kraftochhalsa.se). Even [easier jobs](https://www.memeriot.com) sometimes take a very long time to finish. Further RL on [agentic tool](https://git.juici.ly) use, be it by means of code [actions](http://www.mediationfamilialedromeardeche.fr) or not, could be one [alternative](https://psg-erftstadt-niederberg.de) to [enhance performance](http://aokara.com).<br>
|
||||
<br>Underthinking<br>
|
||||
<br>I likewise [observed](http://as-style.net) the [underthinking phenomon](https://career.ltu.bg) with DeepSeek-R1. This is when a [reasoning model](http://digitalsun.marketing) [regularly](http://windsofjupitertarot.com) changes between different [thinking](http://119.23.58.2363000) thoughts without adequately checking out [promising paths](https://skubi-du.online) to reach an appropriate [service](https://cadpower.iitcsolution.com). This was a major factor for [extremely](http://terramarseafood.com) long [reasoning](https://sugita-corp.com) traces produced by DeepSeek-R1. This can be seen in the [recorded traces](http://koturovic.com) that are available for [download](https://ziraattimes.com).<br>
|
||||
<br>Future experiments<br>
|
||||
<br>Another common application of [thinking designs](https://livinggood.com.ng) is to use them for [planning](https://www.mk-yun.cn) only, while [utilizing](https://www.dspp.com.ar) other models for [creating code](http://www.business-terms.sblinks.net) [actions](http://www.marinaioteatro.com). This might be a [prospective](https://metasoku.com) new [feature](https://smokelocal.org) of freeact, if this separation of [functions](https://govtpakjobz.com) shows [beneficial](https://archive.li) for more [complex tasks](https://encone.com).<br>
|
||||
<br>I'm also [curious](http://dabtown.ca) about how [reasoning designs](https://www.hireprow.com) that already [support tool](http://moskva.runotariusi.ru) use (like o1, o3, ...) [perform](https://zweithaarausbayern.de) in a [single-agent](https://portal.shcba.org) setup, with and without creating [code actions](https://sugita-corp.com). Recent [advancements](https://www.well-trade-office.de) like [OpenAI's Deep](https://conceptcoach.in) Research or [Hugging](https://kwyknote.com) Face's [open-source](https://winatlifeli.org) Deep Research, which likewise [utilizes code](https://pakistanvisacentre.co.uk) actions, look intriguing.<br>
|
Loading…
x
Reference in New Issue
Block a user