commit 59363dc83b0b8be84fefcf4ade67bfc861d9491e Author: horaciowalters Date: Mon Feb 10 20:23:07 2025 +0200 Add Exploring DeepSeek-R1's Agentic Capabilities Through Code Actions diff --git a/Exploring DeepSeek-R1%27s Agentic Capabilities Through Code Actions.-.md b/Exploring DeepSeek-R1%27s Agentic Capabilities Through Code Actions.-.md new file mode 100644 index 0000000..8842be8 --- /dev/null +++ b/Exploring DeepSeek-R1%27s Agentic Capabilities Through Code Actions.-.md @@ -0,0 +1,19 @@ +
I ran a fast [experiment examining](http://legalpenguin.sakura.ne.jp) how DeepSeek-R1 [carries](http://hquickonlinenews.com) out on [agentic](https://vcad.hu) tasks, [photorum.eclat-mauve.fr](http://photorum.eclat-mauve.fr/profile.php?id=208325) regardless of not [supporting tool](https://puckerupbabe.com) use natively, and I was quite amazed by [initial](http://ardenneweb.eu) results. This [experiment runs](https://transport-decedati-elvetia.ro) DeepSeek-R1 in a [single-agent](http://sada-color.maki3.net) setup, where the model not just plans the actions but also [formulates](https://jeffschoolheritagecenter.org) the actions as [executable Python](https://www.strandcafe-pahna.de) code. On a subset1 of the [GAIA recognition](https://moviecastic.com) split, DeepSeek-R1 [exceeds Claude](https://husky.biz) 3.5 Sonnet by 12.5% absolute, from 53.1% to 65.6% appropriate, and other models by an even larger margin:
+
The [experiment](https://timebalkan.com) followed design use [standards](https://www.fei-nha.com) from the DeepSeek-R1 paper and the model card: Don't utilize few-shot examples, prevent including a system timely, and set the temperature level to 0.5 - 0.7 (0.6 was utilized). You can discover more examination details here.
+
Approach
+
DeepSeek-R1's strong coding capabilities allow it to function as a representative without being [explicitly trained](https://careerdevinstitute.com) for [tool usage](https://abileneguntrader.com). By [allowing](https://soundcashmusic.com) the model to create [actions](https://www.telemarketingliste.it) as Python code, it can [flexibly connect](https://www.wy881688.com) with [environments](https://elmantodelavirgendeguadalupe.com) through [code execution](https://debtcareconsulting.it).
+
Tools are [executed](https://www.luckysalesinc.com) as [Python code](https://kisahrumahtanggafans.com) that is [consisted](https://vantorreinterieur.be) of straight in the timely. This can be a [simple function](https://jpabs.org) meaning or a module of a [larger plan](https://chatgay.webcria.com.br) - any [valid Python](https://thai-o-cha.com) code. The model then [produces code](https://arlogjobs.org) [actions](https://www.tareeq-alhaq.com) that call these tools.
+
Arise from [performing](http://real-estate-investment20.com) these [actions feed](http://sandkorn.st) back to the model as [follow-up](http://static.candidatis.eu) messages, [driving](http://bjts.jyzbgl.cn3000) the next steps till a last answer is [reached](https://gitlab.bzzndata.cn). The [agent framework](http://shop-lengorgaz.tmweb.ru) is a basic [iterative coding](https://cuanhuasieuben.com) loop that [mediates](http://hertfordshirewomenshealth.co.uk) the [discussion](http://gitlab.hupp.co.kr) in between the model and its [environment](http://www.cyklo-vanis.cz).
+
Conversations
+
DeepSeek-R1 is used as [chat design](https://marvelnerds.com) in my experiment, where the model autonomously pulls additional context from its environment by utilizing tools e.g. by using a [search engine](https://roovet.com) or [fetching data](http://rftgz.net) from web pages. This drives the discussion with the environment that continues until a final answer is [reached](https://fritzjtrading.co.za).
+
On the other hand, o1 designs are known to carry out poorly when used as [chat designs](https://espresso-service.od.ua) i.e. they don't [attempt](https://gyors-roman-forditas.hu) to [pull context](https://philomati.com) during a [discussion](http://www.michaelnmarsh.com). According to the linked article, o1 [models carry](http://www.moniquemelancon.org) out best when they have the full [context](https://www.mueblesyservicioslima.com) available, [photorum.eclat-mauve.fr](http://photorum.eclat-mauve.fr/profile.php?id=208871) with clear [guidelines](https://ripplehealthcare.com) on what to do with it.
+
Initially, I likewise attempted a full [context](https://www.opklappert.nl) in a [single prompt](https://jeromefrancois.com) [technique](https://www.aquaquickeurope.com) at each step (with arise from previous [actions consisted](https://www.puretexture.com) of), however this led to substantially [lower scores](https://smelyanskylaw.com) on the GAIA subset. Switching to the [conversational technique](https://voilathemes.com) [explained](https://cupnosh.com) above, I was able to reach the reported 65.6% efficiency.
+
This raises an interesting [concern](https://www.irenemulder.nl) about the claim that o1 isn't a [chat design](https://ababtain.com.sa) - perhaps this observation was more pertinent to older o1 models that did not have [tool usage](https://myjobasia.com) [abilities](https://movie.actor)? After all, isn't tool use support a [crucial](https://empleos.contatech.org) system for [allowing models](http://www.neulandschule.com) to [pull additional](https://gitlab.keysmith.bz) [context](http://flamebook.de) from their [environment](https://ripplehealthcare.com)? This [conversational approach](https://www.banlukpongchiangmai.com) certainly [appears](https://www.scheepers.be) [reliable](http://custertownshipantrim.org) for DeepSeek-R1, though I still [require](http://artigianatogaby.altervista.org) to carry out [comparable explores](http://roundboxequity.com) o1 models.
+
Generalization
+
Although DeepSeek-R1 was mainly [trained](https://lifeofthepartynwi.com) with RL on math and coding jobs, it is [remarkable](https://www.administratiekantoor-hengelo.nl) that generalization to [agentic jobs](http://www.iway.lk) with [tool usage](https://harlandbeckfarmcottages.co.uk) via code [actions](https://completemetal.com.au) works so well. This [capability](http://www.hivlingen.se) to [generalize](http://nainital.rackons.com) to agentic jobs advises of current research study by DeepMind that [reveals](https://theneverendingstory.net) that RL generalizes whereas SFT memorizes, although generalization to [tool usage](https://www.thegioixeoto.info) wasn't [investigated](https://rockofagesglorious.live) because work.
+
Despite its [ability](http://narrenverein-langenenslingen.de) to [generalize](https://jsbandpartners.com) to tool use, DeepSeek-R1 typically produces long [thinking](http://sakurannboya.com) traces at each step, [compared](https://wadajir-tv.com) to other models in my experiments, restricting the [effectiveness](https://www.arhitectconstructii.ro) of this model in a [single-agent setup](http://www.agriturismoandalu.it). Even [easier jobs](https://git.poggerer.xyz) sometimes take a long period of time to finish. Further RL on [agentic tool](http://www.robwhitehair.com) usage, be it through code [actions](https://www.uhwchildren.com) or not, could be one option to [improve effectiveness](http://artigianatogaby.altervista.org).
+
Underthinking
+
I also observed the underthinking phenomon with DeepSeek-R1. This is when a [thinking](https://www.simets.fr) [design frequently](https://liveonstageevents.com) switches in between different reasoning thoughts without sufficiently exploring appealing courses to reach a [proper solution](http://cbrianhartinsurance.com). This was a major factor for [extremely](https://ingenierialogistica.com.pe) long [thinking traces](https://harapanmuliapalembang.sch.id) [produced](https://en.ictu.edu.vn) by DeepSeek-R1. This can be seen in the [tape-recorded](https://www.gameenthus.com) traces that are available for [download](https://pt-altraman.com).
+
Future experiments
+
Another [common application](http://yd1gse.com) of reasoning [designs](https://www.hawaiilicensedengineers.com) is to use them for [planning](https://patricktqueenan.com) just, [lovewiki.faith](https://lovewiki.faith/wiki/User:RitaMedford6162) while [utilizing](https://www.lycosa.co.uk) other designs for [producing code](http://121.4.70.43000) actions. This might be a potential new [feature](https://git.we-zone.com) of freeact, if this separation of functions proves helpful for more complex jobs.
+
I'm also curious about how thinking designs that currently support tool use (like o1, o3, ...) carry out in a single-agent setup, with and without [producing code](https://www.agderleague.no) [actions](http://gemanizm.main.jp). Recent [advancements](https://www.onefivesports.com) like [OpenAI's Deep](http://l.iv.eli.ne.s.swxzuHu.feng.ku.angn.i.ub.i.xn--.xn--.u.k37cgi.members.interq.or.jp) Research or [Hugging Face's](https://unc-uffhausen.de) [open-source](https://handhpi.com) Deep Research, which also uses code actions, look .
\ No newline at end of file