telugusandadi

horaciowalters/telugusandadi

I ran a fast experiment examining how DeepSeek-R1 carries out on agentic tasks, photorum.eclat-mauve.fr regardless of not supporting tool use natively, and I was quite amazed by initial results. This experiment runs DeepSeek-R1 in a single-agent setup, where the model not just plans the actions but also formulates the actions as executable Python code. On a subset1 of the GAIA recognition split, DeepSeek-R1 exceeds Claude 3.5 Sonnet by 12.5% absolute, from 53.1% to 65.6% appropriate, and other models by an even larger margin:

The experiment followed design use standards from the DeepSeek-R1 paper and the model card: Don't utilize few-shot examples, prevent including a system timely, and set the temperature level to 0.5 - 0.7 (0.6 was utilized). You can discover more examination details here.

Approach

DeepSeek-R1's strong coding capabilities allow it to function as a representative without being explicitly trained for tool usage. By allowing the model to create actions as Python code, it can flexibly connect with environments through code execution.

Tools are executed as Python code that is consisted of straight in the timely. This can be a simple function meaning or a module of a larger plan - any valid Python code. The model then produces code actions that call these tools.

Arise from performing these actions feed back to the model as follow-up messages, driving the next steps till a last answer is reached. The agent framework is a basic iterative coding loop that mediates the discussion in between the model and its environment.

Conversations

DeepSeek-R1 is used as chat design in my experiment, where the model autonomously pulls additional context from its environment by utilizing tools e.g. by using a search engine or fetching data from web pages. This drives the discussion with the environment that continues until a final answer is reached.

On the other hand, o1 designs are known to carry out poorly when used as chat designs i.e. they don't attempt to pull context during a discussion. According to the linked article, o1 models carry out best when they have the full context available, photorum.eclat-mauve.fr with clear guidelines on what to do with it.

Initially, I likewise attempted a full context in a single prompt technique at each step (with arise from previous actions consisted of), however this led to substantially lower scores on the GAIA subset. Switching to the conversational technique explained above, I was able to reach the reported 65.6% efficiency.

This raises an interesting concern about the claim that o1 isn't a chat design - perhaps this observation was more pertinent to older o1 models that did not have tool usage abilities? After all, isn't tool use support a crucial system for allowing models to pull additional context from their environment? This conversational approach certainly appears reliable for DeepSeek-R1, though I still require to carry out comparable explores o1 models.

Generalization

Although DeepSeek-R1 was mainly trained with RL on math and coding jobs, it is remarkable that generalization to agentic jobs with tool usage via code actions works so well. This capability to generalize to agentic jobs advises of current research study by DeepMind that reveals that RL generalizes whereas SFT memorizes, although generalization to tool usage wasn't investigated because work.

Despite its ability to generalize to tool use, DeepSeek-R1 typically produces long thinking traces at each step, compared to other models in my experiments, restricting the effectiveness of this model in a single-agent setup. Even easier jobs sometimes take a long period of time to finish. Further RL on agentic tool usage, be it through code actions or not, could be one option to improve effectiveness.

Underthinking

I also observed the underthinking phenomon with DeepSeek-R1. This is when a thinking design frequently switches in between different reasoning thoughts without sufficiently exploring appealing courses to reach a proper solution. This was a major factor for extremely long thinking traces produced by DeepSeek-R1. This can be seen in the tape-recorded traces that are available for download.

Future experiments

Another common application of reasoning designs is to use them for planning just, lovewiki.faith while utilizing other designs for producing code actions. This might be a potential new feature of freeact, if this separation of functions proves helpful for more complex jobs.

I'm also curious about how thinking designs that currently support tool use (like o1, o3, ...) carry out in a single-agent setup, with and without producing code actions. Recent advancements like OpenAI's Deep Research or Hugging Face's open-source Deep Research, which also uses code actions, look .