Exploring DeepSeek-R1's Agentic Capabilities Through Code Actions - Goolink_issue_system

I ran a fast experiment investigating how DeepSeek-R1 carries out on agentic tasks, regardless of not supporting tool usage natively, and I was rather impressed by preliminary results. This experiment runs DeepSeek-R1 in a single-agent setup, where the design not only prepares the actions however also develops the actions as executable Python code. On a subset1 of the GAIA validation split, DeepSeek-R1 surpasses Claude 3.5 Sonnet by 12.5% absolute, from 53.1% to 65.6% proper, and other designs by an even bigger margin:

The experiment followed model usage standards from the DeepSeek-R1 paper and the design card: Don't use few-shot examples, prevent adding a system timely, clashofcryptos.trade and set the temperature to 0.5 - 0.7 (0.6 was used). You can find more examination details here.

Approach

DeepSeek-R1's strong coding capabilities enable it to act as an agent without being explicitly trained for tool use. By enabling the design to create actions as Python code, it can flexibly communicate with environments through code execution.

Tools are executed as Python code that is consisted of straight in the prompt. This can be a basic function meaning or a module of a larger plan - any valid Python code. The design then generates code actions that call these tools.

Arise from executing these actions feed back to the model as follow-up messages, driving the next steps till a final answer is reached. The representative structure is a basic iterative coding loop that mediates the conversation in between the design and its environment.

Conversations

DeepSeek-R1 is used as chat model in my experiment, where the model autonomously pulls extra context from its environment by utilizing tools e.g. by utilizing an online search engine or bring information from websites. This drives the conversation with the environment that continues till a final response is reached.

On the other hand, o1 designs are known to perform inadequately when used as chat models i.e. they don't try to pull context throughout a conversation. According to the linked post, o1 models perform best when they have the full context available, with clear instructions on what to do with it.

Initially, valetinowiki.racing I likewise attempted a full context in a single prompt approach at each step (with results from previous steps consisted of), however this caused considerably lower scores on the GAIA subset. Switching to the conversational approach explained above, setiathome.berkeley.edu I was able to reach the reported 65.6% efficiency.

This raises an intriguing concern about the claim that o1 isn't a chat design - possibly this observation was more pertinent to older o1 models that did not have tool use capabilities? After all, utahsyardsale.com isn't tool usage support an important system for allowing designs to pull additional context from their environment? This conversational method certainly appears effective for DeepSeek-R1, though I still require to carry out similar explores o1 designs.

Generalization

Although DeepSeek-R1 was mainly trained with RL on mathematics and coding jobs, it is remarkable that generalization to agentic tasks with tool usage via code actions works so well. This ability to generalize to agentic jobs reminds of recent research study by DeepMind that shows that RL generalizes whereas SFT remembers, although generalization to tool use wasn't examined because work.

Despite its ability to generalize to tool usage, dokuwiki.stream DeepSeek-R1 frequently produces long thinking traces at each action, compared to other designs in my experiments, restricting the usefulness of this model in a single-agent setup. Even simpler jobs often take a very long time to complete. Further RL on agentic tool use, be it through code actions or not, could be one option to enhance effectiveness.

Underthinking

I also observed the underthinking phenomon with DeepSeek-R1. This is when a reasoning model often switches between different thinking thoughts without sufficiently checking out promising courses to reach a right service. This was a major reason for excessively long reasoning traces produced by DeepSeek-R1. This can be seen in the taped traces that are available for download.

Future experiments

Another typical application of reasoning models is to utilize them for preparing only, while utilizing other models for creating code actions. This could be a prospective new feature of freeact, if this separation of functions proves helpful for more complex tasks.

I'm also curious about how reasoning designs that already support tool use (like o1, o3, ...) carry out in a single-agent setup, with and without producing code actions. Recent developments like OpenAI's Deep Research or Hugging Deep Research, which likewise uses code actions, look interesting.