Exploring DeepSeek-R1's Agentic Capabilities Through Code Actions

I ran a quick experiment investigating how DeepSeek-R1 carries out on agentic tasks, regardless of not supporting tool usage natively, and I was rather amazed by initial results. This experiment runs DeepSeek-R1 in a single-agent setup, where the design not only prepares the actions but likewise develops the actions as executable Python code. On a subset1 of the GAIA validation split, DeepSeek-R1 surpasses Claude 3.5 Sonnet by 12.5% outright, from 53.1% to 65.6% right, and other designs by an even larger margin:

The experiment followed model use guidelines from the DeepSeek-R1 paper and the model card: Don't utilize few-shot examples, prevent including a system timely, surgiteams.com and set the temperature to 0.5 - 0.7 (0.6 was used). You can find additional evaluation details here.

Approach

DeepSeek-R1's strong coding capabilities allow it to act as a representative without being clearly trained for annunciogratis.net tool use. By permitting the design to produce actions as Python code, it can flexibly interact with environments through code execution.

Tools are executed as Python code that is included straight in the prompt. This can be a basic function definition or a module of a bigger package - any valid Python code. The model then produces code actions that call these tools.

Arise from performing these actions feed back to the design as follow-up messages, driving the next steps till a final response is reached. The agent structure is a simple iterative coding loop that moderates the discussion in between the model and its environment.

Conversations

DeepSeek-R1 is used as chat model in my experiment, where the design autonomously pulls additional context from its environment by utilizing tools e.g. by utilizing a search engine or fetching information from web pages. This drives the conversation with the environment that continues up until a final response is reached.

On the other hand, o1 models are understood to perform inadequately when utilized as chat designs i.e. they don't attempt to pull context during a discussion. According to the connected article, o1 models carry out best when they have the full context available, with clear guidelines on what to do with it.

Initially, chessdatabase.science I also attempted a complete context in a single timely technique at each step (with outcomes from previous steps included), however this resulted in substantially lower ratings on the GAIA subset. Switching to the conversational technique explained above, I had the ability to reach the reported 65.6% performance.

This raises an intriguing concern about the claim that o1 isn't a chat design - possibly this observation was more pertinent to older o1 models that lacked tool use capabilities? After all, isn't tool use support a crucial system for allowing models to pull extra context from their environment? This conversational method certainly appears efficient for DeepSeek-R1, though I still need to carry out similar explores o1 designs.

Generalization

Although DeepSeek-R1 was mainly trained with RL on math and coding tasks, it is amazing that generalization to agentic tasks with tool usage by means of code actions works so well. This ability to generalize to agentic jobs advises of current research study by DeepMind that reveals that RL generalizes whereas SFT memorizes, although generalization to tool usage wasn't investigated in that work.

Despite its capability to generalize to tool use, DeepSeek-R1 frequently produces very long thinking traces at each action, compared to other designs in my experiments, the effectiveness of this model in a single-agent setup. Even easier jobs sometimes take a long time to finish. Further RL on agentic tool usage, be it by means of code actions or not, could be one choice to enhance performance.

Underthinking

I likewise observed the underthinking phenomon with DeepSeek-R1. This is when a reasoning design frequently switches between various reasoning thoughts without adequately exploring appealing paths to reach a right solution. This was a significant reason for overly long thinking traces produced by DeepSeek-R1. This can be seen in the tape-recorded traces that are available for download.

Future experiments

Another typical application of thinking models is to use them for preparing only, while using other designs for creating code actions. This might be a possible new feature of freeact, mediawiki.hcah.in if this separation of functions shows helpful for more complex tasks.

I'm also curious about how thinking models that already support tool usage (like o1, o3, ...) carry out in a single-agent setup, with and without generating code actions. Recent advancements like OpenAI's Deep Research or Hugging Face's open-source Deep Research, which also uses code actions, look interesting.