Tuning AI Agents to Call the Right Tools

AI agents now call the right tools more often. The AWS blog details a practical recipe that blends Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to boost tool-calling accuracy for small language models on Amazon SageMaker AI. The approach centers on training code, not wrestling with infrastructure, by using SageMaker AI training jobs to keep the focus on data and objectives.

The core idea is straightforward but powerful: teach the agent when and how to invoke the correct tool, and how to format the interaction so downstream workflows stay intact. The team reports that SFT builds a high-quality dataset that mirrors the model’s intended function, teaching the model the nuances of tool-specific language, commands, and constraints. In other words, the model learns not just what a tool is, but how to call it correctly within a workflow, including the expected parameterization and sequencing. Directly complementing this, DPO injects human preferences into the loop, nudging the model toward behaviors that align with target outcomes. The blog notes that DPO’s “like this, not like that” style of feedback helps the model converge on more reliable, goal-fulfilling responses rather than just technically plausible ones.

The methodology is practical: the training data is curated to reflect real tool interactions, and the evaluation looks at tool-calling accuracy across base and fine-tuned variants. The team reports that comparing a base model to several fine-tuned variants yields measurable improvements in tool selection and the correctness of tool invocation. Benchmarks indicate that the combination of SFT and DPO can reduce mis-tool calls and formatting errors, which in turn shortens task completion times and lowers support burdens in production environments. The example provided in SageMaker AI training jobs demonstrates a path for teams to move from pilot experiments to production-grade tool usage without building custom training pipelines from scratch.

For practitioners, the article offers several concrete takeaways. First, data quality and alignment to tool usage are constraining levers: the better the curated examples, the more the model learns to recognize when and which tool to call. Second, explicit preference design matters: if the objectives encoded in DPO don’t reflect real user workflows, the agent can drift toward acceptable but suboptimal tool interactions. Third, observability and evaluation matter: defining clear tool-calling accuracy metrics and running controlled comparisons between a base model and fine-tuned variants is essential to justify the investment. Fourth, production readiness benefits from the SageMaker angle: the approach lets teams focus on training code rather than managing training infrastructure, helping organizations scale automation as tool inventories grow and workflows become more complex.

The broader takeaway for the field is that improving tool-calling accuracy is now a tractable, repeatable line of work, not a one-off pilot. By pairing SFT with DPO, teams can turn carefully curated training data and explicit preferences into tangible gains in reliability and speed for autonomous agents. As agentic applications shift from experimental demos to dependable production components, the emphasis on data quality, objective alignment, and clear evaluation will likely become a standard part of the engineering stack for automated decision and action.

Tuning AI Agents to Call the Right Tools

The Robotics Briefing