Skip to content
SATURDAY, MAY 30, 2026
AI & Machine Learning2 min read

LangSmith on AWS Elevates Deep Agent Testing

By Alexander Cole

Deep agents finally get a production-grade testing regime. LangSmith on AWS provides a framework to catch hard-to-find errors early, track them in production, and iteratively improve an agent's reliability throughout its lifecycle. The post co-authored with Karan Singh shows how to apply five evaluation patterns for deep agents, build offline evaluations using pytest and LangSmith, and configure online monitoring for production. The walkthrough uses a text-to-SQL deep agent with Amazon Bedrock for the full development to production lifecycle. Amazon's Nova 2 Lite is a fast, cost-effective reasoning model available in Bedrock, with extended thinking and a 1 million-token context window, and it supports configurable budgets levels (low, medium, high). The team reports that Nova 2 Lite handles instruction following, function calling, and code generation well, making it a good fit for agentic workloads like the text-to-SQL agent in the post.

The article frames an evaluation as a test for AI systems: give an input, apply grading logic to the output, and measure success. For agents, every component becomes more complex, because a single bad tool call can cascade through an entire workflow. The post shows five evaluation patterns for deep agents and explains how to build offline evaluations with pytest and LangSmith, plus how to configure online monitoring for production. In practice, engineers can start with offline evaluations that reproduce failures deterministically, then layer in online monitoring to watch for drift and failures as real users interact with the system. This pairing creates a reliability-centric lifecycle that surfaces issues before users are exposed to them and provides a structured path to improving agent behavior with each iteration.

From a practitioner’s perspective, the article offers concrete, actionable guidance. Start with offline evaluations using pytest and LangSmith to capture regressions in a repeatable way, ensuring early failures don’t slip into production. Pair those tests with robust online monitoring to detect anomalies that only emerge under production load, and to trace how early steps in a multi-step plan influence downstream results. The choice of Bedrock’s Nova 2 Lite, with its extended thinking and 1M-token context window, is a pragmatic compromise: it enables deeper reasoning without exploding cost or latency, and the budget controls (low, medium, high) give teams a lever to tune performance to their workflow needs. Finally, the article reiterates a fundamental governance point for teams building agent-powered products: non-determinism and cascading tool calls demand disciplined evaluation and continuous improvement, not a one-off QA pass.

The practical takeaway is clear: treat agent evaluation as an ongoing discipline rather than a single milestone. With LangSmith on AWS, teams now have a structured way to test, monitor, and refine deep agents from development through deployment, using concrete patterns, a reproducible offline test harness, and production-grade monitoring that aligns with the realities of multi-step reasoning and tool use.

Sources
  1. Comprehensive observability for Amazon SageMaker AI LLM inference: From GPU utilization to LLM quality
    AWS Machine Learning / Primary / Published MAY 29, 2026 / Accessed MAY 30, 2026
  2. Evaluating Deep Agents using LangSmith on AWS
    AWS Machine Learning / Primary / Published MAY 28, 2026 / Accessed MAY 30, 2026

Newsletter

The Robotics Briefing

A daily front-page digest delivered around noon Central Time, with the strongest headlines linked straight into the full stories.

No spam. Unsubscribe anytime. Read our privacy policy for details.