A\
Lessons/Domain 2
2.212 min

Implement structured error responses for MCP tools

In production, MCP tools fail constantly: gateways time out, inputs are malformed, policies block actions, and permissions are missing. If every failure returns the same generic "Operation failed" string, the agent cannot tell a recoverable blip from a permanent block, so it wastes iterations retrying doomed calls or abandons cases it could have resolved. Structured error responses, built on the MCP isError flag plus categorized, retryability-tagged metadata, let the agent make correct recovery decisions and communicate appropriately with users.

Structured MCP error drives informed recoverySet isError:true and return categorized, retryability-tagged metadataprocess_refund result (failure path){"isError": true,"content": [{ "type":"text", "text": {"errorCategory": "business","isRetryable": false,"message": "Refund exceeds limit","customerMessage": "Needs approval"} }] }0 matches = success, isError:falsecould not run (timeout) = isError:true, transientSubagent: recover transient locally; propagateonly unresolved errors, with partial resultsand what was attempted, to the coordinator.category -> retryable -> actiontransientretryabletimeout, 503, 429 = retry with backoffvalidationnot retryablebad input = fix arguments, re-callbusinessnot retryablepolicy block = explain, escalatepermissionnot retryableunauthorized = escalate, re-auth
A structured MCP failure result (left) and how each error category maps to a retryability flag and a distinct agent recovery action (right).

The MCP isError flag: report failures the agent can see

MCP tools report problems through two different channels, and choosing the wrong one hides the failure from Claude. Protocol errors are JSON-RPC level errors (unknown tool, malformed arguments, a server that crashed) and signal that the request itself could not be processed. Tool execution errors are different: the tool ran, but the operation failed (a payment gateway timed out, a refund violated policy). For execution errors, the MCP pattern is to return a normal CallToolResult with isError: true and the details inside content, so the model sees the failure and can reason about what to do next.

If you instead let the handler throw, many SDKs surface a terse protocol error or a generic message and you lose control of the payload. Catch the error yourself and return structured content:

return {
  isError: true,
  content: [{ type: "text", text: JSON.stringify({
    errorCategory: "transient",
    isRetryable: true,
    message: "Payment gateway timed out after 30s"
  }) }]
};

The isError flag is what tells the agent this result is a failure rather than data. The JSON inside content is what tells it how to react.

A taxonomy of failure: transient, validation, business, permission

Not all failures deserve the same response, so categorize them. Four categories cover almost every MCP tool:

  • Transient: timeouts, service unavailable (503), rate limits (429). The same call may succeed if tried again. Retryable.
  • Validation: malformed or out-of-range input, such as a badly formatted order ID or a missing required field. Retrying the identical call fails again; the agent must correct the input first.
  • Business: the operation is well-formed and authorized but violates a policy, such as a refund above the auto-approval limit or an item outside the return window. Not retryable; the agent must explain or escalate.
  • Permission: the caller is not authorized or lacks the required scope. Not retryable by the agent; it needs escalation or different credentials.

Putting the category in the response as errorCategory lets the agent branch correctly instead of guessing from a prose string. It also lets you log and alert on each class separately: a spike in transient errors is an infrastructure problem, while a spike in validation errors usually means a prompt or schema issue.

Retryable vs non-retryable, and why metadata prevents wasted work

The agentic loop keeps calling tools until Claude decides it is done. If a failed call looks recoverable, the model may retry it. That is exactly right for a transient timeout and exactly wrong for a business rule violation, where every retry burns another iteration, more tokens, and more latency while producing the identical failure.

A boolean isRetryable (some teams spell it retriable) removes the guesswork. Transient errors set it true, often with a retryAfterMs hint; validation, business, and permission errors set it false. When the model sees isRetryable: false it stops retrying and moves to the appropriate fallback: fix the input, explain the policy, or escalate. Returning this metadata is what converts a blind retry loop into a deliberate decision, and it is the main reason structured errors save real money at scale.

Anatomy of a good structured error payload

A good error payload carries everything the agent needs in one object: a machine-readable category, a retryability flag, a developer-facing message, and, for business rules, a customer-facing explanation the agent can relay verbatim.

// business rule violation
return {
  isError: true,
  content: [{ type: "text", text: JSON.stringify({
    errorCategory: "business",
    isRetryable: false,
    message: "Refund of $650 exceeds the $500 auto-approval limit",
    customerMessage: "This refund needs a supervisor to approve it. I can connect you with someone who can help right now."
  }) }]
};

Separating message from customerMessage matters: the first drives the agent's internal logic and your logs, the second gives the agent safe, on-brand wording so it does not leak an internal threshold or invent an explanation. Keep the field names identical across every tool so the agent learns one consistent shape rather than a different schema per tool.

Access failure vs valid empty result

A query that runs successfully and finds nothing is not an error. lookup_order returning zero orders for a valid customer is a successful result: isError is false and the content is an empty list. A tool that could not run at all, for example because the order service timed out, is a failure: isError is true with a transient category.

Conflating the two is a common and costly bug. If you mark empty results as errors, the agent retries queries that were already answered, or worse, concludes the data is unavailable and escalates unnecessarily. If you return empty success on a real failure, the agent believes no data exists and produces confidently wrong output, for example a research report that silently omits a whole subtopic because one search timed out. Always distinguish an access failure, which needs a retry decision, from a valid empty result, which is a definitive answer of none.

Local recovery in subagents, structured propagation to the coordinator

In multi-agent systems, error handling has two layers. A subagent should recover locally from transient failures it can resolve, for example retrying a timed-out search once or twice with backoff, so the coordinator never sees noise it cannot act on. Only errors the subagent cannot resolve should propagate upward.

When a subagent does propagate, it should not send a bare "failed" status. It returns structured context: the failure category, what it attempted (the exact query or operation), any partial results it did gather, and possible alternatives. That lets the coordinator make an intelligent recovery decision, whether to retry with a narrower query, route to a different source, or proceed with partial coverage and annotate the gap, rather than terminating the entire workflow on a single failure. This is the tool-level foundation of the error-propagation strategies used across a coordinator-subagent research system.

Anti-patterns to avoid

avoid
Returning one generic error like "Operation failed" (or a bare non-zero status) for every failure.

Why it fails: The agent cannot tell a retryable timeout from a permanent policy block, so it either retries doomed calls until it hits an iteration cap or gives up on cases it could have resolved. A uniform error destroys the information the agent needs to recover.

instead Return `isError: true` with a categorized payload (errorCategory, isRetryable, message) so each failure maps to a distinct, correct recovery path.

avoid
Letting the tool handler throw so the SDK surfaces a raw protocol/JSON-RPC error for an execution failure.

Why it fails: Protocol errors signal that the request could not be processed at all and give you no control over the payload; the model often sees only a terse string it cannot categorize, so it cannot reason about recovery.

instead Catch execution errors and return a normal result with `isError: true` and structured content. Reserve protocol errors for genuinely un-invocable requests such as an unknown tool or malformed arguments.

avoid
Marking a valid empty result as isError:true, or returning empty success when the tool actually failed.

Why it fails: Zero matches is a definitive answer, not a failure. Flagging it as an error triggers pointless retries or false escalation, while hiding a real failure as empty success makes the agent omit data it should have found.

instead Return empty results as a successful `CallToolResult` (isError:false, empty array). Reserve `isError: true` for access failures that genuinely could not complete.

avoid
Encoding retryability only inside the human-readable message and expecting the agent to parse it.

Why it fails: Relying on the model to infer "this is retryable" from prose is unreliable and wastes retries on business and permission errors that will never succeed no matter how many times they are attempted.

instead Expose an explicit `isRetryable` boolean, plus a `retryAfterMs` hint for transient errors, so retry behavior is deterministic rather than inferred.

Worked example: Hardening a customer support agent's MCP tools with structured errors

Your support agent (Scenario 1) resolves returns and billing disputes through custom MCP tools: get_customer, lookup_order, process_refund, and escalate_to_human. In production, three failure modes dominate, and a uniform "Operation failed" response makes the agent misbehave in all of them: it retries a policy-blocked refund five times, then escalates a case it could have resolved. Redesign each tool to return categorized errors.

1) Transient: the payment gateway behind process_refund times out.

return { isError: true, content: [{ type: "text", text: JSON.stringify({
  errorCategory: "transient", isRetryable: true, retryAfterMs: 2000,
  message: "Payment gateway timed out" }) }] };

The agent waits, retries, and usually succeeds on the second attempt.

2) Business: the refund is $650 but the auto-approval limit is $500.

return { isError: true, content: [{ type: "text", text: JSON.stringify({
  errorCategory: "business", isRetryable: false,
  message: "Refund $650 exceeds $500 auto-approval limit",
  customerMessage: "This amount needs supervisor approval; I will connect you with a human agent." }) }] };

Because isRetryable is false, the agent does not retry. It relays the customerMessage and calls escalate_to_human.

3) Empty vs failure in lookup_order. For a customer who genuinely has no orders, the tool returns a successful, empty result (isError: false, orders: []), and the agent tells the customer no orders were found instead of retrying. But if the order service is down, lookup_order returns isError: true with errorCategory: "transient", and the agent retries or escalates.

The net effect: fewer wasted retries, correct escalation on policy limits, and customer-safe wording, which is exactly what moves first-contact resolution toward the 80% target.

Exam tips

  • isError:true inside a CallToolResult reports an execution failure the agent should reason about; JSON-RPC protocol errors are for requests that cannot be invoked at all (unknown tool, bad arguments).
  • Memorize the four categories and their retryability: transient (retryable), validation, business, and permission (all not retryable by the agent as-is).
  • Structured metadata (errorCategory + isRetryable) is what prevents wasted retries; without it the agent burns iterations retrying business and permission errors.
  • Business rule violations should return retriable:false plus a customer-friendly message the agent can relay, so it does not leak internal thresholds or invent an explanation.
  • A query with zero matches is a successful, empty result (isError:false), not an error; only a query that could not run is isError:true.
  • In multi-agent systems, subagents recover transient failures locally and propagate only unresolved errors, including partial results and what was attempted, to the coordinator.
Official exam objectives for 2.2
Knowledge of
  • The MCP isError flag pattern for communicating tool failures back to the agent
  • The distinction between transient errors (timeouts, service unavailability), validation errors (invalid input), business errors (policy violations), and permission errors
  • Why uniform error responses (generic "Operation failed") prevent the agent from making appropriate recovery decisions
  • The difference between retryable and non-retryable errors, and how returning structured metadata prevents wasted retry attempts
Skills in
  • Returning structured error metadata including errorCategory (transient/validation/permission), isRetryable boolean, and human-readable descriptions
  • Including retriable: false flags and customer-friendly explanations for business rule violations so the agent can communicate appropriately
  • Implementing local error recovery within subagents for transient failures, propagating to the coordinator only errors that cannot be resolved locally along with partial results and what was attempted
  • Distinguishing between access failures (needing retry decisions) and valid empty results (representing successful queries with no matches)

Flashcards from this lesson

Which MCP field marks a tool result as a failure the agent should reason about, and where does it live?

The isError:true field on the CallToolResult. Set it and put structured details in the content, rather than throwing a protocol error.

How does an isRetryable boolean save cost in the agentic loop?

It stops the model from retrying business, validation, and permission failures that will never succeed, avoiding wasted iterations, tokens, and latency.

How should a tool represent a query that ran successfully but found nothing?

As a successful result: isError:false with an empty array. It is a valid empty result, not an access failure, so the agent should not retry or escalate.

When should a subagent propagate an error to the coordinator, and what should it include?

Only errors it cannot resolve locally (transient failures should be retried locally first). Include the failure category, what was attempted, partial results, and possible alternatives.

For a business rule violation like an over-limit refund, what two message fields help the agent?

message (developer/internal detail for logic and logs) and customerMessage (safe, on-brand wording the agent relays), with isRetryable:false.

Name the four MCP error categories and which is retryable.

Transient (retryable), validation, business, and permission (all not retryable without a change). Transient may include a retryAfter hint.

Difference between an MCP protocol error and a tool execution error?

Protocol (JSON-RPC) errors mean the request could not be processed at all (unknown tool, bad args). Execution errors are failures of a tool that ran, reported via isError:true so the model can see and handle them.

Study all flashcards with spaced repetition