4.512 min

Design efficient batch processing strategies

The Message Batches API trades latency for a 50% discount: you submit many independent requests at once and collect results later, with processing taking up to 24 hours and no guaranteed latency SLA. This lesson covers when batch is the right tool (non-blocking, latency-tolerant work) versus when it is disqualifying (blocking, real-time checks), how custom_id correlates results, why the batch API cannot run agentic tool loops mid-request, and how to size submission windows to an SLA and recover cleanly from partial failures.

The Message Batches API lifecycle: submit independent requests keyed by custom_id, wait through an asynchronous window of up to 24 hours (50% cheaper, no latency SLA), then correlate results by custom_id and resubmit only the failures. The bottom strip contrasts synchronous (blocking) versus batch (non-blocking) selection.

What the Message Batches API is

The Message Batches API lets you submit a large collection of independent Message requests in a single asynchronous job instead of making one blocking API call at a time. You POST a batch to /v1/messages/batches, the platform processes the requests in the background, and you poll the batch's status until it finishes, then download the results file.

The economics and the constraints are the whole point. Batched tokens cost about 50% less than the equivalent synchronous request for both input and output. In exchange, the batch processes asynchronously with a window of up to 24 hours. Most batches finish well before that, but there is no guaranteed latency SLA, so you must design as if any batch could take the full 24 hours. Batches have large capacity (on the order of 100,000 requests or a few hundred MB per batch), and results remain retrievable for weeks after completion.

POST /v1/messages/batches
{
  "requests": [
    { "custom_id": "invoice-001",
      "params": { "model": "claude-...", "max_tokens": 1024,
                  "messages": [ { "role": "user", "content": "..." } ] } },
    { "custom_id": "invoice-002", "params": { "...": "..." } }
  ]
}

Each entry in requests[] is a completely standalone Message call with its own params. The batch is just a container that runs them cheaply and asynchronously; it does not chain them together or share state between them.

Match the API to the workload's latency tolerance

The core exam decision is choosing batch versus the synchronous (real-time) /v1/messages API based on whether a human or a downstream step is blocked waiting for the answer. Batch is appropriate for non-blocking, latency-tolerant workloads: overnight technical-debt reports, weekly compliance audits, nightly test generation, and bulk document extraction that feeds a report someone reads the next morning. Nothing is stalled while the batch runs, so a multi-hour turnaround is free money.

Batch is inappropriate for blocking workflows. The canonical disqualifier is a pre-merge check that a developer waits on before they can merge a pull request. If someone is sitting there waiting, or a synchronous pipeline stage cannot proceed until the answer arrives, the up-to-24-hour window makes batch unusable regardless of the cost savings. The right move is to split the workloads: keep the synchronous API for the blocking pre-merge check, and move only the latency-tolerant overnight report to batch.

A classic distractor is "switch both to batch and poll for completion" or "switch both with a real-time fallback if the batch is slow." Both are wrong. Batch often finishes fast, but "often" is not acceptable for a blocking gate, and bolting a real-time fallback onto a batch job just adds complexity to hide the fact that the workload never belonged in batch. Match each workflow to the API that fits its latency requirement.

custom_id: correlating requests and results

Every request in a batch carries a custom_id that you assign. It is the key that ties each result back to the input it came from. This matters because batch results are not guaranteed to come back in the order you submitted them, and each request succeeds or fails independently. You must correlate by custom_id, never by array position or submission order.

Each result reports its own outcome, typically one of succeeded, errored, canceled, or expired. That per-request status is what lets you treat a batch as a set of independent jobs rather than an all-or-nothing unit. A common exam misconception is that batch results are unusable because ordering is lost; the correct rebuttal is that custom_id fully solves correlation, so ordering is a non-issue.

In practice you keep a map from custom_id to your domain object (an invoice number, a file path, a PR file, a ticket id). When results come back, you look up each one by its custom_id, write the succeeded outputs to your store, and collect the custom_ids of the errored ones for targeted resubmission. Make custom_id values stable and meaningful (for example the document id) so a resubmission batch reuses the same identifiers and stays traceable end to end.

The tool-use limitation: no agentic loop inside a batch request

A single batch request is one independent completion, not an agentic loop. The batch API does not support multi-turn tool calling within a request: Claude cannot call a tool, have your code execute it, feed the result back, and continue reasoning, all inside one batched request. There is no mechanism for your infrastructure to run a tool and return its output mid-request during batch processing.

You can still include tool definitions in a batch request to get a single, schema-validated tool_use output (this is a common structured-extraction pattern, and it pairs naturally with domain 4.3). What you cannot do is orchestrate the full agentic loop, that is send request, inspect stop_reason, execute the tool, append the result, and iterate, inside the batch itself.

The design implication is a clean split. If a task needs Claude to actually invoke tools and act on their results (browsing, querying a database, running code), that agentic loop belongs in the synchronous API or the Agent SDK, where your loop can execute tools between turns. Batch is for large volumes of self-contained, single-shot completions: classify this document, extract these fields, summarize this file. If you find yourself wanting tool execution mid-request in a batch, that is a signal the workload is agentic and should not be batched as one request.

Sizing submission windows to hit an SLA

When latency-tolerant work still has a deadline, you size how often you submit batches so the worst case fits the SLA. The reasoning: a document that arrives just after a submission waits until the next window (up to the full window length W), then the batch itself can take up to 24 hours. So worst-case latency is approximately W + 24 hours.

To guarantee a 30-hour SLA with a 24-hour batch window, you need W + 24 <= 30, so W <= 6 hours. Choosing a 4-hour submission window satisfies this with margin: worst case 4 + 24 = 28 hours, leaving a 2-hour safety buffer for retrieval, downstream processing, and jitter. This is exactly the blueprint's example: 4-hour windows guarantee a 30-hour SLA under 24-hour batch processing.

worst_case_latency = submission_window + max_batch_processing
28h                = 4h                 + 24h        <= 30h SLA  (OK, 2h buffer)

Note what this budget does not include: a full resubmission cycle. If a batch's failures need to be resubmitted, that is another window plus up to 24 hours, which would blow a tight SLA. That is why maximizing first-pass success (next concept) is part of hitting an SLA, not a separate nicety. Shorter windows reduce worst-case latency but submit more, smaller batches; longer windows batch more efficiently but eat into the SLA.

Maximize first-pass success, then resubmit only failures

Because a failed batch item costs you another window plus processing time, the cheapest strategy is to get as many requests right on the first pass as possible. Before batching a large volume, refine the prompt on a small representative sample using the synchronous API where iteration is fast. Fix the format, schema, and edge-case handling on the sample, then run the full batch once the sample passes reliably. This trades a little up-front sync cost for far fewer expensive resubmission rounds.

When a batch does come back with failures, resubmit only the failed items, identified by custom_id, not the whole batch. Reprocessing the succeeded requests would waste the discount you just captured. Apply targeted modifications to the failures based on why they failed: chunk documents that exceeded the context limit, trim or preprocess inputs that were malformed, or adjust the prompt for a systematic error, then submit those as a new, smaller batch reusing the same custom_ids.

results -> partition by status
  succeeded[]  -> write to store, done
  errored[]    -> for each custom_id: fix cause (e.g. chunk oversized doc)
               -> resubmit ONLY these as a new batch

For extra savings, batch composes with prompt caching: a shared system prompt or shared context across many requests can be cached so you pay the discounted batch rate on top of cache reads. The combination is common in high-volume extraction pipelines.

Anti-patterns to avoid

avoid

Moving a blocking pre-merge check to the Message Batches API to capture the 50% discount.

Why it fails: Batch processing can take up to 24 hours with no latency SLA, so a developer or pipeline stage waiting on the result would be blocked for an unacceptable and unpredictable time.

instead Keep the synchronous API for blocking, real-time checks and reserve batch for non-blocking work like overnight reports and nightly analysis.

avoid

Switching a time-sensitive workload to batch with status polling because batches 'usually finish within an hour', or adding a real-time fallback if the batch runs long.

Why it fails: 'Usually fast' is not a guarantee; there is no latency SLA, and a real-time fallback just adds complexity to disguise a workload that never fit batch.

instead Decide by the SLA: if the deadline cannot tolerate up to 24 hours plus the submission wait, use the synchronous API instead of batch.

avoid

Re-running the entire batch when only some requests failed, or correlating results by their position in the output.

Why it fails: Reprocessing succeeded requests throws away the discount, and results are not returned in submission order, so position-based matching mismatches outputs to inputs.

instead Correlate every result by its custom_id and resubmit only the failed custom_ids, with fixes such as chunking oversized documents.

avoid

Trying to run a multi-turn, tool-calling agent inside a single batch request (call a tool, get results, keep reasoning).

Why it fails: The batch API does not support executing tools mid-request and feeding results back; each batch request is a single independent completion.

instead Run agentic tool loops on the synchronous API or Agent SDK; use batch only for self-contained single-shot completions (extraction, classification, summarization).

Worked example: A nightly invoice-extraction pipeline with a 30-hour SLA (Scenario 6)

You run a structured data extraction system on Claude. Invoices arrive continuously throughout the day, and finance needs each one's extracted fields available within 30 hours of receipt. Volume is high (tens of thousands per day) and nothing is blocked in real time waiting on any single invoice, so cost is the priority.

Choose the API. Because the work is non-blocking and latency-tolerant with a comfortable 30-hour deadline, this is a textbook Message Batches API workload: 50% cheaper, and the up-to-24-hour window fits inside the SLA. A synchronous per-invoice call would be correct only if a person or downstream step were blocked waiting, which they are not.

Size the submission window. Worst case, an invoice arrives just after a batch is submitted, waits a full window W, then its batch takes up to 24 hours: W + 24 <= 30, so W <= 6h. You choose a 4-hour window, giving 4 + 24 = 28h worst case and a 2-hour buffer for result retrieval and posting to the finance system. So you submit a batch every 4 hours with all invoices received since the last submission.

Build requests with stable custom_ids. Each request uses the invoice id as its custom_id and defines a single extraction tool so the output is schema-validated (a tool_use output is supported in batch; a full tool-execution loop is not). One tool_use turn per invoice is exactly right here, no agentic loop is needed.

{ "requests": [
  { "custom_id": "INV-48213",
    "params": { "model": "claude-...", "max_tokens": 1024,
      "tools": [ { "name": "extract_invoice", "input_schema": { "...": "..." } } ],
      "tool_choice": { "type": "tool", "name": "extract_invoice" },
      "messages": [ { "role": "user", "content": "<invoice text>" } ] } }
] }

Refine on a sample first. Before the first full run you test the extraction prompt and schema on ~50 representative invoices via the synchronous API, fixing field-mapping and edge cases (missing totals, multi-page scans). This maximizes first-pass success so you are not paying for resubmission cycles that a tight SLA cannot absorb.

Handle partial failures. When a batch completes, you partition results by status using custom_id. Succeeded extractions go to finance. For the errored ones, you inspect the cause: several oversized multi-page invoices exceeded the context limit. You chunk those specific documents and resubmit only their custom_ids as a smaller batch, leaving the tens of thousands of successes untouched so you keep the discount you already earned.

Contrast with a blocking case. If finance later added a real-time gate, say an invoice cannot be approved for payment until Claude flags fraud risk while a clerk waits, that gate must move to the synchronous API. You would run the nightly bulk extraction on batch and the interactive fraud check on sync, matching each to its latency requirement rather than forcing both onto one API.

Exam tips

✓Memorize the batch profile: ~50% cost savings, up to a 24-hour processing window, and NO guaranteed latency SLA. These three facts drive almost every batch question.
✓The decision rule: synchronous API for blocking/real-time work (pre-merge checks), Message Batches API for non-blocking, latency-tolerant work (overnight reports, weekly audits, nightly test generation). When both types exist, split them rather than forcing one API on both.
✓custom_id correlates each result to its request. Results can return out of order and succeed/fail independently, so 'ordering is lost' is a false objection to batch, and matching by position is wrong.
✓The batch API cannot do multi-turn tool calling inside a request (no execute-tool-then-continue). A single tool_use output for structured extraction is fine; a full agentic loop belongs in the sync API or Agent SDK.
✓SLA math: worst-case latency = submission window + up to 24h processing. For a 30h SLA, window must be <= 6h; a 4h window (4 + 24 = 28h) leaves a 2h buffer.
✓On partial failure, resubmit only the failed custom_ids with targeted fixes (e.g. chunk oversized docs); refine the prompt on a small sample first to maximize first-pass success and avoid costly resubmission rounds. Note: batch is an API feature, there is no '--batch' flag on the Claude Code CLI.

Official exam objectives for 4.5

Knowledge of

The Message Batches API: 50% cost savings, up to 24-hour processing window, no guaranteed latency SLA
Batch processing is appropriate for non-blocking, latency-tolerant workloads (overnight reports, weekly audits, nightly test generation) and inappropriate for blocking workflows (pre-merge checks)
The batch API does not support multi-turn tool calling within a single request (cannot execute tools mid-request and return results)
custom_id fields for correlating batch request/response pairs

Skills in

Matching API approach to workflow latency requirements: synchronous API for blocking pre-merge checks, batch API for overnight/weekly analysis
Calculating batch submission frequency based on SLA constraints (e.g., 4-hour windows to guarantee 30-hour SLA with 24-hour batch processing)
Handling batch failures: resubmitting only failed documents (identified by custom_id) with appropriate modifications (e.g., chunking documents that exceeded context limits)
Using prompt refinement on a sample set before batch-processing large volumes to maximize first-pass success rates and reduce iterative resubmission costs

Flashcards from this lesson

What are the three defining properties of the Message Batches API?

About 50% lower token cost, asynchronous processing with a window of up to 24 hours, and no guaranteed latency SLA.

A team wants to move both a blocking pre-merge check and an overnight tech-debt report to batch for the discount. What should they do?

Move only the overnight report to batch; keep the pre-merge check on the synchronous API because it is a blocking, latency-sensitive workload.

Why is 'batch loses result ordering' not a valid objection?

Every request has a custom_id, so you correlate each result to its input by custom_id regardless of return order.

Can a single batch request run an agentic tool loop (call tool, get result, continue)?

No. The batch API does not support multi-turn tool calling mid-request; each request is one independent completion. A single tool_use output for structured extraction is fine, but agentic loops belong on the sync API or Agent SDK.

With a 24-hour batch window, how frequently must you submit to guarantee a 30-hour SLA?

At least every 6 hours (window + 24 <= 30). A 4-hour window is a safe choice: 4 + 24 = 28h, a 2-hour buffer.

A batch returns with some errored requests due to context-length limits. What is the correct recovery?

Resubmit only the failed custom_ids as a new batch with fixes (e.g. chunk the oversized documents), leaving successful requests untouched.

How do you minimize expensive resubmission rounds before a large batch run?

Refine the prompt and schema on a small representative sample using the synchronous API until it passes reliably, then run the full batch.

Study all flashcards with spaced repetition

Mark this lesson complete when you are confident.

← Previous

4.4 Implement validation, retry, and feedback loops for extraction quality

4.6 Design multi-instance and multi-pass review architectures