The complete curriculum
One lesson per exam task statement, grouped by domain and ordered for a first pass. Each lesson expands the official objectives with concepts, anti-patterns, a worked example, exam tips and flashcards.
Domain 1: Agentic Architecture & Orchestration
27%An agentic loop is the control flow that lets Claude use tools to complete a task without a human driving each step. In production you send a request, inspect the structured stop_reason, run any tools Claude asked for, feed the results back, and repeat until Claude signals it is done. Getting this loop right (driven by stop_reason, not by parsing text) is the difference between a reliable autonomous agent and one that stops early, loops forever, or misroutes work.
Multi-agent systems get their reliability from structure: one coordinator acts as the single hub that decomposes the request, delegates to specialized subagents, and combines their results. Getting the coordinator's decomposition, dynamic subagent selection, and context passing right is what separates a system that produces complete, cited reports from one where every subagent succeeds yet the final answer is silently incomplete. In production, most multi-agent failures trace back to the coordinator, not the subagents.
In multi-agent systems built on the Claude Agent SDK, a coordinator spawns subagents with the Task tool, and every subagent starts with a blank context window. Production reliability depends on three decisions: giving the coordinator the Task tool, passing all required findings explicitly into each subagent prompt with structured source metadata, and choosing correctly between parallel spawning, goal-oriented delegation, and session forking. Getting these wrong causes lost attribution, silent context gaps, and needless serial latency that a correct design avoids.
Production agents that touch money, identity, or compliance cannot depend on the model choosing to follow the right order. This task teaches when to move workflow ordering out of the prompt and into programmatic gates that deterministically block a tool until its prerequisites are met, how to break a multi-concern request into items you resolve together, and how to package a self-contained handoff when a human must take over.
Agent SDK hooks are deterministic callbacks that run inside the agentic loop, giving you code-level control that prompts cannot provide. A PreToolUse hook intercepts an outgoing tool call before it executes, so it can block policy-violating actions such as an over-limit refund and steer the model to escalate. A PostToolUse hook intercepts the result after a tool returns, so it can normalize heterogeneous formats (Unix timestamps, ISO 8601, numeric status codes) into one consistent shape before the model reasons over it. Reach for hooks whenever a business rule must be guaranteed rather than merely encouraged.
Complex work rarely fits reliably into one prompt. Choosing how to break a task into pieces, a fixed sequential chain versus a plan that adapts to discoveries, is what separates a system that produces consistent, complete output from one that gives contradictory, superficial results. In production this decision drives review quality, test coverage, and whether large investigations actually finish the job.
Long-running agentic work rarely fits in one sitting: engineers pause an investigation and pick it up the next day, or want to try two competing approaches from the same starting analysis. Claude Code and the Agent SDK persist conversations as resumable sessions and let you branch them with fork_session. Choosing correctly between resuming, forking, and starting fresh with a summary is what keeps an agent working from accurate state instead of stale tool results.
Domain 2: Tool Design & MCP Integration
18%In production, the tool description is the main thing Claude uses to decide which tool to call. When several tools look alike, thin or overlapping descriptions cause misrouting, wrong data, and failed tasks. This lesson covers how to write differentiated descriptions, when to rename or split tools, and how system prompt wording can quietly override even a well-built interface.
In production, MCP tools fail constantly: gateways time out, inputs are malformed, policies block actions, and permissions are missing. If every failure returns the same generic "Operation failed" string, the agent cannot tell a recoverable blip from a permanent block, so it wastes iterations retrying doomed calls or abandons cases it could have resolved. Structured error responses, built on the MCP isError flag plus categorized, retryability-tagged metadata, let the agent make correct recovery decisions and communicate appropriately with users.
How you spread tools across agents and how you constrain the model's tool decision are two of the highest-leverage reliability knobs in an agentic system. Give a single agent too many tools, or tools outside its job, and selection quality collapses. Scope each agent to a tight set that matches its role, add only the narrow cross-role tools it needs for high-frequency cases, and use tool_choice to guarantee the model calls a tool (or a specific tool) when the workflow demands it.
MCP servers are how you give Claude Code and Agent SDK workflows access to your real systems (GitHub, Jira, databases, internal APIs) as first-class tools. The architect's job is knowing where to configure a server so the right people get it, how to inject credentials without committing secrets, how tools and resources become available at connect time, and how to write descriptions good enough that the agent actually prefers your MCP tools over built-ins like Grep. Getting scope, secrets, and descriptions right is what makes the difference between a server that quietly works for the whole team and one that leaks tokens or never gets used.
The built-in file and shell tools (Read, Write, Edit, Bash, Grep, Glob) are available in every Claude Code and Claude Agent SDK session before any MCP server is added, and choosing the right one per step is a genuine tool-selection skill. Searching file contents is Grep's job, matching filenames is Glob's, surgical edits are Edit's, and whole-file rewrites are Write's, with Bash reserved for running commands. Picking the wrong tool wastes context and turns, and the exam probes the specific edge cases: Grep versus Glob, Edit's unique-anchor requirement, the Read+Write fallback, and incremental exploration instead of reading everything upfront.
Domain 3: Claude Code Configuration & Workflows
20%CLAUDE.md files are the always-loaded memory that shapes how Claude Code behaves in a repository. Getting the hierarchy and scope right determines whether your whole team and CI actually receive the same standards, and organizing memory modularly keeps context lean instead of drowning every request in irrelevant rules. Most production problems here are scoping bugs (personal config that never reached teammates) or bloat (one giant file that dilutes attention), both of which are diagnosable and fixable.
Slash commands and skills are how you package repeatable Claude Code workflows so a team runs them the same way every time. The two decisions that matter most in production are scope (project-level and committed so everyone gets it, versus user-level and personal) and isolation and safety (using context: fork to keep verbose skill output out of the main conversation, and allowed-tools to fence off destructive actions). Getting these wrong shows up as teammates missing a command, context windows bloated by exploratory output, or a skill with more power than it should have.
Large codebases carry many conventions, and loading all of them on every request wastes context and dilutes the model's attention. Path-specific rules in `.claude/rules/` attach conventions to glob patterns so a rule only enters the context window when Claude edits a matching file. This lesson covers the `paths` frontmatter, how conditional loading saves tokens, and why glob rules beat directory-level CLAUDE.md for conventions that span scattered files like tests.
Choosing plan mode versus direct execution is a routing decision that protects production codebases from costly rework. Plan mode gives Claude Code a read-only exploration-and-design phase with an approval gate before any change lands, while direct execution moves straight to editing for changes whose scope is already understood. Picking the wrong mode either wastes turns on trivial fixes or lets large architectural work make premature, hard-to-unwind edits.
Iterative refinement is how you get production-quality output from Claude Code across multiple turns instead of in one shot. The exam tests which feedback mechanism fits which situation: concrete input/output examples when prose is interpreted inconsistently, test-driven iteration that guides fixes with real failing tests, the interview pattern to surface hidden design decisions, and knowing when to batch interacting fixes versus iterate on independent ones. Picking the right technique converges faster and avoids regression whack-a-mole.
Running Claude Code inside a CI/CD pipeline turns it into an automated reviewer and test generator that runs on every pull request, with no human at the keyboard. The architect's job is to make it non-interactive so the job never hangs, force machine-parseable output so findings can post as inline PR comments, feed project context through CLAUDE.md, and design re-runs so they add only new information instead of spamming duplicate comments. Getting the flags and the context right is the difference between a review bot developers trust and one they mute.
Domain 4: Prompt Engineering & Structured Output
20%In production code review and extraction pipelines, the line between a tool people rely on and one they mute is precision. Vague instructions like "be accurate" or "be conservative" produce noisy false positives that erode trust, while explicit categorical criteria that name the exact condition for a finding produce consistent, actionable output. This lesson covers writing report/skip criteria, calibrating severity with concrete code examples, and temporarily disabling noisy categories to protect trust while you iterate on their prompts.
Few-shot prompting means showing Claude 2-4 worked examples of the task instead of only describing it, and it is the highest-leverage fix when detailed instructions still produce inconsistent format or judgment. In production it is what makes review output parse the same way every time, calibrates borderline decisions like escalation and severity, and stops extraction from hallucinating values for missing fields. The exam tests both when few-shot is the right lever and when a different fix (better tool descriptions, a JSON schema, explicit criteria) is more appropriate.
Downstream systems need machine-parseable output every single time, but asking Claude to "reply with only JSON" fails intermittently at scale. Defining a tool whose input_schema is a JSON Schema, then reading the tool_use block, gives you output whose shape is validated by the API. This lesson covers tool_choice modes, schema design that prevents fabrication, and the critical limit that schemas stop syntax errors but never semantic ones.
Structured output tooling guarantees valid JSON but not correct JSON. This task is about the layer that runs after extraction: validating values against business rules, retrying with specific error feedback when the model got a transformation wrong, and recognizing when retrying cannot help because the information was never in the source. Done well, it turns silent extraction errors into structured signals you can reconcile, escalate, or feed back into prompt improvement.
The Message Batches API trades latency for a 50% discount: you submit many independent requests at once and collect results later, with processing taking up to 24 hours and no guaranteed latency SLA. This lesson covers when batch is the right tool (non-blocking, latency-tolerant work) versus when it is disqualifying (blocking, real-time checks), how custom_id correlates results, why the batch API cannot run agentic tool loops mid-request, and how to size submission windows to an SLA and recover cleanly from partial failures.
When Claude reviews work it just produced, it carries the reasoning it used to generate that work and tends to rationalize its own choices instead of questioning them. In production review pipelines (automated code review, extraction QA), the reliable fixes are architectural: hand the artifact to a fresh independent instance with no generation context, and split large multi-file reviews into focused per-file passes plus a separate cross-file integration pass. These techniques catch subtle bugs and eliminate the inconsistent, contradictory findings that a single self-review pass produces.
Domain 5: Context Management & Reliability
15%Long sessions are where agents quietly go wrong: exact amounts, dates, and order numbers get flattened into vague summaries, verbose tool outputs crowd out the signal, and findings buried in the middle of a huge input simply get ignored. Because the Claude API is stateless, you own the transcript and its budget, so context management is the core engineering task of any long-running agent. The reliable pattern is to preserve full history for coherence while compressing the expensive parts (verbose tool results), protecting the fragile parts (exact facts) in a persistent case-facts layer, and laying out long inputs so the important material sits where the model actually attends to it.
A production support agent has to know two things it cannot fake: when to hand a case to a human, and when a request is too ambiguous to act on safely. Escalating easy cases wastes human capacity and tanks first-contact resolution, while grinding on cases that need a human erodes trust and risks wrong actions. This task is about encoding explicit, testable escalation triggers and ambiguity-resolution rules in the prompt, and rejecting the tempting but unreliable proxies (customer sentiment, the model's own confidence score) that the exam uses as distractors.
In coordinator-subagent systems, subagents fail routinely, so system reliability depends on how failure information travels back to the coordinator, the only component that can decide whether to retry, switch approaches, or degrade gracefully. Returning structured error context (failure type, attempted operation, partial results, alternatives) lets the coordinator recover intelligently, while generic statuses, silent suppression, and whole-run termination all break recovery. This lesson covers how to propagate errors so multi-agent workflows fail gracefully instead of silently or catastrophically.
Exploring a large or unfamiliar codebase floods the context window with verbose Read and Grep output, and over a long session the model starts giving inconsistent answers and reasoning from generic 'typical patterns' instead of the specific classes it actually discovered. The reliable fix is to stop treating the context window as memory: externalize findings to scratchpad files, delegate verbose investigation to subagents that return only distilled summaries, summarize between phases, and compact history when it fills. For long-running multi-agent explorations you also design crash recovery by exporting structured agent state to disk and reloading it from a manifest on resume.
Before an extraction pipeline is allowed to auto-accept results without a human in the loop, you have to prove it is safe, and a single aggregate accuracy number is not proof. A headline like 97% can hide a document type or a field that is failing badly, so you validate accuracy by segment (document type and field), have the model emit field-level confidence, and calibrate routing thresholds against a labeled validation set. Then you keep watching the automated lane with stratified random sampling to catch novel error patterns before they cause harm, routing low-confidence and ambiguous cases to your limited pool of human reviewers.
When a multi-agent system gathers facts from many sources and boils them down into a single report, the risk is not that the facts are wrong but that they arrive stripped of their origin, their date, and their disagreements. Provenance (which source, which excerpt, when) dies at every summarization boundary unless you carry it as structured data. This lesson covers how to make subagents emit claim-source mappings, how to annotate conflicts instead of silently picking a winner, why publication dates prevent fake contradictions, and how to structure a synthesis so a reader can tell settled facts from contested ones.