4.213 min

Apply few-shot prompting to improve output consistency and quality

Few-shot prompting means showing Claude 2-4 worked examples of the task instead of only describing it, and it is the highest-leverage fix when detailed instructions still produce inconsistent format or judgment. In production it is what makes review output parse the same way every time, calibrates borderline decisions like escalation and severity, and stops extraction from hallucinating values for missing fields. The exam tests both when few-shot is the right lever and when a different fix (better tool descriptions, a JSON schema, explicit criteria) is more appropriate.

A few-shot prompt (instructions plus 2-4 examples, including a contrasting acceptable-vs-issue pair with reasoning) produces consistently shaped output that generalizes to unseen code, versus zero-shot drift.

Why and when few-shot beats more instructions

Few-shot prompting means pairing your instructions with a small set of worked examples, each an input alongside the exact output you want. Zero-shot is instructions only; few-shot adds a handful of demonstrations. The blueprint's core claim is precise: few-shot examples are the most effective technique for achieving consistently formatted, actionable output when detailed instructions alone produce inconsistent results. Prose leaves too many degrees of freedom open, and a single example collapses them all at once.

It helps to place few-shot next to the other Domain 4 levers so you pick the right one on the exam. Explicit criteria (task 4.1) tell the model which categories to decide. A tool_use JSON schema (task 4.3) guarantees the syntactic shape. Few-shot is unique in that it demonstrates the decision and the shape together, and it is the only one of the three that reliably teaches judgment on ambiguous cases.

So the trigger to reach for few-shot is specific: your instructions are already reasonably detailed, yet the output still varies in format or in how borderline cases are judged. That is the signal that the model needs to be shown, not told. Moving from zero-shot to few-shot is usually the lowest-effort, highest-leverage improvement at that point.

Anatomy of good examples: 2-4, targeted, with reasoning

Keep the set small: 2 to 4 targeted examples. More is rarely better. Extra examples inflate token cost on every call, and worse, they can push the model to overfit to the specific cases you listed instead of generalizing. Pick examples that sit on the hard edges of the task, the ambiguous cases where a reasonable model could plausibly go either way.

For ambiguous scenarios the example must show the reasoning for why one action was chosen over plausible alternatives, not just input then output. If you show only "input X produced output Y," the model can copy the surface form and still misjudge the next borderline case, because it never saw the decision boundary. Spelling out the why (why escalate here but resolve there, why this line is a bug but that near-identical one is fine) is exactly what transfers the judgment.

Examples can live in the system prompt, usually wrapped in delimiters so they are visibly separate from the instructions, or be supplied as prior user and assistant message turns. Both are valid; what matters is that examples are clearly distinct from the live input.

<example>
input: <a borderline case>
reasoning: chose to FLAG because the claimed behavior contradicts the code,
           not merely because the comment is old
output: { location, issue, severity, suggested_fix }
</example>

Demonstrating output format for consistency

When downstream code parses every response, format drift is expensive. Few-shot examples that show the exact desired shape make every response come out the same way. For a code review, the canonical fields to demonstrate are location, issue, severity, and suggested fix. One or two fully worked examples do more for format consistency than several paragraphs describing the format, because the example is a template the model fills in rather than a spec it must interpret.

This is the fastest cure for the classic symptom of "detailed feedback for some items, one-liners for others, occasional freeform prose." Show the shape once or twice and the model conforms.

Mind the boundary with structured output (task 4.3). If you need machine-parseable JSON with zero syntax errors, tool_use with a strict JSON schema is the stronger guarantee, because it enforces the shape rather than suggesting it. Few-shot format examples are the right tool when you want a specific human-readable structure, or when you want to complement a schema by showing how to populate its fields well.

Generalization, not memorization: cutting false positives

A defining property of good few-shot examples is that they let the model generalize judgment to novel patterns rather than matching only the exact cases you listed. This is why you include contrasting examples: at least one genuine issue and at least one acceptable pattern that superficially resembles a problem but is actually fine. Showing both draws the decision boundary, so the model can classify code it has never seen.

This is the direct fix for a reviewer that keeps flagging a legitimate local pattern. Add an example showing that pattern explicitly accepted, with the reasoning, right next to an example of the genuinely problematic version. The model learns to separate the two and stops flagging the benign one, without you enumerating every acceptable variant by hand.

Contrast this with the failed approach from task 4.1: telling the model to "be conservative" or "only report high-confidence findings" does not move precision, because those are vague confidence knobs, not decision boundaries. Concrete contrasting examples move precision; general confidence instructions do not.

Few-shot for extraction: varied structures and null fields

Extraction from unstructured documents is where few-shot pays off against hallucination. Real documents vary in structure: inline citations versus a bibliography section, methodology in a dedicated section versus embedded in prose, informal measurements versus clean tables. Examples that show correct extraction from each structural variant teach the model to find the same information regardless of layout, so it generalizes across the document shapes you actually receive.

The second and bigger win is the empty or null problem. When a required field is simply absent from a source document, a model pressured to produce a value will often fabricate one. Include an example where the field genuinely is missing and the correct output is null (or an explicit "not present"). Demonstrating that returning null is the desired behavior suppresses fabricated values far more reliably than an instruction alone.

A compact, high-value few-shot set for extraction is therefore one example per structurally distinct document type you expect, plus one "field absent, return null" example. Pair these examples with nullable schema fields from task 4.3 so the schema permits the null the examples demonstrate.

Choosing the right lever: when few-shot is NOT the answer

Few-shot is powerful but not universal, and the exam checks whether you pick the proportionate fix rather than reflexively adding examples.

First trap: tool misrouting caused by minimal tool descriptions. If two similar tools have thin descriptions and the model picks the wrong one, the root-cause fix is to expand the tool descriptions (task 2.1), because descriptions are the primary mechanism the model uses to select tools. Adding few-shot routing examples piles token overhead onto every call without fixing the underlying ambiguity; it is a weaker, more expensive patch. Few-shot for tool selection is appropriate for genuinely ambiguous requests, not as a substitute for a clear description.

Second trap: guaranteed structure. If you must have syntactically valid JSON every time, use tool_use with a JSON schema (task 4.3), not few-shot format examples, which reduce but do not eliminate syntax drift.

Where few-shot IS the right first move: calibrating judgment on ambiguous or borderline cases (escalation decisions, severity classification, branch-level coverage-gap detection), locking an output format, and reducing false positives through contrasting examples. A recurring correct-answer pattern on the exam is "add explicit escalation criteria plus 2-4 few-shot examples" as the proportionate first response, ahead of building a separate classifier, a self-reported confidence threshold, or a sentiment model.

Anti-patterns to avoid

avoid

Adding few-shot examples of correct tool routing to fix misrouting between two tools that have minimal descriptions.

Why it fails: Descriptions are the primary mechanism the model uses to select tools, so thin descriptions are the root cause; example turns add token overhead on every call without resolving the ambiguity.

instead Expand each tool's description first (task 2.1) with input formats, example queries, and when-to-use-versus boundaries. Reserve few-shot for genuinely ambiguous requests, not as a description substitute.

avoid

Piling in 10-20 examples, or a set of near-duplicate examples, to be thorough.

Why it fails: Large or redundant example sets inflate token cost and cause overfitting, so the model matches the specific cases shown instead of generalizing to novel inputs.

instead Use 2-4 targeted examples chosen for the hard, ambiguous edges of the task, including at least one contrasting acceptable-versus-genuine-issue pair.

avoid

Writing examples as bare input to output with no reasoning, for ambiguous decisions.

Why it fails: Without the reasoning, the model can only copy surface form; it never learns the decision boundary and still misjudges the next borderline case.

instead For ambiguous cases, include the reasoning for why one action was chosen over plausible alternatives, so the judgment transfers, not just the format.

avoid

Relying on few-shot format examples alone to guarantee valid, machine-parseable JSON output.

Why it fails: Examples reduce syntax drift but do not enforce it, so occasional malformed JSON still breaks downstream parsing.

instead Use tool_use with a strict JSON schema (task 4.3) to guarantee syntax, and use few-shot to govern the judgment and field quality that fills the schema.

Worked example: Scenario 5: making CI code-review output consistent and trustworthy with few-shot

Your Claude Code review job in CI draws two complaints from developers. First, findings are formatted inconsistently: some detailed, some one-liners, occasionally a prose paragraph the bot cannot post cleanly as an inline comment. Second, it raises false positives on a legitimate in-house logging pattern, which is eroding trust in the whole bot. Your review instructions are already detailed. That combination, detailed instructions but inconsistent format and judgment, is the textbook signal to add few-shot examples.

Step 1, lock the format. Add an example that shows the canonical finding shape so every result parses the same way:

<example>
diff:  + const total = items.reduce((a, i) => a + i.price)
finding:
  location: cart.ts:42
  issue: reduce() without an initial value throws on an empty array
  severity: high
  suggested_fix: pass 0 as the second arg -> reduce((a,i)=>a+i.price, 0)
</example>

Step 2, kill the false positive with a contrasting pair. Developers keep getting flagged for approved logger.debug(...) calls. Show one accepted-pattern example and one genuine-issue example, each with reasoning, so the model learns the boundary rather than a blanket rule:

<example>   // acceptable: do NOT flag
  + logger.debug(`request id ${reqId}`)
  reasoning: debug logging of non-sensitive ids is an approved local pattern
  finding: none
</example>
<example>   // genuine issue: DO flag
  + logger.info(`user token ${token}`)
  reasoning: writing a secret to logs is a real security bug at any log level
  finding: { location, issue: secret logged, severity: high, suggested_fix: redact }
</example>

Because the examples carry reasoning, the model generalizes: it stops flagging benign debug logs but still catches a secret written to logs, even in files and code shapes it never saw. That is generalization, not memorization.

Step 3, pair with the right companion levers. For a comment the bot posts automatically, harden the format further by combining these examples with --output-format json and a JSON schema (tasks 3.6 and 4.3), so the payload is guaranteed parseable; the few-shot examples then govern the quality and judgment of what fills the schema. If one category (say style nits) is still noisy, temporarily disable that category (task 4.1) while you refine its examples.

What not to do. Do not respond to the false positives by appending "be conservative, only report high-confidence issues"; vague confidence instructions do not improve precision, the contrasting examples do. And do not reach for a larger-context model hoping consistency emerges; the fix is demonstrating the boundary, not adding capacity.

Exam tips

✓Few-shot is the most effective fix when detailed instructions ALONE still produce inconsistent format or judgment; use 2-4 targeted examples, not dozens.
✓For ambiguous cases, examples must show the reasoning for why one action was chosen over plausible alternatives, not just input to output, or the judgment will not transfer.
✓Demonstrate the exact output shape (location, issue, severity, suggested fix) so every response comes out identically formatted.
✓Include a contrasting pair (an acceptable pattern AND a genuine issue) so the model generalizes the decision boundary and cuts false positives, instead of memorizing listed cases.
✓For extraction, show varied document structures (inline citations vs bibliography) and one field-absent example that correctly returns null, to reduce hallucination and empty-field fabrication.
✓Exam trap: tool misrouting from MINIMAL DESCRIPTIONS is fixed by expanding descriptions (2.1), not few-shot; and guaranteed JSON syntax comes from a schema (4.3), not format examples. Few-shot's home turf is judgment, format, and false-positive reduction.

Official exam objectives for 4.2

Knowledge of

Few-shot examples as the most effective technique for achieving consistently formatted, actionable output when detailed instructions alone produce inconsistent results
The role of few-shot examples in demonstrating ambiguous-case handling (e.g., tool selection for ambiguous requests, branch-level test coverage gaps)
How few-shot examples enable the model to generalize judgment to novel patterns rather than matching only pre-specified cases
The effectiveness of few-shot examples for reducing hallucination in extraction tasks (e.g., handling informal measurements, varied document structures)

Skills in

Creating 2-4 targeted few-shot examples for ambiguous scenarios that show reasoning for why one action was chosen over plausible alternatives
Including few-shot examples that demonstrate specific desired output format (location, issue, severity, suggested fix) to achieve consistency
Providing few-shot examples distinguishing acceptable code patterns from genuine issues to reduce false positives while enabling generalization
Using few-shot examples to demonstrate correct handling of varied document structures (inline citations vs bibliographies, methodology sections vs embedded details)
Adding few-shot examples showing correct extraction from documents with varied formats to address empty/null extraction of required fields

Flashcards from this lesson

When is few-shot prompting the most effective technique to reach for?

When you have already written detailed instructions but the output is still inconsistent in format or judgment. Examples pin down what prose cannot, especially for ambiguous cases and consistent formatting.

How many examples should a few-shot set have, and what must ambiguous-case examples include?

2 to 4 targeted examples. For ambiguous cases each must show the reasoning for why one action was chosen over plausible alternatives, not just input to output, so the model learns the decision boundary.

Why include an acceptable-pattern example alongside a genuine-issue example in a review prompt?

The contrasting pair draws the decision boundary, so the model generalizes to unseen code and reduces false positives, rather than memorizing only the specific listed cases.

How does few-shot reduce hallucination and empty-field errors in extraction?

Examples of varied document structures teach the model to find the same information regardless of layout, and a 'field absent, return null' example demonstrates not fabricating a value to satisfy a required field.

Escalation calibration is off. What is the proportionate first fix, and what should you avoid?

Add explicit escalation criteria plus 2-4 few-shot examples showing when to escalate versus resolve. Avoid self-reported confidence thresholds, sentiment analysis, or a separately trained classifier as the first move.

Few-shot format examples versus tool_use with a JSON schema for structured output?

A schema guarantees syntactically valid JSON (task 4.3). Few-shot governs judgment, value quality, and human-readable structure. They are complementary; use the schema for guaranteed syntax and few-shot for how fields are populated.

Two similar tools are misrouted because their descriptions are minimal. Is few-shot the best first fix?

No. Expand the tool descriptions first (task 2.1), since descriptions are the primary tool-selection mechanism. Few-shot routing examples add token overhead every call without fixing the root ambiguity.

Study all flashcards with spaced repetition

Mark this lesson complete when you are confident.

← Previous

4.1 Design prompts with explicit criteria to improve precision and reduce false positives

4.3 Enforce structured output using tool use and JSON schemas