4.112 min

Design prompts with explicit criteria to improve precision and reduce false positives

In production code review and extraction pipelines, the line between a tool people rely on and one they mute is precision. Vague instructions like "be accurate" or "be conservative" produce noisy false positives that erode trust, while explicit categorical criteria that name the exact condition for a finding produce consistent, actionable output. This lesson covers writing report/skip criteria, calibrating severity with concrete code examples, and temporarily disabling noisy categories to protect trust while you iterate on their prompts.

A review prompt rewritten from vague instructions (low precision, developers mute the bot) to explicit report/skip criteria with example-anchored severity (high precision, trust restored). The banner shows the trust-recovery move: temporarily disable high false-positive categories while iterating.

Explicit criteria beat vague instructions

The single highest-leverage move for precision is replacing vague quality language with explicit, categorical criteria. A prompt that says "check that comments are accurate" leaves the model to invent its own definition of accuracy, so it flags stylistic quibbles, stale TODOs, and harmless imprecision. A prompt that says "flag a comment ONLY when the behavior it claims contradicts what the code actually does" names the exact condition, so the model applies a testable rule instead of a mood.

Explicit criteria work because they convert a judgment call into a decision procedure. Instead of asking "is this good enough," the model asks "does this input meet the stated condition," which is far more reproducible across files and across runs.

Concrete, testable predicates such as "contradicts actual code behavior," "writes past the end of the buffer," or "logs a secret" outperform abstract adjectives such as "accurate," "clean," or "safe" every time. The exam's canonical contrast is exactly this pair: flag comments only when claimed behavior contradicts actual code behavior versus check that comments are accurate.

Why 'be conservative' and confidence filters fail

General meta-instructions like "be conservative" or "only report high-confidence findings" do not improve precision. They feel like a knob, but they do not move the decision boundary. The model's self-assessed confidence is poorly calibrated, so the same false positives it produced before are still ones it feels confident about. In practice the phrase either suppresses true and false positives together, or just adds hedging language to the same wrong output.

This is the same calibration weakness that makes self-reported confidence scores an unreliable escalation trigger elsewhere in the exam. The fix is not a better adjective, it is a specific categorical rule. "Report an issue only if it falls into one of these named categories" gives the model something to check; "be careful" does not.

The false-positive tax on developer trust

False positive rates are not judged category by category in the real world, they are judged at the product level. If one category (say, comment accuracy or naming) is noisy, developers stop reading the tool's output entirely, and they dismiss even the accurate categories like security and data-loss findings along with the noise. A high false positive category undermines confidence in the categories that are actually working.

That means precision is partly a trust problem, not only a prompt problem. The cost of a bad category is not limited to its own comments; it spills over and poisons the well. This is why the correct operational response to a noisy category is often to remove it from production first, then improve it, rather than leaving it live while you iterate.

Define what to report and what to skip

Rather than relying on the model to self-filter by confidence, write explicit report and skip lists that scope the surface area. Name the classes of issue worth a comment (bugs, security, data loss) and the classes to ignore (minor style, formatting, naming, patterns that are consistent with the surrounding file).

REPORT (post a finding):
- Bugs: logic errors, off-by-one, null/undefined dereference
- Security: injection, missing authorization, secrets in code
- Data loss: destructive migrations, unbounded deletes

SKIP (never post):
- Naming, formatting, import order, style preferences
- Patterns consistent with the surrounding file

This is more reliable than confidence filtering because it targets the actual source of noise: the model reporting subjective, low-value issues. A skip list explicitly tells the model that a whole class of observations is out of scope, which removes the ambiguity that produced false positives in the first place.

Calibrate severity with concrete code examples

Consistent severity classification is a precision problem too: a review that labels the same pattern CRITICAL in one file and MEDIUM in another is noise. Adjectives alone ("critical," "major," "minor") do not anchor the model. Define explicit severity levels and attach a concrete code example to each one so the model has a reference point.

CRITICAL: e.g. db.exec("DELETE FROM orders") with no WHERE clause
HIGH:     e.g. assignment (=) used where equality (==) was intended in an auth check
MEDIUM:   e.g. missing null check on an optional field used later
LOW:      (do not report unless it also breaks a REPORT category)

Anchoring each level to a worked example turns severity from a vibe into a comparison. The model asks "is this input more like the CRITICAL example or the MEDIUM example," which produces reproducible classification across the whole review.

Temporarily disable noisy categories while you iterate

When a category is generating so many false positives that developers are ignoring the tool, the fastest way to restore trust is to temporarily disable that category entirely, keep the reliable categories running, and improve the disabled category's criteria offline. Re-enable it only once its precision is acceptable on a validation sample.

This is a deliberate trust-recovery play, not an admission of defeat. Leaving a broken category live while you "fix it later" keeps eroding confidence in everything else the tool says. Pulling it out stops the bleeding immediately, and because trust is a product-level property, the accurate categories become useful again the moment the noise stops.

A useful supporting practice is to tag each posted finding with its category (and, for analysis, a detected_pattern field). Tracking which categories developers dismiss most tells you exactly which one to disable and iterate on, rather than guessing.

Anti-patterns to avoid

avoid

Tuning a 'confidence' knob by adding 'be conservative' or 'only report high-confidence findings'.

Why it fails: LLM self-confidence is poorly calibrated and the phrase does not change the decision boundary; it suppresses true and false positives together or just adds hedging to the same wrong output.

instead Replace it with specific categorical criteria that name the exact condition that qualifies as a finding.

avoid

Vague quality verbs like 'check that comments are accurate' or 'ensure the code is clean'.

Why it fails: The model invents its own definition of the adjective and flags subjective nits, producing false positives.

instead State a testable predicate, for example 'flag a comment only when its claimed behavior contradicts what the code actually does'.

avoid

Leaving a high false-positive category live in production while planning to fix it later.

Why it fails: Its noise erodes trust in every other category, so developers start ignoring even correct security and data-loss findings.

instead Temporarily disable the noisy category, keep the reliable ones running, improve its criteria offline, then re-enable once precision is acceptable.

avoid

Defining severity levels with adjectives only (critical / major / minor) and no examples.

Why it fails: Without anchors the model classifies the same pattern inconsistently across files and runs, which is itself a form of noise.

instead Anchor each severity level to a concrete code example so classification becomes a reproducible comparison.

Worked example: Rescuing a noisy CI code-review bot (Scenario 5)

Your team runs Claude Code in CI on every pull request with claude -p, and posts its findings as inline PR comments. After two weeks, developers complain that the bot is noisy and start ignoring it. Category telemetry shows that the comment-accuracy and naming/style categories account for roughly 70% of all comments and about 90% of dismissals, while the security and null-dereference findings are accurate but now ignored along with the rest.

Diagnosis. This is a criteria problem, not a model problem. The CI prompt reads:

Review this pull request and report any issues you find.
Be conservative and only flag things you are confident about.

This is exactly the failure mode the exam targets: vague quality language plus a confidence knob. "Be conservative" does nothing to move the decision boundary, and "report any issues" invites subjective style nits.

Fix. Rewrite the prompt around explicit report/skip lists, a testable comment predicate, and severity anchored to examples:

Review this pull request. Post a finding ONLY if it matches a REPORT category.

REPORT (post as an inline comment):
- Bugs: logic errors, off-by-one, null/undefined dereference, wrong error handling
- Security: injection, missing authorization, secrets in code, unsafe deserialization
- Data loss: destructive migrations, unbounded deletes, dropped writes

SKIP (never post):
- Naming, formatting, import order, and other style preferences
- Patterns consistent with the surrounding file
- Comment wording, UNLESS the behavior a comment claims contradicts the code

SEVERITY (classify every finding, anchored to these examples):
- CRITICAL: db.exec("DELETE FROM orders") with no WHERE clause
- HIGH:     assignment (=) used where equality (==) was intended in an auth check
- MEDIUM:   missing null check on an optional field used later

Tag each finding with its category so dismissals can be analyzed.

Protect trust while iterating. Do not ship the rewrite and hope. Because the two noisy categories were destroying confidence in the accurate ones, temporarily disable naming/style entirely (it is now on the SKIP list) and disable comment-accuracy until the new contradiction-based predicate is validated on a sample of recent PRs. This mirrors refining a prompt on a sample set before running it at scale. Keep security and null-dereference live throughout, since they were already trustworthy.

Outcome. With style noise gone and the comment rule now testable, precision on the live categories climbs and developers start reading comments again. Once the rewritten comment-accuracy criterion shows acceptable precision on the sample, re-enable it. The key lesson: the fix was specific categorical criteria plus a trust-recovery move, not a better adjective or a confidence threshold.

Exam tips

✓Explicit categorical criteria, not confidence thresholds, are the correct fix for a noisy review. 'Be conservative' and 'only high-confidence findings' are distractor answers because self-reported confidence is poorly calibrated.
✓The canonical exemplar to memorize: 'flag a comment only when its claimed behavior contradicts the actual code' beats 'check that comments are accurate'.
✓High false positives in one category erode trust in ALL categories, including the accurate ones. Precision is judged at the product level, not per category.
✓To restore trust fast, temporarily disable the high-false-positive category, then iterate on its criteria offline before re-enabling.
✓Consistent severity classification requires concrete code examples anchored to each severity level, not just adjectives.
✓Prefer explicit report lists (bugs, security, data loss) and skip lists (minor style, local patterns) over asking the model to self-filter by confidence.

Official exam objectives for 4.1

Knowledge of

The importance of explicit criteria over vague instructions (e.g., "flag comments only when claimed behavior contradicts actual code behavior" vs "check that comments are accurate")
How general instructions like "be conservative" or "only report high-confidence findings" fail to improve precision compared to specific categorical criteria
The impact of false positive rates on developer trust: high false positive categories undermine confidence in accurate categories

Skills in

Writing specific review criteria that define which issues to report (bugs, security) versus skip (minor style, local patterns) rather than relying on confidence-based filtering
Temporarily disabling high false-positive categories to restore developer trust while improving prompts for those categories
Defining explicit severity criteria with concrete code examples for each severity level to achieve consistent classification

Flashcards from this lesson

Why does 'only report high-confidence findings' fail to improve precision?

LLM self-confidence is poorly calibrated and the phrase does not move the decision boundary. It suppresses true and false positives together instead of applying a specific rule.

Rewrite 'check that comments are accurate' as an explicit criterion.

'Flag a comment only when the behavior it claims contradicts what the code actually does.'

What is the danger of one high-false-positive review category?

It erodes developer trust in the whole tool, so they start dismissing even the accurate categories like security and data loss.

Fastest way to restore trust in a noisy review bot while you improve it?

Temporarily disable the high-false-positive category, keep the reliable ones running, iterate on its criteria offline, then re-enable once precision is acceptable.

How do you get consistent severity classification across files?

Define explicit severity levels each anchored to a concrete code example, not just adjectives, so classification becomes a reproducible comparison.

Report/skip lists vs confidence filtering: which improves precision?

Explicit report (bugs, security) and skip (minor style, local patterns) lists. Confidence filtering does not, because it targets the wrong thing.

Study all flashcards with spaced repetition

Mark this lesson complete when you are confident.

← Previous

3.6 Integrate Claude Code into CI/CD pipelines

4.2 Apply few-shot prompting to improve output consistency and quality