Prompt Evaluation Guide
AI Prompt Evaluation Scorecard
A scoring framework for testing prompt outputs, comparing prompt versions, and deciding whether an AI result is ready to use.
The problem
It is easy to tell whether an AI answer sounds good. It is harder to tell whether it is actually usable.
For serious work, you need a repeatable way to score outputs. A scorecard helps you compare prompt versions, catch weak answers, and avoid shipping content that only looks polished.
The scorecard
Score each dimension from 1 to 5.
| Dimension | What to check | 1 means | 5 means |
|---|---|---|---|
| Task fit | Does it answer the actual request? | Misses the task | Solves the task directly |
| Specificity | Does it use the provided context? | Generic advice | Clearly tied to context |
| Accuracy | Are facts and claims reliable? | Unsupported or wrong | Grounded and checkable |
| Completeness | Does it cover the necessary parts? | Important gaps | Covers all key parts |
| Usability | Can the user act on it? | Needs major rewrite | Ready for review or use |
| Risk control | Does it flag assumptions and limits? | Hides uncertainty | Calls out assumptions clearly |
| Format compliance | Does it follow the requested structure? | Ignores format | Matches format cleanly |
How to use the scorecard
Step 1: Pick a target score
Not every task needs a perfect output.
Use this rule:
- 3.5+ for brainstorming
- 4.0+ for internal drafts
- 4.5+ for customer-facing or decision-support work
The higher the stakes, the higher the review bar.
Step 2: Test with realistic inputs
Do not test only with easy examples.
Use:
- a typical case
- a messy case
- an edge case
- a short input
- an input with missing information
Good prompts should degrade gracefully when the input is incomplete.
Step 3: Compare versions
When improving a prompt, test the old and new versions on the same inputs.
Track:
- average score
- lowest score
- most common failure
- editing time after output
The best prompt is not always the longest prompt. It is the one that produces usable output more consistently.
Copy-ready evaluator prompt
Evaluate this AI output using a 1-5 score for each dimension:
- Task fit
- Specificity
- Accuracy
- Completeness
- Usability
- Risk control
- Format compliance
Return:
1. Score table
2. Top 3 issues
3. What would make the output ready to use
4. A revised version if the score is below 4
Original task:
[PASTE TASK]
AI output:
[PASTE OUTPUT]
Common failure patterns
- High fluency, low specificity
- Correct structure, weak substance
- Useful advice with invented evidence
- Good first draft, missing next step
- Strong answer for one input, fragile for others