Prompt Design & Tuning Best Practices
name: prompt_design_tuning_best_practice
by abysscat-yj · published 2026-04-01
$ claw add gh:abysscat-yj/abysscat-yj-prompt-design-tuning---
name: prompt_design_tuning_best_practice
description: Collaboratively design, evaluate, iterate on, and recommend a final launch candidate for a target prompt under the principle of “human-gated, agent-executed” workflow.
---
# Prompt Design & Tuning Best Practices
The goal of this Skill is not to casually “chat about prompts,” but to turn prompt tuning into an **executable, reviewable, and cost-controlled** engineering workflow.
The Agent handles most of the execution work.
Humans are responsible only for validating direction, approving high-cost loops, and signing off on the final launch candidate.
---
When to Use
Use this Skill when the user needs to:
---
Working Modes
1. Design-Only Mode
Use this mode when:
In this mode, the Agent should produce:
2. Execution Mode
Use this mode when:
In this mode, the Agent should continue with:
---
Core Principles
The following rules are non-negotiable by default:
1. The **target prompt** and the **judge prompt** must be separated.
Do not silently modify both in the same comparison round and then mix their gains together.
2. Before large-scale evaluation, the **task definition (task spec)** must be frozen first.
3. Every round of prompt optimization must have a clear **optimization hypothesis**.
No random “this sentence feels off, let’s tweak it” behavior.
4. An **experiment log** must be maintained, including at least:
- version number
- summary of changes in the current round
- optimization hypothesis
- evaluation results
- cost information
- conclusion
5. Any high-cost evaluation loop must be approved by a human beforehand.
6. The final launch candidate must be reviewed by a human.
A high machine-evaluation score does not automatically mean it is ready for launch.
7. If the input information is incomplete, low-risk assumptions may be made, but they must be stated explicitly.
---
Recommended Inputs to Collect
The Agent should gather or infer the following whenever possible:
---
Target Deliverables
By default, the workflow should aim to produce the following:
---
Human Gates
By default, human confirmation is required only at the following key checkpoints:
Gate A — Freeze the Task Definition
Confirm:
Gate B — Confirm the Direction of the Target Prompt
Confirm:
Gate C — Confirm the Direction of the Judge Prompt
Confirm:
Gate D — Approve a High-Cost Iteration Loop
Confirm:
Gate E — Final Review
Confirm:
Unless the user explicitly asks for finer-grained control, do not interrupt too frequently in the middle.
---
# Execution Flow
Phase 0 — Task Definition (Task Spec)
Before writing any prompt, first establish a clear task definition.
The task definition should include at least:
If the user’s description is incomplete, do not stall.
Fill in reasonable assumptions first, then present them for confirmation.
After this, proceed to **Gate A**.
---
Phase 1 — Generate the First Draft of the Target Prompt
Based on the task definition, produce the first draft of the target prompt.
Requirements:
Also output:
After this, proceed to **Gate B**.
---
Phase 2 — Generate the First Draft of the Judge Prompt
Design an independent Judge / Eval Prompt.
Requirements:
- partially correct outputs
- format errors
- misunderstanding of the task
- unsafe or policy-violating content
- reasonable uncertainty caused by incomplete task information
Also output:
After this, proceed to **Gate C**.
---
Phase 3 — Design the Evaluation Plan
Before running large-scale evaluations, define the evaluation plan clearly.
The plan should include at least:
Default loop policy:
---
Phase 4 — Write the Generation Script
If executable conditions are available, the Agent should write a batch generation script.
The script should support, as much as possible:
TPM Handling Principles
Do not crudely translate TPM directly into high concurrency.
Preferred approach:
---
Phase 5 — Batch Generate Model Outputs
Run the full evaluation set across all specified models and prompt versions.
At minimum, record:
If generation failures occur frequently:
---
Phase 6 — Run Automatic Evaluation
Use the Judge Prompt to evaluate generated outputs in batch.
Requirements:
---
Phase 7 — Analyze and Optimize
A new prompt iteration is allowed only when there is a clear optimization hypothesis.
Each round must include:
1. summarize the previous round’s results
2. identify the major failure clusters
3. propose the optimization hypothesis for this round
4. modify only the most necessary prompt sections
5. provide a version-diff summary
6. predict what should improve and what may regress
Do not run another round for no reason.
If the next round will consume meaningful resources, go to **Gate D** first.
---
Phase 8 — Final Recommendation
Once a version reaches a sufficiently strong level, the Agent should produce a final review package.
It should include at least:
After this, proceed to **Gate E**.
---
# Default Outputs at Each Gate
Gate A Output
Gate B Output
Gate C Output
Gate D Output
Gate E Output
---
# Default Analysis Templates
Experiment Log Fields
Each experiment round should record at least:
Suggested Failure Taxonomy
The Agent should try to classify failures into one of the following:
---
# Explicitly Forbidden Anti-Patterns
Do not do the following:
---
# Default Behavior When the Skill Is Triggered
When this Skill is triggered, the Agent should follow this order:
1. build or refresh the task definition
2. determine which phase the workflow is currently in
3. prioritize filling missing artifacts before rewriting existing ones
4. prefer incremental optimization over full rewrites
5. request confirmation only at the defined human gates
6. after each major step, output a concise decision memo including:
- what changed
- why it changed
- which metrics improved
- what major issues remain
- whether another round is worth it
---
# Example Trigger Phrases
The following requests are suitable triggers for this Skill:
More tools from the same signal band
Order food/drinks (点餐) on an Android device paired as an OpenClaw node. Uses in-app menu and cart; add goods, view cart, submit order (demo, no real payment).
Sign plugins, rotate agent credentials without losing identity, and publicly attest to plugin behavior with verifiable claims and authenticated transfers.
The philosophical layer for AI agents. Maps behavior to Spinoza's 48 affects, calculates persistence scores, and generates geometric self-reports. Give your...