⚡

// Skill profile

Prompt Design & Tuning Best Practices

Name: Prompt Design & Tuning Best Practices
Author: abysscat-yj

name: prompt_design_tuning_best_practice

by abysscat-yj · published 2026-04-01

数据处理API集成

Total installs

Stars

★ 0

Last updated

2026-04

// Install command

$ claw add gh:abysscat-yj/abysscat-yj-prompt-design-tuning

View on GitHub

// Full documentation

---

name: prompt_design_tuning_best_practice

description: Collaboratively design, evaluate, iterate on, and recommend a final launch candidate for a target prompt under the principle of “human-gated, agent-executed” workflow.

---

# Prompt Design & Tuning Best Practices

The goal of this Skill is not to casually “chat about prompts,” but to turn prompt tuning into an **executable, reviewable, and cost-controlled** engineering workflow.

The Agent handles most of the execution work.

Humans are responsible only for validating direction, approving high-cost loops, and signing off on the final launch candidate.

---

When to Use

Use this Skill when the user needs to:

design or optimize a target prompt from scratch

design a separate evaluation / judge prompt

compare the performance of multiple models on an evaluation set

work with an existing API curl, SDK integration, or request protocol

run controlled prompt iterations under a limited budget

turn prompt tuning into a reusable workflow instead of a one-off chat exercise

---

Working Modes

1. Design-Only Mode

Use this mode when:

there is no runnable environment yet

no evaluation resources are available yet

real model calls cannot be executed for now

In this mode, the Agent should produce:

task definition

target prompt draft

judge prompt draft

evaluation plan

script skeletons

manual execution guidance

2. Execution Mode

Use this mode when:

a runnable environment already exists

the model invocation method has been provided

the evaluation set, resource limits, and candidate models have been provided

In this mode, the Agent should continue with:

batch generation

automatic evaluation

result analysis

prompt iteration

final candidate recommendation

---

Core Principles

The following rules are non-negotiable by default:

1. The **target prompt** and the **judge prompt** must be separated.

Do not silently modify both in the same comparison round and then mix their gains together.

2. Before large-scale evaluation, the **task definition (task spec)** must be frozen first.

3. Every round of prompt optimization must have a clear **optimization hypothesis**.

No random “this sentence feels off, let’s tweak it” behavior.

4. An **experiment log** must be maintained, including at least:

- version number

- summary of changes in the current round

- optimization hypothesis

- evaluation results

- cost information

- conclusion

5. Any high-cost evaluation loop must be approved by a human beforehand.

6. The final launch candidate must be reviewed by a human.

A high machine-evaluation score does not automatically mean it is ready for launch.

7. If the input information is incomplete, low-risk assumptions may be made, but they must be stated explicitly.

---

Recommended Inputs to Collect

The Agent should gather or infer the following whenever possible:

business goal

user scenario

input format

output format

hard constraints

unacceptable errors

success criteria

online acceptance threshold

evaluation set

candidate models

invocation method (curl / SDK / API)

resource limits (TPM, RPM, timeout, budget, retry cap)

---

Target Deliverables

By default, the workflow should aim to produce the following:

`docs/task_spec.md`

`prompts/production_prompt_v{n}.md`

`prompts/judge_prompt_v{n}.md`

`docs/eval_plan.md`

`scripts/run_generation.py`

`scripts/run_judge.py`

`reports/iteration_{n}_summary.md`

`reports/final_recommendation.md`

`reports/experiment_log.md`

---

Human Gates

By default, human confirmation is required only at the following key checkpoints:

Gate A — Freeze the Task Definition

Confirm:

whether the task is understood correctly

whether the success criteria are reasonable

whether the constraints are complete

Gate B — Confirm the Direction of the Target Prompt

Confirm:

whether the target prompt is directionally correct

whether it is ready to enter evaluation

Gate C — Confirm the Direction of the Judge Prompt

Confirm:

whether the evaluation standard is fair

whether the judge is evaluating what actually matters

Gate D — Approve a High-Cost Iteration Loop

Confirm:

model list

TPM budget

number of iteration rounds

whether it is worth spending more resources

Gate E — Final Review

Confirm:

whether the current best version can serve as a launch candidate

whether to continue optimizing

whether to stop

Unless the user explicitly asks for finer-grained control, do not interrupt too frequently in the middle.

---

# Execution Flow

Phase 0 — Task Definition (Task Spec)

Before writing any prompt, first establish a clear task definition.

The task definition should include at least:

problem description

input format

output format

business goal

user goal

constraints

explicitly forbidden outputs

positive and negative examples

success metrics

unresolved issues

current assumptions

If the user’s description is incomplete, do not stall.

Fill in reasonable assumptions first, then present them for confirmation.

After this, proceed to **Gate A**.

---

Phase 1 — Generate the First Draft of the Target Prompt

Based on the task definition, produce the first draft of the target prompt.

Requirements:

instructions must be clear

constraints must be explicit

output structure must be stable

ambiguity should be minimized

controllability should be prioritized over fluffy “stylistic” wording

examples should be included only when they are truly helpful

Also output:

key design rationale

predicted risk points

likely failure scenarios

what to pay close attention to in the first evaluation round

After this, proceed to **Gate B**.

---

Phase 2 — Generate the First Draft of the Judge Prompt

Design an independent Judge / Eval Prompt.

Requirements:

evaluate the task outcome, not whether the prompt itself reads nicely

score across separate dimensions, then aggregate

include hard-fail categories

output must be structured JSON

minimize bias caused by stylistic model preferences

explicitly handle the following cases:

- partially correct outputs

- format errors

- misunderstanding of the task

- unsafe or policy-violating content

- reasonable uncertainty caused by incomplete task information

Also output:

scoring dimensions

weight design

hard-fail conditions

Judge output JSON schema

Judge blind spots

After this, proceed to **Gate C**.

---

Phase 3 — Design the Evaluation Plan

Before running large-scale evaluations, define the evaluation plan clearly.

The plan should include at least:

source and size of the evaluation set

sample slicing strategy (easy / medium / hard / edge cases)

online acceptance threshold

primary metrics

secondary diagnostic metrics

tie-breaking rules

maximum number of iteration rounds

total budget limit

early stopping conditions

Default loop policy:

by default, run at most **2 high-cost optimization rounds**

stop if the gains are marginal and the failure types have not improved in substance

if the Judge itself looks unreliable, fix the Judge first instead of continuing to modify the target prompt

---

Phase 4 — Write the Generation Script

If executable conditions are available, the Agent should write a batch generation script.

The script should support, as much as possible:

jsonl / csv / excel input

multiple models

resume from checkpoint

retries and backoff

logging

strict input-output order preservation

TPM / RPM rate limiting

structured outputs for downstream evaluation

TPM Handling Principles

Do not crudely translate TPM directly into high concurrency.

Preferred approach:

estimate token consumption per request

use token-bucket or time-window rate limiting

use conservative concurrency when RPM and latency are unknown

prioritize stability before speed

---

Phase 5 — Batch Generate Model Outputs

Run the full evaluation set across all specified models and prompt versions.

At minimum, record:

model name

prompt version

input sample ID

raw output

token usage (if available)

latency

retry count

request failure information

truncation / parsing failures

If generation failures occur frequently:

first separate infrastructure issues from prompt issues

do not conclude that the prompt is bad before ruling out quota, network, rate-limiting, or protocol problems

---

Phase 6 — Run Automatic Evaluation

Use the Judge Prompt to evaluate generated outputs in batch.

Requirements:

Judge output must be structured JSON

raw judge outputs must be traceable

compute overall scores and slice-level metrics

automatically identify major failure clusters

distinguish format errors from content errors

if the Judge is noisy, state that explicitly instead of pretending the results are reliable

---

Phase 7 — Analyze and Optimize

A new prompt iteration is allowed only when there is a clear optimization hypothesis.

Each round must include:

1. summarize the previous round’s results

2. identify the major failure clusters

3. propose the optimization hypothesis for this round

4. modify only the most necessary prompt sections

5. provide a version-diff summary

6. predict what should improve and what may regress

Do not run another round for no reason.

If the next round will consume meaningful resources, go to **Gate D** first.

---

Phase 8 — Final Recommendation

Once a version reaches a sufficiently strong level, the Agent should produce a final review package.

It should include at least:

final target prompt

final judge prompt

recommended model

overall metrics

slice-level metrics

major remaining failure types

cost / latency notes

whether launch is recommended

what should be monitored after launch

After this, proceed to **Gate E**.

---

# Default Outputs at Each Gate

Gate A Output

task definition document

current assumptions

missing information

recommended acceptance criteria

Gate B Output

target prompt v1

design rationale

expected risks

Gate C Output

judge prompt v1

scoring rubric

JSON schema

Judge blind spots

Gate D Output

current result comparison

failure analysis

prompt change summary

next-round optimization hypothesis

estimated resource consumption

Gate E Output

final candidate

why it is the current best version

where it may still fail

recommendation to launch / continue optimizing / stop

---

# Default Analysis Templates

Experiment Log Fields

Each experiment round should record at least:

iteration

production_prompt_version

judge_prompt_version

model

dataset_version

hypothesis

change_summary

aggregate_score

slice_scores

dominant_failures

cost

verdict

Suggested Failure Taxonomy

The Agent should try to classify failures into one of the following:

task misunderstanding

missing constraints

extraction error

reasoning error

incomplete coverage

unsafe / policy-violating output

format / schema error

verbose / redundant output

hallucinated details

mismatch between Judge and actual task goal

infrastructure failure

---

# Explicitly Forbidden Anti-Patterns

Do not do the following:

modify both the target prompt and the judge prompt in the same round without saying so

look only at aggregate score and ignore the failure distribution

overfit to a tiny evaluation set without warning about the risk

use machine evaluation as a substitute for final human review

loop endlessly because the score moved slightly

rewrite the whole prompt when only one part is broken

hide critical assumptions

declare success without showing hard examples

---

# Default Behavior When the Skill Is Triggered

When this Skill is triggered, the Agent should follow this order:

1. build or refresh the task definition

2. determine which phase the workflow is currently in

3. prioritize filling missing artifacts before rewriting existing ones

4. prefer incremental optimization over full rewrites

5. request confirmation only at the defined human gates

6. after each major step, output a concise decision memo including:

- what changed

- why it changed

- which metrics improved

- what major issues remain

- whether another round is worth it

---

# Example Trigger Phrases

The following requests are suitable triggers for this Skill:

“Help me automate this prompt tuning workflow”

“Write the target prompt first, then the judge prompt, then design the evaluation”

“Use the evaluation set and several models to find the current best prompt”

“Run 1 to 2 prompt iteration rounds under a controlled budget”

“Turn this prompt tuning process into a reusable agent skill”

“Let the agent drive the process, and keep humans only at key checkpoints”

// Comments

// Related skills

More tools from the same signal band

Order food/drinks (点餐) on an Android device paired as an OpenClaw node. Uses in-app menu and cart; add goods, view cart, submit order (demo, no real payment).

Sign plugins, rotate agent credentials without losing identity, and publicly attest to plugin behavior with verifiable claims and authenticated transfers.

The philosophical layer for AI agents. Maps behavior to Spinoza's 48 affects, calculates persistence scores, and generates geometric self-reports. Give your...

日历管理数据处理

1 installs★ 0