HomeBrowseUpload
← Back to registry
// Skill profile

Agentic Evaluation Patterns

name: agentic-eval

by boleyn · published 2026-03-22

数据处理自动化任务加密货币
Total installs
0
Stars
★ 0
Last updated
2026-03
// Install command
$ claw add gh:boleyn/boleyn-agentic-eval
View on GitHub
// Full documentation

---

name: agentic-eval

description: |

Patterns and techniques for evaluating and improving AI agent outputs. Use this skill when:

- Implementing self-critique and reflection loops

- Building evaluator-optimizer pipelines for quality-critical generation

- Creating test-driven code refinement workflows

- Designing rubric-based or LLM-as-judge evaluation systems

- Adding iterative improvement to agent outputs (code, reports, analysis)

- Measuring and improving agent response quality

---

# Agentic Evaluation Patterns

Patterns for self-improvement through iterative evaluation and refinement.

Overview

Evaluation patterns enable agents to assess and improve their own outputs, moving beyond single-shot generation to iterative refinement loops.

Generate → Evaluate → Critique → Refine → Output
    ↑                              │
    └──────────────────────────────┘

When to Use

  • **Quality-critical generation**: Code, reports, analysis requiring high accuracy
  • **Tasks with clear evaluation criteria**: Defined success metrics exist
  • **Content requiring specific standards**: Style guides, compliance, formatting
  • ---

    Pattern 1: Basic Reflection

    Agent evaluates and improves its own output through self-critique.

    def reflect_and_refine(task: str, criteria: list[str], max_iterations: int = 3) -> str:
        """Generate with reflection loop."""
        output = llm(f"Complete this task:\n{task}")
        
        for i in range(max_iterations):
            # Self-critique
            critique = llm(f"""
            Evaluate this output against criteria: {criteria}
            Output: {output}
            Rate each: PASS/FAIL with feedback as JSON.
            """)
            
            critique_data = json.loads(critique)
            all_pass = all(c["status"] == "PASS" for c in critique_data.values())
            if all_pass:
                return output
            
            # Refine based on critique
            failed = {k: v["feedback"] for k, v in critique_data.items() if v["status"] == "FAIL"}
            output = llm(f"Improve to address: {failed}\nOriginal: {output}")
        
        return output

    **Key insight**: Use structured JSON output for reliable parsing of critique results.

    ---

    Pattern 2: Evaluator-Optimizer

    Separate generation and evaluation into distinct components for clearer responsibilities.

    class EvaluatorOptimizer:
        def __init__(self, score_threshold: float = 0.8):
            self.score_threshold = score_threshold
        
        def generate(self, task: str) -> str:
            return llm(f"Complete: {task}")
        
        def evaluate(self, output: str, task: str) -> dict:
            return json.loads(llm(f"""
            Evaluate output for task: {task}
            Output: {output}
            Return JSON: {{"overall_score": 0-1, "dimensions": {{"accuracy": ..., "clarity": ...}}}}
            """))
        
        def optimize(self, output: str, feedback: dict) -> str:
            return llm(f"Improve based on feedback: {feedback}\nOutput: {output}")
        
        def run(self, task: str, max_iterations: int = 3) -> str:
            output = self.generate(task)
            for _ in range(max_iterations):
                evaluation = self.evaluate(output, task)
                if evaluation["overall_score"] >= self.score_threshold:
                    break
                output = self.optimize(output, evaluation)
            return output

    ---

    Pattern 3: Code-Specific Reflection

    Test-driven refinement loop for code generation.

    class CodeReflector:
        def reflect_and_fix(self, spec: str, max_iterations: int = 3) -> str:
            code = llm(f"Write Python code for: {spec}")
            tests = llm(f"Generate pytest tests for: {spec}\nCode: {code}")
            
            for _ in range(max_iterations):
                result = run_tests(code, tests)
                if result["success"]:
                    return code
                code = llm(f"Fix error: {result['error']}\nCode: {code}")
            return code

    ---

    Evaluation Strategies

    Outcome-Based

    Evaluate whether output achieves the expected result.

    def evaluate_outcome(task: str, output: str, expected: str) -> str:
        return llm(f"Does output achieve expected outcome? Task: {task}, Expected: {expected}, Output: {output}")

    LLM-as-Judge

    Use LLM to compare and rank outputs.

    def llm_judge(output_a: str, output_b: str, criteria: str) -> str:
        return llm(f"Compare outputs A and B for {criteria}. Which is better and why?")

    Rubric-Based

    Score outputs against weighted dimensions.

    RUBRIC = {
        "accuracy": {"weight": 0.4},
        "clarity": {"weight": 0.3},
        "completeness": {"weight": 0.3}
    }
    
    def evaluate_with_rubric(output: str, rubric: dict) -> float:
        scores = json.loads(llm(f"Rate 1-5 for each dimension: {list(rubric.keys())}\nOutput: {output}"))
        return sum(scores[d] * rubric[d]["weight"] for d in rubric) / 5

    ---

    Best Practices

    | Practice | Rationale |

    |----------|-----------|

    | **Clear criteria** | Define specific, measurable evaluation criteria upfront |

    | **Iteration limits** | Set max iterations (3-5) to prevent infinite loops |

    | **Convergence check** | Stop if output score isn't improving between iterations |

    | **Log history** | Keep full trajectory for debugging and analysis |

    | **Structured output** | Use JSON for reliable parsing of evaluation results |

    ---

    Quick Start Checklist

    ## Evaluation Implementation Checklist
    
    ### Setup
    - [ ] Define evaluation criteria/rubric
    - [ ] Set score threshold for "good enough"
    - [ ] Configure max iterations (default: 3)
    
    ### Implementation
    - [ ] Implement generate() function
    - [ ] Implement evaluate() function with structured output
    - [ ] Implement optimize() function
    - [ ] Wire up the refinement loop
    
    ### Safety
    - [ ] Add convergence detection
    - [ ] Log all iterations for debugging
    - [ ] Handle evaluation parse failures gracefully
    // Comments
    Sign in with GitHub to leave a comment.
    // Related skills

    More tools from the same signal band