autoresearch
name: autoresearch
by alannjaf · published 2026-04-01
$ claw add gh:alannjaf/alannjaf-karpathy-autoresearch---
name: autoresearch
description: "Autonomously optimize any OpenClaw skill by running it repeatedly, scoring outputs against binary evals, mutating the prompt, and keeping improvements. Based on Karpathy's autoresearch methodology. Use when: optimize this skill, improve this skill, run autoresearch on, make this skill better, self-improve skill, benchmark skill, eval my skill, run evals on."
---
# autoresearch
Autonomously optimize any OpenClaw skill by running it repeatedly, scoring outputs against binary evals, mutating the prompt, and keeping improvements. Based on Karpathy's autoresearch methodology.
Triggers
Use when: optimize this skill, improve this skill, run autoresearch on, make this skill better, self-improve skill, benchmark skill, eval my skill, run evals on.
Description
Autonomous prompt/strategy optimization using Karpathy's autoresearch pattern. Mutate → evaluate → keep improvements. Works on anything with a measurable score: trading strategies, content scripts, thumbnails, ad copy, email subjects.
How It Works
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ 1. BASELINE │────▶│ 2. MUTATE │────▶│ 3. EVALUATE │────▶│ 4. DECIDE │
│ Score the │ │ Change one │ │ Run scoring │ │ Better? │
│ current │ │ thing │ │ function │ │ Keep : Revert│
│ version │ │ │ │ │ │ │
└─────────────┘ └─────────────┘ └─────────────┘ └──────┬───────┘
│
Loop back to 2Instructions
Step 1: Identify the Mutable File
The **mutable file** is the thing you're optimizing. It can be:
Create or identify this file. Example:
my-skill/
├── SKILL.md ← this is your mutable file
├── eval/
│ ├── test_cases.json
│ └── score.pyStep 2: Create an Evaluation Function
Your eval function must:
1. **Take the current mutable file as input**
2. **Run it against test cases**
3. **Return a numeric score** (higher = better)
The eval can be anything:
Template eval function (customize for your domain):
# eval/score.py
import json
import sys
def evaluate(mutable_file_path: str, test_cases_path: str) -> float:
"""
Score the current version of the mutable file.
Returns a float — higher is better.
"""
with open(mutable_file_path) as f:
current_version = f.read()
with open(test_cases_path) as f:
test_cases = json.load(f)
scores = []
for case in test_cases:
# YOUR SCORING LOGIC HERE
# Example: run the prompt, compare output to expected
score = run_and_score(current_version, case)
scores.append(score)
return sum(scores) / len(scores)
if __name__ == "__main__":
score = evaluate(sys.argv[1], sys.argv[2])
print(f"SCORE: {score}")Step 3: Run the Autoresearch Loop
The loop follows this exact pattern:
1. Git init (if not already) — every experiment is a commit
2. Run eval on current version → get BASELINE score
3. For each experiment (1..N):
a. Read the current mutable file
b. Generate a MUTATION (change one thing — a threshold, a phrase, a rule)
c. Write the mutated version
d. Run eval → get NEW score
e. If NEW > BASELINE:
- Git commit with message: "exp-{N}: {description} | score: {baseline} → {new}"
- Update BASELINE = NEW
- Log: "✅ KEPT — improvement"
f. If NEW <= BASELINE:
- Git checkout the mutable file (revert)
- Log: "❌ REVERTED — no improvement"
4. Print final summary: experiments run, improvements found, final score#### Agent Instructions for Running the Loop
When the user says "run autoresearch on X", follow this procedure:
1. **Locate the mutable file** — ask the user or infer from context
2. **Locate or create the eval function** — the user must have a way to score
3. **Initialize git tracking** in the project directory
4. **Run baseline eval** — record the starting score
5. **Begin experiment loop:**
- Read the mutable file
- Think about what single change might improve the score
- Make the change (be specific — change ONE thing per experiment)
- Run eval
- Keep or revert based on score
- Log the result
6. **Continue for N experiments** (default: 20, or until user stops)
7. **Report results:**
- Starting score → Final score
- Number of experiments run
- Number of improvements kept
- Summary of what changes worked
#### Mutation Strategy
Good mutations change ONE thing at a time:
Bad mutations change everything at once — you can't learn what worked.
Step 4: Git Tracking
Every experiment MUST be tracked in git:
# Before starting
git init
git add -A
git commit -m "baseline: score {X}"
# After each successful mutation
git add -A
git commit -m "exp-{N}: {what changed} | {old_score} → {new_score}"
# After each failed mutation
git checkout -- {mutable_file}This gives you:
Proven Results
Case Study 1: Gold Trading Strategy
Case Study 2: YouTube Shorts Scripts
Example Usage
**User**: "Run autoresearch on my email subject line skill"
**Agent workflow**:
1. Read the skill's SKILL.md (mutable file)
2. Create eval: generate 20 test emails → score subject lines with LLM judge (1-100 on open-rate prediction)
3. Baseline: 72.4/100
4. Experiment 1: Add "use numbers in subject lines" → 74.1 ✅ KEPT
5. Experiment 2: Add "max 6 words" → 71.8 ❌ REVERTED
6. Experiment 3: Add "start with a verb" → 75.3 ✅ KEPT
7. ... continue for 20 experiments
8. Final: 79.2/100 (+9.4%)
**User**: "Optimize my trading strategy config"
**Agent workflow**:
1. Read strategy.json (mutable file)
2. Eval: run backtest script → Sharpe ratio
3. Baseline: Sharpe 2.1
4. Experiment 1: Lower stop-loss from 2% to 1.5% → Sharpe 2.3 ✅
5. Experiment 2: Increase EMA fast period 12→15 → Sharpe 1.9 ❌
6. ... continue
7. Final: Sharpe 3.8 (+81%)
More tools from the same signal band
Order food/drinks (点餐) on an Android device paired as an OpenClaw node. Uses in-app menu and cart; add goods, view cart, submit order (demo, no real payment).
Sign plugins, rotate agent credentials without losing identity, and publicly attest to plugin behavior with verifiable claims and authenticated transfers.
The philosophical layer for AI agents. Maps behavior to Spinoza's 48 affects, calculates persistence scores, and generates geometric self-reports. Give your...