๐ฌ Awesome Autoresearch
```markdown
by adisinghstudent ยท published 2026-04-01
$ claw add gh:adisinghstudent/adisinghstudent-awesome-autoresearch---
name: awesome-autoresearch
description: Curated index of autonomous improvement loops, research agents, and autoresearch-style systems inspired by Karpathy's autoresearch.
triggers:
- set up an autoresearch loop
- build a self-improving agent
- implement autonomous research workflow
- create an experiment optimization loop
- add autoresearch skill to my project
- build a keep-or-revert improvement loop
- set up a research agent pipeline
- automate ml experimentation with agents
---
# ๐ฌ Awesome Autoresearch
> Skill by [ara.so](https://ara.so) โ Daily 2026 Skills collection.
A curated index of autonomous improvement loops, research agents, and autoresearch-style systems. The core pattern: an LLM agent proposes changes, runs experiments, measures a metric, and keeps or reverts โ looping until a budget is exhausted or a threshold is met.
---
## What Is Autoresearch?
Autoresearch (originated by [karpathy/autoresearch](https://github.com/karpathy/autoresearch)) is an **autonomous experiment loop** where:
1. An LLM agent reads a codebase and a goal metric
2. It proposes a targeted change (hypothesis)
3. The change is applied and the metric is measured
4. If the metric improves โ keep; otherwise โ revert
5. Repeat within a fixed compute/time budget
The pattern generalizes to any measurable objective: model loss, Sharpe ratio, test pass rate, API latency, prompt quality, etc.
---
## Core Loop Pattern
# Canonical keep-or-revert autoresearch loop
import subprocess, shutil, json
from pathlib import Path
METRIC_CMD = ["python", "eval.py"] # returns JSON {"score": float}
BUDGET = 20 # number of iterations
GOAL = "maximize score"
def measure() -> float:
result = subprocess.run(METRIC_CMD, capture_output=True, text=True)
return json.loads(result.stdout)["score"]
def run_loop(agent_propose_fn):
best_score = measure()
print(f"Baseline: {best_score:.4f}")
for step in range(BUDGET):
# Agent proposes a diff/edit
agent_propose_fn(goal=GOAL, step=step, best=best_score)
score = measure()
if score > best_score:
best_score = score
print(f"[{step}] โ Improved โ {score:.4f}")
# Commit the change (git add -A && git commit)
subprocess.run(["git", "commit", "-am", f"step {step}: {score:.4f}"])
else:
print(f"[{step}] โ Reverted ({score:.4f} < {best_score:.4f})")
# Revert to last good state
subprocess.run(["git", "checkout", "--", "."])
print(f"Final best: {best_score:.4f}")
---
## Installation Patterns by Platform
### Claude Code Skill (SKILL.md / CLAUDE.md)
Create `CLAUDE.md` or `.claude/skills/autoresearch.md` in your repo:
Autoresearch Loop
You are running an autonomous improvement loop. Each iteration:
1. Read `GOAL.md` for the objective and metric command
2. Propose ONE focused change to the codebase
3. Apply the change
4. Run: `python eval.py` โ parse `{"score": float}`
5. If score improves over baseline: `git commit -am "step N: <score>"`
6. Else: `git checkout -- .`
7. Log to `experiments.jsonl`
8. Repeat until BUDGET iterations or target score reached
### GOAL.md Pattern ([jmilinovich/goal-md](https://github.com/jmilinovich/goal-md))
# GOAL.md
Objective
Minimize validation bits-per-byte on the Shakespeare dataset.
Metric Command
python eval.py --split valReturns: `{"val_bpb": float}` โ lower is better.
Budget
Constraints
### Codex / OpenAI CLI
# Install Codex CLI
npm install -g @openai/codex
# Run autoresearch loop via Codex
codex "Read GOAL.md. Run the autoresearch loop: propose a change, measure eval.py output, keep if improved else revert. Repeat 20 times. Log each step to experiments.jsonl."
---
## Experiment Logging
# experiments.jsonl writer โ append each step
import json, datetime
def log_step(step: int, score: float, baseline: float, diff: str, kept: bool):
record = {
"step": step,
"timestamp": datetime.datetime.utcnow().isoformat(),
"score": score,
"baseline": baseline,
"delta": score - baseline,
"kept": kept,
"diff_summary": diff[:200], # first 200 chars of unified diff
}
with open("experiments.jsonl", "a") as f:
f.write(json.dumps(record) + "\n")
---
## Domain-Specific Configurations
### ML Training Loss (original pattern)
# eval.py for language model val_bpb
import torch, json
model = load_checkpoint("ckpt_latest.pt")
val_bpb = evaluate_bpb(model, "data/val.bin")
print(json.dumps({"score": -val_bpb})) # negate so higher=better
### API / Prompt Optimization
# eval.py for prompt quality
import os, json
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
def score_prompt(prompt_file="system_prompt.txt") -> float:
prompt = open(prompt_file).read()
scores = []
for test_case in load_test_cases("test_cases.json"):
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "system", "content": prompt},
{"role": "user", "content": test_case["input"]}]
)
scores.append(judge(resp.choices[0].message.content, test_case["expected"]))
return sum(scores) / len(scores)
print(json.dumps({"score": score_prompt()}))
### GPU Kernel Optimization ([RightNow-AI/autokernel](https://github.com/RightNow-AI/autokernel))
# eval.py for kernel throughput
import subprocess, json, re
result = subprocess.run(
["python", "benchmark_kernel.py", "--kernel", "attn_fwd"],
capture_output=True, text=True
)
tflops = float(re.search(r"TFLOP/s: ([\d.]+)", result.stdout).group(1))
print(json.dumps({"score": tflops}))
### Trading Strategy ([chrisworsey55/atlas-gic](https://github.com/chrisworsey55/atlas-gic))
# eval.py for Sharpe ratio
import json
from backtest import run_backtest
sharpe = run_backtest(
strategy_file="strategy.py",
data_path="data/ohlcv_2023.parquet",
initial_capital=100_000
)
print(json.dumps({"score": sharpe}))
---
## Multi-GPU / Parallel Loops ([iii-hq/n-autoresearch](https://github.com/iii-hq/n-autoresearch))
# parallel_loop.py โ run N agents on different hypotheses simultaneously
import asyncio, json
from pathlib import Path
async def run_agent(agent_id: int, gpu_id: int, hypothesis: dict) -> dict:
env = {"CUDA_VISIBLE_DEVICES": str(gpu_id)}
proc = await asyncio.create_subprocess_exec(
"python", "eval.py",
env={**__import__("os").environ, **env},
stdout=asyncio.subprocess.PIPE
)
stdout, _ = await proc.communicate()
score = json.loads(stdout)["score"]
return {"agent_id": agent_id, "hypothesis": hypothesis, "score": score}
async def parallel_search(hypotheses: list, gpus: list):
tasks = [
run_agent(i, gpus[i % len(gpus)], h)
for i, h in enumerate(hypotheses)
]
results = await asyncio.gather(*tasks)
best = max(results, key=lambda r: r["score"])
return best
---
## Persistent Memory Across Sessions
# memory.py โ frequency-weighted cross-session knowledge retrieval
import json, time
from pathlib import Path
MEMORY_FILE = Path(".autoresearch_memory.json")
def load_memory() -> dict:
if MEMORY_FILE.exists():
return json.loads(MEMORY_FILE.read_text())
return {"lessons": [], "best_score": None, "total_steps": 0}
def save_lesson(lesson: str, score_delta: float):
mem = load_memory()
mem["lessons"].append({
"text": lesson,
"delta": score_delta,
"timestamp": time.time(),
"weight": 1.0
})
# Boost weight for high-impact lessons
if score_delta > 0.01:
mem["lessons"][-1]["weight"] = 3.0
MEMORY_FILE.write_text(json.dumps(mem, indent=2))
def get_top_lessons(n: int = 5) -> list[str]:
mem = load_memory()
sorted_lessons = sorted(
mem["lessons"],
key=lambda l: l["weight"] * l["delta"],
reverse=True
)
return [l["text"] for l in sorted_lessons[:n]]
---
## Swarm Coordination ([mutable-state-inc/autoresearch-at-home](https://github.com/mutable-state-inc/autoresearch-at-home))
# swarm.py โ share best configs and hypotheses across agents
import json, os, requests
SWARM_API = os.environ.get("SWARM_API_URL", "http://localhost:8080")
def claim_experiment(agent_id: str, hypothesis: str) -> bool:
"""Claim a hypothesis so other agents don't duplicate work."""
resp = requests.post(f"{SWARM_API}/claim", json={
"agent_id": agent_id,
"hypothesis": hypothesis
})
return resp.json()["claimed"]
def push_best_config(score: float, config: dict):
"""Broadcast a new best config to the swarm leaderboard."""
requests.post(f"{SWARM_API}/best", json={
"score": score,
"config": config,
"agent_id": os.environ.get("AGENT_ID", "local")
})
def pull_best_config() -> dict | None:
"""Fetch current global best config from swarm."""
resp = requests.get(f"{SWARM_API}/best")
return resp.json() if resp.ok else None
---
## Apple Silicon / MLX Port ([trevin-creator/autoresearch-mlx](https://github.com/trevin-creator/autoresearch-mlx))
# eval_mlx.py โ drop-in eval for Apple Silicon (no CUDA required)
import mlx.core as mx
import mlx.nn as nn
import json
def evaluate_mlx(model_path: str, val_data: str) -> float:
model = nn.load(model_path) # MLX checkpoint
tokens = mx.array(load_tokens(val_data))
logits = model(tokens[:-1])
loss = nn.losses.cross_entropy(logits, tokens[1:]).mean().item()
bpb = loss / 0.6931 # nats โ bits
return bpb
bpb = evaluate_mlx("ckpt.npz", "data/val.bin")
print(json.dumps({"score": -bpb})) # negate: higher score = lower bpb
---
## End-to-End Research Agent ([SakanaAI/AI-Scientist](https://github.com/SakanaAI/AI-Scientist))
# Clone and install AI-Scientist
git clone https://github.com/SakanaAI/AI-Scientist
cd AI-Scientist
pip install -r requirements.txt
# Set API keys
export OPENAI_API_KEY="..."
export ANTHROPIC_API_KEY="..."
# Run full pipeline: idea โ experiments โ paper
python launch_scientist.py \
--model "gpt-4o" \
--experiment nanoGPT \
--num-ideas 5
---
## Key Environment Variables
# LLM providers
OPENAI_API_KEY=...
ANTHROPIC_API_KEY=...
# Swarm coordination
SWARM_API_URL=http://swarm.internal:8080
AGENT_ID=agent_gpu0
# Hardware
CUDA_VISIBLE_DEVICES=0
PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 # Apple Silicon
# Loop config
AUTORESEARCH_BUDGET=20
AUTORESEARCH_TARGET_SCORE=0.95
AUTORESEARCH_LOG_FILE=experiments.jsonl
---
## Minimal Repo Structure
my-autoresearch-project/
โโโ GOAL.md # Objective, metric command, budget
โโโ CLAUDE.md # Agent skill/instructions
โโโ eval.py # Returns {"score": float} to stdout
โโโ train.py # Or whatever is being optimized
โโโ experiments.jsonl # Append-only experiment log
โโโ .autoresearch_memory.json # Cross-session lessons (optional)
โโโ results/
โโโ best_config.json # Current best configuration
---
## Troubleshooting
| Problem | Fix |
|---|---|
| Agent makes too-large changes | Constrain in GOAL.md: "Edit only one function per step" |
| Eval crashes โ always reverts | Wrap eval.py in try/except, return `{"score": -999}` on error |
| No improvement after 10 steps | Lower learning rate, restrict search space, or seed with known-good config |
| GPU OOM during eval | Add `torch.cuda.empty_cache()` before eval; reduce batch size |
| Agent forgets past lessons | Use persistent memory (`.autoresearch_memory.json`) and inject top lessons into context |
| Metric is noisy | Average over 3 runs: `score = mean([measure() for _ in range(3)])` |
| macOS / no CUDA | Use MLX port or set `device = "mps"` in PyTorch |
| Free Colab T4 | Replace Flash Attention 3 with `torch.nn.functional.scaled_dot_product_attention` |
---
## Resources
- [karpathy/autoresearch](https://github.com/karpathy/autoresearch) โ original
- [ShengranHu/ADAS](https://github.com/ShengranHu/ADAS) โ meta-agent architecture design (ICLR 2025)
- [SakanaAI/AI-Scientist-v2](https://github.com/SakanaAI/AI-Scientist-v2) โ template-free scientific discovery
- [HKUDS/AI-Researcher](https://github.com/HKUDS/AI-Researcher) โ NeurIPS 2025 end-to-end research automation
- [gepa-ai/gepa](https://github.com/gepa-ai/gepa) โ genetic-pareto prompt evolution (ICLR 2026 Oral)
- [snap-stanford/MLAgentBench](https://github.com/snap-stanford/MLAgentBench) โ benchmark suite for research agentsMore tools from the same signal band
Order food/drinks (็น้ค) on an Android device paired as an OpenClaw node. Uses in-app menu and cart; add goods, view cart, submit order (demo, no real payment).
Sign plugins, rotate agent credentials without losing identity, and publicly attest to plugin behavior with verifiable claims and authenticated transfers.
The philosophical layer for AI agents. Maps behavior to Spinoza's 48 affects, calculates persistence scores, and generates geometric self-reports. Give your...