Rag Evaluator
version: "2.0.0"
by bytesagain1 · published 2026-03-22
$ claw add gh:bytesagain1/bytesagain1-rag-evaluator---
version: "2.0.0"
name: Ragaai Catalyst
description: "Python SDK for Agent AI Observability, Monitoring and Evaluation Framework. Includes features like a ragaai catalyst, python, agentic-ai."
---
# Rag Evaluator
AI-powered RAG (Retrieval-Augmented Generation) evaluation toolkit. Configure, benchmark, compare, and optimize your RAG pipelines from the command line. Track prompts, evaluations, fine-tuning experiments, costs, and usage — all with persistent local logging and full export capabilities.
Commands
Run `rag-evaluator <command> [args]` to use.
| Command | Description |
|---------|-------------|
| `configure` | Configure RAG evaluation settings and parameters |
| `benchmark` | Run benchmarks against your RAG pipeline |
| `compare` | Compare results across different RAG configurations |
| `prompt` | Log and manage prompt templates and variations |
| `evaluate` | Evaluate RAG output quality and relevance |
| `fine-tune` | Track fine-tuning experiments and parameters |
| `analyze` | Analyze evaluation results and identify patterns |
| `cost` | Track and log API/inference costs |
| `usage` | Monitor token usage and API call volumes |
| `optimize` | Log optimization strategies and results |
| `test` | Run test cases against RAG configurations |
| `report` | Generate evaluation reports |
| `stats` | Show summary statistics across all categories |
| `export <fmt>` | Export data in json, csv, or txt format |
| `search <term>` | Search across all logged entries |
| `recent` | Show recent activity from history log |
| `status` | Health check — version, data dir, disk usage |
| `help` | Show help and available commands |
| `version` | Show version (v2.0.0) |
Each domain command (configure, benchmark, compare, etc.) works in two modes:
Data Storage
All data is stored locally in `~/.local/share/rag-evaluator/`:
Requirements
When to Use
1. **Evaluating RAG pipeline quality** — log evaluation scores, compare retrieval strategies, and track improvements over time
2. **Benchmarking different configurations** — run benchmarks across embedding models, chunk sizes, or retrieval methods and compare results side by side
3. **Tracking costs and usage** — monitor API costs and token usage across experiments to stay within budget
4. **Managing prompt engineering** — log prompt variations, test them against your pipeline, and analyze which templates perform best
5. **Generating reports for stakeholders** — export evaluation data as JSON/CSV for dashboards, or generate text reports summarizing RAG performance
Examples
# Configure a new evaluation run
rag-evaluator configure "model=gpt-4 chunks=512 overlap=50 top_k=5"
# Run a benchmark and log results
rag-evaluator benchmark "latency=230ms recall@5=0.82 precision@5=0.71"
# Compare two retrieval strategies
rag-evaluator compare "bm25 vs dense: bm25 recall=0.78, dense recall=0.85"
# Track evaluation scores
rag-evaluator evaluate "faithfulness=0.91 relevance=0.87 coherence=0.93"
# Log API cost for a run
rag-evaluator cost "run-042: $0.23 (1.2k tokens input, 800 tokens output)"
# View summary statistics
rag-evaluator stats
# Export all data as CSV
rag-evaluator export csv
# Search for specific entries
rag-evaluator search "gpt-4"
# Check recent activity
rag-evaluator recent
# Health check
rag-evaluator statusOutput
All commands output to stdout. Redirect to a file if needed:
rag-evaluator report "weekly summary" > report.txt
rag-evaluator export json # saves to ~/.local/share/rag-evaluator/export.jsonConfiguration
Set `DATA_DIR` by modifying the script, or use the default: `~/.local/share/rag-evaluator/`
---
Powered by BytesAgain | bytesagain.com | hello@bytesagain.com
More tools from the same signal band
Order food/drinks (点餐) on an Android device paired as an OpenClaw node. Uses in-app menu and cart; add goods, view cart, submit order (demo, no real payment).
Sign plugins, rotate agent credentials without losing identity, and publicly attest to plugin behavior with verifiable claims and authenticated transfers.
The philosophical layer for AI agents. Maps behavior to Spinoza's 48 affects, calculates persistence scores, and generates geometric self-reports. Give your...