Crawl4ai
name: crawl4ai
by codylrn804 · published 2026-03-22
$ claw add gh:codylrn804/codylrn804-crawl4ai---
name: crawl4ai
description: AI-powered web scraping framework for extracting structured data from websites. Use when Codex needs to crawl, scrape, or extract data from web pages using AI-powered parsing, handle dynamic content, or work with complex HTML structures.
---
# Crawl4ai
Overview
Crawl4ai is an AI-powered web scraping framework designed to extract structured data from websites efficiently. It combines traditional HTML parsing with AI to handle dynamic content, extract text intelligently, and clean and structure data from complex web pages.
When to Use This Skill
Use when Codex needs to:
**Trigger phrases:**
Quick Start
Basic Usage
from crawl4ai import AsyncWebCrawler, BrowserMode
async def scrape_page(url):
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=url,
browser_mode=BrowserMode.LATEST,
headless=True
)
return result.markdown, result.clean_htmlExtracting Structured Data
from crawl4ai import AsyncWebCrawler, JsonModeScreener
import json
async def extract_products(url):
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=url,
screenshot=True,
javascript=True,
bypass_cache=True
)
# Extract product data
products = []
for item in result.extracted_content:
if item['type'] == 'product':
products.append({
'name': item['name'],
'price': item['price'],
'url': item['url']
})
return productsCommon Tasks
Web Scraping Basics
**Scenario:** User wants to scrape a website for all article titles.
from crawl4ai import AsyncWebCrawler
async def scrape_articles(url):
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=url,
javascript=True,
verbose=True
)
# Extract article titles from HTML
articles = result.extracted_content if result.extracted_content else []
titles = [item.get('name', item.get('text', '')) for item in articles]
return titles**Trigger:** "Scrape this site for article titles" or "Get all titles from [URL]"
Dynamic Content Handling
**Scenario:** Website loads data via JavaScript.
from crawl4ai import AsyncWebCrawler
async def scrape_dynamic_site(url):
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=url,
javascript=True, # Wait for JS execution
wait_for="body", # Wait for specific element
delay=1.5, # Wait time after load
headless=True
)
return result.markdown**Trigger:** "Scrape this dynamic website" or "This page needs JavaScript to load data"
Structured Data Extraction
**Scenario:** Extract specific fields like prices, descriptions, etc.
from crawl4ai import AsyncWebCrawler
async def extract_product_details(url):
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=url,
screenshot=True,
js_code="""
const products = document.querySelectorAll('.product');
return Array.from(products).map(p => ({
name: p.querySelector('.name')?.textContent,
price: p.querySelector('.price')?.textContent,
url: p.querySelector('a')?.href
}));
"""
)
return result.extracted_content**Trigger:** "Extract product details from this page" or "Get price and name from [URL]"
HTML Cleaning and Parsing
**Scenario:** Clean messy HTML and extract clean text.
from crawl4ai import AsyncWebCrawler
async def clean_and_parse(url):
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=url,
remove_tags=['script', 'style', 'nav', 'footer', 'header'],
only_main_content=True
)
# Clean and return markdown
clean_text = result.clean_html
return clean_text**Trigger:** "Clean this HTML" or "Extract main content from this page"
Advanced Features
Custom JavaScript Injection
async def custom_scrape(url, custom_js):
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=url,
js_code=custom_js,
js_only=True # Only execute JS, don't download resources
)
return result.extracted_contentSession Management
from crawl4ai import AsyncWebCrawler
async def multi_page_scrape(base_url, urls):
async with AsyncWebCrawler() as crawler:
results = []
for url in urls:
result = await crawler.arun(
url=url,
session_id=f"session_{url}",
bypass_cache=True
)
results.append({
'url': url,
'content': result.markdown,
'status': result.success
})
return resultsBest Practices
1. **Always check if the site allows scraping** - Respect robots.txt and terms of service
2. **Use appropriate delays** - Add delays between requests to avoid overwhelming servers
3. **Handle errors gracefully** - Implement retry logic and error handling
4. **Be selective with data** - Extract only what you need, don't dump entire pages
5. **Store data reliably** - Save extracted data in structured formats (JSON, CSV)
6. **Clean URLs** - Handle redirects and malformed URLs
Error Handling
async def robust_scrape(url):
try:
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=url,
timeout=30000 # 30 seconds timeout
)
if result.success:
return result.markdown, result.extracted_content
else:
print(f"Scraping failed: {result.error_message}")
return None, None
except Exception as e:
print(f"Scraping error: {str(e)}")
return None, NoneOutput Formats
Crawl4ai supports multiple output formats:
Resources
scripts/
Python scripts for common crawling operations:
references/
Documentation and examples:
More tools from the same signal band
Order food/drinks (点餐) on an Android device paired as an OpenClaw node. Uses in-app menu and cart; add goods, view cart, submit order (demo, no real payment).
Sign plugins, rotate agent credentials without losing identity, and publicly attest to plugin behavior with verifiable claims and authenticated transfers.
The philosophical layer for AI agents. Maps behavior to Spinoza's 48 affects, calculates persistence scores, and generates geometric self-reports. Give your...