HomeBrowseUpload
← Back to registry
// Skill profile

Crawl4ai

name: crawl4ai

by codylrn804 · published 2026-03-22

数据处理API集成加密货币
Total installs
0
Stars
★ 0
Last updated
2026-03
// Install command
$ claw add gh:codylrn804/codylrn804-crawl4ai
View on GitHub
// Full documentation

---

name: crawl4ai

description: AI-powered web scraping framework for extracting structured data from websites. Use when Codex needs to crawl, scrape, or extract data from web pages using AI-powered parsing, handle dynamic content, or work with complex HTML structures.

---

# Crawl4ai

Overview

Crawl4ai is an AI-powered web scraping framework designed to extract structured data from websites efficiently. It combines traditional HTML parsing with AI to handle dynamic content, extract text intelligently, and clean and structure data from complex web pages.

When to Use This Skill

Use when Codex needs to:

  • Extract structured data from web pages (products, articles, forms, tables, etc.)
  • Scrape websites with dynamic content or complex JavaScript
  • Clean and normalize extracted data from various HTML structures
  • Work with APIs or web services that return HTML
  • Handle CORS limitations by scraping directly
  • Process web content at scale with reliability
  • **Trigger phrases:**

  • "Extract data from this website"
  • "Scrape this page for [specific data]"
  • "Parse this HTML"
  • "Get data from [URL]"
  • "Extract structured information from [website]"
  • "Scrape [website] for [data type]"
  • "Web scrape [URL]"
  • Quick Start

    Basic Usage

    from crawl4ai import AsyncWebCrawler, BrowserMode
    
    async def scrape_page(url):
        async with AsyncWebCrawler() as crawler:
            result = await crawler.arun(
                url=url,
                browser_mode=BrowserMode.LATEST,
                headless=True
            )
            return result.markdown, result.clean_html

    Extracting Structured Data

    from crawl4ai import AsyncWebCrawler, JsonModeScreener
    import json
    
    async def extract_products(url):
        async with AsyncWebCrawler() as crawler:
            result = await crawler.arun(
                url=url,
                screenshot=True,
                javascript=True,
                bypass_cache=True
            )
            # Extract product data
            products = []
            for item in result.extracted_content:
                if item['type'] == 'product':
                    products.append({
                        'name': item['name'],
                        'price': item['price'],
                        'url': item['url']
                    })
            return products

    Common Tasks

    Web Scraping Basics

    **Scenario:** User wants to scrape a website for all article titles.

    from crawl4ai import AsyncWebCrawler
    
    async def scrape_articles(url):
        async with AsyncWebCrawler() as crawler:
            result = await crawler.arun(
                url=url,
                javascript=True,
                verbose=True
            )
            # Extract article titles from HTML
            articles = result.extracted_content if result.extracted_content else []
            titles = [item.get('name', item.get('text', '')) for item in articles]
            return titles

    **Trigger:** "Scrape this site for article titles" or "Get all titles from [URL]"

    Dynamic Content Handling

    **Scenario:** Website loads data via JavaScript.

    from crawl4ai import AsyncWebCrawler
    
    async def scrape_dynamic_site(url):
        async with AsyncWebCrawler() as crawler:
            result = await crawler.arun(
                url=url,
                javascript=True,  # Wait for JS execution
                wait_for="body",   # Wait for specific element
                delay=1.5,         # Wait time after load
                headless=True
            )
            return result.markdown

    **Trigger:** "Scrape this dynamic website" or "This page needs JavaScript to load data"

    Structured Data Extraction

    **Scenario:** Extract specific fields like prices, descriptions, etc.

    from crawl4ai import AsyncWebCrawler
    
    async def extract_product_details(url):
        async with AsyncWebCrawler() as crawler:
            result = await crawler.arun(
                url=url,
                screenshot=True,
                js_code="""
                    const products = document.querySelectorAll('.product');
                    return Array.from(products).map(p => ({
                        name: p.querySelector('.name')?.textContent,
                        price: p.querySelector('.price')?.textContent,
                        url: p.querySelector('a')?.href
                    }));
                """
            )
            return result.extracted_content

    **Trigger:** "Extract product details from this page" or "Get price and name from [URL]"

    HTML Cleaning and Parsing

    **Scenario:** Clean messy HTML and extract clean text.

    from crawl4ai import AsyncWebCrawler
    
    async def clean_and_parse(url):
        async with AsyncWebCrawler() as crawler:
            result = await crawler.arun(
                url=url,
                remove_tags=['script', 'style', 'nav', 'footer', 'header'],
                only_main_content=True
            )
            # Clean and return markdown
            clean_text = result.clean_html
            return clean_text

    **Trigger:** "Clean this HTML" or "Extract main content from this page"

    Advanced Features

    Custom JavaScript Injection

    async def custom_scrape(url, custom_js):
        async with AsyncWebCrawler() as crawler:
            result = await crawler.arun(
                url=url,
                js_code=custom_js,
                js_only=True  # Only execute JS, don't download resources
            )
            return result.extracted_content

    Session Management

    from crawl4ai import AsyncWebCrawler
    
    async def multi_page_scrape(base_url, urls):
        async with AsyncWebCrawler() as crawler:
            results = []
            for url in urls:
                result = await crawler.arun(
                    url=url,
                    session_id=f"session_{url}",
                    bypass_cache=True
                )
                results.append({
                    'url': url,
                    'content': result.markdown,
                    'status': result.success
                })
            return results

    Best Practices

    1. **Always check if the site allows scraping** - Respect robots.txt and terms of service

    2. **Use appropriate delays** - Add delays between requests to avoid overwhelming servers

    3. **Handle errors gracefully** - Implement retry logic and error handling

    4. **Be selective with data** - Extract only what you need, don't dump entire pages

    5. **Store data reliably** - Save extracted data in structured formats (JSON, CSV)

    6. **Clean URLs** - Handle redirects and malformed URLs

    Error Handling

    async def robust_scrape(url):
        try:
            async with AsyncWebCrawler() as crawler:
                result = await crawler.arun(
                    url=url,
                    timeout=30000  # 30 seconds timeout
                )
                if result.success:
                    return result.markdown, result.extracted_content
                else:
                    print(f"Scraping failed: {result.error_message}")
                    return None, None
        except Exception as e:
            print(f"Scraping error: {str(e)}")
            return None, None

    Output Formats

    Crawl4ai supports multiple output formats:

  • **Markdown**: Clean, readable text (`result.markdown`)
  • **Clean HTML**: Structured, cleaned HTML (`result.clean_html`)
  • **Extracted Content**: Structured JSON data (`result.extracted_content`)
  • **Screenshot**: Visual representation (`result.screenshot`)
  • **Links**: All links found on page (`result.links`)
  • Resources

    scripts/

    Python scripts for common crawling operations:

  • `scrape_single_page.py` - Basic scraping utility
  • `scrape_multiple_pages.py` - Batch scraping with pagination
  • `extract_from_html.py` - HTML parsing helper
  • `clean_html.py` - HTML cleaning utility
  • references/

    Documentation and examples:

  • `api_reference.md` - Complete API documentation
  • `examples.md` - Common use cases and patterns
  • `error_handling.md` - Troubleshooting guide
  • // Comments
    Sign in with GitHub to leave a comment.
    // Related skills

    More tools from the same signal band