Skip to content

gabubu-dev/ai-web-scraper

Repository files navigation

AI Web Scraper

A smart web scraper that uses Large Language Models (LLMs) to understand page structure and extract data without brittle CSS selectors. Query websites using natural language and get structured data back.

Features

  • Natural Language Queries: Describe what you want to extract in plain English
  • Intelligent Extraction: Uses LLMs (OpenAI GPT or Anthropic Claude) to understand page structure
  • No Brittle Selectors: No need to maintain CSS selectors or XPath expressions
  • JavaScript Rendering: Built on Playwright for full dynamic content support
  • Structured Output: Returns data in structured Pydantic models with confidence scores
  • Flexible Configuration: Support for multiple LLM providers and customizable scraping behavior
  • Type-Safe: Full type hints and Pydantic validation

Installation

Prerequisites

  • Python 3.9+
  • Playwright browsers

Setup

  1. Clone the repository:
git clone https://github.com/gabubu-dev/ai-web-scraper.git
cd ai-web-scraper
  1. Install dependencies:
pip install -r requirements.txt
  1. Install Playwright browsers:
playwright install chromium
  1. Configure API keys:
cp .env.example .env
# Edit .env and add your API keys

Configuration

Environment Variables

Create a .env file with your LLM API keys:

OPENAI_API_KEY=your-openai-key-here
ANTHROPIC_API_KEY=your-anthropic-key-here
LLM_PROVIDER=openai
LLM_MODEL=gpt-4-turbo-preview

Configuration File

Alternatively, use a config.json file (see config.example.json):

{
  "llm": {
    "provider": "openai",
    "model": "gpt-4-turbo-preview",
    "api_key": "your-api-key",
    "temperature": 0.1,
    "max_tokens": 4096
  },
  "scraper": {
    "timeout": 30000,
    "headless": true,
    "javascript_enabled": true
  }
}

Usage

Command Line

# Basic usage
python -m src.cli "https://example.com" "Extract all product names and prices"

# With output file
python -m src.cli "https://example.com" "Get article title and author" -o output.json --pretty

# Specify provider and model
python -m src.cli "https://example.com" "Extract data" --provider anthropic --model claude-3-sonnet-20240229

# Wait for specific element
python -m src.cli "https://example.com" "Extract products" --wait-for ".product-list"

Python API

Basic Usage

from src.extractor import DataExtractor

# Initialize extractor
extractor = DataExtractor()

# Extract data with natural language query
result = extractor.extract_from_url(
    url="https://example.com/products",
    query="Extract all product names, prices, and descriptions"
)

if result.success:
    for field in result.data:
        print(f"{field.name}: {field.value}")
else:
    print(f"Error: {result.error}")

Structured Extraction with Schema

from src.extractor import DataExtractor
from src.models import ExtractionRequest

# Define expected schema
schema = {
    "products": [
        {
            "name": "string",
            "price": "number",
            "in_stock": "boolean"
        }
    ]
}

extractor = DataExtractor()

request = ExtractionRequest(
    url="https://example.com/shop",
    query="Extract all products with name, price, and stock status",
    schema=schema
)

result = extractor.extract(request)
print(result.raw_data)

Custom Configuration

from pathlib import Path
from src.config import Config
from src.extractor import DataExtractor

# Load custom configuration
config = Config(config_path=Path("config.json"))

# Create extractor with custom config
extractor = DataExtractor(config)

result = extractor.extract_from_url(
    url="https://example.com",
    query="Extract article metadata"
)

Architecture

Core Components

  • WebScraper (src/scraper.py): Playwright-based web scraper with JavaScript rendering
  • DataExtractor (src/extractor.py): Orchestrates scraping and LLM-based extraction
  • LLMProvider (src/llm.py): Abstraction for different LLM providers (OpenAI, Anthropic)
  • Models (src/models.py): Pydantic models for type-safe data handling
  • Config (src/config.py): Configuration management with environment variable support

How It Works

  1. Fetch Page: Playwright loads the page with full JavaScript execution
  2. Clean HTML: Remove scripts, styles, and unnecessary elements
  3. LLM Analysis: Send cleaned HTML to LLM with natural language query
  4. Parse Response: Convert LLM response to structured Pydantic models
  5. Return Results: Return typed results with confidence scores and metadata

Examples

See the examples/ directory for more usage examples:

  • basic_usage.py: Simple extraction examples
  • structured_extraction.py: Using schemas for structured data

Testing

Run tests with pytest:

pip install pytest
pytest tests/

API Reference

DataExtractor

Main class for data extraction.

Methods:

  • extract(request: ExtractionRequest) -> ExtractionResult: Extract data using request object
  • extract_from_url(url: str, query: str, wait_for_selector: Optional[str] = None) -> ExtractionResult: Convenience method for URL extraction

ExtractionRequest

Request model for data extraction.

Fields:

  • url: URL to scrape
  • query: Natural language description of what to extract
  • schema: Optional JSON schema for structured output
  • wait_for_selector: Optional CSS selector to wait for

ExtractionResult

Result model containing extracted data.

Fields:

  • success: Whether extraction succeeded
  • data: List of ExtractedField objects
  • metadata: PageMetadata with URL, title, etc.
  • raw_data: Raw dictionary of extracted data
  • error: Error message if failed
  • tokens_used: Number of LLM tokens used

Supported LLM Providers

  • OpenAI: GPT-4 Turbo, GPT-3.5
  • Anthropic: Claude 3 (Opus, Sonnet, Haiku)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feat/amazing-feature)
  3. Commit your changes (git commit -m 'feat: add amazing feature')
  4. Push to the branch (git push origin feat/amazing-feature)
  5. Open a Pull Request

License

MIT License - see LICENSE file for details

Acknowledgments

About

Smart web scraper using LLMs to understand page structure and extract data without brittle selectors

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages