AI Web Scraper

A smart web scraper that uses Large Language Models (LLMs) to understand page structure and extract data without brittle CSS selectors. Query websites using natural language and get structured data back.

Features

Natural Language Queries: Describe what you want to extract in plain English
Intelligent Extraction: Uses LLMs (OpenAI GPT or Anthropic Claude) to understand page structure
No Brittle Selectors: No need to maintain CSS selectors or XPath expressions
JavaScript Rendering: Built on Playwright for full dynamic content support
Structured Output: Returns data in structured Pydantic models with confidence scores
Flexible Configuration: Support for multiple LLM providers and customizable scraping behavior
Type-Safe: Full type hints and Pydantic validation

Installation

Prerequisites

Python 3.9+
Playwright browsers

Setup

Clone the repository:

git clone https://github.com/gabubu-dev/ai-web-scraper.git
cd ai-web-scraper

Install dependencies:

pip install -r requirements.txt

Install Playwright browsers:

playwright install chromium

Configure API keys:

cp .env.example .env
# Edit .env and add your API keys

Configuration

Environment Variables

Create a .env file with your LLM API keys:

OPENAI_API_KEY=your-openai-key-here
ANTHROPIC_API_KEY=your-anthropic-key-here
LLM_PROVIDER=openai
LLM_MODEL=gpt-4-turbo-preview

Configuration File

Alternatively, use a config.json file (see config.example.json):

{
  "llm": {
    "provider": "openai",
    "model": "gpt-4-turbo-preview",
    "api_key": "your-api-key",
    "temperature": 0.1,
    "max_tokens": 4096
  },
  "scraper": {
    "timeout": 30000,
    "headless": true,
    "javascript_enabled": true
  }
}

Usage

Command Line

# Basic usage
python -m src.cli "https://example.com" "Extract all product names and prices"

# With output file
python -m src.cli "https://example.com" "Get article title and author" -o output.json --pretty

# Specify provider and model
python -m src.cli "https://example.com" "Extract data" --provider anthropic --model claude-3-sonnet-20240229

# Wait for specific element
python -m src.cli "https://example.com" "Extract products" --wait-for ".product-list"

Python API

Basic Usage

from src.extractor import DataExtractor

# Initialize extractor
extractor = DataExtractor()

# Extract data with natural language query
result = extractor.extract_from_url(
    url="https://example.com/products",
    query="Extract all product names, prices, and descriptions"
)

if result.success:
    for field in result.data:
        print(f"{field.name}: {field.value}")
else:
    print(f"Error: {result.error}")

Structured Extraction with Schema

from src.extractor import DataExtractor
from src.models import ExtractionRequest

# Define expected schema
schema = {
    "products": [
        {
            "name": "string",
            "price": "number",
            "in_stock": "boolean"
        }
    ]
}

extractor = DataExtractor()

request = ExtractionRequest(
    url="https://example.com/shop",
    query="Extract all products with name, price, and stock status",
    schema=schema
)

result = extractor.extract(request)
print(result.raw_data)

Custom Configuration

from pathlib import Path
from src.config import Config
from src.extractor import DataExtractor

# Load custom configuration
config = Config(config_path=Path("config.json"))

# Create extractor with custom config
extractor = DataExtractor(config)

result = extractor.extract_from_url(
    url="https://example.com",
    query="Extract article metadata"
)

Architecture

Core Components

WebScraper (src/scraper.py): Playwright-based web scraper with JavaScript rendering
DataExtractor (src/extractor.py): Orchestrates scraping and LLM-based extraction
LLMProvider (src/llm.py): Abstraction for different LLM providers (OpenAI, Anthropic)
Models (src/models.py): Pydantic models for type-safe data handling
Config (src/config.py): Configuration management with environment variable support

How It Works

Fetch Page: Playwright loads the page with full JavaScript execution
Clean HTML: Remove scripts, styles, and unnecessary elements
LLM Analysis: Send cleaned HTML to LLM with natural language query
Parse Response: Convert LLM response to structured Pydantic models
Return Results: Return typed results with confidence scores and metadata

Examples

See the examples/ directory for more usage examples:

basic_usage.py: Simple extraction examples
structured_extraction.py: Using schemas for structured data

Testing

Run tests with pytest:

pip install pytest
pytest tests/

API Reference

DataExtractor

Main class for data extraction.

Methods:

extract(request: ExtractionRequest) -> ExtractionResult: Extract data using request object
extract_from_url(url: str, query: str, wait_for_selector: Optional[str] = None) -> ExtractionResult: Convenience method for URL extraction

ExtractionRequest

Request model for data extraction.

Fields:

url: URL to scrape
query: Natural language description of what to extract
schema: Optional JSON schema for structured output
wait_for_selector: Optional CSS selector to wait for

ExtractionResult

Result model containing extracted data.

Fields:

success: Whether extraction succeeded
data: List of ExtractedField objects
metadata: PageMetadata with URL, title, etc.
raw_data: Raw dictionary of extracted data
error: Error message if failed
tokens_used: Number of LLM tokens used

Supported LLM Providers

OpenAI: GPT-4 Turbo, GPT-3.5
Anthropic: Claude 3 (Opus, Sonnet, Haiku)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feat/amazing-feature)
Commit your changes (git commit -m 'feat: add amazing feature')
Push to the branch (git push origin feat/amazing-feature)
Open a Pull Request

License

MIT License - see LICENSE file for details

Acknowledgments

Built with Playwright for web scraping
Uses BeautifulSoup for HTML parsing
Powered by OpenAI and Anthropic LLMs
Type safety with Pydantic

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
docs		docs
examples		examples
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
config.example.json		config.example.json
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Web Scraper

Features

Installation

Prerequisites

Setup

Configuration

Environment Variables

Configuration File

Usage

Command Line

Python API

Basic Usage

Structured Extraction with Schema

Custom Configuration

Architecture

Core Components

How It Works

Examples

Testing

API Reference

DataExtractor

ExtractionRequest

ExtractionResult

Supported LLM Providers

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI Web Scraper

Features

Installation

Prerequisites

Setup

Configuration

Environment Variables

Configuration File

Usage

Command Line

Python API

Basic Usage

Structured Extraction with Schema

Custom Configuration

Architecture

Core Components

How It Works

Examples

Testing

API Reference

DataExtractor

ExtractionRequest

ExtractionResult

Supported LLM Providers

Contributing

License

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages