A smart web scraper that uses Large Language Models (LLMs) to understand page structure and extract data without brittle CSS selectors. Query websites using natural language and get structured data back.
- Natural Language Queries: Describe what you want to extract in plain English
- Intelligent Extraction: Uses LLMs (OpenAI GPT or Anthropic Claude) to understand page structure
- No Brittle Selectors: No need to maintain CSS selectors or XPath expressions
- JavaScript Rendering: Built on Playwright for full dynamic content support
- Structured Output: Returns data in structured Pydantic models with confidence scores
- Flexible Configuration: Support for multiple LLM providers and customizable scraping behavior
- Type-Safe: Full type hints and Pydantic validation
- Python 3.9+
- Playwright browsers
- Clone the repository:
git clone https://github.com/gabubu-dev/ai-web-scraper.git
cd ai-web-scraper- Install dependencies:
pip install -r requirements.txt- Install Playwright browsers:
playwright install chromium- Configure API keys:
cp .env.example .env
# Edit .env and add your API keysCreate a .env file with your LLM API keys:
OPENAI_API_KEY=your-openai-key-here
ANTHROPIC_API_KEY=your-anthropic-key-here
LLM_PROVIDER=openai
LLM_MODEL=gpt-4-turbo-previewAlternatively, use a config.json file (see config.example.json):
{
"llm": {
"provider": "openai",
"model": "gpt-4-turbo-preview",
"api_key": "your-api-key",
"temperature": 0.1,
"max_tokens": 4096
},
"scraper": {
"timeout": 30000,
"headless": true,
"javascript_enabled": true
}
}# Basic usage
python -m src.cli "https://example.com" "Extract all product names and prices"
# With output file
python -m src.cli "https://example.com" "Get article title and author" -o output.json --pretty
# Specify provider and model
python -m src.cli "https://example.com" "Extract data" --provider anthropic --model claude-3-sonnet-20240229
# Wait for specific element
python -m src.cli "https://example.com" "Extract products" --wait-for ".product-list"from src.extractor import DataExtractor
# Initialize extractor
extractor = DataExtractor()
# Extract data with natural language query
result = extractor.extract_from_url(
url="https://example.com/products",
query="Extract all product names, prices, and descriptions"
)
if result.success:
for field in result.data:
print(f"{field.name}: {field.value}")
else:
print(f"Error: {result.error}")from src.extractor import DataExtractor
from src.models import ExtractionRequest
# Define expected schema
schema = {
"products": [
{
"name": "string",
"price": "number",
"in_stock": "boolean"
}
]
}
extractor = DataExtractor()
request = ExtractionRequest(
url="https://example.com/shop",
query="Extract all products with name, price, and stock status",
schema=schema
)
result = extractor.extract(request)
print(result.raw_data)from pathlib import Path
from src.config import Config
from src.extractor import DataExtractor
# Load custom configuration
config = Config(config_path=Path("config.json"))
# Create extractor with custom config
extractor = DataExtractor(config)
result = extractor.extract_from_url(
url="https://example.com",
query="Extract article metadata"
)- WebScraper (
src/scraper.py): Playwright-based web scraper with JavaScript rendering - DataExtractor (
src/extractor.py): Orchestrates scraping and LLM-based extraction - LLMProvider (
src/llm.py): Abstraction for different LLM providers (OpenAI, Anthropic) - Models (
src/models.py): Pydantic models for type-safe data handling - Config (
src/config.py): Configuration management with environment variable support
- Fetch Page: Playwright loads the page with full JavaScript execution
- Clean HTML: Remove scripts, styles, and unnecessary elements
- LLM Analysis: Send cleaned HTML to LLM with natural language query
- Parse Response: Convert LLM response to structured Pydantic models
- Return Results: Return typed results with confidence scores and metadata
See the examples/ directory for more usage examples:
basic_usage.py: Simple extraction examplesstructured_extraction.py: Using schemas for structured data
Run tests with pytest:
pip install pytest
pytest tests/Main class for data extraction.
Methods:
extract(request: ExtractionRequest) -> ExtractionResult: Extract data using request objectextract_from_url(url: str, query: str, wait_for_selector: Optional[str] = None) -> ExtractionResult: Convenience method for URL extraction
Request model for data extraction.
Fields:
url: URL to scrapequery: Natural language description of what to extractschema: Optional JSON schema for structured outputwait_for_selector: Optional CSS selector to wait for
Result model containing extracted data.
Fields:
success: Whether extraction succeededdata: List of ExtractedField objectsmetadata: PageMetadata with URL, title, etc.raw_data: Raw dictionary of extracted dataerror: Error message if failedtokens_used: Number of LLM tokens used
- OpenAI: GPT-4 Turbo, GPT-3.5
- Anthropic: Claude 3 (Opus, Sonnet, Haiku)
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feat/amazing-feature) - Commit your changes (
git commit -m 'feat: add amazing feature') - Push to the branch (
git push origin feat/amazing-feature) - Open a Pull Request
MIT License - see LICENSE file for details
- Built with Playwright for web scraping
- Uses BeautifulSoup for HTML parsing
- Powered by OpenAI and Anthropic LLMs
- Type safety with Pydantic