Add site-specific download schemas/recipes

## Summary

Create a registry of site-specific "recipes" or schemas that encapsulate the best way to download PDFs from different publishers/websites. This allows fetcharoo to intelligently handle different site structures without users needing to figure out the right parameters each time.

## Motivation

Different sites have different PDF structures:
- **Springer**: Chapters named `978-3-xxx_N.pdf`, need numeric sorting
- **arXiv**: Usually single PDF, straightforward
- **IEEE/ACM**: May require authentication, have supplements
- **University sites**: Wildly varying structures

Currently users must manually specify `sort_by`, `include`/`exclude` patterns, etc. A schema registry would provide sensible defaults that "just work."

## Proposed Design

### 1. SiteSchema Base Class

```python
from dataclasses import dataclass
from typing import Optional, List, Callable
import re

@dataclass
class SiteSchema:
    """Base class for site-specific download configurations."""
    name: str
    url_pattern: str  # Regex to match URLs
    
    # PDF discovery
    include_patterns: List[str] = None
    exclude_patterns: List[str] = None
    pdf_selector: str = None  # CSS selector for PDF links
    
    # Ordering
    sort_by: str = None  # 'numeric', 'alpha', 'alpha_desc', 'none'
    sort_key: Callable = None
    
    # Output
    default_output_name: str = None  # e.g., "{title}.pdf"
    
    # Behavior
    recommended_depth: int = 1
    request_delay: float = 1.0
    
    def matches(self, url: str) -> bool:
        return bool(re.match(self.url_pattern, url))
    
    def extract_metadata(self, url: str, html: str) -> dict:
        """Override to extract title, author, etc. from page."""
        return {}
```

### 2. Built-in Schemas

```python
# fetcharoo/schemas/springer.py
class SpringerBook(SiteSchema):
    name = "springer_book"
    url_pattern = r"https?://link\.springer\.com/book/.*"
    
    include_patterns = ["*_*.pdf"]  # Chapter PDFs have underscore
    exclude_patterns = ["*bbm*", "*fm*"]  # Exclude front/back matter
    sort_by = "numeric"
    recommended_depth = 1
    request_delay = 1.0
    
    def sort_key(self, url):
        # Extract chapter number from 978-3-xxx_N.pdf
        match = re.search(r'_(\d+)\.pdf$', url)
        return int(match.group(1)) if match else float('inf')
    
    def extract_metadata(self, url, html):
        # Extract book title from page
        soup = BeautifulSoup(html, 'html.parser')
        title = soup.select_one('h1.c-article-title')
        return {'title': title.text if title else 'book'}

class SpringerChaptersOnly(SpringerBook):
    """Download only individual chapters, not full book PDF."""
    name = "springer_chapters"
    exclude_patterns = ["*bbm*", "*fm*", "*978-3-*-*-?.pdf"]  # Exclude full book
```

```python
# fetcharoo/schemas/arxiv.py
class ArxivPaper(SiteSchema):
    name = "arxiv"
    url_pattern = r"https?://arxiv\.org/(abs|pdf)/.*"
    sort_by = "none"
    recommended_depth = 0
    
    def extract_metadata(self, url, html):
        # Extract paper ID and title
        ...
```

### 3. Schema Registry

```python
# fetcharoo/schemas/registry.py
_SCHEMAS: Dict[str, SiteSchema] = {}

def register_schema(schema: SiteSchema):
    """Register a schema in the global registry."""
    _SCHEMAS[schema.name] = schema

def get_schema(name: str) -> Optional[SiteSchema]:
    """Get schema by name."""
    return _SCHEMAS.get(name)

def detect_schema(url: str) -> Optional[SiteSchema]:
    """Auto-detect schema from URL."""
    for schema in _SCHEMAS.values():
        if schema.matches(url):
            return schema
    return None

def list_schemas() -> List[str]:
    """List all registered schema names."""
    return list(_SCHEMAS.keys())

# Decorator for easy registration
def schema(cls):
    register_schema(cls())
    return cls
```

### 4. Integration with Main API

```python
from fetcharoo import download_pdfs_from_webpage

# Auto-detect schema from URL
result = download_pdfs_from_webpage(
    url='https://link.springer.com/book/10.1007/978-3-031-41026-0',
    schema='auto'  # Detects SpringerBook
)

# Explicitly use a schema
result = download_pdfs_from_webpage(
    url='https://link.springer.com/book/...',
    schema='springer_chapters'
)

# Use schema instance with overrides
from fetcharoo.schemas import SpringerBook
result = download_pdfs_from_webpage(
    url='...',
    schema=SpringerBook(request_delay=2.0)
)

# Schema parameters are defaults; explicit params override
result = download_pdfs_from_webpage(
    url='...',
    schema='springer_book',
    sort_by='alpha'  # Overrides schema's 'numeric'
)
```

### 5. CLI Integration

```sh
# Auto-detect
fetcharoo https://link.springer.com/book/... --schema auto

# Explicit schema
fetcharoo https://link.springer.com/book/... --schema springer_chapters

# List available schemas
fetcharoo --list-schemas
```

### 6. User-Defined Schemas

Users can add custom schemas in a config file:

```yaml
# ~/.config/fetcharoo/schemas.yaml
my_university:
  url_pattern: "https?://library\\.myuni\\.edu/.*"
  include_patterns: ["*.pdf"]
  exclude_patterns: ["*thumbnail*", "*preview*"]
  sort_by: alpha
  request_delay: 2.0
```

Or programmatically:

```python
from fetcharoo.schemas import SiteSchema, register_schema

@register_schema
class MyUniversityLibrary(SiteSchema):
    name = "myuni"
    url_pattern = r"https://library\.myuni\.edu/.*"
    sort_by = "alpha"
```

## Implementation Plan

1. [ ] Create `SiteSchema` base dataclass
2. [ ] Implement schema registry with auto-detection
3. [ ] Add built-in schemas: Springer, arXiv, generic
4. [ ] Integrate `schema` parameter into `download_pdfs_from_webpage`
5. [ ] Add CLI `--schema` flag and `--list-schemas`
6. [ ] Support YAML config for user schemas
7. [ ] Add tests for schema matching and behavior
8. [ ] Document schema system in README

## Open Questions

1. Should schemas be able to define custom PDF extraction logic (beyond CSS selectors)?
2. How to handle authentication requirements (just document, or provide hooks)?
3. Should we ship schemas for sites that may have legal restrictions?
4. Version schemas separately to update without library release?

## Related

- Builds on #3 (sort ordering), #4 (deduplication), #5 (output naming)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add site-specific download schemas/recipes #10

Summary

Motivation

Proposed Design

1. SiteSchema Base Class

2. Built-in Schemas

3. Schema Registry

4. Integration with Main API

5. CLI Integration

6. User-Defined Schemas

Implementation Plan

Open Questions

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add site-specific download schemas/recipes #10

Description

Summary

Motivation

Proposed Design

1. SiteSchema Base Class

2. Built-in Schemas

3. Schema Registry

4. Integration with Main API

5. CLI Integration

6. User-Defined Schemas

Implementation Plan

Open Questions

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions