You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Create a registry of site-specific "recipes" or schemas that encapsulate the best way to download PDFs from different publishers/websites. This allows fetcharoo to intelligently handle different site structures without users needing to figure out the right parameters each time.
Motivation
Different sites have different PDF structures:
Springer: Chapters named 978-3-xxx_N.pdf, need numeric sorting
arXiv: Usually single PDF, straightforward
IEEE/ACM: May require authentication, have supplements
University sites: Wildly varying structures
Currently users must manually specify sort_by, include/exclude patterns, etc. A schema registry would provide sensible defaults that "just work."
Proposed Design
1. SiteSchema Base Class
fromdataclassesimportdataclassfromtypingimportOptional, List, Callableimportre@dataclassclassSiteSchema:
"""Base class for site-specific download configurations."""name: strurl_pattern: str# Regex to match URLs# PDF discoveryinclude_patterns: List[str] =Noneexclude_patterns: List[str] =Nonepdf_selector: str=None# CSS selector for PDF links# Orderingsort_by: str=None# 'numeric', 'alpha', 'alpha_desc', 'none'sort_key: Callable=None# Outputdefault_output_name: str=None# e.g., "{title}.pdf"# Behaviorrecommended_depth: int=1request_delay: float=1.0defmatches(self, url: str) ->bool:
returnbool(re.match(self.url_pattern, url))
defextract_metadata(self, url: str, html: str) ->dict:
"""Override to extract title, author, etc. from page."""return {}
2. Built-in Schemas
# fetcharoo/schemas/springer.pyclassSpringerBook(SiteSchema):
name="springer_book"url_pattern=r"https?://link\.springer\.com/book/.*"include_patterns= ["*_*.pdf"] # Chapter PDFs have underscoreexclude_patterns= ["*bbm*", "*fm*"] # Exclude front/back mattersort_by="numeric"recommended_depth=1request_delay=1.0defsort_key(self, url):
# Extract chapter number from 978-3-xxx_N.pdfmatch=re.search(r'_(\d+)\.pdf$', url)
returnint(match.group(1)) ifmatchelsefloat('inf')
defextract_metadata(self, url, html):
# Extract book title from pagesoup=BeautifulSoup(html, 'html.parser')
title=soup.select_one('h1.c-article-title')
return {'title': title.textiftitleelse'book'}
classSpringerChaptersOnly(SpringerBook):
"""Download only individual chapters, not full book PDF."""name="springer_chapters"exclude_patterns= ["*bbm*", "*fm*", "*978-3-*-*-?.pdf"] # Exclude full book
# fetcharoo/schemas/arxiv.pyclassArxivPaper(SiteSchema):
name="arxiv"url_pattern=r"https?://arxiv\.org/(abs|pdf)/.*"sort_by="none"recommended_depth=0defextract_metadata(self, url, html):
# Extract paper ID and title
...
3. Schema Registry
# fetcharoo/schemas/registry.py_SCHEMAS: Dict[str, SiteSchema] = {}
defregister_schema(schema: SiteSchema):
"""Register a schema in the global registry."""_SCHEMAS[schema.name] =schemadefget_schema(name: str) ->Optional[SiteSchema]:
"""Get schema by name."""return_SCHEMAS.get(name)
defdetect_schema(url: str) ->Optional[SiteSchema]:
"""Auto-detect schema from URL."""forschemain_SCHEMAS.values():
ifschema.matches(url):
returnschemareturnNonedeflist_schemas() ->List[str]:
"""List all registered schema names."""returnlist(_SCHEMAS.keys())
# Decorator for easy registrationdefschema(cls):
register_schema(cls())
returncls
4. Integration with Main API
fromfetcharooimportdownload_pdfs_from_webpage# Auto-detect schema from URLresult=download_pdfs_from_webpage(
url='https://link.springer.com/book/10.1007/978-3-031-41026-0',
schema='auto'# Detects SpringerBook
)
# Explicitly use a schemaresult=download_pdfs_from_webpage(
url='https://link.springer.com/book/...',
schema='springer_chapters'
)
# Use schema instance with overridesfromfetcharoo.schemasimportSpringerBookresult=download_pdfs_from_webpage(
url='...',
schema=SpringerBook(request_delay=2.0)
)
# Schema parameters are defaults; explicit params overrideresult=download_pdfs_from_webpage(
url='...',
schema='springer_book',
sort_by='alpha'# Overrides schema's 'numeric'
)
5. CLI Integration
# Auto-detect
fetcharoo https://link.springer.com/book/... --schema auto
# Explicit schema
fetcharoo https://link.springer.com/book/... --schema springer_chapters
# List available schemas
fetcharoo --list-schemas
Summary
Create a registry of site-specific "recipes" or schemas that encapsulate the best way to download PDFs from different publishers/websites. This allows fetcharoo to intelligently handle different site structures without users needing to figure out the right parameters each time.
Motivation
Different sites have different PDF structures:
978-3-xxx_N.pdf, need numeric sortingCurrently users must manually specify
sort_by,include/excludepatterns, etc. A schema registry would provide sensible defaults that "just work."Proposed Design
1. SiteSchema Base Class
2. Built-in Schemas
3. Schema Registry
4. Integration with Main API
5. CLI Integration
6. User-Defined Schemas
Users can add custom schemas in a config file:
Or programmatically:
Implementation Plan
SiteSchemabase dataclassschemaparameter intodownload_pdfs_from_webpage--schemaflag and--list-schemasOpen Questions
Related