A searchable catalog of all Apache Airflow providers and their modules (operators,
hooks, sensors, triggers, transfers, and more). Built with Eleventy
and deployed to airflow.apache.org/registry/.
- Python 3.10+ (for metadata extraction)
- Node.js 20+ and pnpm 9+
pyyamlPython package (uv pip install pyyaml)
# 1. Extract metadata from provider.yaml files into JSON
uv run python dev/registry/extract_metadata.py
# 2. Install Node.js dependencies
cd registry
pnpm install
# 3. Start dev server (http://localhost:8080)
pnpm devThe full registry data extraction (metadata, parameters, and connections) is available as a breeze subcommand:
breeze registry extract-data # Extract all registry data
breeze registry extract-data --python 3.12 # With a specific Python version
breeze registry extract-data --provider amazon # Extract only one provider (incremental)This runs inside the breeze CI container where all providers are installed. It is
the same command used by CI in registry-build.yml.
The dev script sets REGISTRY_PATH_PREFIX=/ so links work at the root during local
development. In production the prefix defaults to /registry/.
provider.yaml files (providers/*/provider.yaml)
│
▼
extract_metadata.py ← Parses YAML, fetches PyPI stats, resolves logos
│ → providers.json
│
extract_parameters.py ← Runtime class discovery + parameter extraction (breeze)
│ → modules.json + parameters.json
▼
registry/src/_data/
├── providers.json ← Provider metadata (name, versions, downloads, lifecycle, ...)
├── modules.json ← Individual modules (operators, hooks, sensors, ...)
└── versions/{id}/{ver}/ ← Per-version metadata, parameters, connections
│
▼
Eleventy build (pnpm build) ← Generates static HTML + Pagefind search index
│
▼
registry/_site/ ← Deployable static site
The root-level JSON files (providers.json, modules.json)
are generated artifacts and are listed in .gitignore. The versions/ directory is also
gitignored. Only exploreCategories.js, statsData.js, latestVersionData.js, and
providerVersions.js are checked in because they contain hand-authored or computed logic.
extract_metadata.py (runs on host) walks every providers/*/provider.yaml and:
- Parses provider metadata — name, description, versions, dependencies
- Fetches PyPI download stats — calls
pypistats.org/api/packages/{name}/recent - Resolves logos — checks
public/logos/for matching images - Determines AIP-95 lifecycle stage — reads
lifecyclefromprovider.yaml - Writes
providers.json
extract_parameters.py (runs inside breeze) handles module discovery and parameter
extraction:
- Discovers modules at runtime — imports each module listed in
provider.yaml, iterates over classes withinspect.getmembers(), and usesissubclass()to classify them (operator, hook, sensor, trigger, etc.) - Resolves documentation URLs via Sphinx inventory — downloads
objects.invfiles from S3 for each provider (cached locally with 12-hour TTL), parses them to get canonical documentation URLs for each class. Falls back to manual URL construction for providers not yet published. See Documentation URL Resolution below. - Produces
modules.json— the full module catalog with all 11 fields (id, name, type, import_path, module_path, short_description, docs_url, source_url, category, provider_id, provider_name) - Extracts
__init__parameters — walks the MRO and extracts parameter names, types, defaults, and docstrings. Writesversions/{provider_id}/{version}/parameters.json.
extract_connections.py (runs inside breeze) reads connection-types from
provider.yaml, falling back to runtime inspection of hook
get_connection_form_widgets() and get_ui_field_behaviour(). Writes
versions/{provider_id}/{version}/connections.json.
The docs_url field in modules.json links each class to its Sphinx-generated API
reference page. Rather than constructing these URLs manually (which breaks if the Sphinx
output structure changes), the extractor uses Sphinx inventory files (objects.inv).
How it works:
-
Every Sphinx build produces an
objects.invfile that maps every documented symbol to its URL. Apache publishes these for all released providers at:http://apache-airflow-docs.s3-website.eu-central-1.amazonaws.com/docs/{package_name}/stable/objects.inv -
Before discovering modules,
extract_parameters.pyfetches inventories for all providers in parallel using a thread pool. -
For each discovered class, it looks up the fully qualified name (e.g.,
airflow.providers.amazon.hooks.s3.S3Hook) in the inventory. If found, the inventory-sourced URL is used. If not (e.g., a brand new class not yet in a published docs build), it falls back to manual URL construction.
Caching:
Inventory files are cached locally in dev/registry/.inventory_cache/ with a 12-hour
TTL. This matches the caching strategy used by
devel-common/src/sphinx_exts/docs_build/fetch_inventories.py. The cache directory is
gitignored.
Why not just construct URLs manually?
The previous approach assembled URLs like
{base_docs_url}/_api/{module/path}/index.html#{full.class.path}. This is fragile:
Sphinx can change its output layout, and some classes end up in different URL structures
than expected. The inventory file is the canonical source of truth — it's produced by
the same Sphinx build that generates the docs.
| File | Type | Purpose |
|---|---|---|
providers.json |
Generated | All providers with metadata, sorted alphabetically |
modules.json |
Generated | All extracted modules (operators, hooks, etc.) |
versions/ |
Generated | Per-provider, per-version metadata/parameters/connections |
exploreCategories.js |
Checked-in | Category definitions with keyword lists for the Explore page |
statsData.js |
Checked-in | Computed statistics (lifecycle counts, top providers, etc.) |
providerVersions.js |
Checked-in | Builds the provider × version page collection |
latestVersionData.js |
Checked-in | Latest version parameters/connections lookup |
| Page | Template | URL |
|---|---|---|
| Home | src/index.njk |
/ |
| All Providers | src/providers.njk |
/providers/ |
| Explore by Category | src/explore.njk |
/explore/ |
| Statistics | src/stats.njk |
/stats/ |
| Provider Detail | src/provider-detail.njk |
/providers/{id}/ (redirects to latest) |
| Provider Version | src/provider-version.njk |
/providers/{id}/{version}/ |
| Script | Purpose |
|---|---|
js/provider-filters.js |
Search, lifecycle filter, category filter, sort on /providers/ |
js/search.js |
Global search modal (Cmd+K) powered by Pagefind |
js/provider-detail.js |
Module tabs, copy-to-clipboard on provider version pages |
js/connection-builder.js |
Interactive connection form builder on provider detail pages |
js/copy-button.js |
Generic copy button utility |
js/theme.js |
Dark/light mode toggle |
js/mobile-menu.js |
Responsive navigation |
The site is deployed under /registry/ on airflow.apache.org. Eleventy's pathPrefix
is configured via the REGISTRY_PATH_PREFIX environment variable:
- Production:
REGISTRY_PATH_PREFIX=/registry/(the default) - Local dev:
REGISTRY_PATH_PREFIX=/(set automatically bypnpm dev)
All internal links in templates use the | url Nunjucks filter, which prepends the
prefix. Client-side JavaScript accesses the base path via window.__REGISTRY_BASE__
(injected in base.njk).
Full-text search is powered by Pagefind. During postbuild:
scripts/build-pagefind-index.mjscreates a Pagefind index with custom records fromproviders.jsonandmodules.json- URLs in the index are prefixed with
REGISTRY_PATH_PREFIX - The client loads Pagefind lazily on first search interaction (
js/search.js)
Providers follow the AIP-95 lifecycle stages:
| Stage | Meaning |
|---|---|
incubation |
New provider, API may change |
production / stable |
Stable, recommended for use |
mature |
Well-established, widely adopted |
deprecated |
No longer maintained, consider alternatives |
The UI displays stable for both production and mature stages.
The Explore page and provider filtering use categories defined in
src/_data/exploreCategories.js. Each category has:
id— URL-safe identifier (e.g.,cloud,databases,ai-ml)name— Display namekeywords— List of substrings matched againstprovider.idicon,color,description— Visual properties
Providers are assigned to categories at build time by checking if any keyword in a category matches (substring) the provider's ID. A provider can belong to multiple categories.
The homepage has a "New Providers" section powered by dates fetched from the PyPI JSON
API during extraction. It shows providers sorted by first_released (the upload date
of their earliest PyPI release) descending, highlighting providers that are new to the
ecosystem.
The src/api/ directory contains Eleventy templates that generate JSON API endpoints,
providing programmatic access to provider and module data:
/api/providers.json— All providers/api/modules.json— All modules/api/providers/{id}/modules.json— Modules for a specific provider/api/providers/{id}/parameters.json— Parameters for a provider/api/providers/{id}/connections.json— Connection types/api/providers/{id}/versions.json— Deployed versions (generated bypublish_versions.pyfrom S3)/api/providers/{id}/{version}/modules.json— Version-specific modules/api/providers/{id}/{version}/parameters.json— Version-specific parameters/api/providers/{id}/{version}/connections.json— Version-specific connections
The registry supports two build modes: full builds (all providers) and per-provider incremental builds (single provider).
Each full CI run builds pages for only the latest version of each provider. Old version pages persist in S3 from previous deploys.
This follows the same pattern as Airflow docs (see publish_docs_to_s3.py and
packages-metadata.json): the source of truth for which versions exist is the S3
bucket itself, not git or a stored manifest.
- CI extracts latest data —
extract_metadata.pywritesproviders.jsonandextract_parameters.pywritesmodules.jsonwith all known versions (fromprovider.yaml), but only the latest version gets a full page built by Eleventy. - S3 sync without
--delete— new pages are uploaded; old version pages already in S3 are left untouched. publish_versions.py— after sync, this script lists S3 directories underproviders/{id}/to discover every deployed version, then writesapi/providers/{id}/versions.jsonwith the full version list.- Client-side dropdown —
provider-detail.jsfetchesversions.jsonon page load and replaces the static<select>options, so even old pages get an up-to-date dropdown. The statically-rendered dropdown is the fallback if the fetch fails.
When publish-docs-to-s3.yml publishes provider docs (e.g., providers-amazon/9.22.0),
it triggers registry-build.yml with the provider ID. The incremental flow:
- Download existing data —
providers.jsonandmodules.jsonare fetched from the current S3 bucket (/api/providers.json,/api/modules.json). - Extract single provider —
extract_metadata.py --provider amazonextracts metadata and PyPI stats;extract_parameters.pydiscovers modules for only the specified provider. - Merge —
merge_registry_data.pyreplaces the updated provider's entries in the downloaded JSON while keeping all other providers intact. - Build site — Eleventy builds all pages from the merged data; Pagefind indexes all records.
- S3 sync — only changed pages are uploaded (S3 sync diffs).
- Publish versions —
publish_versions.pyupdatesapi/providers/{id}/versions.json.
The merge script (dev/registry/merge_registry_data.py) handles edge cases:
- First deploy (no existing data on S3): uses the single-provider output as-is.
- Missing modules file: treated as empty.
To run an incremental build locally:
# Extract only amazon
breeze registry extract-data --python 3.12 --provider amazon
# If you have existing full JSON from a previous build or S3 download:
uv run dev/registry/merge_registry_data.py \
--existing-providers /tmp/existing/providers.json \
--existing-modules /tmp/existing/modules.json \
--new-providers dev/registry/providers.json \
--new-modules dev/registry/modules.json \
--output dev/registry/
# Build site from merged data
cd registry && pnpm buildTo populate S3 with historical version pages (e.g., when setting up a new bucket),
temporarily restore the older-versions loop in providerVersions.js so Eleventy
builds all version pages, then:
# Extract metadata (includes all versions from provider.yaml)
uv run python dev/registry/extract_metadata.py
# Build the full site (with older-versions loop enabled)
cd registry && pnpm build
# Sync everything, then generate versions.json
aws s3 sync registry/_site/ s3://bucket/registry/ --cache-control "public, max-age=300"
breeze registry publish-versions --s3-bucket s3://bucket/registry/.github/workflows/registry-build.yml— Reusable workflow that extracts metadata (host), builds a breeze CI image to run parameter/connection extraction, builds the Eleventy site, syncs to S3 (without--delete), and runspublish_versions.pyto update version metadata. Supportsstagingandlivedestinations. Accepts an optionalproviderinput for incremental builds..github/workflows/registry-tests.yml— Runs extraction script unit tests on PRs that touchdev/registry/,registry/, orproviders/*/provider.yaml..github/workflows/publish-docs-to-s3.yml— Main docs workflow. When publishing provider docs, theupdate-registryjob automatically triggersregistry-build.ymlwith the provider ID for an incremental registry update.
The registry can be rebuilt independently via workflow_dispatch on registry-build.yml.
Only designated committers can trigger manual builds. The provider input can be set
to run an incremental build for a specific provider (e.g., amazon).
The built site is synced to:
- Staging:
s3://staging-docs-airflow-apache-org/registry/ - Live:
s3://live-docs-airflow-apache-org/registry/
Module discovery (modules.json) uses runtime inspection inside Breeze, where all
providers are installed. extract_parameters.py imports each module listed in
provider.yaml, iterates over its classes with inspect.getmembers(), and uses
issubclass() checks against base classes like BaseOperator and BaseHook to
classify them.
Runtime discovery is more accurate than AST-based alternatives: it resolves dynamic class definitions, runtime-computed attributes, and complex inheritance chains that static analysis misses. Validation showed runtime discovery found 9 classes that AST missed (triggers and a hook) with 0 type mismatches across 1600+ modules.
Since extract_parameters.py already runs inside Breeze for parameter inspection, module
discovery adds no extra infrastructure cost — the same Breeze session handles both.
scripts/in_container/run_provider_yaml_files_check.py (run by the
check-provider-yaml-valid pre-commit hook inside Breeze) validates that provider.yaml
is correct and complete: modules exist, classes are importable, and every Python file in
the operators//hooks//sensors//triggers/ directories is listed. This is a
correctness guarantee that extract_parameters.py builds on.
The distinction: provider.yaml lists operators/hooks/sensors/triggers/transfers/bundles
at the module level (e.g., airflow.providers.amazon.operators.s3), while the
registry needs individual class names within each module. Runtime discovery fills
that gap by importing each module and inspecting its members. For class-level entries
(notifications, secrets-backends, logging, executors, task-decorators), provider.yaml
already has the full class path and extract_parameters.py uses it directly.
extract_parameters.py and extract_connections.py need runtime access to provider
classes (to discover modules via issubclass(), inspect __init__ signatures, and call
get_connection_form_widgets()). They run inside Breeze where all providers are installed.
extract_parameters.py produces both modules.json (the module catalog) and per-provider
parameters.json files. extract_metadata.py and extract_versions.py only need
filesystem access and run on the host. This split means the CI workflow can run the fast
scripts (metadata) without spinning up Breeze, while module discovery, parameter
extraction, and connection extraction are a separate step inside Breeze.
Static site generators produce zero-JS pages by default. The registry works without JavaScript — filtering and search are layered on top progressively. Eleventy has no opinion on frontend frameworks, which keeps the dependency surface small (~30 packages in the lockfile).
The site deploys at /registry/ on airflow.apache.org but runs at / during local dev.
Eleventy's pathPrefix config handles this via the REGISTRY_PATH_PREFIX env var.
Templates use the | url filter, and client-side JS reads window.__REGISTRY_BASE__
(injected in base.njk).
Classes are discovered by runtime issubclass() checks against type-specific base
classes — e.g., BaseOperator for operators, BaseHook for hooks. Since
extract_parameters.py runs inside Breeze where all providers are installed, Python's
MRO handles transitive inheritance natively: chains like
S3ListOperator → AwsBaseOperator → BaseOperator are resolved without needing to build
a cross-file inheritance map. After inheritance filtering, a post-filter skips private,
Base*, Abstract*, and *Mixin classes. There is no suffix-based matching.
No registry-specific changes are needed. When extract_metadata.py runs during CI, it
automatically discovers all providers under providers/*/provider.yaml. To ensure your
provider appears well in the registry:
- Complete
provider.yaml— include description, integrations withhow-to-guidelinks, and logo references - Add a logo — place a PNG/SVG in
registry/public/logos/{provider-id}-{Name}.png - Write docstrings — the extraction script uses runtime inspection to pull class-level docstrings for module descriptions
- Publish to PyPI — download stats are fetched automatically
- Run
uv run python dev/registry/extract_metadata.pywhenever provider metadata changes - The
pnpm devcommand runs both the Eleventy build and starts a live-reload dev server - CSS uses custom properties defined in
src/css/tokens.cssfor theming - The site works without JavaScript (progressive enhancement); filters and search are
layered on top via
js/scripts