Table of contents
- Documentation configuration
- Architecture of documentation for Airflow
- Diagrams of the documentation architecture
- Staging site
- Typical workflows
- Fixing historical documentation
This directory used to contain all the documentation files for the project. The documentation has been split to separate folders - the documentation is now in the folders in sub-projects that they are referring to.
If you look for the documentation it is stored as follows:
Documentation in separate distributions:
airflow-core/docs- documentation for Airflow Coreproviders/**/docs- documentation for Providerschart/docs- documentation for Helm Charttask-sdk/docs- documentation for Task SDK (new format not yet published)airflow-ctl/docs- documentation for Airflow CLI (future)
Documentation for general overview and summaries not connected with any specific distribution:
docker-stack-docs- documentation for Docker Stack'providers-summary-docs- documentation for provider summary page
Building documentation for Airflow is optimized for speed and for convenience workflows of the release managers and committers who publish and fix the documentation - that's why it's a little complex, as we have multiple repositories and multiple sources of the documentation involved.
There are few repositories under apache organization which are used to build the documentation for Airflow:
apache-airflow- the repository with the code and the documentation sources for Airflow distributions, provider distributions, providers summary and docker summary: apache-airflow from here we publish the documentation to S3 bucket where the documentation is hosted.airflow-site- the repository with the website theme and content where we keep sources of the website structure, navigation, theme for the website airflow-site. From here we publish the website to the ASF servers so they are publish as the official websiteairflow-site-archive- here we keep the archived historical versions of the generated documentation of all the documentation packages that we keep on S3. This repository is automatically synchronized from the S3 buckets and is only used in case we need to perform a bulk update of historical documentation. Here only generatedhtml,css,jsandimagesfiles are kept, no sources of the documentation are kept here.
We have two S3 buckets where we can publish the documentation generated from apache-airflow repository:
s3://live-docs-airflow-apache-org/docs/- live, official documentations3://staging-docs-airflow-apache-org/docs/- staging documentation official documentation TODO: make it work
This is the diagram of live documentation architecture:
Staging documentation architecture is similar, but uses staging bucket and staging Apache Website. The main differences are:
- The staging bucket is
s3://staging-docs-airflow-apache-org/docs/ - The staging website is
https://airflow.staged.apache.org/docs/ - The staging site is deployed by merging PR or pushing
stagingbranch in theairflow-siterepository rather than main. Thestagingbranch should be periodically rebased to themainbranch, but while some changes are developed instagingit can diverge frommainbranch. - Merging into
stagingbranch ofairflow-siterepository or pushingstagingbranch will automatically trigger the build of the website and publish it to thepublish-stagingbranch and effectively to the staging site.
Documentation of pre-release versions of Airflow distributions should be published to the staging s3
bucket so that we can test the documentation before we publish it to the live bucket.
There are a few typical workflows that we support:
The release manager publishes the documentation using GitHub Actions workflow Publish Docs to S3. The same workflow can be used to publish Airflow, Helm chart and providers documentation.
This workflow is used twice:
- when pre-release distributions are prepared (alpha/beta/rc) - the documentation should be published to
the
stagingbucket andstagingsite should be built and published. - when final releases of distributions are prepared - the documentation should be published to the
livebucket and thelivewebsite should be built and published.
When release manager publishes the documentation they choose auto destination by default - depending on the
tag they use - staging will be used to publish from pre-release tag and live will be used ot publish
from the release tag.
You can also specify whether live or staging documentation should be published manually - overriding
the auto-detection.
The person who triggers the build (release manager) should specify the tag name of the docs to be published and the list of documentation packages to be published. Usually it is:
- Airflow:
apache-airflow docker-stack(later we will addairflow-ctlandtask-sdk) - Helm chart:
helm-chart - Providers:
provider_id1 provider_id2orall providersif all providers should be published.
Optionally - specifically if we run all-providers and release manager wants to exclude some providers,
they can specify documentation packages to exclude. Leaving "no-docs-excluded" will publish all packages
specified to be published without exclusions.
Example screenshot of the workflow triggered from the GitHub UI:
Note that this just publishes the documentation but does not update the "site" with version numbers or
stable links to providers and airflow - if you release a new documentation version it will be available
with direct URL (say https://apache.airflow.org/docs/apache-airflow/3.0.1/) but the main site will still
point to previous version of the documentation as stable and the version drop-downs will not be updated.
In order to do it, you need to run the Build docs
workflow in airflow-site repository.
For live site you should run the workflow in main branch. For staging site it should be staging branch.
This will build the website and publish it to the publish branch of airflow-site repository (for live
site) or publish-staging branch, (for staging site). The workflow will also update the website with
including refreshing of the version numbers in the drop-downs and stable links.
The staging documentation is produced automatically with staging watermark added.
This workflow also invalidates cache in Fastly that Apache Software Foundation uses to serve the website, so you should always run it after you modify the documentation for the website. Other than that Fastly is configured with 3600 seconds TTL - which means that changes will propagate to the website in ~1 hour.
Shortly after the workflow succeeds and documentation is published, in live bucket, the airflow-site-archive
repository is automatically synchronized with the live S3 bucket. TODO: IMPLEMENT THIS, FOR NOW IT HAS
TO BE MANUALLY SYNCHRONIZED VIA Sync s3 to GitHub
workflow in airflow-site-archive repository. The airflow-site-archive essentially keeps the history of
snapshots of the live documentation.
The workflows in apache-airflow only update the documentation for the packages (Airflow, Helm chart,
Providers, Docker Stack) that we publish from airflow sources. If we want to publish changes to the website
itself or to the theme (css, javascript) we need to do it in airflow-site repository.
Publishing of airflow-site happens automatically when a PR from airflow-site is merged to main or when
the Build docs workflow is triggered
manually in the main branch of airflow-site repository. The workflow builds the website and publishes it to
publish branch of airflow-site repository, which in turn gets picked up by the ASF servers and is
published as the official website. This includes any changes to .htaccess of the website.
Such a main build also publishes latest "sphinx-airflow-theme" package to GitHub so that the next build
of documentation can automatically pick it up from there. This means that if you want to make changes to
javascript or css that are part of the theme, you need to do it in ariflow-site repository and
merge it to main branch in order to be able to run the documentation build in apache-airflow repository
and pick up the latest version of the theme.
The version of sphinx theme is fixed in both repositories:
- https://github.com/apache/airflow-site/blob/main/sphinx_airflow_theme/sphinx_airflow_theme/__init__.py#L21
- https://github.com/apache/airflow/blob/main/devel-common/pyproject.toml#L77 in "docs" section
In case of bigger changes to the theme, we can first iterate on the website and merge a new theme version, and only after that we can switch to the new version of the theme.
Sometimes we need to update historical documentation (modify generated html) - for example when we find
bad links or when we change some of the structure in the documentation. This can be done via the
airflow-site-archive repository. The workflow is as follows:
- Get the latest version of the documentation from S3 to
airflow-site-archiverepository usingSync s3 to GitHubworkflow. This will download the latest version of the documentation from S3 toairflow-site-archiverepository (this should be normally not needed, if automated synchronization works). - Make the changes to the documentation in
airflow-site-archiverepository. This can be done using any text editors, scripts etc. Those files are generated ashtmlfiles and are not meant to be regenerated, they should be modified ashtmlfiles in-place - Commit the changes to
airflow-site-archiverepository and push them tosomebranch of the repository. - Run
Sync GitHub to S3workflow inairflow-site-archiverepository. This will upload the modified documentation to S3 bucket. - You can choose, whether to sync the changes to
liveorstagingbucket. The default islive. - By default the workflow will synchronize all documentation modified in single - last commit pushed to the branch you specified. You can also specify "full_sync" to synchronize all files in the repository.
- In case you specify "full_sync", you can also synchronize
alldocs or only selected documentation packages (for exampleapache-airflowordocker-stackoramazonorhelm-chart) - you can specify more than one package separated by spaces. - After you synchronize the changes to S3, the Sync
S3 to GitHubworkflow will be triggered automatically and the changes will be synchronized toairflow-site-archivemainbranch - so there is no need to merge your changes tomainbranch ofairflow-site-archiverepository. You can safely delete the branch you created in step 3.
The regular publishing workflows involve running Github Actions workflow and they cover majority of cases, however sometimes some manual updates and cherry-picks are needed, when we discover problems with the publishing and doc building code - for example when we find that we need to fix extensions to sphinx.
In such case, release manager or a committer can build and publish documentation locally - providing that they configure AWS credentials to be able to upload files to S3. You can ask in the #internal-airflow-ci-cd channel on Airflow Slack to get your AWS credentials configured.
You can checkout locally a version of airflow repo that you need and apply any cherry-picks you need before running publishing.
This is done using breeze. You also need to have aws CLI installed and configured credentials to be able
to upload files to S3. You can get credentials from one of the admins of Airflow's AWS account. The
region to set for AWS is us-east-2.
Note that it is advise to add --dry-run if you just want to see what would happen. Also you can use
the s3://staging-docs-airflow-apache-org/docs/ bucket to test the publishing using staging site.
breeze build-docs "<package_id1>" "<package_id2>" --docs-only
mkdir /tmp/airflow-site
breeze release-management publish-docs --override-versioned --airflow-site-directory /tmp/airflow-site
breeze release-management publish-docs-to-s3 --source-dir-path /tmp/airflow-site/docs-archive \
--destination-location s3://live-docs-airflow-apache-org/docs/ --stable-versions \
--exclude-docs "<package_id1_to_exclude> <package_id2_to_exclude>" [--dry-run]If you do not have S3 credentials and want to be careful about publishing the documentation you can also
use publishing via apache-airflow-site-archive repository. This is a little more complex, but it allows
you to publish documentation without having S3 credentials.
The process is as follows:
- Run
Sync s3 to GitHubworkflow inapache-airflow-site-archiverepository. This will download the latest version of the documentation from S3 toairflow-site-archiverepository (this should be normally not needed, if automated synchronization works). - Checkout
apache-airflow-site-archiverepository and create a branch for your changes. - Build documentation locally in
apache-airflowrepo with any cherry-picks and modifications you need and publish the docs to the checked outairflow-site-archivebranch
breeze build-docs "<package_id1>" "<package_id2>" --docs-only
breeze release-management publish-docs --override-versioned --airflow-site-directory <PATH_TO_THE_ARCHIVE_REPO>- Commit the changes to
apache-airflow-site-archiverepository and push them tosomebranch of the repository. - Run
Sync GitHub to S3workflow inapache-airflow-site-archiverepository. This will upload the modified documentation to S3 bucket. You can choose, whether to sync the changes toliveorstagingbucket. The default islive. You can also specify which folders to sync - by default all modified folders are synced. - After you synchronize the changes to S3, the Sync
S3 to GitHubworkflow will be triggered automatically and the changes will be synchronized toairflow-site-archivemainbranch - so there is no need to merge your changes tomainbranch ofairflow-site-archiverepository. You can safely delete the branch you created in step 2.



