ci: Migrate CI to hosted Cirrus Runners by willcl-ark · Pull Request #32989 · bitcoin/bitcoin

willcl-ark · 2025-07-16T08:39:41Z

This changeset migrates all current self-hosted CI jobs over to hosted Cirrus Runners.

These runners cost a flat rate of $150/month, and we qualify for an open source discount of 50%. Therefore they are $75/month/runner.

One "runner" should more accurately be thought of in terms of the number of vCPU you are purchasing: https://cirrus-runners.app/pricing/ or in terms of "concurrency", where 1 runners gets you 1.0 concurrency.
e.g. a Linux x86 Runner gets you 16 vCPU (1.0 concurrency) and 64GB RAM to be provisioned as you choose, amongst one or more jobs.

Cirrus Runners currently only support Linux (x86 and Arm64) and MacOS (Arm64).
This changeset does not move the existing Github Actions native MacOS runners away from being run on Github's infrastructure. This could be a follow up optimisation.

Runs from this changeset using Cirrus Runners can be found at: https://github.com/testing-cirrus-runners/bitcoin2/actions which shows an uncached run on master (CI#1), an outside pull request (CI#3) and an updated push to master (CI#4).

These workflows were run on 10 runners, and we would recommend purchasing a similar number for our CI in this repo to achieve the speed and concurrency we expect.

We include some optional performance commits, but these could be split out and made into followups or dropped entirely.

Benefits

Maintenance

As we are not self-hosting, nobody needs to maintain servers, disks etc.

Bus factor

Currently we have a very small number of people with the know-how working on server setup and maintenance. This setup fixes that so that "anyone" familiar with GitHub-style CI systems can work on it.

Scaling

These do not "auto-scale"/have "unlimited concurrency" like some solutions, but if we want more workers/cpu to increase parallism or increase the runner size of certain jobs for a speed-up we can simply buy more concurrency using the web interface.

Speed

Runtimes aproximate current runtimes pretty well, with some jobs being faster.
Caching improvements on pull request (re-runs) are left as future optimisations from the current changeset (see below).

GitHub workflow syntax

With a migration to the more-commonly-used GitHub workflow syntax, migration to other providers in the future is often as simple as a one-line change (and installing a new GitHub app to the repo).

If we decide to self-host again, then we can also self-host GitHub runners (using https://github.com/actions/runner) and maintain new GH-style CI syntax.

Reporting

GitHub workflows provide nicer built-in reporting directly on the "Checks" page of a pr. This includes more-detailed action reporting, and a host of pretty nice integrated features, such as Workflow Commands for creating annotations that can print messages during runs. See for example at the bottom of this window where we report ccache hitrate, if it was below 90%: https://github.com/testing-cirrus-runners/bitcoin/actions/runs/16163449125?pr=1

These could be added conditionally into our CI scripts to report interesting or other information.

Costs

Financial

Relative to competitors Cirrus runners are cheap for the hosted CI-world. However these are likely more expensive than our current setup, or a well-configured (new) self-hosted setup.

If we started with 10 runners to be shared amongst all migrated jobs, this would total $750/mo = $9000/yr.

Note that we are not trying to comptete here on cost directly.

Dependencies

We would be dependent on Cirrus infra.

Forks

Forks should be able to run CI without paid Cirrus runners. This behaviour is achieved through a rather verbose runs-on: directive.
- This directive hardcodes the main repo (unfortunately you cannot use the env github context in this field in particular, for some reason).
- This directive also allows for a fork to patch the runs-on: field in the ci.yml file if they want to use Cirrus Runners too.
- The workflow otherwise will fallback to the GitHub free runners on forks.
This cirrus cache action transparently falls back to github actions cache when not running on cirrus, so forks will get some free github caching (10GB per repo).

All jobs work on forks, but will run (slowly) on GitHub native free hosted runners, instead of Cirrus runners. They will also suffer from poor cache hit-rates, but there's nothing that can be done about that, and the situtation is an improvement on today.

Migration process

The main org should also, in addition to pulling code changes:

Permit the actions docker/setup-buildx-action@v3 and docker/login-action@v3 to be run in this repo.

Caching

For the number of CI jobs we have, cache usage on GitHub would be an issue as GH only provides 10GB of cache space, per repo. However cirrus provides 10 GB per runner, which scales better with the number of runners.

The cirruslabs/action/[restore|save] action we use here redirects this to Cirrus' own cache and is both faster and larger.

In the case that user is running CI on a fork, the cirrus cache falls back transparently to GitHub default cache without error.

ccache, depends-sources, built-depends

Cached as blobs via cirruslabs/actions/cache action.
Current implementation:
- On push: restores and saves caches.
- On pull_request: restores but does not save caches.

This means a new pull request should hit a pretty relevant cache.
Old pull requests which are not being rebased on master may suffer from lower cache hit-rate.

If we save caches on all pull request runs we run the risk of evicting recent (and more relevant) cache blobs.
It may be possible in a future optimisation to widen this to save on pull request runs too, but it will also depend on how many runners we provision and what cache churn rates are like in the main repo.

Docker build layer caching

Cached using the gha cache backend
These cache blobs compete for space with ccache, depends-sources and depends-built caches
gha cache allows --cache-from to be used from pull requests, which does not work using a registry cache type (technically we could use a public read-only token to get this working, but that feels wrong)

This backend does network i/o and so are marginally slower than our current disk i/o cache.

But what about... `x`?

We have tested many other providers, including Runs-on, Buildjet, WarpBuild, and GitHub hosted runners (and investigated even more). But they all fall short in one-way or another.

Runs-On and Buildjet (and others) require installing GH apps with much too-liberal permissions (e.g. Administration: Read|Write) for our use-case.
GitHub hosted runners suffer from all of high costs, lower speed, small cache, and the requirement for a GitHub Teams subscription.
WarpBuild seems to be simply too expensive.

TODO:

To complete migration from self-hosted to hosted for this repo, the backport branches 27.x, 28.x and 29.x would also need their CI ported, but these are left for followups to this change (and pending review/changes here first).

Work and experimentation undertaken with m3dwards

DrahtBot · 2025-07-16T08:39:45Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32989.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	janb84, maflcko, m3dwards, achow101
Concept ACK	fanquake, hebasto, 0xB10C, stickies-v

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#33145 (CI: silent merge check by m3dwards)
#33051 (Don't fix Python patch version by Sjors)
#31425 (RFC: Riscv bare metal CI job by TheCharlatan)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

ci/test_imagefile

ci/test/02_run_container.sh

fanquake · 2025-07-16T09:43:53Z

Concept ACK. This will also need to go back to 27.x.

.github/workflows/ci.yml

willcl-ark · 2025-07-16T10:29:49Z

Concept ACK. This will also need to go back to 27.x.

Testing a backport to 29.x here: https://github.com/testing-cirrus-runners/bitcoin2/actions/runs/16320536543

I think the best course of action could be to look for a little more conceptual review here, and after that squash the "ci: port x" commits in this changeset down to a single one, to make backporting to the multiple supported branches easier.

achow101 · 2025-07-16T18:07:12Z

Concept ACK

ci/README.md

maflcko · 2025-07-17T06:51:15Z

This means a new pull request should hit a pretty relevant cache.
Old pull requests which are not being rebased on master may suffer from lower cache hit-rate.

I don't think this is true. A pull request that modifies a core header (like serialize.h) will now always start from a cold cache. The current persistent workers have a high ccache hit rate for pulls that are (force) pushed for minor fixups (https://0xb10c.github.io/bitcoin-core-ci-stats/graph/ccache/). Also, before CI runs, pull requests are rebased/merged with master, so the age of a pull request alone shouldn't affect cache hit rate.

However, the trade-offs here are probably worth to go forward and try to optimize the ccache hit rate later. Concept ACK.

Give the robot account read/write access to this repo

This seems a bit scary. Are you saying that a proprietary third party outside of our control can now push directly to the repo? My assumption was that the tokens would be added to CI in this repo and CI had write access to the registry, not the other way round. Why would the registry need write access here?

Edit: We may reconsider #31850 and drop container image caching, and just accept the intermittent network IO errors or network speed issues.

willcl-ark · 2025-07-17T07:02:49Z

This means a new pull request should hit a pretty relevant cache.
Old pull requests which are not being rebased on master may suffer from lower cache hit-rate.

I don't think this is true. A pull request that modifies a core header (like serialize.h) will now always start from a cold cache. The current persistent workers have a high ccache hit rate for pulls that are (force) pushed for minor fixups (0xb10c.github.io/bitcoin-core-ci-stats/graph/ccache). Also, before CI runs, pull requests are rebased/merged with master, so the age of a pull request alone shouldn't affect cache hit rate.

However, the trade-offs here are probably worth to go forward and try to optimize the ccache hit rate later. Concept ACK.

Yes, force pushes for minor fixups is the tradeoff we have in the current implementation. As you say, we can set the ccache to save on pull requests too (in the future) if necessary.

Give the robot account read/write access to this repo

This seems a bit scary. Are you saying that a proprietary third party outside of our control can now push directly to the repo? My assumption was that the tokens would be added to CI in this repo and CI had write access to the registry, not the other way round. Why would the registry need write access here?

Sorry for not being clearer! The robot account gets read/write access to the Quay.io (docker) repo, not this code repository!

Edit: We may reconsider #31850 and drop container image caching, and just accept the intermittent network IO errors or network speed issues.

Yes, I would love to get #31850 working in any case, as it would simply avoid long rebuilds in the worst cases; most docker images are rebuilding in < 2 minutes, except MSAN...

maflcko

I guess you want review here and then address it, as it comes in? Once review is finished, the app will be installed and reviewers can also look at a "real" run in this repo?

looked at c0ad2b6~23 🎬

Show signature

Signature:

untrusted comment: signature from minisign secret key on empty file; verify via: minisign -Vm "${path_to_any_empty_file}" -P RWTRmVTMeKV5noAMqVlsMugDDCyyTSbA3Re5AkUrhvLVln0tSaFWglOw -x "${path_to_this_whole_four_line_signature_blob}"
RUTRmVTMeKV5npGrKx1nqXCw5zeVHdtdYURB/KlyA/LMFgpNCs+SkW9a8N95d+U4AP1RJMi+krxU1A3Yux4bpwZNLvVBKy0wLgM=
trusted comment: looked at c0ad2b6aa8e8c31c9f9c9ea2b35ca86f7985c490~23 🎬
gylJtm++jv0E+65SRoLFPC+ef+fwpVJftiMQ+ziB1uRZAF2MwE7TW3JaEv8iJIpZsRmForPR0jik8/6QvUs+BQ==

.github/actions/restore-caches/action.yml

.github/actions/save-caches/action.yml

.github/actions/restore-caches/action.yml

.github/actions/configure-docker/action.yml

.github/workflows/ci.yml

hebasto

Concept ACK.

we qualify for an open source discount of 50%.

We would be dependent on Cirrus infra...

We shouldn't be surprised when Cirrus suddenly changes its modus operandi, including its advertised open-source discount or general availability.

willcl-ark · 2025-07-17T12:35:22Z

Concept ACK.

we qualify for an open source discount of 50%.

We would be dependent on Cirrus infra...

We shouldn't be surprised when Cirrus suddenly changes its modus operandi, including its advertised open-source discount or general availability.

Certainly, it is good to be wary of that. I think this is equally true for all cloud providers though.

It's my belief that if we complete this migration we are resonably well protected against this risk for the following reasons:

As we use the common GitHub workflow yaml format, changing to another provider can be as simple as amending the runs-on: line in the yaml (and installing a different provider's GH app, and changing the cache action to that provider's own).
We could always revert back to a self-hosted solution utilizing the new workflow yaml via https://github.com/actions/runner, which for example can be trivially configured on one or more self-hosted servers using services.github.runner with Nix (I have a demo of such a configuration here).

We seem to have a good working relationship so far with Cirrus, @m3dwards has a responsive and helpful contact there. Of course, we are in the tendering stage so there is perhaps extra impetus to be helpful to us, but I don't see any reason that a historical precedence of limiting free runners (which are allegedly being abused for crypto mining), should appear any more risky to paid/premium customers (than any other provider).

maflcko · 2025-07-17T12:42:09Z

2. We could always revert back to a self-hosted solution

Agree that this may happen with any third party (including GitHub itself). If we want to switch back to the self-hosted runners, it should be as trivial as git revert $the_merge_commit_of_this_pull. Alternatively, switch to GHA-based self-hosted runners. Though, it would be good to create a proof-of-concept pull request to switch to self-hosted runners in GHA, or a different hosted alternative (e.g. warp-build), after this pull is merged. This makes it easier to see that (1) it works and (2) how easily it is possible.

willcl-ark · 2025-07-18T10:34:45Z

Pushed c126475 with a CI run on master branch at https://github.com/testing-cirrus-runners/bitcoin2/actions/runs/16368410249

willcl-ark · 2025-07-18T10:40:33Z

Thanks for the review @maflcko & @fanquake , I hope I addressed all your current review comments.

0xB10C · 2025-07-18T11:27:22Z

Concept ACK!

I think finishing the migration would close #31965

maflcko

looked at c126475~20 💇

Show signature

Signature:

untrusted comment: signature from minisign secret key on empty file; verify via: minisign -Vm "${path_to_any_empty_file}" -P RWTRmVTMeKV5noAMqVlsMugDDCyyTSbA3Re5AkUrhvLVln0tSaFWglOw -x "${path_to_this_whole_four_line_signature_blob}"
RUTRmVTMeKV5npGrKx1nqXCw5zeVHdtdYURB/KlyA/LMFgpNCs+SkW9a8N95d+U4AP1RJMi+krxU1A3Yux4bpwZNLvVBKy0wLgM=
trusted comment: looked at c126475ed7a17ec9030066056e31846c7124dcf~20 💇
kep8ZK4UJEamaLijXtMFwgjSmf1fhSJuF49dbZ/NHDe/5jmIZR2EzJa0ewjjGov4n3xWZMN5f3LKRekrMZ6HAw==

ci/test/02_run_container.sh

.github/workflows/ci.yml

willcl-ark · 2025-07-25T21:45:25Z

Thanks for the review on this so far.

Whilst we had tested the docker registry caching on PRs successfully, because I was opening them myself (and was owner of the parent repo) repo-level variables were available to me which were not available to 3rd party pull requests.

The short of this is that this meant the docker registry cache setup could not pull from the registry on pull requests and we didn't think it was therefore suitable for our purposes.

We have switched to the gha cache backend in the latest force-push with the good news that we finally have it working for all images correctly; caching to/from on pushes, and caching from on pulls.

A push to master can be found here: https://github.com/testing-cirrus-runners/bitcoin2/actions/runs/16529929752

And a pull request (from a 3rd party account) here: https://github.com/testing-cirrus-runners/bitcoin2/actions/runs/16531270443?pr=3

A new commit has been added, 0bd758e, to fix a (new?) issue we experienced with the asan job where the runner host appeared to update it's image and kernel. The cached docker image for the ASAN job then had the incorrect linux-headers-<version> package installed, causing the job to fail. There are likely multiple ways to fix this (we could input uname as a docker build arg for example), but simply forcing a reinstall of the correct headers in 03_*.sh seemed easiest. Open to suggestions here however...

Marking as ready for review now, as I think this is conceptually ready.

These jobs can use reduced runner size to avoid wasting CPU, as much of the long-running part of the job is single-threaded. Suggested in: bitcoin#32989 (comment) Co-authored-by: MarcoFalke <*~=`'#}+{/-|&$^[email protected]>

5eeb2fa ci: reduce runner sizes on various jobs (will) Pull request description: These jobs can likely use reduced runner sizes to avoid wasting our CPU quota, as much of the long-running part of the job is single-threaded. This will also give us more (job) parallelisem from the same number of CPU that we are using. Suggested in: #32989 (comment) ACKs for top commit: kevkevinpal: ACK [5eeb2fa](5eeb2fa) m3dwards: ACK 5eeb2fa janb84: ACK 5eeb2fa Tree-SHA512: 6fb0352bc40623dd63b9bd6169d753d1ec9667c272445fda7a2db8bbedfa35350a51d08c1adf3fa5e070e84855c3f491668726d3c7ded07a39f2f9c63edacefc

fa8f081 ci: Checkout latest merged pulls (MarcoFalke) Pull request description: Currently, the `actions/checkout@v5` checks out pull requests merged against master, which is what we want. However, it checks out ancient/stale merge commits on a re-run. This is documented (https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs): > Re-run workflows [...] will also use the same GITHUB_SHA (commit SHA) and GITHUB_REF (git ref) of the original event that triggered the workflow run. For example: * https://github.com/bitcoin/bitcoin/actions/runs/17458152407/job/49579638898?pr=29641#step:9:914 compiles with IPC=ON, even though latest master is at ed2ff3c * #32989 (comment) (example explained in comment) This is problematic, because: * Unrelated CI failures and intermittent issues, which are fixed or worked around in latest master can not be cleaned by re-running the task. The author has to actively go out and (force-)push the branch, invalidating review. * It is odd to have a recent CI run, but it uses code and config from the past. * Detecting silent merge conflicts by re-running the CI task is impossible. Fix all issues by checking out the latest merged state of the pull request. The behavior is unchanged for non-pull-request actions. This patch changes the "re-run" default behaviour. Forcing it to use the new state instead of running the old state again. ACKs for top commit: janb84: re ACK fa8f081 hebasto: ACK fa8f081. Tree-SHA512: c22c6f837402f61ec46be46817473e1946424b5312e36ed0e246cadb1ca89c04163bb471f71c309765a3d327f198a83cd83679d231f03828a99a97562a622fdd

fanquake reviewed Jul 16, 2025

View reviewed changes

ci/test_imagefile Show resolved Hide resolved

fanquake reviewed Jul 16, 2025

View reviewed changes

ci/test/02_run_container.sh Outdated Show resolved Hide resolved

fanquake reviewed Jul 16, 2025

View reviewed changes

.github/workflows/ci.yml Show resolved Hide resolved

This was referenced Jul 16, 2025

Add bitcoin-{node,gui} to release binaries for IPC #31802

Merged

RFC: Riscv bare metal CI job #31425

Open

DrahtBot reviewed Jul 17, 2025

View reviewed changes

ci/README.md Outdated Show resolved Hide resolved

maflcko approved these changes Jul 17, 2025

View reviewed changes

Sjors mentioned this pull request Jul 17, 2025

Test against accidentally running hosted cirrus runner Sjors/bitcoin#97

Closed

hebasto reviewed Jul 17, 2025

View reviewed changes

DrahtBot mentioned this pull request Jul 17, 2025

ci: Run unit tests parallel with functional tests #33000

Closed

willcl-ark force-pushed the cirrus-runners branch from c0ad2b6 to c126475 Compare July 18, 2025 10:33

maflcko reviewed Jul 18, 2025

View reviewed changes

maflcko mentioned this pull request Jul 18, 2025

Hide CI failed comment to avoid bloat? maflcko/DrahtBot#42

Open

fanquake mentioned this pull request Jul 21, 2025

ci: Use APT_LLVM_V in msan task #32999

Merged

fanquake reviewed Jul 21, 2025

View reviewed changes

.github/workflows/ci.yml Outdated Show resolved Hide resolved

m3dwards force-pushed the cirrus-runners branch from c126475 to fe0906f Compare July 21, 2025 13:21

willcl-ark force-pushed the cirrus-runners branch from fe0906f to b4e85f5 Compare July 25, 2025 21:28

willcl-ark marked this pull request as ready for review July 25, 2025 21:45

willcl-ark mentioned this pull request Sep 5, 2025

ci: reduce runner sizes on various jobs #33319

Merged

Sjors mentioned this pull request Sep 9, 2025

test: automatically pick bitcoind or bitcoin-node Sjors/bitcoin#104

Closed

willcl-ark mentioned this pull request Sep 16, 2025

Backport Cirrus runners to 29.x #33403

Merged

willcl-ark mentioned this pull request Sep 16, 2025

Backport Cirrus runners to 28.x #33406

Merged

glozow mentioned this pull request Oct 3, 2025

[29.x] Finalise 29.2rc2 #33534

Merged

Sjors mentioned this pull request Oct 24, 2025

TSAN/MSAN fails with vm.mmap_rnd_bits=32 even with llvm 18.1.3 #30674

Closed

Conversation

willcl-ark commented Jul 16, 2025 • edited by fanquake Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benefits

Maintenance

Bus factor

Scaling

Speed

GitHub workflow syntax

Reporting

Costs

Financial

Dependencies

Forks

Migration process

Caching

ccache, depends-sources, built-depends

Docker build layer caching

But what about... x?

TODO:

Uh oh!

DrahtBot commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Coverage & Benchmarks

Reviews

Conflicts

Uh oh!

Uh oh!

Uh oh!

fanquake commented Jul 16, 2025

Uh oh!

Uh oh!

willcl-ark commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

achow101 commented Jul 16, 2025

Uh oh!

Uh oh!

maflcko commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

willcl-ark commented Jul 17, 2025

Uh oh!

maflcko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hebasto left a comment

Choose a reason for hiding this comment

Uh oh!

willcl-ark commented Jul 17, 2025

Uh oh!

maflcko commented Jul 17, 2025

Uh oh!

willcl-ark commented Jul 18, 2025

Uh oh!

willcl-ark commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xB10C commented Jul 18, 2025

Uh oh!

maflcko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

willcl-ark commented Jul 25, 2025

Uh oh!

willcl-ark commented Jul 16, 2025 •

edited by fanquake

Loading

But what about... `x`?

DrahtBot commented Jul 16, 2025 •

edited

Loading

willcl-ark commented Jul 16, 2025 •

edited

Loading

maflcko commented Jul 17, 2025 •

edited

Loading

willcl-ark commented Jul 18, 2025 •

edited

Loading