ci: Migrate CI to hosted Cirrus Runners#32989
Conversation
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code Coverage & BenchmarksFor details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32989. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
|
Concept ACK. This will also need to go back to |
Testing a backport to 29.x here: https://github.com/testing-cirrus-runners/bitcoin2/actions/runs/16320536543 I think the best course of action could be to look for a little more conceptual review here, and after that squash the "ci: port x" commits in this changeset down to a single one, to make backporting to the multiple supported branches easier. |
|
Concept ACK |
I don't think this is true. A pull request that modifies a core header (like serialize.h) will now always start from a cold cache. The current persistent workers have a high ccache hit rate for pulls that are (force) pushed for minor fixups (https://0xb10c.github.io/bitcoin-core-ci-stats/graph/ccache/). Also, before CI runs, pull requests are rebased/merged with master, so the age of a pull request alone shouldn't affect cache hit rate. However, the trade-offs here are probably worth to go forward and try to optimize the ccache hit rate later. Concept ACK.
This seems a bit scary. Are you saying that a proprietary third party outside of our control can now push directly to the repo? My assumption was that the tokens would be added to CI in this repo and CI had write access to the registry, not the other way round. Why would the registry need write access here? Edit: We may reconsider #31850 and drop container image caching, and just accept the intermittent network IO errors or network speed issues. |
Yes, force pushes for minor fixups is the tradeoff we have in the current implementation. As you say, we can set the ccache to save on pull requests too (in the future) if necessary.
Sorry for not being clearer! The robot account gets read/write access to the Quay.io (docker) repo, not this code repository!
Yes, I would love to get #31850 working in any case, as it would simply avoid long rebuilds in the worst cases; most docker images are rebuilding in < 2 minutes, except MSAN... |
maflcko
left a comment
There was a problem hiding this comment.
I guess you want review here and then address it, as it comes in? Once review is finished, the app will be installed and reviewers can also look at a "real" run in this repo?
looked at c0ad2b6~23 🎬
Show signature
Signature:
untrusted comment: signature from minisign secret key on empty file; verify via: minisign -Vm "${path_to_any_empty_file}" -P RWTRmVTMeKV5noAMqVlsMugDDCyyTSbA3Re5AkUrhvLVln0tSaFWglOw -x "${path_to_this_whole_four_line_signature_blob}"
RUTRmVTMeKV5npGrKx1nqXCw5zeVHdtdYURB/KlyA/LMFgpNCs+SkW9a8N95d+U4AP1RJMi+krxU1A3Yux4bpwZNLvVBKy0wLgM=
trusted comment: looked at c0ad2b6aa8e8c31c9f9c9ea2b35ca86f7985c490~23 🎬
gylJtm++jv0E+65SRoLFPC+ef+fwpVJftiMQ+ziB1uRZAF2MwE7TW3JaEv8iJIpZsRmForPR0jik8/6QvUs+BQ==
hebasto
left a comment
There was a problem hiding this comment.
Concept ACK.
we qualify for an open source discount of 50%.
We would be dependent on Cirrus infra...
We shouldn't be surprised when Cirrus suddenly changes its modus operandi, including its advertised open-source discount or general availability.
Certainly, it is good to be wary of that. I think this is equally true for all cloud providers though. It's my belief that if we complete this migration we are resonably well protected against this risk for the following reasons:
We seem to have a good working relationship so far with Cirrus, @m3dwards has a responsive and helpful contact there. Of course, we are in the tendering stage so there is perhaps extra impetus to be helpful to us, but I don't see any reason that a historical precedence of limiting free runners (which are allegedly being abused for crypto mining), should appear any more risky to paid/premium customers (than any other provider). |
Agree that this may happen with any third party (including GitHub itself). If we want to switch back to the self-hosted runners, it should be as trivial as |
c0ad2b6 to
c126475
Compare
|
Pushed c126475 with a CI run on master branch at https://github.com/testing-cirrus-runners/bitcoin2/actions/runs/16368410249 |
|
Concept ACK! I think finishing the migration would close #31965 |
maflcko
left a comment
There was a problem hiding this comment.
looked at c126475~20 💇
Show signature
Signature:
untrusted comment: signature from minisign secret key on empty file; verify via: minisign -Vm "${path_to_any_empty_file}" -P RWTRmVTMeKV5noAMqVlsMugDDCyyTSbA3Re5AkUrhvLVln0tSaFWglOw -x "${path_to_this_whole_four_line_signature_blob}"
RUTRmVTMeKV5npGrKx1nqXCw5zeVHdtdYURB/KlyA/LMFgpNCs+SkW9a8N95d+U4AP1RJMi+krxU1A3Yux4bpwZNLvVBKy0wLgM=
trusted comment: looked at c126475ed7a17ec9030066056e31846c7124dcf~20 💇
kep8ZK4UJEamaLijXtMFwgjSmf1fhSJuF49dbZ/NHDe/5jmIZR2EzJa0ewjjGov4n3xWZMN5f3LKRekrMZ6HAw==
fe0906f to
b4e85f5
Compare
|
Thanks for the review on this so far. Whilst we had tested the docker registry caching on PRs successfully, because I was opening them myself (and was owner of the parent repo) repo-level variables were available to me which were not available to 3rd party pull requests. The short of this is that this meant the docker registry cache setup could not pull from the registry on pull requests and we didn't think it was therefore suitable for our purposes. We have switched to the A push to master can be found here: https://github.com/testing-cirrus-runners/bitcoin2/actions/runs/16529929752 And a pull request (from a 3rd party account) here: https://github.com/testing-cirrus-runners/bitcoin2/actions/runs/16531270443?pr=3 A new commit has been added, 0bd758e, to fix a (new?) issue we experienced with the asan job where the runner host appeared to update it's image and kernel. The cached docker image for the ASAN job then had the incorrect Marking as ready for review now, as I think this is conceptually ready. |
These jobs can use reduced runner size to avoid wasting CPU, as much of the long-running part of the job is single-threaded. Suggested in: bitcoin#32989 (comment) Co-authored-by: MarcoFalke <*~=`'#}+{/-|&$^[email protected]>
These jobs can use reduced runner size to avoid wasting CPU, as much of the long-running part of the job is single-threaded. Suggested in: bitcoin#32989 (comment) Co-authored-by: MarcoFalke <*~=`'#}+{/-|&$^[email protected]>
5eeb2fa ci: reduce runner sizes on various jobs (will) Pull request description: These jobs can likely use reduced runner sizes to avoid wasting our CPU quota, as much of the long-running part of the job is single-threaded. This will also give us more (job) parallelisem from the same number of CPU that we are using. Suggested in: #32989 (comment) ACKs for top commit: kevkevinpal: ACK [5eeb2fa](5eeb2fa) m3dwards: ACK 5eeb2fa janb84: ACK 5eeb2fa Tree-SHA512: 6fb0352bc40623dd63b9bd6169d753d1ec9667c272445fda7a2db8bbedfa35350a51d08c1adf3fa5e070e84855c3f491668726d3c7ded07a39f2f9c63edacefc
fa8f081 ci: Checkout latest merged pulls (MarcoFalke) Pull request description: Currently, the `actions/checkout@v5` checks out pull requests merged against master, which is what we want. However, it checks out ancient/stale merge commits on a re-run. This is documented (https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs): > Re-run workflows [...] will also use the same GITHUB_SHA (commit SHA) and GITHUB_REF (git ref) of the original event that triggered the workflow run. For example: * https://github.com/bitcoin/bitcoin/actions/runs/17458152407/job/49579638898?pr=29641#step:9:914 compiles with IPC=ON, even though latest master is at ed2ff3c * #32989 (comment) (example explained in comment) This is problematic, because: * Unrelated CI failures and intermittent issues, which are fixed or worked around in latest master can not be cleaned by re-running the task. The author has to actively go out and (force-)push the branch, invalidating review. * It is odd to have a recent CI run, but it uses code and config from the past. * Detecting silent merge conflicts by re-running the CI task is impossible. Fix all issues by checking out the latest merged state of the pull request. The behavior is unchanged for non-pull-request actions. This patch changes the "re-run" default behaviour. Forcing it to use the new state instead of running the old state again. ACKs for top commit: janb84: re ACK fa8f081 hebasto: ACK fa8f081. Tree-SHA512: c22c6f837402f61ec46be46817473e1946424b5312e36ed0e246cadb1ca89c04163bb471f71c309765a3d327f198a83cd83679d231f03828a99a97562a622fdd
This changeset migrates all current self-hosted CI jobs over to hosted Cirrus Runners.
These runners cost a flat rate of $150/month, and we qualify for an open source discount of 50%. Therefore they are $75/month/runner.
One "runner" should more accurately be thought of in terms of the number of vCPU you are purchasing: https://cirrus-runners.app/pricing/ or in terms of "concurrency", where 1 runners gets you 1.0 concurrency.
e.g. a Linux x86 Runner gets you 16 vCPU (1.0 concurrency) and 64GB RAM to be provisioned as you choose, amongst one or more jobs.
Cirrus Runners currently only support Linux (x86 and Arm64) and MacOS (Arm64).
This changeset does not move the existing Github Actions native MacOS runners away from being run on Github's infrastructure. This could be a follow up optimisation.
Runs from this changeset using Cirrus Runners can be found at: https://github.com/testing-cirrus-runners/bitcoin2/actions which shows an uncached run on master (CI#1), an outside pull request (CI#3) and an updated push to master (CI#4).
These workflows were run on 10 runners, and we would recommend purchasing a similar number for our CI in this repo to achieve the speed and concurrency we expect.
We include some optional performance commits, but these could be split out and made into followups or dropped entirely.
Benefits
Maintenance
As we are not self-hosting, nobody needs to maintain servers, disks etc.
Bus factor
Currently we have a very small number of people with the know-how working on server setup and maintenance. This setup fixes that so that "anyone" familiar with GitHub-style CI systems can work on it.
Scaling
These do not "auto-scale"/have "unlimited concurrency" like some solutions, but if we want more workers/cpu to increase parallism or increase the runner size of certain jobs for a speed-up we can simply buy more concurrency using the web interface.
Speed
Runtimes aproximate current runtimes pretty well, with some jobs being faster.
Caching improvements on pull request (re-runs) are left as future optimisations from the current changeset (see below).
GitHub workflow syntax
With a migration to the more-commonly-used GitHub workflow syntax, migration to other providers in the future is often as simple as a one-line change (and installing a new GitHub app to the repo).
If we decide to self-host again, then we can also self-host GitHub runners (using https://github.com/actions/runner) and maintain new GH-style CI syntax.
Reporting
GitHub workflows provide nicer built-in reporting directly on the "Checks" page of a pr. This includes more-detailed action reporting, and a host of pretty nice integrated features, such as Workflow Commands for creating annotations that can print messages during runs. See for example at the bottom of this window where we report
ccachehitrate, if it was below 90%: https://github.com/testing-cirrus-runners/bitcoin/actions/runs/16163449125?pr=1These could be added conditionally into our CI scripts to report interesting or other information.
Costs
Financial
Relative to competitors Cirrus runners are cheap for the hosted CI-world. However these are likely more expensive than our current setup, or a well-configured (new) self-hosted setup.
If we started with 10 runners to be shared amongst all migrated jobs, this would total $750/mo = $9000/yr.
Note that we are not trying to comptete here on cost directly.
Dependencies
We would be dependent on Cirrus infra.
Forks
runs-on:directive.envgithub context in this field in particular, for some reason).runs-on:field in the ci.yml file if they want to use Cirrus Runners too.All jobs work on forks, but will run (slowly) on GitHub native free hosted runners, instead of Cirrus runners. They will also suffer from poor cache hit-rates, but there's nothing that can be done about that, and the situtation is an improvement on today.
Migration process
The main org should also, in addition to pulling code changes:
docker/setup-buildx-action@v3anddocker/login-action@v3to be run in this repo.Caching
For the number of CI jobs we have, cache usage on GitHub would be an issue as GH only provides 10GB of cache space, per repo. However cirrus provides 10 GB per runner, which scales better with the number of runners.
The
cirruslabs/action/[restore|save]action we use here redirects this to Cirrus' own cache and is both faster and larger.In the case that user is running CI on a fork, the cirrus cache falls back transparently to GitHub default cache without error.
ccache, depends-sources, built-depends
cirruslabs/actions/cacheaction.push: restores and saves caches.pull_request: restores but does not save caches.This means a new pull request should hit a pretty relevant cache.
Old pull requests which are not being rebased on master may suffer from lower cache hit-rate.
If we save caches on all pull request runs we run the risk of evicting recent (and more relevant) cache blobs.
It may be possible in a future optimisation to widen this to save on pull request runs too, but it will also depend on how many runners we provision and what cache churn rates are like in the main repo.
Docker build layer caching
ghacache backendccache,depends-sourcesanddepends-builtcachesghacache allows--cache-fromto be used from pull requests, which does not work using a registry cache type (technically we could use a public read-only token to get this working, but that feels wrong)This backend does network i/o and so are marginally slower than our current disk i/o cache.
But what about...
x?We have tested many other providers, including Runs-on, Buildjet, WarpBuild, and GitHub hosted runners (and investigated even more). But they all fall short in one-way or another.
Administration: Read|Write) for our use-case.TODO:
To complete migration from self-hosted to hosted for this repo, the backport branches
27.x,28.xand29.xwould also need their CI ported, but these are left for followups to this change (and pending review/changes here first).Work and experimentation undertaken with m3dwards