A GitHub App that listens for workflow_job webhooks and provisions ephemeral RISC-V GitHub Actions runners on Kubernetes using a demand-matching model.
RISE RISC-V Runners is a GitHub App that provides ephemeral RISC-V runners for GitHub Actions workflows.
- Install the app on your organization from https://github.com/apps/rise-risc-v-runners.
- Contact the app administrators to have your organization added to the allowlist.
Use runs-on: ubuntu-24.04-riscv in your workflow:
jobs:
build:
runs-on: ubuntu-24.04-riscv
steps:
- uses: actions/checkout@v4
- run: uname -m # riscv64Available platform labels:
| Labels | Board | Description |
|---|---|---|
ubuntu-24.04-riscv |
scw-em-rv1 |
Scaleway EM-RV1 RISC-V |
ubuntu-24.04-riscv-2xlarge |
cloudv10x-pioneer |
Cloud-V-provided hardware with larger number of cores (MILK-V Pioneer) |
ubuntu-24.04-riscv-rvv |
cloudv10x-rvv |
Cloud-V-provided hardware with RVV support |
ubuntu-26.04-riscv |
scw-em-rv1 |
Scaleway EM-RV1 RISC-V (Ubuntu 26.04) |
ubuntu-26.04-riscv-2xlarge |
cloudv10x-pioneer |
Cloud-V-provided hardware with larger number of cores (MILK-V Pioneer) (Ubuntu 26.04) |
ubuntu-26.04-riscv-rvv |
cloudv10x-rvv |
Cloud-V-provided hardware with RVV support (Ubuntu 26.04) |
- Install the GitHub App on your organization or personal account.
- Runners are ephemeral -- each runner handles exactly one job and then terminates.
The app uses a demand matching model: on one side, workflow_jobs create demand for runners; on the other, k8s workers provide supply. The scheduler scales supply to match demand per (entity, job_labels) pool, with configurable limits.
Two GitHub Apps are used: one for organizations (org-scoped runners with runner groups) and one for personal accounts (repo-scoped runners). The entity_id abstracts over both: it is org_id for organizations or repo_id for personal accounts.
Jobs and workers are not directly linked -- the only relationship is through the entity. GitHub makes no direct job-to-runner link; a runner is attached to an org or repo, and the job runs inside that context.
The system is split into two containers:
- ghfe receives GitHub webhooks, validates them, and writes job state to PostgreSQL. It makes no GitHub API or k8s calls.
- scheduler reads job state from PostgreSQL, provisions runner pods on k8s, reconciles with GitHub, and cleans up completed pods.
GitHub (workflow_job webhook)
|
v
ghfe (ghfe.py)
| - Proxies webhooks to staging for staging entities (prod only)
| - Verifies webhook signature
| - Validates labels, determines entity type (org or personal)
| - Resolves (entity_id, job_labels) -> (k8s_pool, k8s_image)
| - Writes job to PostgreSQL
| - Serves /usage, /history
| - NO GitHub API calls, NO k8s calls
|
v
PostgreSQL (state store)
| - jobs table: all job metadata with status_enum, sorted JSONB labels
| - workers table: never deleted, status tracked (pending/running/completed/failed)
| - failure_info: exhaustive diagnostics for failed pods (including stuck ones)
| - LISTEN/NOTIFY: wakes scheduler on new jobs
|
v
Scheduler (scheduler.py)
| - gh_reconcile_jobs: sync job status with GitHub
| - gh_reconcile_runners: detect stuck runners (registration timeout,
| pending timeout); clean up orphan/terminal runners
| from GitHub
| - cleanup_pods: sync worker status from k8s, delete terminal pods
| after a 6h grace period (so logs stay inspectable)
| - demand_match: provision runners where demand > supply
| - Woken by PostgreSQL LISTEN/NOTIFY or 15s timeout
|
v
Kubernetes (runner pods)
GitHub -> ghfe: workflow_job (action=queued)
ghfe: validate signature, labels, entity type
ghfe: match_labels_to_k8s(labels) -> (k8s_pool, k8s_image)
ghfe -> PostgreSQL: add_job() -> INSERT + NOTIFY queue_event
ghfe -> GitHub: 200 OK
Scheduler: woken by LISTEN/NOTIFY (or 15s timeout)
Scheduler: get_pending_jobs() -> SELECT job_id FROM jobs WHERE status='pending' ORDER BY created_at
Scheduler: for each pending job:
- get_pool_demand(entity_id, job_labels) -> (jobs, workers)
- if jobs <= workers: skip (demand met)
- if entity total workers >= max_workers: skip
- has_available_slot(node_selector): skip if no capacity
- add_worker(entity_id, k8s_pool, name, labels, image) -> reserve name in DB
- authenticate_app(installation_id, entity_type) -> token
- [org] ensure_runner_group(entity_name, token) -> group_id
- [org] create_jit_runner_config_org(token, group_id, labels, entity_name, name) -> jit_config
- [personal] create_jit_runner_config_repo(token, labels, repo_full_name, name) -> jit_config
- provision_runner(jit_config, name, image, pool, entity_id, entity_name) -> pod
GitHub -> ghfe: workflow_job (action=in_progress)
ghfe -> PostgreSQL: update_job_running(job_id)
- UPDATE jobs SET status='running' WHERE status='pending'
ghfe -> GitHub: 200 OK
GitHub -> ghfe: workflow_job (action=completed)
ghfe -> PostgreSQL: update_job_completed(job_id)
- UPDATE jobs SET status='completed' WHERE status IN ('pending', 'running')
ghfe -> GitHub: 200 OK
Cancellation is passive. When a job is cancelled on GitHub:
- The
completedwebhook fires and marks the job completed in PostgreSQL - If a worker was already provisioned, it picks up another job or times out
- GH reconciliation detects stale jobs within ~15s and cleans them up
queued webhook in_progress webhook completed webhook
| | |
v v v
PENDING -----------> RUNNING -----------> COMPLETED
| ^
+-----------------------------------------------+
completed webhook (before provision)
add_worker() reserves name in DB (status=pending, running_at=NULL, completed_at=NULL)
-> k8s pod created
-> K8s pod Running -> status=running, running_at set from container start time
-> K8s pod Succeeded -> status=completed, completed_at set from container finish time
-> K8s pod Failed -> status=failed, completed_at set, failure_info populated
|
cleanup_pods() keeps the pod around for 6 hours (POD_DELETE_GRACE_SECONDS)
after termination so logs/events remain accessible via kubectl, then deletes it.
The worker row is updated immediately on phase transition (not after delete).
Additional health checks run in gh_reconcile_runners (not cleanup_pods):
- RUNNER_NEVER_REGISTERED: pod has been Running for more than
RUNNER_REGISTRATION_TIMEOUT_SECONDS(120s) but the runner never appeared in the GitHub API. Worker is markedfailed, pod is deleted immediately (no grace period — the slot is needed for retries), and any stale GH runner entry is removed.failure_infocaptures logs/events before deletion. - POD_STUCK_PENDING: pod has been Pending for more than
POD_PENDING_TIMEOUT_SECONDS(600s), likely due to missing capacity or image pull failures. Same remediation.
Workers are never deleted from PostgreSQL. The status field tracks the lifecycle: pending -> running -> completed|failed. Historical workers with failure_info are available for post-mortem debugging.
gh_reconcile_runners also performs GitHub-side cleanup each cycle:
- Runners registered in GitHub whose worker row is
completedorfailedare deleted from GitHub. - Runners registered in GitHub with no matching worker row (orphans from a previous scheduler, crashed provisioning, etc.) are deleted.
- For org-scoped runners the listing is scoped to the
RISE RISC-V Runnersrunner group; for repo-scoped (personal accounts) runners are filtered by therise-riscv-runner{-staging}-name prefix.
Tables live in a prod or staging schema (same database, isolated by SET search_path).
CREATE TYPE status_enum AS ENUM ('pending', 'running', 'completed', 'failed');
CREATE TABLE jobs (
job_id BIGINT PRIMARY KEY,
status status_enum NOT NULL DEFAULT 'pending',
entity_id BIGINT NOT NULL,
entity_name TEXT NOT NULL,
entity_type TEXT NOT NULL, -- 'Organization' or 'User'
repo_full_name TEXT NOT NULL,
installation_id BIGINT NOT NULL,
job_labels JSONB NOT NULL DEFAULT '[]', -- sorted at write time
k8s_pool TEXT NOT NULL,
k8s_image TEXT NOT NULL,
html_url TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE TABLE workers (
pod_name TEXT PRIMARY KEY,
entity_id BIGINT NOT NULL,
entity_name TEXT NOT NULL,
entity_type TEXT NOT NULL, -- 'Organization' or 'User'
installation_id BIGINT NOT NULL, -- GitHub App installation, needed for reconcile calls
repo_full_name TEXT, -- only set for User entities (repo-scoped runners); NULL for Organization
job_labels JSONB NOT NULL DEFAULT '[]',
k8s_pool TEXT NOT NULL,
k8s_image TEXT NOT NULL,
k8s_node TEXT,
status status_enum NOT NULL DEFAULT 'pending',
failure_info JSONB, -- exhaustive diagnostics for Failed and stuck pods (version=2)
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
running_at TIMESTAMPTZ, -- set when k8s pod first reaches running
completed_at TIMESTAMPTZ, -- set when status transitions to completed|failed
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);Status transitions are forward-only: pending -> running -> (completed | failed). All UPDATE queries enforce this with explicit WHERE clauses. A failed worker does not count toward supply in get_pool_demand, so demand_match automatically re-provisions a runner for the same pending job on the next loop iteration.
demand = COUNT(jobs WHERE entity_id = ? AND job_labels = ? AND status IN (pending, running))
supply = COUNT(workers WHERE entity_id = ? AND job_labels = ? AND status IN (pending, running))
deficit = demand - supply
Demand and supply are matched by (entity_id, job_labels). This prevents the bug where different label sets mapping to the same pool cause stuck workers (e.g., PyTorch linux.riscv64.xlarge vs ubuntu-24.04-riscv both map to scw-em-rv1 but need separate runners with matching labels).
The scheduler iterates pending jobs in FIFO order. For each job:
- If
demand <= supplyfor its(entity_id, job_labels): skip (demand already met) - If entity's total workers across all pools >=
max_workers: skip - If no k8s node capacity for the pool's node selector: skip
- Otherwise: provision a new runner
Per-entity configuration is defined in ENTITY_CONFIG in constants.py, keyed by entity ID (org ID or user ID):
| Field | Type | Description |
|---|---|---|
max_workers |
int or None | Maximum concurrent workers across all pools. None = unlimited |
staging |
bool | If true, webhooks are proxied from prod to staging |
ghfe:
| Route | Method | Description |
|---|---|---|
/ |
POST | Webhook endpoint for workflow_job events |
/health |
GET | Health check (returns ok) |
/usage |
GET | Human-readable view of per-pool jobs and workers |
/history |
GET | Job history sorted by status (pending, running, completed) then creation time |
scheduler:
| Route | Method | Description |
|---|---|---|
/health |
GET | Health check (returns ok) |
| File | Purpose |
|---|---|
container/constants.py |
Environment configuration, entity config, image tags |
container/ghfe.py |
Flask webhook handler -- validates requests, writes to PostgreSQL |
container/scheduler.py |
Scheduler -- GH reconciliation, demand matching, cleanup, worker status sync |
container/k8s.py |
Kubernetes pod provisioning, deletion, capacity checks, failure info collection |
container/db.py |
PostgreSQL database operations |
container/github.py |
GitHub API functions (auth, runner groups, JIT config, job status) |
container/Dockerfile |
Docker image for the ghfe and scheduler containers |
| Service | Product | Purpose |
|---|---|---|
| ghfe | Scaleway Container | Receives webhooks, writes job state to PostgreSQL |
| scheduler | Scaleway Container | Demand matching, pod provisioning, cleanup, worker status sync |
| State store | Scaleway Managed Database | PostgreSQL: jobs + workers tables |
| Runner pods | Self-hosted k8s clusters | Ephemeral RISC-V runner pods |
Production and staging each have their own k8s cluster, provisioned via the scripts/ tooling. Four containers are deployed total:
ghfe+scheduler(production,mainbranch)ghfe+scheduler(staging,stagingbranch)
Create a python venv and install dev dependencies:
python3.12 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements-dev.txtRun tests:
source .venv/bin/activate && PYTHONPATH=container python3 -m pytestTests mock PostgreSQL and Kubernetes -- no live services are required.
Deployment is handled automatically by GitHub Actions (.github/workflows/release.yml).
- Push to
mainautomatically deploys to production: runs tests, builds theghfeandschedulerDocker images, pushes them to Scaleway Container Registry, and deploys viaserverless deploy. - Push to
stagingautomatically deploys to staging: same pipeline but builds:stagingtags. After deploy, it triggers a sample workflow to verify end-to-end. - Manual deploy via the Actions tab: click "Run workflow", select "staging" or "production".
- The CI pipeline runs tests first. If tests fail, deploy is skipped.
- Docker image build and push takes ~1 minute.
serverless deploytakes ~1 minute to update the containers on Scaleway.- Total pipeline time is ~2-3 minutes.
The following secrets must be configured in the repository settings (Settings > Secrets and variables > Actions):
| Secret | Description |
|---|---|
SCW_SECRET_KEY |
Scaleway API secret key (used for container registry login and serverless deploy) |
GHAPP_WEBHOOK_SECRET |
GitHub webhook HMAC secret (shared by both apps) |
GHAPP_ORG_PRIVATE_KEY |
GitHub App RSA private key for organizations (PEM format) |
GHAPP_PERSONAL_PRIVATE_KEY |
GitHub App RSA private key for personal accounts (PEM format) |
K8S_KUBECONFIG |
Kubeconfig for the Kubernetes cluster |
POSTGRES_URL |
PostgreSQL connection string (e.g. postgresql://user:pass@<host>:5432/db?sslmode=require) |
RISCV_RUNNER_SAMPLE_ACCESS_TOKEN |
PAT for triggering sample workflow on staging deploy |
Production and staging each have their own k8s cluster on Scaleway, managed via scripts in scripts/.
| Script | Purpose |
|---|---|
scripts/scw-provision-control-plane.py |
Create a k8s control plane instance (Scaleway POP2-2C-8G) with containerd, kubeadm, Flannel CNI, RBAC, and device plugins |
scripts/scw-provision-runner.py |
Create, reinstall, list, or delete bare metal RISC-V runner nodes (Scaleway EM-RV1) |
scripts/constants.py |
Scaleway project ID, zone, private network ID, SSH key IDs |
scripts/utils.py |
Scaleway SDK clients, SSH helpers, BareMetal/Instance wrappers |
cd scripts
python3 -m venv .venv-scripts
source .venv-scripts/bin/activate
pip3 install -r requirements.txt
# 1. Create the control plane
## Pass --staging for a staging control-plane
python scw-provision-control-plane.py create [--staging]
# 2. Add runner nodes (creates 3 bare metal RISC-V servers)
python scw-provision-runner.py --control-plane <control-plane-name> create 3
# 3. Update Github Secrets:
## Note the `--env main` for the prod environment, use `--env staging` for staging environment
ssh root@$(scw instance server list zone=fr-par-2 project-id=03a2e06e-e7c1-45a6-9f05-775d813c2e28 -o json | jq -r '.[] | select(.name == "<control-plane-name>") | .public_ip.address') cat /etc/kubernetes/kubeconfig-gh-app.conf | gh secret set K8S_KUBECONFIG --repo riseproject-dev/riscv-runner-app --env main
ssh root@$(scw instance server list zone=fr-par-2 project-id=03a2e06e-e7c1-45a6-9f05-775d813c2e28 -o json | jq -r '.[] | select(.name == "<control-plane-name>") | .public_ip.address') cat /etc/kubernetes/kubeconfig-gh-deploy.conf | gh secret set K8S_KUBECONFIG --repo riseproject-dev/riscv-runner-images --env main
ssh root@$(scw instance server list zone=fr-par-2 project-id=03a2e06e-e7c1-45a6-9f05-775d813c2e28 -o json | jq -r '.[] | select(.name == "<control-plane-name>") | .public_ip.address') cat /etc/kubernetes/kubeconfig-gh-deploy.conf | gh secret set K8S_KUBECONFIG --repo riseproject-dev/riscv-runner-device-plugin --env main# List runners tagged to a control plane
python scw-provision-runner.py --control-plane <control-plane-name> list
# Reinstall OS on a runner (wipes and re-joins the cluster)
python scw-provision-runner.py --control-plane <control-plane-name> reinstall <runner-name>
# Reinstall OS on many runners (4 in parallel)
parallel --tag --line-buffer --halt never --delay 3 -j 4 --tagstring '[{}]' \
python3 -u scw-provision-runner.py reinstall {} \
::: riscv-runner-{6,25,27,30,33,34}
# Delete runners
python scw-provision-runner.py --control-plane <control-plane-name> delete <runner-name>RBAC is configured automatically by the control plane provisioning script. The key users:
gh-app-- used by the scheduler container. Has edit access and node list permission for capacity checks.gh-deploy-- used by CI for kubeconfig stored in GitHub Secrets. Has cluster-admin access.
Runner pods stay alive for 6 hours after reaching Succeeded/Failed so their logs and events can still be inspected via kubectl. The worker row in PostgreSQL is updated to completed/failed immediately on phase transition (not after the pod is deleted), so pool supply accounting is accurate throughout the grace period.
To manually clean up finished pods ahead of the grace period:
kubectl delete pods -l app=rise-riscv-runner --field-selector=status.phase!=Running,status.phase!=Pending,status.phase!=Unknown# Connect to PostgreSQL (use connection string from POSTGRES_URL secret)
psql $POSTGRES_URL
# Check demand for a label set
SELECT COUNT(*) FROM staging.jobs WHERE entity_id = {entity_id} AND job_labels = '["ubuntu-24.04-riscv"]' AND (status = 'pending' OR status = 'running');
# Check supply for a label set
SELECT COUNT(*) FROM staging.workers WHERE entity_id = {entity_id} AND job_labels = '["ubuntu-24.04-riscv"]' AND (status = 'pending' OR status = 'running');
# View a job
SELECT * FROM staging.jobs WHERE job_id = {job_id};
# View recent failed workers with diagnostics
SELECT pod_name, entity_id, k8s_pool, failure_info FROM staging.workers WHERE status = 'failed' ORDER BY completed_at DESC LIMIT 10;
# Filter by failure reason (e.g. runners that never registered with GitHub)
SELECT pod_name, entity_name, completed_at FROM staging.workers WHERE status = 'failed' AND failure_info->>'reason' = 'runner_never_registered' ORDER BY completed_at DESC LIMIT 20;