RISC-V Runner App

A GitHub App that listens for workflow_job webhooks and provisions ephemeral RISC-V GitHub Actions runners on Kubernetes using a demand-matching model.

Usage

RISE RISC-V Runners is a GitHub App that provides ephemeral RISC-V runners for GitHub Actions workflows.

Installation

Install the app on your organization from https://github.com/apps/rise-risc-v-runners.
Contact the app administrators to have your organization added to the allowlist.

Running workflows on RISC-V

Use runs-on: ubuntu-24.04-riscv in your workflow:

jobs:
  build:
    runs-on: ubuntu-24.04-riscv
    steps:
      - uses: actions/checkout@v4
      - run: uname -m  # riscv64

Available platform labels:

Labels	Board	Description
`ubuntu-24.04-riscv`	`scw-em-rv1`	Scaleway EM-RV1 RISC-V
`ubuntu-24.04-riscv-2xlarge`	`cloudv10x-pioneer`	Cloud-V-provided hardware with larger number of cores (MILK-V Pioneer)
`ubuntu-24.04-riscv-rvv`	`cloudv10x-rvv`	Cloud-V-provided hardware with RVV support
`ubuntu-26.04-riscv`	`scw-em-rv1`	Scaleway EM-RV1 RISC-V (Ubuntu 26.04)
`ubuntu-26.04-riscv-2xlarge`	`cloudv10x-pioneer`	Cloud-V-provided hardware with larger number of cores (MILK-V Pioneer) (Ubuntu 26.04)
`ubuntu-26.04-riscv-rvv`	`cloudv10x-rvv`	Cloud-V-provided hardware with RVV support (Ubuntu 26.04)

Requirements

Install the GitHub App on your organization or personal account.
Runners are ephemeral -- each runner handles exactly one job and then terminates.

Architecture

The app uses a demand matching model: on one side, workflow_jobs create demand for runners; on the other, k8s workers provide supply. The scheduler scales supply to match demand per (entity, job_labels) pool, with configurable limits.

Two GitHub Apps are used: one for organizations (org-scoped runners with runner groups) and one for personal accounts (repo-scoped runners). The entity_id abstracts over both: it is org_id for organizations or repo_id for personal accounts.

Jobs and workers are not directly linked -- the only relationship is through the entity. GitHub makes no direct job-to-runner link; a runner is attached to an org or repo, and the job runs inside that context.

The system is split into two containers:

ghfe receives GitHub webhooks, validates them, and writes job state to PostgreSQL. It makes no GitHub API or k8s calls.
scheduler reads job state from PostgreSQL, provisions runner pods on k8s, reconciles with GitHub, and cleans up completed pods.

GitHub (workflow_job webhook)
  |
  v
ghfe (ghfe.py)
  |  - Proxies webhooks to staging for staging entities (prod only)
  |  - Verifies webhook signature
  |  - Validates labels, determines entity type (org or personal)
  |  - Resolves (entity_id, job_labels) -> (k8s_pool, k8s_image)
  |  - Writes job to PostgreSQL
  |  - Serves /usage, /history
  |  - NO GitHub API calls, NO k8s calls
  |
  v
PostgreSQL (state store)
  |  - jobs table: all job metadata with status_enum, sorted JSONB labels
  |  - workers table: never deleted, status tracked (pending/running/completed/failed)
  |  - failure_info: exhaustive diagnostics for failed pods (including stuck ones)
  |  - LISTEN/NOTIFY: wakes scheduler on new jobs
  |
  v
Scheduler (scheduler.py)
  |  - gh_reconcile_jobs:    sync job status with GitHub
  |  - gh_reconcile_runners: detect stuck runners (registration timeout,
  |                          pending timeout); clean up orphan/terminal runners
  |                          from GitHub
  |  - cleanup_pods:         sync worker status from k8s, delete terminal pods
  |                          after a 6h grace period (so logs stay inspectable)
  |  - demand_match:         provision runners where demand > supply
  |  - Woken by PostgreSQL LISTEN/NOTIFY or 15s timeout
  |
  v
Kubernetes (runner pods)

Sequence: Queued webhook

GitHub -> ghfe: workflow_job (action=queued)
ghfe: validate signature, labels, entity type
ghfe: match_labels_to_k8s(labels) -> (k8s_pool, k8s_image)
ghfe -> PostgreSQL: add_job() -> INSERT + NOTIFY queue_event
ghfe -> GitHub: 200 OK

Sequence: Scheduler provisioning

Scheduler: woken by LISTEN/NOTIFY (or 15s timeout)
Scheduler: get_pending_jobs() -> SELECT job_id FROM jobs WHERE status='pending' ORDER BY created_at
Scheduler: for each pending job:
  - get_pool_demand(entity_id, job_labels) -> (jobs, workers)
  - if jobs <= workers: skip (demand met)
  - if entity total workers >= max_workers: skip
  - has_available_slot(node_selector): skip if no capacity
  - add_worker(entity_id, k8s_pool, name, labels, image) -> reserve name in DB
  - authenticate_app(installation_id, entity_type) -> token
  - [org] ensure_runner_group(entity_name, token) -> group_id
  - [org] create_jit_runner_config_org(token, group_id, labels, entity_name, name) -> jit_config
  - [personal] create_jit_runner_config_repo(token, labels, repo_full_name, name) -> jit_config
  - provision_runner(jit_config, name, image, pool, entity_id, entity_name) -> pod

Sequence: In-progress webhook

GitHub -> ghfe: workflow_job (action=in_progress)
ghfe -> PostgreSQL: update_job_running(job_id)
  - UPDATE jobs SET status='running' WHERE status='pending'
ghfe -> GitHub: 200 OK

Sequence: Completed webhook

GitHub -> ghfe: workflow_job (action=completed)
ghfe -> PostgreSQL: update_job_completed(job_id)
  - UPDATE jobs SET status='completed' WHERE status IN ('pending', 'running')
ghfe -> GitHub: 200 OK

Sequence: Cancellation

Cancellation is passive. When a job is cancelled on GitHub:

The completed webhook fires and marks the job completed in PostgreSQL
If a worker was already provisioned, it picks up another job or times out
GH reconciliation detects stale jobs within ~15s and cleans them up

Job lifecycle state machine

queued webhook      in_progress webhook     completed webhook
    |                       |                       |
    v                       v                       v
 PENDING  ----------->  RUNNING  ----------->  COMPLETED
    |                                               ^
    +-----------------------------------------------+
              completed webhook (before provision)

Worker lifecycle

add_worker() reserves name in DB (status=pending, running_at=NULL, completed_at=NULL)
  -> k8s pod created
  -> K8s pod Running   -> status=running, running_at set from container start time
  -> K8s pod Succeeded -> status=completed, completed_at set from container finish time
  -> K8s pod Failed    -> status=failed,    completed_at set, failure_info populated
       |
       cleanup_pods() keeps the pod around for 6 hours (POD_DELETE_GRACE_SECONDS)
       after termination so logs/events remain accessible via kubectl, then deletes it.
       The worker row is updated immediately on phase transition (not after delete).

Additional health checks run in gh_reconcile_runners (not cleanup_pods):

RUNNER_NEVER_REGISTERED: pod has been Running for more than RUNNER_REGISTRATION_TIMEOUT_SECONDS (120s) but the runner never appeared in the GitHub API. Worker is marked failed, pod is deleted immediately (no grace period — the slot is needed for retries), and any stale GH runner entry is removed. failure_info captures logs/events before deletion.
POD_STUCK_PENDING: pod has been Pending for more than POD_PENDING_TIMEOUT_SECONDS (600s), likely due to missing capacity or image pull failures. Same remediation.

Workers are never deleted from PostgreSQL. The status field tracks the lifecycle: pending -> running -> completed|failed. Historical workers with failure_info are available for post-mortem debugging.

GitHub / Kubernetes / DB reconciliation

gh_reconcile_runners also performs GitHub-side cleanup each cycle:

Runners registered in GitHub whose worker row is completed or failed are deleted from GitHub.
Runners registered in GitHub with no matching worker row (orphans from a previous scheduler, crashed provisioning, etc.) are deleted.
For org-scoped runners the listing is scoped to the RISE RISC-V Runners runner group; for repo-scoped (personal accounts) runners are filtered by the rise-riscv-runner{-staging}- name prefix.

Database schema

Tables live in a prod or staging schema (same database, isolated by SET search_path).

CREATE TYPE status_enum AS ENUM ('pending', 'running', 'completed', 'failed');

CREATE TABLE jobs (
    job_id          BIGINT PRIMARY KEY,
    status          status_enum NOT NULL DEFAULT 'pending',
    entity_id       BIGINT NOT NULL,
    entity_name     TEXT NOT NULL,
    entity_type     TEXT NOT NULL,        -- 'Organization' or 'User'
    repo_full_name  TEXT NOT NULL,
    installation_id BIGINT NOT NULL,
    job_labels      JSONB NOT NULL DEFAULT '[]',  -- sorted at write time
    k8s_pool        TEXT NOT NULL,
    k8s_image       TEXT NOT NULL,
    html_url        TEXT,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE workers (
    pod_name        TEXT PRIMARY KEY,
    entity_id       BIGINT NOT NULL,
    entity_name     TEXT NOT NULL,
    entity_type     TEXT NOT NULL,        -- 'Organization' or 'User'
    installation_id BIGINT NOT NULL,      -- GitHub App installation, needed for reconcile calls
    repo_full_name  TEXT,                 -- only set for User entities (repo-scoped runners); NULL for Organization
    job_labels      JSONB NOT NULL DEFAULT '[]',
    k8s_pool        TEXT NOT NULL,
    k8s_image       TEXT NOT NULL,
    k8s_node        TEXT,
    status          status_enum NOT NULL DEFAULT 'pending',
    failure_info    JSONB,                -- exhaustive diagnostics for Failed and stuck pods (version=2)
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    running_at      TIMESTAMPTZ,          -- set when k8s pod first reaches running
    completed_at    TIMESTAMPTZ,          -- set when status transitions to completed|failed
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

Status transitions are forward-only: pending -> running -> (completed | failed). All UPDATE queries enforce this with explicit WHERE clauses. A failed worker does not count toward supply in get_pool_demand, so demand_match automatically re-provisions a runner for the same pending job on the next loop iteration.

Demand matching algorithm

demand  = COUNT(jobs WHERE entity_id = ? AND job_labels = ? AND status IN (pending, running))
supply  = COUNT(workers WHERE entity_id = ? AND job_labels = ? AND status IN (pending, running))
deficit = demand - supply

Demand and supply are matched by (entity_id, job_labels). This prevents the bug where different label sets mapping to the same pool cause stuck workers (e.g., PyTorch linux.riscv64.xlarge vs ubuntu-24.04-riscv both map to scw-em-rv1 but need separate runners with matching labels).

The scheduler iterates pending jobs in FIFO order. For each job:

If demand <= supply for its (entity_id, job_labels): skip (demand already met)
If entity's total workers across all pools >= max_workers: skip
If no k8s node capacity for the pool's node selector: skip
Otherwise: provision a new runner

Configuration

Per-entity configuration is defined in ENTITY_CONFIG in constants.py, keyed by entity ID (org ID or user ID):

Field	Type	Description
`max_workers`	int or None	Maximum concurrent workers across all pools. None = unlimited
`staging`	bool	If true, webhooks are proxied from prod to staging

HTTP routes

ghfe:

Route	Method	Description
`/`	POST	Webhook endpoint for `workflow_job` events
`/health`	GET	Health check (returns `ok`)
`/usage`	GET	Human-readable view of per-pool jobs and workers
`/history`	GET	Job history sorted by status (pending, running, completed) then creation time

scheduler:

Route	Method	Description
`/health`	GET	Health check (returns `ok`)

Key files

File	Purpose
`container/constants.py`	Environment configuration, entity config, image tags
`container/ghfe.py`	Flask webhook handler -- validates requests, writes to PostgreSQL
`container/scheduler.py`	Scheduler -- GH reconciliation, demand matching, cleanup, worker status sync
`container/k8s.py`	Kubernetes pod provisioning, deletion, capacity checks, failure info collection
`container/db.py`	PostgreSQL database operations
`container/github.py`	GitHub API functions (auth, runner groups, JIT config, job status)
`container/Dockerfile`	Docker image for the ghfe and scheduler containers

Infrastructure

Service	Product	Purpose
ghfe	Scaleway Container	Receives webhooks, writes job state to PostgreSQL
scheduler	Scaleway Container	Demand matching, pod provisioning, cleanup, worker status sync
State store	Scaleway Managed Database	PostgreSQL: jobs + workers tables
Runner pods	Self-hosted k8s clusters	Ephemeral RISC-V runner pods

Production and staging each have their own k8s cluster, provisioned via the scripts/ tooling. Four containers are deployed total:

ghfe + scheduler (production, main branch)
ghfe + scheduler (staging, staging branch)

Development

Create a python venv and install dev dependencies:

python3.12 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements-dev.txt

Run tests:

source .venv/bin/activate && PYTHONPATH=container python3 -m pytest

Tests mock PostgreSQL and Kubernetes -- no live services are required.

Deployment

Deployment is handled automatically by GitHub Actions (.github/workflows/release.yml).

How it works

Push to main automatically deploys to production: runs tests, builds the ghfe and scheduler Docker images, pushes them to Scaleway Container Registry, and deploys via serverless deploy.
Push to staging automatically deploys to staging: same pipeline but builds :staging tags. After deploy, it triggers a sample workflow to verify end-to-end.
Manual deploy via the Actions tab: click "Run workflow", select "staging" or "production".

What to expect

The CI pipeline runs tests first. If tests fail, deploy is skipped.
Docker image build and push takes ~1 minute.
serverless deploy takes ~1 minute to update the containers on Scaleway.
Total pipeline time is ~2-3 minutes.

GitHub Secrets

The following secrets must be configured in the repository settings (Settings > Secrets and variables > Actions):

Secret	Description
`SCW_SECRET_KEY`	Scaleway API secret key (used for container registry login and serverless deploy)
`GHAPP_WEBHOOK_SECRET`	GitHub webhook HMAC secret (shared by both apps)
`GHAPP_ORG_PRIVATE_KEY`	GitHub App RSA private key for organizations (PEM format)
`GHAPP_PERSONAL_PRIVATE_KEY`	GitHub App RSA private key for personal accounts (PEM format)
`K8S_KUBECONFIG`	Kubeconfig for the Kubernetes cluster
`POSTGRES_URL`	PostgreSQL connection string (e.g. `postgresql://user:pass@<host>:5432/db?sslmode=require`)
`RISCV_RUNNER_SAMPLE_ACCESS_TOKEN`	PAT for triggering sample workflow on staging deploy

Kubernetes cluster provisioning

Production and staging each have their own k8s cluster on Scaleway, managed via scripts in scripts/.

Provisioning scripts

Script	Purpose
`scripts/scw-provision-control-plane.py`	Create a k8s control plane instance (Scaleway POP2-2C-8G) with containerd, kubeadm, Flannel CNI, RBAC, and device plugins
`scripts/scw-provision-runner.py`	Create, reinstall, list, or delete bare metal RISC-V runner nodes (Scaleway EM-RV1)
`scripts/constants.py`	Scaleway project ID, zone, private network ID, SSH key IDs
`scripts/utils.py`	Scaleway SDK clients, SSH helpers, BareMetal/Instance wrappers

Creating a new cluster from scratch

cd scripts
python3 -m venv .venv-scripts
source .venv-scripts/bin/activate
pip3 install -r requirements.txt

# 1. Create the control plane
## Pass --staging for a staging control-plane
python scw-provision-control-plane.py create [--staging]

# 2. Add runner nodes (creates 3 bare metal RISC-V servers)
python scw-provision-runner.py --control-plane <control-plane-name> create 3

# 3. Update Github Secrets:
## Note the `--env main` for the prod environment, use `--env staging` for staging environment
ssh root@$(scw instance server list zone=fr-par-2 project-id=03a2e06e-e7c1-45a6-9f05-775d813c2e28 -o json | jq -r '.[] | select(.name == "<control-plane-name>") | .public_ip.address') cat /etc/kubernetes/kubeconfig-gh-app.conf | gh secret set K8S_KUBECONFIG --repo riseproject-dev/riscv-runner-app --env main
ssh root@$(scw instance server list zone=fr-par-2 project-id=03a2e06e-e7c1-45a6-9f05-775d813c2e28 -o json | jq -r '.[] | select(.name == "<control-plane-name>") | .public_ip.address') cat /etc/kubernetes/kubeconfig-gh-deploy.conf | gh secret set K8S_KUBECONFIG --repo riseproject-dev/riscv-runner-images --env main
ssh root@$(scw instance server list zone=fr-par-2 project-id=03a2e06e-e7c1-45a6-9f05-775d813c2e28 -o json | jq -r '.[] | select(.name == "<control-plane-name>") | .public_ip.address') cat /etc/kubernetes/kubeconfig-gh-deploy.conf | gh secret set K8S_KUBECONFIG --repo riseproject-dev/riscv-runner-device-plugin --env main

Managing runners

# List runners tagged to a control plane
python scw-provision-runner.py --control-plane <control-plane-name> list

# Reinstall OS on a runner (wipes and re-joins the cluster)
python scw-provision-runner.py --control-plane <control-plane-name> reinstall <runner-name>

# Reinstall OS on many runners (4 in parallel)
parallel --tag --line-buffer --halt never --delay 3 -j 4 --tagstring '[{}]' \
  python3 -u scw-provision-runner.py reinstall {} \
  ::: riscv-runner-{6,25,27,30,33,34}

# Delete runners
python scw-provision-runner.py --control-plane <control-plane-name> delete <runner-name>

Kubernetes RBAC

RBAC is configured automatically by the control plane provisioning script. The key users:

gh-app -- used by the scheduler container. Has edit access and node list permission for capacity checks.
gh-deploy -- used by CI for kubeconfig stored in GitHub Secrets. Has cluster-admin access.

Operations

Cleanup terminated runner pods

Runner pods stay alive for 6 hours after reaching Succeeded/Failed so their logs and events can still be inspected via kubectl. The worker row in PostgreSQL is updated to completed/failed immediately on phase transition (not after the pod is deleted), so pool supply accounting is accurate throughout the grace period.

To manually clean up finished pods ahead of the grace period:

kubectl delete pods -l app=rise-riscv-runner --field-selector=status.phase!=Running,status.phase!=Pending,status.phase!=Unknown

Inspect database state

# Connect to PostgreSQL (use connection string from POSTGRES_URL secret)
psql $POSTGRES_URL

# Check demand for a label set
SELECT COUNT(*) FROM staging.jobs WHERE entity_id = {entity_id} AND job_labels = '["ubuntu-24.04-riscv"]' AND (status = 'pending' OR status = 'running');

# Check supply for a label set
SELECT COUNT(*) FROM staging.workers WHERE entity_id = {entity_id} AND job_labels = '["ubuntu-24.04-riscv"]' AND (status = 'pending' OR status = 'running');

# View a job
SELECT * FROM staging.jobs WHERE job_id = {job_id};

# View recent failed workers with diagnostics
SELECT pod_name, entity_id, k8s_pool, failure_info FROM staging.workers WHERE status = 'failed' ORDER BY completed_at DESC LIMIT 10;

# Filter by failure reason (e.g. runners that never registered with GitHub)
SELECT pod_name, entity_name, completed_at FROM staging.workers WHERE status = 'failed' AND failure_info->>'reason' = 'runner_never_registered' ORDER BY completed_at DESC LIMIT 20;

Name		Name	Last commit message	Last commit date
Latest commit History 156 Commits
.github		.github
bin		bin
container		container
docs/img		docs/img
scripts		scripts
tests		tests
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
requirements-dev.txt		requirements-dev.txt
serverless.yml		serverless.yml

Folders and files

Latest commit

History

Repository files navigation

RISC-V Runner App

Usage

Installation

Running workflows on RISC-V

Requirements

Architecture

Sequence: Queued webhook

Sequence: Scheduler provisioning

Sequence: In-progress webhook

Sequence: Completed webhook

Sequence: Cancellation

Job lifecycle state machine

Worker lifecycle

GitHub / Kubernetes / DB reconciliation

Database schema

Demand matching algorithm

Configuration

HTTP routes

Key files

Infrastructure

Development

Deployment

How it works

What to expect

GitHub Secrets

Kubernetes cluster provisioning

Provisioning scripts

Creating a new cluster from scratch

Managing runners

Kubernetes RBAC

Operations

Cleanup terminated runner pods

Inspect database state

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages