Artemis KG Embeddings

Containerized Nextflow pipeline for generating knowledge graph embeddings and link predictions using PyKEEN for multiple biomedical KGs (Hetionet, BioKG, OpenBioLink, PrimeKG).

Repository Structure

Workflow: main.nf (process embedding)
Global config: nextflow.config
Per-dataset profiles: conf/hetionet.config, conf/biokg.config, conf/openbiolink.config, conf/primekg.config
Container build: Dockerfile, requirements.txt
Deployment pipeline: .github/workflows/docker-deploy.yml
Terraform (public ECR): terraform/main.tf, terraform/providers.tf, terraform/backend.hcl
Ignore rules: .gitignore
License: LICENSE

Pipeline Overview

The Nextflow process embedding loads a selected dataset via PyKEEN, merges training/validation/testing triples, and runs pipeline() with user-supplied hyperparameters from a remote YAML config. Results are saved to config["save"]["path"].

Required YAML Config Keys

save:
  path: /output/dir
model:
  name: TransE
  embedding_dim: 256
seed: 42
train:
  num_epoch: 50
  num_negative: 32
optimizer:
  class: Adam
  lr: 0.0005

Parameters (Nextflow)

params.dataset (one of: hetionet, biokg, openbiolink, primekg)
params.config (S3 or local path to YAML)
params.outdir (publish directory / S3 prefix)
params.max_time (wall-time hint)

Profiles supply dataset + config path (see conf/*.config files).

Running the Workflow

Use a profile (recommended):

nextflow run main.nf -profile hetionet

Override output dir:

nextflow run main.nf -profile openbiolink --outdir s3://bucket/path/

Direct parameter usage (without profile):

nextflow run main.nf --dataset hetionet --config s3://bucket/configs/hetionet.yaml

GPU container is defined in nextflow.config (uses image pushed to public ECR).

Docker Image

Build locally:

docker build -t artemis-kgs-embeddings:local -f Dockerfile .

The CI workflow .github/workflows/docker-deploy.yml auto-tags images with either:

Git tag (without leading v)
Commit short SHA

Public ECR repository name is created via Terraform.

Terraform (Public ECR)

Initialize (adjust bucket/table in terraform/backend.hcl):

cd terraform
terraform init -reconfigure -backend-config=backend.hcl
terraform apply

Outputs:

public_image_uri_latest

Resources:

Repository: terraform/main.tf
Provider setup: terraform/providers.tf

Configuration Files

Each profile file (e.g. conf/openbiolink.config) sets:

params.dataset
params.config (S3 path to YAML)
Optional resource overrides (cpus, memory)

Global defaults in nextflow.config:

process.container points to public.ecr.aws/alethiotx/artemis-kgs-embeddings:latest
process.containerOptions enables --gpus all

CUDA Check

The script writes cuda_version.txt after allocating a CUDA tensor to assert GPU availability.

Outputs

pipeline_result.save_to_directory(config["save"]["path"]) produces:

Model artifacts
Embeddings
Evaluation metrics

Directory path is controlled by save.path in YAML.

Troubleshooting

Wrong dataset name: ensure it matches profile.
Missing GPU: container must run with --gpus all.
Config path issues: verify S3 permissions and YAML keys.

License

MIT License in LICENSE.

Minimal End-to-End Example

nextflow run main.nf -profile hetionet

Referenced Files

Dockerfile
requirements.txt
main.nf
nextflow.config
conf/hetionet.config
conf/biokg.config
conf/openbiolink.config
conf/primekg.config
.github/workflows/docker-deploy.yml
terraform/main.tf
terraform/providers.tf
terraform/backend.hcl
.gitignore
LICENSE

Acknowledgements

Public knowledge graph providers (Hetionet, BioKG, OpenBioLink, PrimeKG)
PyKEEN, scikit-learn, and Nextflow communities
Portions of this codebase were assisted using GitHub Copilot (Claude Sonnet 4.5) for code generation, refactoring, cleaning and documentation. The authors reviewed, modified, and validated all AI-assisted code. Responsibility for the correctness, performance, and reproducibility of the code rests entirely with the authors. No AI tools were used to generate scientific conclusions or interpretations in this study.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Artemis KG Embeddings

Repository Structure

Pipeline Overview

Required YAML Config Keys

Parameters (Nextflow)

Running the Workflow

Docker Image

Terraform (Public ECR)

Configuration Files

CUDA Check

Outputs

Troubleshooting

License

Minimal End-to-End Example

Referenced Files

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
conf		conf
terraform		terraform
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
nextflow.config		nextflow.config
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Artemis KG Embeddings

Repository Structure

Pipeline Overview

Required YAML Config Keys

Parameters (Nextflow)

Running the Workflow

Docker Image

Terraform (Public ECR)

Configuration Files

CUDA Check

Outputs

Troubleshooting

License

Minimal End-to-End Example

Referenced Files

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages