Author: Shaun Chen
First created: 2020/10/25
Last updated: 2021/06/29
Slide: 2021_summer_intern_boot_camp_HPC.pdf
Basic introduction of high-performance computing (HPC) in Scripps Research. This repository worked as class materials and was tested on Garibaldi only. Shared large datasets was deprecated.
Shell commands
-
man: format and display the on-line manual pages. -
ssh: OpenSSH SSH client (remote login program). -
scp: secure copy (remote file copy program). -
watch: execute a program periodically, showing output fullscreen. -
vim,nano,emacs: programmers text editors. -
history: list command history. -
exit: exit the current session. -
Ctrl+C: interrupt a running script when you feel not right... -
Ctrl+R: Recall the last command matching the characters you provide.
Git commands
Prerequisite on Garibaldi: module load git
-
git init: Create an empty git repository or reinitialize an existing one. -
git clone: Clone a repository into a new directory. -
git status: Show the working tree status. -
git add: Add file contents to the index. -
git commit -m: Record changes to the repository. -
git push: Update remote refs along with associated objects. -
git pull: Fetch from and merge with another repository or a local branch.
- Note: suggested file size <50MB; limited <100MB.
- Use
.gitignoreto specify intentionally untracked files that Git should ignore.
Simple Linux Utility for Resource Management (SLURM) Workload Manager
-
showuserjobs: display the current usage of the entire cluster. -
sinfo: view information about Slurm nodes and partitions. -
squeue -u [username]: Request jobs or job steps from a comma separated list of users. -
srun --pty bash -i: Request a job in interactive terminal mode. -
sbatch: submit slurm job
--export=<environment variables [ALL] | NONE>: Identify which environment variables from the submission environment are propagated to the launched application.--mem=<size[units]>: Specify the real memory required per node.-t, --time=<time>: Set a limit on the total run time of the job allocation.-J, --job-name=<jobname>: Specify a name for the job allocation.-p, --partition=<partition_names>: Request a specific partition for the resource allocation. -
scancel [job_id]: used to signal or cancel jobs, job arrays or job steps. -
module: command interface to the Modules package
av: display all available modulesload: Load modulefile(s) into the shell environment.
unload: Remove modulefile(s) from the shell environment.
purge: Unload all loaded modulefiles.
#!/bin/bash
#SBATCH --time=24:00:00
#SBATCH --mem=16G
## SBATCH --nodes=1 ### Node count required for the job
## SBATCH --ntasks=1 ### Number of tasks to be launched per Node
## SBATCH --cpus-per-task=16
# sbatch --export=month=,day= run_slurm.qsub
# slurm will set the working directory with the job script by default.
module purge
module load R
# always print the path of your working directory
pwd
echo “Hello HPC!! Today is ${month} ${day}”Modules and shell tools
- R (
module load R) withggplot2 - python >= 3.6.3 (
module load python/3.6.3) - PLINK v2 (
required_tools/plink2) - bcftools >= 1.9 (
module load samtools) - vcftools >= 0.1.14 (
module load vcftools) - ADMIXTURE (
module load admixture)
Alexander, David H., and Kenneth Lange. "Enhancements to the ADMIXTURE algorithm for individual ancestry estimation." BMC bioinformatics 12.1 (2011): 246. - RFMix2 (
required_tools/rfmix2)
Maples, Brian K., et al. "RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference." The American Journal of Human Genetics 93.2 (2013): 278-288.
Datasets
- Phased 1000 Genomes Phase III as VCF.gz files (post-QCed:
/gpfs/sfchen/work/1000G) ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ - 1000 Genomes sample data (
required_tools/1000G.tsv) http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_sample_info.xlsx - Reference genome build - GRCh37 (uncompressed:
/gpfs/sfchen/work/GRCh37) http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai
Test case
- 1000 Genomes Project http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/
Run:
sbatch --job-name=PCA_TGP 0_TGP_PCA.slurm
Expected output: figure_tgp_pca.png
0. (Optional) Convert personal 23andMe genetic data to VCF
It is sufficient to be done by srun --pty bash -i.
bcftools convert --tsv2vcf ${input_txt_path} -f ${ref_fasta_path) -s ${subject_ID} -Oz -o ${output_filename}.vcf.gz
tabix -p vcf ${output_filename}.vcf.gz
GRCh37 on Gariabldi: /gpfs/work/sfchen/human_g1k_v37.fasta
1. Local ancestry inference with RFMix2
sample input: /gpfs/work/sfchen/HGDP
sbatch --export=input=[vcf_gz_path] 1_ancestry_inference.slurm
sample metadata:
Sample_ID sub_pop sup_pop
HGDP00001 Brahui CENTRAL_SOUTH_ASIA
HGDP01041 Pima AMERICA
HGDP00977 Han EAST_ASIA
HGDP01062 Sardinian EUROPE
HGDP00580 Druze MIDDLE_EAST
HGDP01419 BantuKenya AFRICA
HGDP00664 Bougainville OCEANIA
- Scripps Reserach HPC (in Intranet) https://intranet.scripps.edu/its/highperformancecomputing/index.html
- HPC Challenges — A Perspective for General Data Analysis and Visualization http://web.eecs.utk.edu/~huangj/hpc/hpc_intro.php
- Tutorial: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2. Biostars https://www.biostars.org/p/335605/
- Converting from 23andMe to VCF https://samtools.github.io/bcftools/howtos/convert.html
- Bergström, Anders, et al. "Insights into human genetic variation and population history from 929 diverse genomes." Science 367.6484 (2020).
- Skoglund, Pontus, and Iain Mathieson. "Ancient genomics of modern humans: the first decade." Annual review of genomics and human genetics 19 (2018): 381-404.
- Martin, Alicia R., et al. "Clinical use of current polygenic risk scores may exacerbate health disparities." Nature genetics 51.4 (2019): 584-591.
- Tian, Rui, Malay K. Basu, and Emidio Capriotti. "Computational methods and resources for the interpretation of genomic variants in cancer." BMC genomics 16.S8 (2015): S7.
- Alicia R., "Ancestry pipeline". GitHub, https://github.com/armartin/ancestry_pipeline
- Bernie Pope, "pbs2slurm". Github, https://github.com/bjpop/pbs2slurm

