This repository is a companion to the paper "Private Information Leakage from Polygenic Risk Scores". We provide the code and the data for running a genotype-recovery demo, for plotting the figures from the paper, and for end-to-end reproducibility of the experiments.
The experiments require Go of version 1.23.1 or above. You can manually install the latest version for your operating system and then run go get -u ./... to download the dependencies. Alternatively, run the command below to automatically install it. The automatic script currently supports only MacOS and Linux.
make install-go
Plotting the figures requires Python and the packages in requirements.txt. To automatically configure your setup, run
make install-python
We have extracted the genotypes, allele frequencies and genotype frequencies from the samples of the 1000 Genomes Project for all the variants that we use in our experiments. and have uploaded it, along with relevant PGS metadata, to a separate repository. All the 1000 Genomes experiments are possible to run without access to the full dataset by using the pre-processed data. To download it, run the command below.
make download-data
The archive is ~19 MiB, the script will unpack the data and place the following files into the inputs folder:
PGS00XXXX_hmPOS_GRCh37.txtare the PRS metadata files from the PGS Catalog.PGS00XXXX.jsonstores the genotypes for each corresponding variant and the PRS for each individual in the dataset. This is loaded in the code ascohort.PGS00XXXX.efalstores whether the effect allele is the reference allele (0) or the alternative allele (1).PGS00XXXX.statstores the allele and genotype frequencies of each dataset ancestry for each SNP in the PRS.PGS00XXXX.scoresstores just the corresponding PRS for each individual in the dataset.
We do not upload any UK Biobank data, including any statistics, as access to it requires a license. All the experiments that involve the UK Biobank data require a link to the dataset to be reproduced.
The demo shows the process of genotype recovery from a set of PRSs for a random individual from the 1000 Genomes dataset. The demo uses the same PGS selection as the "Patient genotypes can be recovered from publicly available PRSs" section of the paper, but we limit the size of the PGSs to be considered to at most 30 SNPs for efficiency. The demo takes approximately a minute to run. It prints the recovered and true genotypes for each PRS and reports the overall recovery accuracy at the end. Run it with:
make demo
We have uploaded the measurements from the paper to the same secondary repository. To download and uncompress the results, run the command below. The archive is ~200 MiB.
make download-results
The results folder will contain the following:
validated_pgs.jsoncontains the full selection of the PGSs from our experiments with the corresponding size (the number of SNPs) for each.validated_loci.jsoncontains all the covered loci with a list of PGSs where each locus appears.recoveryOutput/contains genotype predictions andrecoveryAccuracy/contains the accuracies of these predictions for the 2535 individuals of the 1000 Genomes dataset. See the "Patient genotypes can be recovered from publicly available PRSs" section and Figure 3 in the paper.linking/contains the PLINK output from running the KING-robust and the GCTA algorithms for matching the predicted genotypes against the true values. See the "Patients and their relatives can be retrieved from genealogy databases using PRSs" section, Figure 4 and Supplementary Figure 2 in the paper.uniqueness/contains the analytical estimates and dataset measurements for the uniqueness and anonymity guarantees of PRS values. Only the results for the 1000 Genomes dataset are included due to the restrictions of sharing the UK Biobank data. See the "Anonymized genotype databases can be de-anonymized by using a single PRS from known patients without the need for genotype prediction" section and Figures 5B-C in the paper.
To reproduce the figures from the paper, run
make plot-recoveryplots the genotype recovery accuracy results.make plot-linkingplots the linking results.make plot-uniquenessplots the score uniqueness and anonymity results. It partially requires access to the UK Biobank data.make plot-roundingplots the PRS distribution when rounding the effect weights to different decimal-places precision. It requires access to the UK Biobank data.make plot-allplots all the above.
The figures will be saved in analysis/figures.
To reproduce the experiments end-to-end:
- Add the paths to the downloaded 1000 Genomes and UK Biobank data to
datasets.yaml. - Run
make pgs-selectionto retrieve suitable PGSs from the PGS Catalog and form the selection forvalidated_pgs.jsonandvalidated_loci.json. - Perform genotype recovery for all the samples in the 1000 Genomes dataset by running
go run eval/run.go -e=solve CHUNK_NUM CHUNK_SIZEwhereCHUNK_NUMis a sequential id of the chunk andCHUNK_SIZEis the number of samples in the chunk. The program sorts all the samples alphabetically. The start solving position isCHUNK_NUM * CHUNK_SIZE. Resolving 298 PRSs for up to 50 SNPs requires 2h40m of computing time and 127 GB of RAM on average per sample. The program relies on bcftools to retrieve the genotype data. - Run
go run eval/run.go -e=genfreqto save genotype frequencies for all the predicted SNPs. - Calculate the genotype-recovery accuracy by running
go run eval/run.go -e=accuracy. - Calculate the relatedness results by running
go run eval/run.go -e=linking. You will need PLINK installed. - Calculate the uniqueness and anonymity results by running
go run eval/run.go -e=uniqueness_ggfor the 1000 Genomes data andgo run eval/run.go -e=uniqueness_ukfor the UK Biobank data. The analytical estimation is computationally intensive. - Finally, run
go run eval/run.go -e=roundingto measure the impact of rounding effect weights on the PRS distribution. - Proceed with producing figures as detailed in Re-plotting the figures.