seqchromloader is composed of two types of functions: writer and loader. You can use writer to dump dataset into webdataset format file for future use, or directly call loader to get tensors immediately.
Generally seqchromloader would produce four kinds of tensors: [seq, chrom, target, label]
- seq is one-hot coded DNA sequence tensor of shape [batch_size, 4, len] using the DNA mapping order of "ACGT" (which means, A = [1,0,0,0], C = [0,1,0,0], ...)
- chrom is chromatin track tensor of shape [batch_size, # tracks, len], chromatin track bigwig files are usually provided by
bigwig_filelistparameter - target is the tensor representing the number of sequencing reads in the region, this is from the bam file given by
target_bamparameter - label is the integer label of each sample, when given bed file input, this info would be from the fourth column. While given a pandas DataFrame, it should have a column named label
Currently only webdataset format is supported, you can write tensors into webdataset in this way:
import pandas as pd
from seqchromloader import dump_data_webdataset
coords = pd.DataFrame({
"chrom": ["chr1", "chr10"],
"start": [1000, 5000],
"end": [1200, 5200],
"label": [0, 1]
})
wds_file_lists = dump_data_webdataset(coords,
genome_fasta="mm10.fa",
bigwig_filelist=["h3k4me3.bw", "atacseq.bw"],
outdir="dataset/"
outprefix="test",
compress=True,
numPorcessors=4,
transforms={"chrom": lambda x: x+1})Note
Each region should be of the same length! As in this example, every region is 200bp long.
The returned wds_file_lists contain the output file paths, every file has ~7000 samples.
One thing worth noting is the transforms parameter here, transforms accepts a dictionary of function, each function will be called on the output that its key refers to. In this example, the add 1 lambda function was called on each chrom tensor, you can do more complicated transformations in this way, e.g., standardize the tensor.
.. autofunction:: seqchromloader.dump_data_webdataset
You can easily load the webdataset files generated by seqchromloader.dump_data_webdataset above by:
from seqchromloader import SeqChromDatasetByWds
dataloader = SeqChromDatasetByWds(wds_file_lists, transforms=None, rank=0, world_size=1)
seq, chrom, target, label = next(iter(dataloader))If you are using multiple GPUs, you can use rank and world_size to do sharding on dataset to ensure each GPU getting non-overlapped piece of dataset
A more straightforward way is using seqchromloader.SeqChromDatasetByBed, which can output tensors given a bed file and other required files.
from seqchromloader import SeqChromDatasetByBed
dataloader = SeqChromDatasetByWds(bed="regions.bed",
genome_fasta="mm10.fa",
bigwig_filelist=["h3k4me3.bw", "atacseq.bw"],
target_bam="foxa1.bam",
transforms={"label": lambda x: x-1},
dataloader_kws={num_workers: 4})
seq, chrom, target, label = next(iter(dataloader))Here I pass a dictionary describing the keywords arguments would be further passed to torch.utils.data.DataLoader to increase the number of workers (default is 1), you can refer to Pytorch DataLoader Document to explore more controls on DataLoader behavior
.. autofunction:: seqchromloader.SeqChromDatasetByBed
.. autofunction:: seqchromloader.SeqChromDatasetByWds
Utility functions for easily manipulating the genomic coordinates to generate training dataset
.. autofunction:: seqchromloader.filter_chromosomes
.. autofunction:: seqchromloader.make_random_shift
.. autofunction:: seqchromloader.make_flank
.. autofunction:: seqchromloader.chop_genome
.. autofunction:: seqchromloader.dna2OneHot
.. autofunction:: seqchromloader.rev_comp