Skip to content

Latest commit

 

History

History
91 lines (63 loc) · 4.4 KB

File metadata and controls

91 lines (63 loc) · 4.4 KB

Usage

seqchromloader is composed of two types of functions: writer and loader. You can use writer to dump dataset into webdataset format file for future use, or directly call loader to get tensors immediately.

Generally seqchromloader would produce four kinds of tensors: [seq, chrom, target, label]

  • seq is one-hot coded DNA sequence tensor of shape [batch_size, 4, len] using the DNA mapping order of "ACGT" (which means, A = [1,0,0,0], C = [0,1,0,0], ...)
  • chrom is chromatin track tensor of shape [batch_size, # tracks, len], chromatin track bigwig files are usually provided by bigwig_filelist parameter
  • target is the tensor representing the number of sequencing reads in the region, this is from the bam file given by target_bam parameter
  • label is the integer label of each sample, when given bed file input, this info would be from the fourth column. While given a pandas DataFrame, it should have a column named label

Writer

Currently only webdataset format is supported, you can write tensors into webdataset in this way:

import pandas as pd
from seqchromloader import dump_data_webdataset

coords = pd.DataFrame({
            "chrom": ["chr1", "chr10"],
            "start": [1000, 5000],
            "end": [1200, 5200],
            "label": [0, 1]
        })
wds_file_lists = dump_data_webdataset(coords,
                                 genome_fasta="mm10.fa",
                                 bigwig_filelist=["h3k4me3.bw", "atacseq.bw"],
                                 outdir="dataset/"
                                 outprefix="test",
                                 compress=True,
                                 numPorcessors=4,
                                 transforms={"chrom": lambda x: x+1})

Note

Each region should be of the same length! As in this example, every region is 200bp long.

The returned wds_file_lists contain the output file paths, every file has ~7000 samples.

One thing worth noting is the transforms parameter here, transforms accepts a dictionary of function, each function will be called on the output that its key refers to. In this example, the add 1 lambda function was called on each chrom tensor, you can do more complicated transformations in this way, e.g., standardize the tensor.

.. autofunction:: seqchromloader.dump_data_webdataset

Loader

You can easily load the webdataset files generated by seqchromloader.dump_data_webdataset above by:

from seqchromloader import SeqChromDatasetByWds

dataloader = SeqChromDatasetByWds(wds_file_lists, transforms=None, rank=0, world_size=1)
seq, chrom, target, label = next(iter(dataloader))

If you are using multiple GPUs, you can use rank and world_size to do sharding on dataset to ensure each GPU getting non-overlapped piece of dataset

A more straightforward way is using seqchromloader.SeqChromDatasetByBed, which can output tensors given a bed file and other required files.

from seqchromloader import SeqChromDatasetByBed

dataloader = SeqChromDatasetByWds(bed="regions.bed",
                                  genome_fasta="mm10.fa",
                                  bigwig_filelist=["h3k4me3.bw", "atacseq.bw"],
                                  target_bam="foxa1.bam",
                                  transforms={"label": lambda x: x-1},
                                  dataloader_kws={num_workers: 4})
seq, chrom, target, label = next(iter(dataloader))

Here I pass a dictionary describing the keywords arguments would be further passed to torch.utils.data.DataLoader to increase the number of workers (default is 1), you can refer to Pytorch DataLoader Document to explore more controls on DataLoader behavior

.. autofunction:: seqchromloader.SeqChromDatasetByBed

.. autofunction:: seqchromloader.SeqChromDatasetByWds

Utilities

Utility functions for easily manipulating the genomic coordinates to generate training dataset

.. autofunction:: seqchromloader.filter_chromosomes
.. autofunction:: seqchromloader.make_random_shift
.. autofunction:: seqchromloader.make_flank
.. autofunction:: seqchromloader.chop_genome
.. autofunction:: seqchromloader.dna2OneHot
.. autofunction:: seqchromloader.rev_comp