Create genomic labels to help identify candidate cis-regulatory regions.
genomeLabel is a command line tool for the fully automated generation of genomic label data that can be summarized into statistically relevant information and visualized in a genome browser, tab-separated tables, and plots. The labels annotate the genome both structurally and functionally, providing insight into the potentially regulatory role of certain regions via colocalization analysis methods.
The genomeLabel tool is implemented in Perl and Python and automatically executes commands provided by bedtools and bedops to annotate the Genome Reference Consortium Human Build 38 (hg38) assembly. Integration with GIGGLE further expands the functionality of genomeLabel to identifying and ranking the significance of genomic loci shared between query features and thousands of genome interval files present in the giggle_files directory.
For display by the UCSC Genome Browser, a script 'makeHubs.pl' can be used. Hoewever, assembly hubs need to be hosted on a publicly accessible web server provided by the user.
Annotating the genome can be broken down into three main steps: (1) identifying portions of the genome that do not code for proteins, (2) identifying elements on the genome, a process called gene prediction, and (3) attaching biological information to these elements. This yields two types of annotation:
1) Structural Annotation - Identifying elements in the genome (i.e. exons, introns, UTRs, CDSs, etc.)
2) Functional Annotation - Attaching biological information to the elements in the genome (e.g. biochemical product, regulatory role, expression, etc.)
Thus, the output of genomeLabel tool combines structural (exon, intron, intergenic, etc.) and putative functional (promoter, enhancer, TF binding site, repetitive sequence, etc.) annotation.
This program takes in annotation data from genomic databases as input to generate custom label tracks.
| INPUT | OUTPUT |
|---|---|
| NCBI RefSeq genes, curated subset | exon, intron, coding, noncoding, intergenic |
| UCSC hg38.chrom.sizes | intergenic |
| FANTOM5 TSS (CAGE) peaks | promoter |
| RepeatMasker, soft-masked | repetitive element |
| Segway genomic "states" | functional element |
| UCSC hg19ToHg38.over.chain.gz | functional element |
| ReMap ChIP-seq peaks | cis-regulatory module |
genomeLabel - A tool to annotate the genomic features within a given region and work with the contents thereof to produce genome browser tracks and compute summary statistics.
SYNOPSIS
USAGE: ./run.pl [chr:start-stop] --biotype [BIOTYPE] --path [/PATH/] <filter(s)>
SAMPLE CALLS:
./run.pl chrX:15200000-15800000 --biotype gm12878 --path /home/bkirsh/
./run.pl chrX:15200000-15800000 --biotype GM12878 --path /home/bkirsh/ --repeat LINE --regulator GATA2
./run.pl chrX:15200000-15800000 --biotype huvec --path /home/bkirsh/ --repeat LINE,L2,SINE --regulator GATA2
The program accepts any of the following six human cell, tissue, or DNA sample used by the ENCODE Consortium:
GM12878, H1-hESC, K562, HeLa-S3, HepG2, and HUVEC.
A description of these samples is available at http://genome.ucsc.edu/cgi-bin/hgEncodeVocab?term=GM12878,K562,H1-hESC,HeLa-S3,HepG2,HUVEC.
This option is case- and delimiter-insensitive, and inputs such as H1HESC, helas3, and HEp-g2 are all valid.
The minimum requirements to run the script are the region, biotype, and path to liftOver.
Additional filter arguments can be passed alongside different commands for added specificity, including "--repeat," "--regulator," and "--felement."
They must be comma-separated and either enclosed within double-quotes, where whitespace is allowed, or listed subsequently without whitespace in between commas.
ACCEPTABLE: --regulator "GATA1,CTCF, SOX2"
--regulator GATA1,CTCF,SOX2
UNACCEPTABLE: --regulator GATA1, CTCF, SOX2
If <filter(s)> is omitted, the default behaviour of the program is to generate a set of "raw files" and the "giggle_files" directory for a given
region and biotype. The "raw files" include genome.bed (coding/noncoding exons, introns, intergenic regions), promoter.bed, repeat.bed, cr-module.bed, f-element.bed, and cat.bed, while the "giggle_files" is a directory tree of their breakdown into different categories and subcategories. These can be useful for customized manipulation
to extract additional statistics or create more complex tracks via concatenation or application of external tools according to user needs.
OPTIONS
<chr:start-stop>
Input region in the format chr:start-end.
<--biotype>
One of the six human cell types used by the ENCODE Consortium: GM12878, H1-hESC, K562, HeLa-S3, HepG2, HUVEC. A description of
these cell types is available at http://genome.ucsc.edu/cgi-bin/hgEncodeVocab?term=GM12878,K562,H1-hESC,HeLa-S3,HepG2,HUVEC.
<--path>
Directory path to liftOver binary file in the format /path/to/utility/ (i.e. enclosed by "/" and excluding the name of the utility itself).
<--regulator>
A list of one or more of the 960 DNA-binding proteins, including transcription factors (TFs), transcription co-activators (TCFs) and
chromatin-remodeling factors (CRFs) used by the ReMap Atlas. Access to ReMap documentation is available at http://remap.univ-amu.fr.
<--repeat>
A list of one or more of the families, classes, or names of repeats used by RepeatMasker. Access to RepeatMasker documentation is available
at http://www.repeatmasker.org/webrepeatmaskerhelp.html.
<--felement>
A list of one or more of the genomic states used by Segway. Access to Segway documentation is available under "Segway Segmentations" at
http://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=913156841_uydtGuw88KR9Xqpgn3fXaMtXmsVQ&c=chr15&g=hub_4607_genomeSegmentation.
<--help>
Prints this message and exits successfully.SYNOPSIS
USAGE: ./stats_giggle.py -i [input] <option(s)>
| Label | Definition |
|---|---|
| Intragenic stretch of DNA sequence, including non-coding untranslated regions, that can synthesize a functional RNA molecule, including mRNAs and ncRNAs. | |
| Intragenic stretch of non-coding DNA sequence located between two successive exons. | |
| Stretch of non-coding DNA sequence located between the two successive genes. | |
| Exonic stretch of DNA sequence that can synthesize a functional protein. | |
| Exonic stretch of DNA sequence that appears twofold: (1) as regulatory untranslated regions of mRNAs or (2) as translated regions of ncRNAs. | |
| Regulatory DNA sequences identified by FANTOM5, which reflect CAGE-defined core promoter regions. | |
| Regulatory DNA sequences identified by RepeatMasker, which reflect interspersed repeats and low complexity DNA sequences. | |
| Functional Element | Putative transcriptional and regulatory DNA sequences identified by Segway, which reflect DNA binding and accessibility across different cell types (e.g. quiescent, enhancer, promoter, bivalent, etc.).Coloured according to Segway schema. |
| Cis-Regulatory Module (CRM) | Putative binding sites for transcriptional regulators identified by ReMap, which reflect the intersection of overlapping ChIP-seq peaks across different datasets of TRs and biotypes. Coloured according to ReMap schema. |
genomeLabel contains different Perl and Python scripts that require bedtools, bedops, and liftOver. The latter is made available by UCSC utilities.
git clone https://github.com/bkirsh99/genomeLabel.gitcd ${genomeLabel_DIR}sudo apt-get install perl
sudo cpan Module::Namesudo apt-get install python-pip
sudo pip install packagegit clone https://github.com/arq5x/bedtools2.git
cd bedtools2
make
cd ..
export PATH=$PATH:`which bedtools`git clone https://github.com/bedops/bedops.git
cd bedops
make all
make install_all
cp bin/* /usr/local/binfind / -type f -name "liftOver" 2>/dev/nullgit clone https://github.com/ryanlayer/giggle.git
cd giggle
make
export GIGGLE_ROOT=`pwd`
cd lib/htslib
export PATH=$PATH:`pwd`
cd ../../..
wget -O gargs https://github.com/brentp/gargs/releases/download/v0.3.6/gargs_linux
chmod +x gargsThis script makes calls and utilizes all other scripts and packages to generate labels, tracks, and summary statistics, depending on the user input.
This script produces a directory named after the input region in the format myHub_chr:start-stop. It follows UCSC's Public Hub Guidelines by generating the following:
├── myHub_chrX:15200000-15800000 - directory containing track hub files
├── hg38 - directory containing hg38 (GRCh38) human assembly data
| ├── [...].bb - tracks in this directory
| └── trackDb.txt - display properties of tracks in this directory
├── genomes.txt - list of genome assemblies included in the hub (i.e. hg38 only)
└── hub.txt - short description of hub propertiesTo use the output of makeHubs.pl in the UCSC Genome Browser, copy the complete hub folder (e.g. myHub_chrX:15200000-15800000) to a publicly accessible web server. Then, go to https://genome.ucsc.edu/index.html, click on My Data -> Track Hubs -> My Hubs, and add the link to your publicly available hub.txt file into the URL window.
Alternatively, you can load individual tracks from './out/chr:start-stop@biotype/tracks/<filter(s)>' as custom tracks on UCSC or IGV.
This is an example of the priority levels behind the labelling of the raw_genome.bed track, which is only one of the many tracks generated:
- raw_genome.bed - Coding/noncoding exonic, intronic, and intergenic elements
- raw_promoter.bed - Promoter elements
- raw_repeat.bed - Repetitive elements
- filtered_BIOTYPE_f-element.bed - Putative cell-type-agnostic functional elements
- filtered_BIOTYPE_cr-module.bed - Putative cell-type and transcriptional regulator-agnostic binding sites
These are examples of the difference between raw and filtered tracks. The commands used were ./run.pl chrX:15200000-15800000 --biotype gm12878 --path /home/bkirsh/ and ./run.pl chrX:15200000-15800000 --biotype gm12878 --path /home/bkirsh/ --repeat LINE,L2,SINE --regulator GATA2 --felement PromF,Tss. The second call adds two files to the chrX:15200000-15800000 directory, namely filtered_GM12878_LINE_L2_SINE_repeat.bed and filtered_GM12878_PromF_Tss_f-element.bed. A filtered track was not created for cr-module.bed because there is no data for GATA2 in this biotype, which is verifiable by:
cat chrX:15200000-15800000/filtered_GM12878_cr-module.bed | grep GATA2In its default behaviour, genomeLabel produces the "giggle_files" directory, as exemplified below:
chrX:15200000-15800000/giggle_files
├── cr-module
│ └── factor
├── exon
│ ├── gene
│ └── transcript
├── f-element
│ └── state
├── intron
│ ├── gene
│ └── transcript
├── promoter
│ └── promoter
├── repeat
│ ├── class
│ ├── family
│ └── name
└── transcript
├── gene
└── transcriptEach directory within the "giggle_files" directory tree corresponds to the different levels (e.g. repeat class, family, or name) or preferred labels (e.g. gene name or transcript name) available within the annotated bed files.
As such, countless permutations of index and query files can be generated and applied by the GIGGLE tool. Detailed instructions and examples of how to run GIGGLE independently are available at https://github.com/ryanlayer/giggle.
Additional scripts were created to rename the genomic elements in a file before running GIGGLE, resulting in tabular outputs and plots labelled with more meaningful identifiers.
-
Download and rename ENCODE Transcription Factor ChIP-seq Data: download.py
Sample call:
mkdir data python ./download.py download.txt data mkdir split_sort bash $GIGGLE_ROOT/scripts/sort_bed "data/*.gz" split_sort/ > /dev/null
Sample .txt file (available under test/ENCODE.txt):
ATF2_(SC-81188) wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsHaibGm12878Atf2sc81188V0422111UniPk.narrowPeak.gz
ATF3 wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsHaibGm12878Atf3Pcr1xUniPk.narrowPeak.gz
BATF wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsHaibGm12878BatfPcr1xUniPk.narrowPeak.gz -
Create .bed file from FANTOM5 Robust CAGE Peaks Data: rename.py
Sample call:
mkdir split python ./rename.py names.txt split -a all.bed mkdir split_sort bash $GIGGLE_ROOT/scripts/sort_bed "split/*.bed" split_sort/ > /dev/null $GIGGLE_ROOT/bin/giggle index -i "split_sort/*gz" -o split_sort_b -f -s 2> /dev/null mkdir all_split cp all.bed all_split/all.bed mkdir all_split_sort bash $GIGGLE_ROOT/scripts/sort_bed "all_split/*.bed" all_split_sort/ > /dev/null $GIGGLE_ROOT/bin/giggle index -i "all_split_sort/*gz" -o all_split_sort_b -f -s 2> /dev/null
Sample .txt file (available under test/FANTOM5.txt):
chrX:73071838..73071849,- XIST
chrX:47053208..47053241,+ UBA1
chrX:16737699..16737708,+ SYAP1
chrX:73040494..73040497,+ TSIX
mkdir split_sort
bash $GIGGLE_ROOT/scripts/sort_bed "chrX:15200000-15800000/giggle_files/cr-module/factor/*" split_sort/ > /dev/null
$GIGGLE_ROOT/bin/giggle index -i "split_sort/*" -o split_sort_b -f -s 2> /dev/null$GIGGLE_ROOT/bin/giggle search -i split_sort_b -r X:15200000-15800000 > cr-module.result
bgzip chrX:15200000-15800000/giggle_files/transcript/gene/ACE2.bed
$GIGGLE_ROOT/bin/giggle search -i split_sort_b -q chrX:15200000-15800000/giggle_files/transcript/gene/ACE2.bed.gz -s > cr-module_ACE2.resultfor filename in chrX:15200000-15800000/giggle_files/transcript/gene/*; do bgzip $filename; done
mkdir cr-module_transcript_results
ls chrX:15200000-15800000/giggle_files/transcript/gene | ./gargs -p 10 '$GIGGLE_ROOT/bin/giggle search -i split_sort_b -q chrX:15200000-15800000/giggle_files/transcript/gene/{} -s > cr-module_transcript_results/{}.results'The GIGGLE results consist of a single file or multiple files located within the same directory. In either case, genomeLabel's stats_giggle.py script can be used to summmarize and visualize the output when the reults are derived from a GIGGLE call that uses at least one query file with the option "-s."
1) Show top 5 highest scoring index files (e.g. transcription factors) for each query file (e.g. transcript), ranked by combo_score (default):
./stats_giggle.py -i "cr-module_transcript_results/"2) Show top 20 highest scoring index files (e.g. transcription factors) for each query file (e.g. transcript), ranked by number of overlapping intervals:
./stats_giggle.py -i "cr-module_transcript_results/" --by overlaps -n 203) Show top 5 highest scoring query files (e.g. transcript) for each index file (e.g. transcription factors), ranked by combo_score:
./stats_giggle.py -i "cr-module_transcript_results/" --by filenameAn important application of genomeLabel is that it allows users to visualize interesting overlapping elements on the genome browser, and revert back to using the tool in order to obtain coverage statistics.
The 'history.txt' file is verified upon every call as a means to avoid requesting redundant data to the UCSC REST API. It consists of an ARRAY created using the Storable package, and stores two sets of HASHES - one made up of previous inputs and another one made up of previous bedFeature objects.
$VAR1 = [ { biotype => {chr => [ [ start , end ] ] } } , { source1 => [bedFeatures] , ... , sourceN => [bedFeatures] } ]Should you encounter any problems with the generation of files that contain the correct information for your desired coordinates and filters, the issue most likely lies in "history.txt." Please delete the file and run the command again. While this will take longer, it will ensure that the output is correct.