Skip to content

Commit 4b24041

Browse files
committed
Improved documentation
1 parent 944cf0b commit 4b24041

2 files changed

Lines changed: 81 additions & 54 deletions

File tree

docs/Description.pdf

16.4 KB
Binary file not shown.

docs/Description.tex

Lines changed: 81 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -5,28 +5,53 @@
55
\usepackage[table,xcdraw]{xcolor}
66
\usepackage{hyperref}
77
\usepackage{svg}
8+
\usepackage{geometry}
9+
\geometry{a4paper, margin=1.1in}
810

911

12+
13+
\date{}
1014
\begin{document}
11-
\section*{Computing transcription factor scores using TEPIC}
12-
\subsection*{Motivation}
15+
\title{TEPIC 2 - An extended framework for transcription factor binding prediction and integrative epigenomic analysis}
16+
\maketitle
17+
18+
TEPIC is a versatile framework for the analysis of transcription factor (TF) binding and offers serveral machine learning approaches for integrative analysis of predicted transcription factor binding sites (TFBS) and gene-expression data.
19+
Briefly, TEPIC offers:
20+
\begin{itemize}
21+
\item Annotation of user defined regions with TF affinities using TRAP and a variety of provided TF-motifs,
22+
\item Aggregation of TF affinities to TF-gene scores,
23+
\item Computation of statistical scores such as peak-length, peak-count or peak-signal per gene,
24+
\item Discretisation of continuous TF affinities using a background distribution into a binary measure for TF-binding,
25+
\item Linear regression analysis to infer key transcriptional regulators within one sample,
26+
\item Logistic regression classifier to suggest key transcriptional regulators between samples,
27+
\item Generate input for DREM to infer important TFs from temporal epigenomic and gene expression data.
28+
\end{itemize}
29+
30+
This document provides a brief introduction into the functionality of TEPIC and the machine learning approaches.
31+
32+
\newpage
33+
\section{Introduction to TEPIC}
34+
\subsection{Motivation}
35+
TF are essential players in transcriptional regulation. To understand their function, it is essential to know their binding sites genome-wide.
36+
Although TFBS can be inferred from ChIP-seq experiments, several in-silico approaches have been developed as well to overcome the burden and complexity of wet-lab experiments.
37+
Especially computational methods considering epigenetics data in the prediction have been used succesfully to predict TFBS.
1338
The main advantage of considering epigenetics data for the task of TF binding prediction is that the number of false positive predictions can be reduced \cite{pmid21106904}.
1439
One way of incorporating epigenetics data is to reduce the genomic search space to a few candidate regions of TF binding.
1540
As shown before, genome-wide candidate sites for TF binding can be determined by open-chromatin experiments \cite{pmid25294828,pmid25086003,pmid22072382,pmid23424114}, e.g. peaks or footprints in DNase1-seq data,
1641
and/or by considering Histone marks \cite{pmid25489339,pmid25086003}, e.g. H3K4me3.
1742

18-
Here, we compute TF affinities for a species specific set of \textit{Position Specific Energy Matrices (PSEM)} using \textit{TRAP} \cite{pmid17098775} which is based on a biophysical model of TF binding \cite{von1986specificity}.
43+
Here, we compute TF affinities for currated sets of \textit{Position Specific Energy Matrices (PSEMs)} using \textit{TRAP} \cite{pmid17098775} which is based on a biophysical model of TF binding \cite{von1986specificity}.
1944
A major advantage of affinity based predictions compared to hit-based methods like Fimo \cite{Grant16022011} is that
2045
low-affinity binding sites can be included \cite{pmid27899623,pmid17098775}. Using the \textit{TEPIC} method, we compute TF gene scores by aggregating TF predictions calculated for a user defined set of candidate regions.
2146
The scores, either per peak/region or gene, can be interpreted as a quantitative measurement of TF binding.
2247

23-
\subsection*{Preprocessing of Position Count Matrices (PCM)}
48+
\subsection{Collection of TF-motifs}
2449
We obtained \textit{Position Count Matrices (PCMs)} from JASPAR \cite{pmid26531826}, which is also including data from Uniprobe \cite{pmid25378322}, HOCOMOCO \cite{pmid23175603} and the Kellis Lab ENCODE Motif database \cite{pmid24335146}.
2550

2651
There are three folder containing Position specific energy matrices (PSEMs): Our current collection of PSEMs \textit{PWMs/2.1}. The previously used motifs are provided in the folders \textit{PWMs/2.0} and \textit{PWMs/1.0}.
27-
The position weight matrices used in the TEPIC manuscript are stored in the file \\\textit{PWMs/1.0/pwm\_vertebrates\_jaspar\_uniprobe\_original.PSEM}.
52+
TF motifs used in the original TEPIC manuscript are stored in the file \\\textit{PWMs/1.0/pwm\_vertebrates\_jaspar\_uniprobe\_original.PSEM}.
2853

29-
In detail, the current collection contains from the JASPAR 2018 Core database:
54+
In detail, the current collection contains from the \textit{JASPAR 2018 Core} database:
3055
\begin{itemize}
3156
\item 579 PSEMs for vertebrates
3257
\item 176 PSEMs for fungi
@@ -93,6 +118,8 @@ \subsection*{Preprocessing of Position Count Matrices (PCM)}
93118

94119
Files holding the length of the PSEMs are provided too.
95120

121+
122+
\subsection{Converting position count matrices to position specific energy matrices}
96123
As mentioned above, \textit{TRAP} computes TF affinities that are based on a biophysical model of TF binding.
97124
Therefore \textit{PCMs} have to be converted to \textit{Position Specific Energy Matrices (PSEMs)} such that they can be used in \textit{TRAP}.
98125
Intuitively, \textit{PSEMs} represent the mismatch energy of a given motif. For a detailed explanation and motivation of the energy based score, please check \cite{pmid17098775}.
@@ -130,14 +157,11 @@ \subsection*{Preprocessing of Position Count Matrices (PCM)}
130157
\end{itemize}
131158
In all other cases, a default GC-content of $0.42$ is used.
132159

133-
\newpage
134-
\subsection*{Computing TF gene scores}
135-
Currently, we offer the annotation of five different species, including the most common model organisms:
136-
\textit{homo sapiens, mus musculus, rattus norvegicus, drosophila melanogaster,} and \textit{caenorhabditis elegans}.
137-
Using our collections of species specific \textit{PSEMs}, \textit{TRAP} computes TF binding affinities in all user provided regions
160+
\subsection{Computing TF gene scores}
161+
Using our collections of \textit{PSEMs}, \textit{TRAP} computes TF binding affinities in all user provided regions
138162
that could be found in the reference genomes of the respective species and
139163
overlap with a window of user defined size $w$ that is centered at the most $5'$ TSS of all annotated genes in the considered organism.
140-
Then, TF gene scores are computed by incorporating all candidate binding sites within the window centered around the $5'$ TSS of genes in the final score.
164+
Then, TF-gene scores are computed by incorporating all candidate binding sites within the window centered around the $5'$ TSS of genes in the final score.
141165
The contribution of the individual sites is weighted by their distance to the selected TSS with an exponential decay function \cite{pmid19995984}.
142166
Formally, the TF gene score $a_{g,i}$ for gene $g$ and TF $i$ is computed as
143167
\begin{align}
@@ -158,9 +182,18 @@ \subsection*{Computing TF gene scores}
158182
where $s_p$ is the per base signal in peak $p$. This computation can be done with and without length normalisation of the affinities.
159183
The workflow of TEPIC is depicted in Figure \ref{workflowFig}.
160184

161-
In addition to the TF gene scores, TEPIC can compute features for peak length, peak count, and peak signal following the same scoring formulation as for
162-
TF affinities. These features can be used for example to assess the influence of chromatin accessiblity on gene expression without considering TF binding
163-
predictions.
185+
In addition to the TF gene scores, TEPIC can compute features for peak length ($pl_g$), peak count ($pc_g$), and peak signal($ps_g$) following the same scoring formulation as for
186+
TF affinities:
187+
\begin{align}
188+
pl_g&=\sum_{p \in P_{g,w}}|p|e^{-\frac{d_{p,g}}{d_0}}, \\
189+
pc_g&=\sum_{p \in P_{g,w}}e^{-\frac{d_{p,g}}{d_0}}, \\
190+
ps_g&=\sum_{p \in P_{g,w}}s_{p}e^{-\frac{d_{p,g}}{d_0}},
191+
\end{align}
192+
where $|p|$ is the length of $p$. These features can be used for example to assess the influence of chromatin accessiblity on gene expression without considering TF binding predictions.
193+
194+
Furthermore, TEPIC can compute a TF-specific affinity cut-off derived from either user-defined, or randomly generated sequences, to distinguish likely bound sites from unbound sites. These scores
195+
can be used to come-up with a binary TF-gene assignment. Further details on this mode are provided in Section \ref{EPIC-DREM}.
196+
164197
\begin{figure}[h!]
165198
\begin{center}
166199
\includegraphics[width=\textwidth]{Workflow.png}
@@ -173,30 +206,30 @@ \subsection*{Computing TF gene scores}
173206
\label{workflowFig}
174207
\end{figure}
175208

176-
\newpage
177-
\subsection*{Required input}
209+
\subsection{Required input}
178210
To compute TF gene scores a user needs to specify:
179211
\begin{itemize}
180-
\item a reference genome,
181-
\item a set of \textit{PSEMs},
182-
\item a set of genomic regions in BED format.
212+
\item a reference genome (-g option),
213+
\item a set of \textit{PSEMs} (-p option),
214+
\item a set of genomic regions in BED format (-b option).
215+
\item a gtf file containing the genome annotation (-a option).
183216
\end{itemize}
184-
Note that the chromosome identifiers in the BED file must match the identifiers used in the reference genomes, neglecting the \textit{chr} prefix.
185-
Otherwise they can not be considered.
217+
Note that the chromosome identifiers in the BED file must match the identifiers used in the reference genomes. Otherwise they can not be considered.
186218
Special care should be taken for \textit{caenorhabditis elegans}, as Roman digits are used for enumeration of chromosomes.
187219

188-
\subsection*{Output}
189-
This step generates the following output:
220+
\subsection{Output}
221+
TEPIC outputs:
190222
\begin{enumerate}
191-
\item TF affinities for all selected \textit{PSEMs} in the regions provided by the user that passed the filtering step.
192-
\item (Length normalised) TF gene scores for all selected \textit{PSEMs} calculated as described above (optionally including peak features).
193-
\item A meta data file listing all used parameters.
194-
\item Optionally a seperate file containing the signal information in peaks.
223+
\item TF affinities for all selected \textit{PSEMs} in the regions provided by the user that passed the filtering step (\textit{\*\_Affinity.txt}).
224+
\item (Length normalised) TF gene scores for all selected \textit{PSEMs} calculated as described above (optionally including peak features) (\textit{\*\_Affinity\_Gene\_View.txt}).
225+
\item A meta data file listing all used parameters (\textit{amd.tsv}).
226+
\item Optionally a seperate file containing the signal information in peaks (\textit{\*\_Peak\_Coverage.txt}).
227+
\item TF affinities with all values below an inferred threshold set to zero (\textit{\*\_Thresholded\_Affinity.txt})
228+
\item A sparse representation linking TF to genes (\textit{\*\_Sparse\_Affinity\_Gene\_View.txt})
195229
\end{enumerate}
196230

197231
\newpage
198-
199-
\section*{Identification of key transcriptional regulators using epigenetics data (INVOKE)}
232+
\section{Identification of key transcriptional regulators using epigenetics data (INVOKE)}
200233
Epigenetics data contains a wealth of information on gene regulation. It was shown that especially
201234
data on open-chromatin is well suited to build predictive models of gene-expression \cite{pmid27899623,pmid22955983,pmid25231769,pmid22954627}.
202235
Interpreting these models allows the inference of regulators that may play a key role in gene-expression regulation.
@@ -213,8 +246,8 @@ \section*{Identification of key transcriptional regulators using epigenetics dat
213246
\item Learning a linear regression model to predict gene expression from TF gene scores computed in (1).
214247
\end{enumerate}
215248

216-
\subsection*{Linear regression to predict gene expression}
217-
\subsubsection*{Motivation}
249+
\subsection{Linear regression to predict gene expression}
250+
\subsubsection{Motivation}
218251
In order to learn about potentially important regulators, we build a linear, interpretable regression model,
219252
comparable to methods proposed in \cite{pmid27899623,pmid22955983,pmid25231769,pmid22954627}.
220253
Here, we use TF gene scores computed with \textit{TEPIC} as features in a linear regression setup to predict gene expression.
@@ -224,7 +257,7 @@ \subsubsection*{Motivation}
224257

225258
Details on the learning setup and on the available regularization methods are provided in the next section.
226259

227-
\subsubsection*{Available regularization methods}
260+
\subsubsection{Available regularization methods}
228261
We offer three different regularization techniques:
229262
\begin{itemize}
230263
\item Lasso:
@@ -252,9 +285,9 @@ \subsubsection*{Available regularization methods}
252285
It resolves the correlation between features by distributing the feature weights among them, and simultaneously leads to sparse and stable models \cite{Zou05regularizationand}.
253286
However, learning a model using elastic net penalty is slower than using either only Lasso or Ridge regularization.
254287

255-
\subsubsection*{Details on the learning setup}
288+
\subsubsection{Details on the learning setup}
256289
The data matrix $X$, containing TF gene scores, and the response vector $y$, containing gene expression values, are log-transformed,
257-
with a pseudo-count of $1$, centered and scaled to fit them as.
290+
with a pseudo-count of $1$, centered and scaled.
258291
Regression coefficients are computed in a inner cross validation,
259292
the $\alpha$ parameter of elastic net regularization is optimized with a default step size of $0.1$.
260293

@@ -272,13 +305,11 @@ \subsubsection*{Details on the learning setup}
272305
\end{enumerate}
273306
All parameters mentioned in this section can be changed by the user. The learning process is sketched in Figure \ref{learningFig}.
274307

275-
\subsubsection*{Required input}
308+
\subsubsection{Required input}
276309
In addition to the input required for the computation of TF gene scores in TEPIC, a file containing gene expression data must be provided.
277310
This file should be structured such that column $1$ contains the gene identifiers and column $2$ holds expression values.
278-
Besides, we support the upload of a matrix containing gene expression data for several samples. In that case, the user has to select the column/sample that should
279-
be used for model construction.
280311

281-
\subsubsection*{Output and hints for interpretation}
312+
\subsubsection{Output and hints for interpretation}
282313
The user is always provided with the following files:
283314
\begin{itemize}
284315
\item a list of regression coefficients computed on the entire data set,
@@ -308,22 +339,20 @@ \subsubsection*{Output and hints for interpretation}
308339
\end{figure}
309340

310341
\newpage
311-
\mbox{}
312-
\newpage
313-
\section*{Differential analysis to identify novel transcriptional regulators for differentially expressed genes (DYNAMITE)}
342+
\section{Differential analysis to identify novel transcriptional regulators for differentially expressed genes (DYNAMITE)}
314343
Although a variety of methods have been proposed to generate genome-wide TF binding predictions [\cite{pmid27899623,pmid22072382,pmid23424114,pmid25086003}] and to establish \textit{TF to tissue}
315344
associations [\cite{pmid22955983,pmid19995984,pmid27899623}], systematic, feasible, and easy to use ways of linking TFs to distinct genes are rare.
316345

317346
In addition to the \textit{INVOKE} analysis, we propose a method to infer the most likely transcriptional regulators for a set of differentially expressed genes.
318347
We use TF scores, computed using \textit{TEPIC}, and logistic regression to identify TFs that have explanatory power to distinguish between up- and down-regulated genes.
319348

320-
\section*{Input}
349+
\subsection{Input}
321350
To run \textit{DYNAMITE}, a user most provide candidate regions of TF binding for two groups of samples, $A$ and $B$, e.g. control and disease.
322351
These can be derived, for example, by open chromatin experiments such as DNase-seq.
323352
It is essential that the candidate regions reflect the characteristics of chromatin organization in the analysed tissues.
324353
In addition, a list of differential expressed genes between two groups as well as log2 fold changes of the expression are needed.
325354

326-
\section*{Method}
355+
\subsection{Method}
327356
Our method consits of two parts: (1) gene-TF score computation, and (2) identification of key TFs.
328357
\subsection*{Step 1: Computing Gene-TF Scores}
329358
Using TEPIC, we compute gene-TF scores $g_{ij}$ for all differentially expressed genes $i$ and distinct TFs $j$ considering the provided candidate regions for all replicates $a$ of group $A$ and for all replicates $b$ of group $B$.
@@ -344,7 +373,7 @@ \subsection*{Step 1: Computing Gene-TF Scores}
344373
\caption{Computation of differential TF features between two groups.}
345374
\label{TF-Gene-Score_Computation}
346375
\end{figure}
347-
\subsection*{Step 2: Identification of Key Transcription Factors}
376+
\subsection{Step 2: Identification of Key Transcription Factors}
348377
To identify those TFs that can explain the differential expression state of as many genes as possible, we build a logistic regression classifier.
349378
We use matrix $R_{AB}$ computed in Step $1$ as the feature matrix $X$, and a binary vector of gene expression changes as response $y$.
350379
An example is shown in Figure \ref{Log-Reg-Example}.
@@ -361,7 +390,7 @@ \subsection*{Step 2: Identification of Key Transcription Factors}
361390
same learning paradigm that is described for the \textit{INVOKE} analysis (Figure \ref{learningFig}). We use the entire dataset for model training and to interpret the regression coefficients.
362391
TFs that correspond to features with a non-zero regression coefficient can be seen as being essential to explain the observed expression differences and should be further investigated.
363392

364-
\section*{Output}
393+
\section{Output}
365394
Model performance is reported in a \textit{txt} file and visually in a bar plot using mean test and training accuracy as well as the F1 measure.
366395
A heatmap shows the regression coefficients in the outer cross validation folds.
367396
Additionaly, we report confusion matrices for the outer cross validation folds.
@@ -380,10 +409,8 @@ \section*{Output}
380409
\end{figure}
381410

382411
\newpage
383-
\mbox{}
384-
\newpage
385-
386-
\subsection*{Determine important transcriptional regulators from time serires data (EPIC-DREM)}
412+
\section{Determine important transcriptional regulators from time serires data (EPIC-DREM)}
413+
\label{EPIC-DREM}
387414
\textit{EPIC-DREM} is a combination of \textit{TEPIC} and the \textit{Dynamic Regulatory Events Miner (DREM)} \cite{pmid17224918}.
388415
Instead of using static ChIP-seq data, which is provided in \textit{DREM 2.0}, we suggest to use time-point specific TF binding predictions
389416
based on time-dependent epigenomic profiles. Thereby, \textit{DREM} can infer regulators that can be linked to expression changes at distinct points in time.
@@ -406,19 +433,19 @@ \subsection*{Determine important transcriptional regulators from time serires da
406433
\label{epicdrem}
407434
\end{figure}
408435

409-
\subsubsection*{Thresholded TF affinities}
436+
\subsection{Thresholded TF affinities}
410437
In some applications it is required to make a binary decision whether a factor is binding or not.
411438
To infer this information from TF affinities, \textit{TEPIC} allows the computation of a TF specific affinity threshold by calculating TF affinities on a randomly selected set of genomic regions. When selected by TEPIC, these regions show similar characteristics compared to the provided regions (GC content and length).
412439
Alternatively a set of background regions can be provided by the user.
413440
By applying a user defined p-value on the distribution of affinities computed on the random regions, a threshold is chosen.
414441
Per TF, all affinities that are smaller than the selected threshold, are set to zero, thus a sparse matrix with TF-gene interactions can be generated.
415442

416-
\subsubsection*{Required input}
443+
\subsubsection{Required input}
417444
In addition to the input mentioned above, a reference genome in 2bit format is required.
418445
Optionally, the user can provide a bed file containing background regions.
419446
These replace the automated generation of background sequences.
420447

421-
\subsubsection*{Output}
448+
\subsubsection{Output}
422449
The following output files are generated in addition:
423450
\begin{enumerate}
424451
\item TF affinities for all selected \textit{PSEMs} in the regions provided by the user that passed the filtering step, where all affinities below the TF specific thresholds are set to 0.
@@ -427,7 +454,7 @@ \subsubsection*{Output}
427454
\end{enumerate}
428455

429456
Either (2) or (3) can be combined with RNA-seq data and used as input for \textit{DREM}.
430-
457+
\newpage
431458
\bibliographystyle{plain}
432459
\bibliography{Description}
433460
\end{document}

0 commit comments

Comments
 (0)