You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/Description.tex
+81-54Lines changed: 81 additions & 54 deletions
Original file line number
Diff line number
Diff line change
@@ -5,28 +5,53 @@
5
5
\usepackage[table,xcdraw]{xcolor}
6
6
\usepackage{hyperref}
7
7
\usepackage{svg}
8
+
\usepackage{geometry}
9
+
\geometry{a4paper, margin=1.1in}
8
10
9
11
12
+
13
+
\date{}
10
14
\begin{document}
11
-
\section*{Computing transcription factor scores using TEPIC}
12
-
\subsection*{Motivation}
15
+
\title{TEPIC 2 - An extended framework for transcription factor binding prediction and integrative epigenomic analysis}
16
+
\maketitle
17
+
18
+
TEPIC is a versatile framework for the analysis of transcription factor (TF) binding and offers serveral machine learning approaches for integrative analysis of predicted transcription factor binding sites (TFBS) and gene-expression data.
19
+
Briefly, TEPIC offers:
20
+
\begin{itemize}
21
+
\item Annotation of user defined regions with TF affinities using TRAP and a variety of provided TF-motifs,
22
+
\item Aggregation of TF affinities to TF-gene scores,
23
+
\item Computation of statistical scores such as peak-length, peak-count or peak-signal per gene,
24
+
\item Discretisation of continuous TF affinities using a background distribution into a binary measure for TF-binding,
25
+
\item Linear regression analysis to infer key transcriptional regulators within one sample,
26
+
\item Logistic regression classifier to suggest key transcriptional regulators between samples,
27
+
\item Generate input for DREM to infer important TFs from temporal epigenomic and gene expression data.
28
+
\end{itemize}
29
+
30
+
This document provides a brief introduction into the functionality of TEPIC and the machine learning approaches.
31
+
32
+
\newpage
33
+
\section{Introduction to TEPIC}
34
+
\subsection{Motivation}
35
+
TF are essential players in transcriptional regulation. To understand their function, it is essential to know their binding sites genome-wide.
36
+
Although TFBS can be inferred from ChIP-seq experiments, several in-silico approaches have been developed as well to overcome the burden and complexity of wet-lab experiments.
37
+
Especially computational methods considering epigenetics data in the prediction have been used succesfully to predict TFBS.
13
38
The main advantage of considering epigenetics data for the task of TF binding prediction is that the number of false positive predictions can be reduced \cite{pmid21106904}.
14
39
One way of incorporating epigenetics data is to reduce the genomic search space to a few candidate regions of TF binding.
15
40
As shown before, genome-wide candidate sites for TF binding can be determined by open-chromatin experiments \cite{pmid25294828,pmid25086003,pmid22072382,pmid23424114}, e.g. peaks or footprints in DNase1-seq data,
16
41
and/or by considering Histone marks \cite{pmid25489339,pmid25086003}, e.g. H3K4me3.
17
42
18
-
Here, we compute TF affinities for a species specific set of \textit{Position Specific Energy Matrices (PSEM)} using \textit{TRAP} \cite{pmid17098775} which is based on a biophysical model of TF binding \cite{von1986specificity}.
43
+
Here, we compute TF affinities for currated sets of \textit{Position Specific Energy Matrices (PSEMs)} using \textit{TRAP} \cite{pmid17098775} which is based on a biophysical model of TF binding \cite{von1986specificity}.
19
44
A major advantage of affinity based predictions compared to hit-based methods like Fimo \cite{Grant16022011} is that
20
45
low-affinity binding sites can be included \cite{pmid27899623,pmid17098775}. Using the \textit{TEPIC} method, we compute TF gene scores by aggregating TF predictions calculated for a user defined set of candidate regions.
21
46
The scores, either per peak/region or gene, can be interpreted as a quantitative measurement of TF binding.
22
47
23
-
\subsection*{Preprocessing of Position Count Matrices (PCM)}
48
+
\subsection{Collection of TF-motifs}
24
49
We obtained \textit{Position Count Matrices (PCMs)} from JASPAR \cite{pmid26531826}, which is also including data from Uniprobe \cite{pmid25378322}, HOCOMOCO \cite{pmid23175603} and the Kellis Lab ENCODE Motif database \cite{pmid24335146}.
25
50
26
51
There are three folder containing Position specific energy matrices (PSEMs): Our current collection of PSEMs \textit{PWMs/2.1}. The previously used motifs are provided in the folders \textit{PWMs/2.0} and \textit{PWMs/1.0}.
27
-
The position weight matrices used in the TEPIC manuscript are stored in the file \\\textit{PWMs/1.0/pwm\_vertebrates\_jaspar\_uniprobe\_original.PSEM}.
52
+
TF motifs used in the original TEPIC manuscript are stored in the file \\\textit{PWMs/1.0/pwm\_vertebrates\_jaspar\_uniprobe\_original.PSEM}.
28
53
29
-
In detail, the current collection contains from the JASPAR 2018 Core database:
54
+
In detail, the current collection contains from the \textit{JASPAR 2018 Core} database:
30
55
\begin{itemize}
31
56
\item 579 PSEMs for vertebrates
32
57
\item 176 PSEMs for fungi
@@ -93,6 +118,8 @@ \subsection*{Preprocessing of Position Count Matrices (PCM)}
93
118
94
119
Files holding the length of the PSEMs are provided too.
95
120
121
+
122
+
\subsection{Converting position count matrices to position specific energy matrices}
96
123
As mentioned above, \textit{TRAP} computes TF affinities that are based on a biophysical model of TF binding.
97
124
Therefore \textit{PCMs} have to be converted to \textit{Position Specific Energy Matrices (PSEMs)} such that they can be used in \textit{TRAP}.
98
125
Intuitively, \textit{PSEMs} represent the mismatch energy of a given motif. For a detailed explanation and motivation of the energy based score, please check \cite{pmid17098775}.
@@ -130,14 +157,11 @@ \subsection*{Preprocessing of Position Count Matrices (PCM)}
130
157
\end{itemize}
131
158
In all other cases, a default GC-content of $0.42$ is used.
132
159
133
-
\newpage
134
-
\subsection*{Computing TF gene scores}
135
-
Currently, we offer the annotation of five different species, including the most common model organisms:
Using our collections of species specific \textit{PSEMs}, \textit{TRAP} computes TF binding affinities in all user provided regions
160
+
\subsection{Computing TF gene scores}
161
+
Using our collections of \textit{PSEMs}, \textit{TRAP} computes TF binding affinities in all user provided regions
138
162
that could be found in the reference genomes of the respective species and
139
163
overlap with a window of user defined size $w$ that is centered at the most $5'$ TSS of all annotated genes in the considered organism.
140
-
Then, TFgene scores are computed by incorporating all candidate binding sites within the window centered around the $5'$ TSS of genes in the final score.
164
+
Then, TF-gene scores are computed by incorporating all candidate binding sites within the window centered around the $5'$ TSS of genes in the final score.
141
165
The contribution of the individual sites is weighted by their distance to the selected TSS with an exponential decay function \cite{pmid19995984}.
142
166
Formally, the TF gene score $a_{g,i}$ for gene $g$ and TF $i$ is computed as
where $s_p$ is the per base signal in peak $p$. This computation can be done with and without length normalisation of the affinities.
159
183
The workflow of TEPIC is depicted in Figure \ref{workflowFig}.
160
184
161
-
In addition to the TF gene scores, TEPIC can compute features for peak length, peak count, and peak signal following the same scoring formulation as for
162
-
TF affinities. These features can be used for example to assess the influence of chromatin accessiblity on gene expression without considering TF binding
163
-
predictions.
185
+
In addition to the TF gene scores, TEPIC can compute features for peak length ($pl_g$), peak count ($pc_g$), and peak signal($ps_g$) following the same scoring formulation as for
where $|p|$ is the length of $p$. These features can be used for example to assess the influence of chromatin accessiblity on gene expression without considering TF binding predictions.
193
+
194
+
Furthermore, TEPIC can compute a TF-specific affinity cut-off derived from either user-defined, or randomly generated sequences, to distinguish likely bound sites from unbound sites. These scores
195
+
can be used to come-up with a binary TF-gene assignment. Further details on this mode are provided in Section \ref{EPIC-DREM}.
To compute TF gene scores a user needs to specify:
179
211
\begin{itemize}
180
-
\item a reference genome,
181
-
\item a set of \textit{PSEMs},
182
-
\item a set of genomic regions in BED format.
212
+
\item a reference genome (-g option),
213
+
\item a set of \textit{PSEMs} (-p option),
214
+
\item a set of genomic regions in BED format (-b option).
215
+
\item a gtf file containing the genome annotation (-a option).
183
216
\end{itemize}
184
-
Note that the chromosome identifiers in the BED file must match the identifiers used in the reference genomes, neglecting the \textit{chr} prefix.
185
-
Otherwise they can not be considered.
217
+
Note that the chromosome identifiers in the BED file must match the identifiers used in the reference genomes. Otherwise they can not be considered.
186
218
Special care should be taken for \textit{caenorhabditis elegans}, as Roman digits are used for enumeration of chromosomes.
187
219
188
-
\subsection*{Output}
189
-
This step generates the following output:
220
+
\subsection{Output}
221
+
TEPIC outputs:
190
222
\begin{enumerate}
191
-
\item TF affinities for all selected \textit{PSEMs} in the regions provided by the user that passed the filtering step.
192
-
\item (Length normalised) TF gene scores for all selected \textit{PSEMs} calculated as described above (optionally including peak features).
193
-
\item A meta data file listing all used parameters.
194
-
\item Optionally a seperate file containing the signal information in peaks.
223
+
\item TF affinities for all selected \textit{PSEMs} in the regions provided by the user that passed the filtering step (\textit{\*\_Affinity.txt}).
224
+
\item (Length normalised) TF gene scores for all selected \textit{PSEMs} calculated as described above (optionally including peak features) (\textit{\*\_Affinity\_Gene\_View.txt}).
225
+
\item A meta data file listing all used parameters (\textit{amd.tsv}).
226
+
\item Optionally a seperate file containing the signal information in peaks (\textit{\*\_Peak\_Coverage.txt}).
227
+
\item TF affinities with all values below an inferred threshold set to zero (\textit{\*\_Thresholded\_Affinity.txt})
228
+
\item A sparse representation linking TF to genes (\textit{\*\_Sparse\_Affinity\_Gene\_View.txt})
195
229
\end{enumerate}
196
230
197
231
\newpage
198
-
199
-
\section*{Identification of key transcriptional regulators using epigenetics data (INVOKE)}
232
+
\section{Identification of key transcriptional regulators using epigenetics data (INVOKE)}
200
233
Epigenetics data contains a wealth of information on gene regulation. It was shown that especially
201
234
data on open-chromatin is well suited to build predictive models of gene-expression \cite{pmid27899623,pmid22955983,pmid25231769,pmid22954627}.
202
235
Interpreting these models allows the inference of regulators that may play a key role in gene-expression regulation.
@@ -213,8 +246,8 @@ \section*{Identification of key transcriptional regulators using epigenetics dat
213
246
\item Learning a linear regression model to predict gene expression from TF gene scores computed in (1).
214
247
\end{enumerate}
215
248
216
-
\subsection*{Linear regression to predict gene expression}
217
-
\subsubsection*{Motivation}
249
+
\subsection{Linear regression to predict gene expression}
250
+
\subsubsection{Motivation}
218
251
In order to learn about potentially important regulators, we build a linear, interpretable regression model,
219
252
comparable to methods proposed in \cite{pmid27899623,pmid22955983,pmid25231769,pmid22954627}.
220
253
Here, we use TF gene scores computed with \textit{TEPIC} as features in a linear regression setup to predict gene expression.
@@ -224,7 +257,7 @@ \subsubsection*{Motivation}
224
257
225
258
Details on the learning setup and on the available regularization methods are provided in the next section.
226
259
227
-
\subsubsection*{Available regularization methods}
260
+
\subsubsection{Available regularization methods}
228
261
We offer three different regularization techniques:
It resolves the correlation between features by distributing the feature weights among them, and simultaneously leads to sparse and stable models \cite{Zou05regularizationand}.
253
286
However, learning a model using elastic net penalty is slower than using either only Lasso or Ridge regularization.
254
287
255
-
\subsubsection*{Details on the learning setup}
288
+
\subsubsection{Details on the learning setup}
256
289
The data matrix $X$, containing TF gene scores, and the response vector $y$, containing gene expression values, are log-transformed,
257
-
with a pseudo-count of $1$, centered and scaled to fit them as.
290
+
with a pseudo-count of $1$, centered and scaled.
258
291
Regression coefficients are computed in a inner cross validation,
259
292
the $\alpha$ parameter of elastic net regularization is optimized with a default step size of $0.1$.
260
293
@@ -272,13 +305,11 @@ \subsubsection*{Details on the learning setup}
272
305
\end{enumerate}
273
306
All parameters mentioned in this section can be changed by the user. The learning process is sketched in Figure \ref{learningFig}.
274
307
275
-
\subsubsection*{Required input}
308
+
\subsubsection{Required input}
276
309
In addition to the input required for the computation of TF gene scores in TEPIC, a file containing gene expression data must be provided.
277
310
This file should be structured such that column $1$ contains the gene identifiers and column $2$ holds expression values.
278
-
Besides, we support the upload of a matrix containing gene expression data for several samples. In that case, the user has to select the column/sample that should
279
-
be used for model construction.
280
311
281
-
\subsubsection*{Output and hints for interpretation}
312
+
\subsubsection{Output and hints for interpretation}
282
313
The user is always provided with the following files:
283
314
\begin{itemize}
284
315
\item a list of regression coefficients computed on the entire data set,
@@ -308,22 +339,20 @@ \subsubsection*{Output and hints for interpretation}
308
339
\end{figure}
309
340
310
341
\newpage
311
-
\mbox{}
312
-
\newpage
313
-
\section*{Differential analysis to identify novel transcriptional regulators for differentially expressed genes (DYNAMITE)}
342
+
\section{Differential analysis to identify novel transcriptional regulators for differentially expressed genes (DYNAMITE)}
314
343
Although a variety of methods have been proposed to generate genome-wide TF binding predictions [\cite{pmid27899623,pmid22072382,pmid23424114,pmid25086003}] and to establish \textit{TF to tissue}
315
344
associations [\cite{pmid22955983,pmid19995984,pmid27899623}], systematic, feasible, and easy to use ways of linking TFs to distinct genes are rare.
316
345
317
346
In addition to the \textit{INVOKE} analysis, we propose a method to infer the most likely transcriptional regulators for a set of differentially expressed genes.
318
347
We use TF scores, computed using \textit{TEPIC}, and logistic regression to identify TFs that have explanatory power to distinguish between up- and down-regulated genes.
319
348
320
-
\section*{Input}
349
+
\subsection{Input}
321
350
To run \textit{DYNAMITE}, a user most provide candidate regions of TF binding for two groups of samples, $A$ and $B$, e.g. control and disease.
322
351
These can be derived, for example, by open chromatin experiments such as DNase-seq.
323
352
It is essential that the candidate regions reflect the characteristics of chromatin organization in the analysed tissues.
324
353
In addition, a list of differential expressed genes between two groups as well as log2 fold changes of the expression are needed.
325
354
326
-
\section*{Method}
355
+
\subsection{Method}
327
356
Our method consits of two parts: (1) gene-TF score computation, and (2) identification of key TFs.
328
357
\subsection*{Step 1: Computing Gene-TF Scores}
329
358
Using TEPIC, we compute gene-TF scores $g_{ij}$ for all differentially expressed genes $i$ and distinct TFs $j$ considering the provided candidate regions for all replicates $a$ of group $A$ and for all replicates $b$ of group $B$.
same learning paradigm that is described for the \textit{INVOKE} analysis (Figure \ref{learningFig}). We use the entire dataset for model training and to interpret the regression coefficients.
362
391
TFs that correspond to features with a non-zero regression coefficient can be seen as being essential to explain the observed expression differences and should be further investigated.
363
392
364
-
\section*{Output}
393
+
\section{Output}
365
394
Model performance is reported in a \textit{txt} file and visually in a bar plot using mean test and training accuracy as well as the F1 measure.
366
395
A heatmap shows the regression coefficients in the outer cross validation folds.
367
396
Additionaly, we report confusion matrices for the outer cross validation folds.
@@ -380,10 +409,8 @@ \section*{Output}
380
409
\end{figure}
381
410
382
411
\newpage
383
-
\mbox{}
384
-
\newpage
385
-
386
-
\subsection*{Determine important transcriptional regulators from time serires data (EPIC-DREM)}
412
+
\section{Determine important transcriptional regulators from time serires data (EPIC-DREM)}
413
+
\label{EPIC-DREM}
387
414
\textit{EPIC-DREM} is a combination of \textit{TEPIC} and the \textit{Dynamic Regulatory Events Miner (DREM)} \cite{pmid17224918}.
388
415
Instead of using static ChIP-seq data, which is provided in \textit{DREM 2.0}, we suggest to use time-point specific TF binding predictions
389
416
based on time-dependent epigenomic profiles. Thereby, \textit{DREM} can infer regulators that can be linked to expression changes at distinct points in time.
@@ -406,19 +433,19 @@ \subsection*{Determine important transcriptional regulators from time serires da
406
433
\label{epicdrem}
407
434
\end{figure}
408
435
409
-
\subsubsection*{Thresholded TF affinities}
436
+
\subsection{Thresholded TF affinities}
410
437
In some applications it is required to make a binary decision whether a factor is binding or not.
411
438
To infer this information from TF affinities, \textit{TEPIC} allows the computation of a TF specific affinity threshold by calculating TF affinities on a randomly selected set of genomic regions. When selected by TEPIC, these regions show similar characteristics compared to the provided regions (GC content and length).
412
439
Alternatively a set of background regions can be provided by the user.
413
440
By applying a user defined p-value on the distribution of affinities computed on the random regions, a threshold is chosen.
414
441
Per TF, all affinities that are smaller than the selected threshold, are set to zero, thus a sparse matrix with TF-gene interactions can be generated.
415
442
416
-
\subsubsection*{Required input}
443
+
\subsubsection{Required input}
417
444
In addition to the input mentioned above, a reference genome in 2bit format is required.
418
445
Optionally, the user can provide a bed file containing background regions.
419
446
These replace the automated generation of background sequences.
420
447
421
-
\subsubsection*{Output}
448
+
\subsubsection{Output}
422
449
The following output files are generated in addition:
423
450
\begin{enumerate}
424
451
\item TF affinities for all selected \textit{PSEMs} in the regions provided by the user that passed the filtering step, where all affinities below the TF specific thresholds are set to 0.
@@ -427,7 +454,7 @@ \subsubsection*{Output}
427
454
\end{enumerate}
428
455
429
456
Either (2) or (3) can be combined with RNA-seq data and used as input for \textit{DREM}.
0 commit comments