PhyloFrame

Scripts to replicate phylogenomic analyses in:

Kuhnhäuser et al. (2021), A robust phylogenomic framework for the calamoid palms, Molecular Phylogenetics and Evolution. https://doi.org/10.1016/j.ympev.2020.107067. Please cite this work where appropriate.

Raw sequence data are deposited in the European Nucleotide Archive of the European Bioinformatics Institute (https://www.ebi.ac.uk/ena) under project number PRJEB40689. The PhyloPalm target file and alignments, gene trees, and species trees resulting from the analyses detailed here are deposited on Zenodo (https://doi.org/10.5281/zenodo.4359280).

Notes:

Exemplary workflow for taxon "Calamoid1" and locus "Gene1"
Identical parameters used for exons and supercontigs
Version numbers of software used can be found in the publication

1. Quality control of sequence data

fastqc Calamoid1_R*_001.fastq --extract

Conduct before and after trimming

2. Remove adapters and low-quality reads

java -jar trimmomatic-0.38.jar PE -threads 16 -phred33 -basein Calamoid1_R1_001.fastq Calamoid1_R1_001_Tpaired.fastq Calamoid1_R1_001_Tunpaired.fastq Calamoid1_R2_001_Tpaired.fastq Calamoid1_R2_001_Tunpaired.fastq ILLUMINACLIP:Trimmomatic-0.38/adapters/TruSeq3-PE-2.fa:2:30:10:1:true LEADING:3 TRAILING:3 MAXINFO:40:0.8 MINLEN:36

ILLUMINACLIP is used for removing adapters specified in the file "TruSeq3-PE-2.fa"
LEADING:3 and TRAILING:3 removes trimming arefacts, which are indicated by very low Phred scores
MAXINFO balances benefits of retaining longer reads against costs of retaining bases with errors
MINLEN sets minimum allowed length for reads

3. Retrieve targeted exons

Retrieve exons per taxon
python HybPiper/reads_first.py -b PhyloPalm.fasta -r Calamoid1_R*Tpaired.fastq --prefix $name --bwa --cov_cutoff 3

-b PhyloPalms.fasta specifies target file
--bwa uses BWA to map reads to target file
--cov_cutoff specifes minimum coverage of SPADES contigs

Compile gene files containing homologous sequences across all taxa
python HybPiper/retrieve_sequences.py PhyloPalm.fasta . dna
Produce exon list
for f in *.FNA; do (echo ${f/.FNA} >> genenames.txt); done

4. Retrieve supercontigs

Retrieve introns
python HybPiper/intronerate.py --prefix Calamoid1
python HybPiper/retrieve_sequences.py PhyloPalm.fasta . intron
Retrieve supercontigs (combining exons and introns)
python HybPiper/retrieve_sequences.py PhyloPalm.fasta . supercontig

5. Exclude paralogs from analysis

Identify paralogs

Write paralogs to individual file for the taxon
python HybPiper/paralog_investigator.py Calamoid1 2> Calamoid1_paralogs.txt
Combine paralog files of different sequences
cat *paralogs.txt > paralog_summary.txt
Return each occurrence of genes starting with HEY or EGU (as all genes in the PhyloPalms.fasta do)
grep -ow 'HEY\w*\|EGU\w*' paralog_summary.txt > paralog_temp.txt
Count number of occurrences per paralog
tr -c '[:alnum:]' '[\n*]' < paralogs_temp.txt | sort | uniq -c | sort -nr > paralog_count.txt
Produce final list of paralogs
grep -ow 'HEY\w*\|EGU\w*' paralog_count.txt > paralogs.txt

Make ortholog list (= genes - paralogs)

grep -Fv -f paralogs.txt genenames.txt > orthologs.txt creates new ortholog list excluding paralogs

6. Multiple sequence alignment

Align genes individually

mafft --thread 4 --localpair --adjustdirectionaccurately --maxiterate 1000 Gene1.FNA > Gene1_aligned.fasta

Trim alignments

Use automated1 algorithm in trimAl, which is optimized for ML phylogeny reconstruction
trimal -in Gene1_aligned.fasta -out Gene1_aligned_trimmed_temp1.fasta -automated1
Remove sites with >=80% gaps from automated1-trimmed alignment
trimal -in Gene1_aligned_trimmed_temp1.fasta -out Gene1_aligned_trimmed_temp2.fasta -gt 0.2
Remove sequences covering less than 10 % of the alignment length
trimal -in Gene1_aligned_trimmed_temp2.fasta -out Gene1_aligned_trimmed.fasta -resoverlap 0.50 -seqoverlap 10

7. Gene trees

Without model testing

raxmlHPC-PTHREADS -T 3 -m GTRGAMMA -f a -p 12345 -x 12345 -# 1000 -k -s Gene1_aligned_trimmed.fasta -n Gene1.tree

With model testing

iqtree -s Gene1_aligned_trimmed.fasta -m MFP -T 4 -B 1000

-m MFP performs model testing followed by tree search

8. Species trees

Coalescence

Concatenate trees
cat RAxML_bipartitions.Gene*.tree > allTrees.tree
Remove outlier branches
run_treeshrink.py -t allTrees.tree -q 0.05
Collapse branches with bootstrap support below 10%
nw_ed allTrees_ts.tree 'i & b<=10' o > allTrees_ts_bs10.tree
Build coalescence species tree
java -jar astral.5.6.3.jar -i allTrees_ts_bs10.tree -o astral.tree
Root the tree
pxrr -t astral.tree -g Nypa-fructicans-MSL30-S32,Kerriodoxa-elegans-MSL76,Asterogyne-martiana-SBL226,Ceroxylon-quindiuense-MSL17 -s > astral_rooted.tree

Concatenation

Without partitioning

Concatenate individual gene alignments
AMAS concat -i Gene*_aligned_trimmed.fasta -f fasta -d dna -c 1
Build concatenated species tree
raxmlHPC-PTHREADS -T 3 -m GTRGAMMA -f a -p 12345 -x 12345 -# autoMRE -k -s concatenated.out -n raxml_concatenated.tree

autoMRE argument for extended majority-rule criterion for stopping bootstrapping after a sufficient number of replicates had been sampled

Root the tree
pxrr -t raxml_concatenated.tree -g Nypa-fructicans-MSL30-S32,Kerriodoxa-elegans-MSL76,Asterogyne-martiana-SBL226,Ceroxylon-quindiuense-MSL17 -s > raxml_concatenated_rooted.tree

With partitioning

Concatenate individual gene alignments and convert to phylip format
AMAS concat -i Gene*_aligned_trimmed.fasta -f fasta -d dna -c 1
AMAS convert -d dna -f fasta -i concatenated.out -u phylip
mv concatenated.out-out.phy concatenated.phy
Perform PartitionFinder analysis
partitionfinder-2.1.1/PartitionFinder.py partitioned_analysis --raxml
Directory partitioned_analysis contains:

concatenated alignment
configuration file partition_finder.cfg with partitions file from AMAS concat output in DATA BLOCKS:

## ALIGNMENT FILE ##
alignment = concatenated.phy;

## BRANCHLENGTHS: linked | unlinked ##
branchlengths = linked;

## MODELS OF EVOLUTION: all | allx | mrbayes | beast | gamma | gammai | <list> ##
models = GTR+G;

# MODEL SELECCTION: AIC | AICc | BIC #
model_selection = aicc;

## DATA BLOCKS: see manual for how to define ##
[data_blocks]
p1_EGU105032175_supercontig_aligned_trimmed_sites_seq = 1-675;
...
p945_HEY989_supercontig_aligned_trimmed_sites_seq = 1942327-1943739;

## SCHEMES, search: all | user | greedy | rcluster | rclusterf | kmeans ##
[schemes]
search = rcluster;

Copy RAxML partitions from output file "best_scheme.txt" to file "partitions_partitionfinder.txt" as input for RAxML analyses

Partitioned RAxML analysis raxmlHPC-PTHREADS -T 22 -m GTRGAMMA -f a -p 12345 -x 12345 -# autoMRE -k -s concatenated.out -n raxml_supercontigs_concatenated_partitioned_partitionfinder_autoMRE.tree -q partitions_partitionfinder.txt

9. Comparative analyses

Tree comparisons

Load R libraries:

library(ape)
library(phangorn)

Read in trees:

coal_e <- read.tree("Calamoideae_coalescence_exons_rooted.tree")
coal_em <- read.tree("Calamoideae_coalescence_exons_models_rooted.tree")
coal_s <- read.tree("Calamoideae_coalescence_supercontigs_rooted.tree")
coal_sm <- read.tree("Calamoideae_coalescence_supercontigs_models_rooted.tree")
concat_e <- read.tree("Calamoideae_concatenation_exons_rooted.tree")
concat_ep <- read.tree("Calamoideae_concatenation_exons_partitioned_rooted.tree")
concat_s <- read.tree("Calamoideae_concatenation_supercontigs_rooted.tree")
concat_sp <- read.tree("Calamoideae_concatenation_supercontigs_partitioned_rooted.tree")

Strict consensus of species trees

cons <- consensus(coal_e, coal_em, coal_s, coal_sm, concat_e, concat_ep, concat_s, concat_sp)
plot.phylo(cons, cex = 0.7)
write.tree(cons, file = "consensus_all.tree")

Normalized Robinson Foulds distances

all_trees <- c(coal_e, coal_em, coal_s, coal_sm, concat_e, concat_ep, concat_s, concat_sp)
RF.dist(all_trees, check.labels = TRUE, normalize = TRUE, rooted = TRUE)

Gene tree conflict

Exemplary command for Calameae:
docker run -v DiscoVista-master/:/data esayyari/discovista discoVista.py -p calamoideae/ -m 5 -a calamoideae/annotation_calameae.txt -o calamoideae/results/calameae -g Outgroup

10. Statistics

Targeted sequencing

python HybPiper/get_seq_lengths.py PhyloPalm.fasta namelist.txt dna > seqlengths_genes.txt
python HybPiper/hybpiper_stats.py seqlengths_genes.txt namelist.txt > seqlengths_genes_stats.txt

"namelist.txt" contains taxon names

Alignments

AMAS summary -f fasta -d dna -i alignment_genes_concatenated.out -o summary_alignment_genes_concatenated.txt

For ingroup only: first remove outgroup from concatenated alignment
AMAS remove -x Asterogyne-martiana-SBL226 Kerriodoxa-elegans-MSL76 Nypa-fructicans-MSL30-S32 Ceroxylon-quindiuense-MSL17 -f fasta -d dna -i alignment_genes_concatenated.out
For outgroup only: first remove ingroup (by listing all ingroup taxa; analogous to "ingroup only" analysis)

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PhyloFrame

1. Quality control of sequence data

2. Remove adapters and low-quality reads

3. Retrieve targeted exons

4. Retrieve supercontigs

5. Exclude paralogs from analysis

Identify paralogs

Make ortholog list (= genes - paralogs)

6. Multiple sequence alignment

Align genes individually

Trim alignments

7. Gene trees

Without model testing

With model testing

8. Species trees

Coalescence

Concatenation

Without partitioning

With partitioning

9. Comparative analyses

Tree comparisons

Load R libraries:

Read in trees:

Strict consensus of species trees

Normalized Robinson Foulds distances

Gene tree conflict

10. Statistics

Targeted sequencing

Alignments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

PhyloFrame

1. Quality control of sequence data

2. Remove adapters and low-quality reads

3. Retrieve targeted exons

4. Retrieve supercontigs

5. Exclude paralogs from analysis

Identify paralogs

Make ortholog list (= genes - paralogs)

6. Multiple sequence alignment

Align genes individually

Trim alignments

7. Gene trees

Without model testing

With model testing

8. Species trees

Coalescence

Concatenation

Without partitioning

With partitioning

9. Comparative analyses

Tree comparisons

Load R libraries:

Read in trees:

Strict consensus of species trees

Normalized Robinson Foulds distances

Gene tree conflict

10. Statistics

Targeted sequencing

Alignments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages