Scripts to replicate phylogenomic analyses in:
Kuhnhäuser et al. (2021), A robust phylogenomic framework for the calamoid palms, Molecular Phylogenetics and Evolution. https://doi.org/10.1016/j.ympev.2020.107067. Please cite this work where appropriate.
Raw sequence data are deposited in the European Nucleotide Archive of the European Bioinformatics Institute (https://www.ebi.ac.uk/ena) under project number PRJEB40689. The PhyloPalm target file and alignments, gene trees, and species trees resulting from the analyses detailed here are deposited on Zenodo (https://doi.org/10.5281/zenodo.4359280).
Notes:
- Exemplary workflow for taxon "Calamoid1" and locus "Gene1"
- Identical parameters used for exons and supercontigs
- Version numbers of software used can be found in the publication
fastqc Calamoid1_R*_001.fastq --extract
- Conduct before and after trimming
java -jar trimmomatic-0.38.jar PE -threads 16 -phred33 -basein Calamoid1_R1_001.fastq Calamoid1_R1_001_Tpaired.fastq Calamoid1_R1_001_Tunpaired.fastq Calamoid1_R2_001_Tpaired.fastq Calamoid1_R2_001_Tunpaired.fastq ILLUMINACLIP:Trimmomatic-0.38/adapters/TruSeq3-PE-2.fa:2:30:10:1:true LEADING:3 TRAILING:3 MAXINFO:40:0.8 MINLEN:36
ILLUMINACLIPis used for removing adapters specified in the file "TruSeq3-PE-2.fa"LEADING:3andTRAILING:3removes trimming arefacts, which are indicated by very low Phred scoresMAXINFObalances benefits of retaining longer reads against costs of retaining bases with errorsMINLENsets minimum allowed length for reads
- Retrieve exons per taxon
python HybPiper/reads_first.py -b PhyloPalm.fasta -r Calamoid1_R*Tpaired.fastq --prefix $name --bwa --cov_cutoff 3
-b PhyloPalms.fastaspecifies target file--bwauses BWA to map reads to target file--cov_cutoffspecifes minimum coverage of SPADES contigs
- Compile gene files containing homologous sequences across all taxa
python HybPiper/retrieve_sequences.py PhyloPalm.fasta . dna - Produce exon list
for f in *.FNA; do (echo ${f/.FNA} >> genenames.txt); done
-
Retrieve introns
python HybPiper/intronerate.py --prefix Calamoid1
python HybPiper/retrieve_sequences.py PhyloPalm.fasta . intron -
Retrieve supercontigs (combining exons and introns)
python HybPiper/retrieve_sequences.py PhyloPalm.fasta . supercontig
- Write paralogs to individual file for the taxon
python HybPiper/paralog_investigator.py Calamoid1 2> Calamoid1_paralogs.txt - Combine paralog files of different sequences
cat *paralogs.txt > paralog_summary.txt - Return each occurrence of genes starting with HEY or EGU (as all genes in the PhyloPalms.fasta do)
grep -ow 'HEY\w*\|EGU\w*' paralog_summary.txt > paralog_temp.txt - Count number of occurrences per paralog
tr -c '[:alnum:]' '[\n*]' < paralogs_temp.txt | sort | uniq -c | sort -nr > paralog_count.txt - Produce final list of paralogs
grep -ow 'HEY\w*\|EGU\w*' paralog_count.txt > paralogs.txt
grep -Fv -f paralogs.txt genenames.txt > orthologs.txt creates new ortholog list excluding paralogs
mafft --thread 4 --localpair --adjustdirectionaccurately --maxiterate 1000 Gene1.FNA > Gene1_aligned.fasta
- Use automated1 algorithm in trimAl, which is optimized for ML phylogeny reconstruction
trimal -in Gene1_aligned.fasta -out Gene1_aligned_trimmed_temp1.fasta -automated1 - Remove sites with >=80% gaps from automated1-trimmed alignment
trimal -in Gene1_aligned_trimmed_temp1.fasta -out Gene1_aligned_trimmed_temp2.fasta -gt 0.2 - Remove sequences covering less than 10 % of the alignment length
trimal -in Gene1_aligned_trimmed_temp2.fasta -out Gene1_aligned_trimmed.fasta -resoverlap 0.50 -seqoverlap 10
raxmlHPC-PTHREADS -T 3 -m GTRGAMMA -f a -p 12345 -x 12345 -# 1000 -k -s Gene1_aligned_trimmed.fasta -n Gene1.tree
iqtree -s Gene1_aligned_trimmed.fasta -m MFP -T 4 -B 1000
-m MFPperforms model testing followed by tree search
- Concatenate trees
cat RAxML_bipartitions.Gene*.tree > allTrees.tree - Remove outlier branches
run_treeshrink.py -t allTrees.tree -q 0.05 - Collapse branches with bootstrap support below 10%
nw_ed allTrees_ts.tree 'i & b<=10' o > allTrees_ts_bs10.tree - Build coalescence species tree
java -jar astral.5.6.3.jar -i allTrees_ts_bs10.tree -o astral.tree - Root the tree
pxrr -t astral.tree -g Nypa-fructicans-MSL30-S32,Kerriodoxa-elegans-MSL76,Asterogyne-martiana-SBL226,Ceroxylon-quindiuense-MSL17 -s > astral_rooted.tree
- Concatenate individual gene alignments
AMAS concat -i Gene*_aligned_trimmed.fasta -f fasta -d dna -c 1 - Build concatenated species tree
raxmlHPC-PTHREADS -T 3 -m GTRGAMMA -f a -p 12345 -x 12345 -# autoMRE -k -s concatenated.out -n raxml_concatenated.tree
autoMREargument for extended majority-rule criterion for stopping bootstrapping after a sufficient number of replicates had been sampled
- Root the tree
pxrr -t raxml_concatenated.tree -g Nypa-fructicans-MSL30-S32,Kerriodoxa-elegans-MSL76,Asterogyne-martiana-SBL226,Ceroxylon-quindiuense-MSL17 -s > raxml_concatenated_rooted.tree
- Concatenate individual gene alignments and convert to phylip format
AMAS concat -i Gene*_aligned_trimmed.fasta -f fasta -d dna -c 1
AMAS convert -d dna -f fasta -i concatenated.out -u phylip
mv concatenated.out-out.phy concatenated.phy - Perform PartitionFinder analysis
partitionfinder-2.1.1/PartitionFinder.py partitioned_analysis --raxml
Directorypartitioned_analysiscontains:
- concatenated alignment
- configuration file
partition_finder.cfgwith partitions file fromAMAS concatoutput in DATA BLOCKS:
## ALIGNMENT FILE ##
alignment = concatenated.phy;
## BRANCHLENGTHS: linked | unlinked ##
branchlengths = linked;
## MODELS OF EVOLUTION: all | allx | mrbayes | beast | gamma | gammai | <list> ##
models = GTR+G;
# MODEL SELECCTION: AIC | AICc | BIC #
model_selection = aicc;
## DATA BLOCKS: see manual for how to define ##
[data_blocks]
p1_EGU105032175_supercontig_aligned_trimmed_sites_seq = 1-675;
...
p945_HEY989_supercontig_aligned_trimmed_sites_seq = 1942327-1943739;
## SCHEMES, search: all | user | greedy | rcluster | rclusterf | kmeans ##
[schemes]
search = rcluster;
- Copy RAxML partitions from output file "best_scheme.txt" to file "partitions_partitionfinder.txt" as input for RAxML analyses
- Partitioned RAxML analysis
raxmlHPC-PTHREADS -T 22 -m GTRGAMMA -f a -p 12345 -x 12345 -# autoMRE -k -s concatenated.out -n raxml_supercontigs_concatenated_partitioned_partitionfinder_autoMRE.tree -q partitions_partitionfinder.txt
library(ape)
library(phangorn)
coal_e <- read.tree("Calamoideae_coalescence_exons_rooted.tree")
coal_em <- read.tree("Calamoideae_coalescence_exons_models_rooted.tree")
coal_s <- read.tree("Calamoideae_coalescence_supercontigs_rooted.tree")
coal_sm <- read.tree("Calamoideae_coalescence_supercontigs_models_rooted.tree")
concat_e <- read.tree("Calamoideae_concatenation_exons_rooted.tree")
concat_ep <- read.tree("Calamoideae_concatenation_exons_partitioned_rooted.tree")
concat_s <- read.tree("Calamoideae_concatenation_supercontigs_rooted.tree")
concat_sp <- read.tree("Calamoideae_concatenation_supercontigs_partitioned_rooted.tree")
cons <- consensus(coal_e, coal_em, coal_s, coal_sm, concat_e, concat_ep, concat_s, concat_sp)
plot.phylo(cons, cex = 0.7)
write.tree(cons, file = "consensus_all.tree")
all_trees <- c(coal_e, coal_em, coal_s, coal_sm, concat_e, concat_ep, concat_s, concat_sp)
RF.dist(all_trees, check.labels = TRUE, normalize = TRUE, rooted = TRUE)
Exemplary command for Calameae:
docker run -v DiscoVista-master/:/data esayyari/discovista discoVista.py -p calamoideae/ -m 5 -a calamoideae/annotation_calameae.txt -o calamoideae/results/calameae -g Outgroup
python HybPiper/get_seq_lengths.py PhyloPalm.fasta namelist.txt dna > seqlengths_genes.txt
python HybPiper/hybpiper_stats.py seqlengths_genes.txt namelist.txt > seqlengths_genes_stats.txt
- "namelist.txt" contains taxon names
AMAS summary -f fasta -d dna -i alignment_genes_concatenated.out -o summary_alignment_genes_concatenated.txt
- For ingroup only: first remove outgroup from concatenated alignment
AMAS remove -x Asterogyne-martiana-SBL226 Kerriodoxa-elegans-MSL76 Nypa-fructicans-MSL30-S32 Ceroxylon-quindiuense-MSL17 -f fasta -d dna -i alignment_genes_concatenated.out - For outgroup only: first remove ingroup (by listing all ingroup taxa; analogous to "ingroup only" analysis)