An alignment-free sequence comparison based on natural language analysis in information theory (e.g., k-mer), which is primary developed to compare whole-genomic sequences such as genomes, proteomes and transcriptome, but it can also compare sequences of English alpabets (books or scripts) and custom alphabet sets other than nucleotide (four letters) or amino acids (20 letters). It constructs representing profiles and determines their distances for further visualization and interpretation.
Early FFP application was explored in 2009 (Sims GE, Jun S-R, Wu GA, Kim S-H; https://doi.org/10.1073/pnas.0909377106 ) and addressed its conceptual background and then demonstrated the potential using nucleotide and amino acids sequences.
I re-developed from scratch to improve overall performance, focus on genericity (e.g., able to compare non-genomic sequences) and harnessing multicore or parallel compute environments; thus, evolving independent from Sims GE's version of FFP applications.
Please cite one of the publications below if you are utilzing programs provided here for your publication.
- JJ. Choi, “Whole-genomic sequence comparison for evolutionary studies based on a natural-language analysis of information theory,” UC Berkeley. (2024).
- "Whole-proteome tree of life suggests a deep burst of organism diversity", JaeJin Choi and Sung-Hou Kim (2019), PNAS.
- "A genome Tree of Life for the Fungi kingdom", JaeJin Choi and Sung-Hou Kim (2017), PNAS.
Mainly developed and tested in linux environment (Ubuntu 20.04+), and so the command blocks provided below.
Please contact JaeJin Choi ([email protected]) if you have questions or comments regarding the program codes, bug, or usage.
-
GCC (g++) version 4.7.1+
Any recent g++ versions that support c++11 -
Google sparse hash library
https://github.com/sparsehash/sparsehash
sudo apt update
sudo apt install libsparsehash-dev- zlib version 1.2.8+
https://github.com/madler/zlib or http://www.zlib.net/
sudo apt update
sudo apt install zlib1g-dev- A tutorial you can walkthrough here:
- Additional fungi study supplement files (e.g., tree newick and divergence matrix) are here:
- The first and the second numberings indicate program compatibility between FF Profile and FFP distance calculation. For instance, any versions between 2v.3.x are compatible but not with any 2v.4.x.
- Incompatibility is due to output file format change during improvement or adding more functions difficult to unify and resolve. Thus, be cautious and when using different versions.
- The latest
- For usage and compiling options, check individual version folder.
- Old text based
, list all versions
Typically, longer feature lengths (ls) consume more memory. In fungi whole proteome study, the largest proteome has 35,274 proteins containing 10,866,611 amino acids, and the version used worked for feature lengths up to 24 amino acids (l=24).