Cladekit is a lightweight, composable CLI phylogenetics toolkit. A single binary, Cladekit provides many subcommands that replace common chains of bash commands or a collection of individual programs in phylogenetic pipelines. Examples include header extraction, concatenation, and alignment quality control.
Note: Cladekit is under active development. Subcommands may change or be added as the project matures.
Requires Rust.
cargo install --git https://github.com/andrewbudge/cladekitThis builds the binary and adds cladekit to your PATH.
To update to the latest version:
cargo install --force --git https://github.com/andrewbudge/cladekitExtract headers from FASTA files.
Example:
$ cladekit getheaders testdata/test_good.fasta
Sequence1
Sequence2
Sequence1
$ cladekit getheaders -u testdata/test_good.fasta
Sequence1
Sequence2Concatenate multiple gene alignments into a supermatrix. Unlike other tools, input files can live anywhere and globs are accepted.
Concat runs in two modes:
- Exact match (default): headers must match exactly across files, like FASconCAT and AMAS.
- Smart match (
-a alias.txt): pass an alias list — a file of clean output names (one per line, e.g.Mus_musculus) that get matched to messy input headers via case-insensitive substring search. Underscores in aliases match spaces in headers, soMus_musculusfindsAB123.1 Mus musculus COX1 gene, partial cds. Longer aliases match first to prevent partial collisions. The alias list doubles as a rename map — input headers stay messy, output gets clean names. Requires-lfor a provenance TSV that records exactly which original header matched each alias.
Concat auto-detects DNA vs amino acid data per gene and adjusts missing characters and partition labels accordingly. FASTA output goes to stdout, partition boundaries to stderr in RAxML/IQ-TREE format by default. NEXUS bundles everything into one file.
Exact match — clean headers:
$ cladekit concat gene1.fasta gene2.fasta > supermatrix.fasta
DNA, gene1.fasta = 1-4
DNA, gene2.fasta = 5-8Smart match — messy headers with an alias list:
$ cat alias.txt
Mus_musculus
Rattus_rattus
Xenopus_laevis
$ cladekit concat -a alias.txt -l prov.tsv gene1.fasta gene2.fasta > supermatrix.fasta
DNA, gene1.fasta = 1-4
DNA, gene2.fasta = 5-8
$ cat supermatrix.fasta
>Mus_musculus
ATCGATCG
>Rattus_rattus
ATCGNNNN
>Xenopus_laevis
NNNNATCG
$ cat prov.tsv
alias.txt gene1.fasta gene2.fasta
Mus_musculus AB123.1 Mus musculus gene1 cds XM456.1 Mus musculus gene2 cds
Rattus_rattus AB124.1 Rattus rattus gene1 cds MISSING
Xenopus_laevis MISSING XM789.1 Xenopus laevis gene2 cdsNEXUS output:
$ cladekit concat -a alias.txt -l prov.tsv -f nexus gene1.fasta gene2.fasta
#NEXUS
BEGIN DATA;
DIMENSIONS NTAX=3 NCHAR=8;
FORMAT DATATYPE=DNA MISSING=N GAP=-;
MATRIX
Mus_musculus ATCGATCG
Rattus_rattus ATCGNNNN
Xenopus_laevis NNNNATCG
;
END;
BEGIN SETS;
CHARSET gene1.fasta = 1-4;
CHARSET gene2.fasta = 5-8;
END;Flags:
-a, --alias— alias list for smart matching (clean output names that map to messy input headers)-l, --log— provenance TSV output file (required with-a)-f, --format— output format: fasta (default), nexus (also acceptsnornex)-m, --missing— override missing data character (default: auto per data type — N for DNA, X for amino acid, ? for mixed)-p, --partitions— partition format: raxml (default, also used by IQ-TREE) or nexus
Get basic alignment statistics from FASTA files. Accepts multiple files via globs. Automatically detects DNA vs amino acid sequences.
Columns:
- file — filename (path stripped)
- sequences — number of sequences
- length — alignment length (NA if unaligned)
- type —
DNAorAA(auto-detected, supports IUPAC ambiguity codes) - gc_pct — GC content as a percentage of real bases (NA for amino acid data)
- missing_pct — percentage of gaps and unknown characters
- variable — sites with at least 2 different residues (excluding gaps/unknowns)
- variable_pct — variable sites as a percentage of alignment length
- informative — parsimony-informative sites (at least 2 residues each appearing 2+ times)
- informative_pct — informative sites as a percentage of alignment length
Example:
$ cladekit stats supermatrix.fasta proteins.fasta
file sequences length type gc_pct missing_pct variable variable_pct informative informative_pct
supermatrix.fasta 3 8 DNA 50.0 33.3 0 0.0 0 0.0
proteins.fasta 4 20 AA NA 0.0 3 15.0 2 10.0Flags:
-d, --detailed— per-sequence statistics (header, length, GC%, missingness)-p, --pretty— column-aligned output for readability
Summarize taxa and loci coverage from a concat provenance TSV. Shows how many loci each taxon appears in, or how many taxa each locus has.
Example:
$ cladekit coverage -t prov.tsv
taxa loci_present loci_missing pct_missing
Mus_musculus 5/5 0/5 0.0%
Smilodon_populator 2/5 3/5 60.0%
$ cladekit coverage -l -p prov.tsv
loci appearance_count missing_pct
12S_aln.fas 6/8 25.0%
COX1_aln.fas 6/8 25.0%Flags:
-t, --taxa— show per-taxon coverage (how many loci each taxon has)-l, --loci— show per-loci coverage (how many taxa each locus has)-p, --pretty— column-aligned output for readability
Convert between common sequence data file types. Auto-detects the input format from file contents.
Supported formats:
- FASTA (
f) - NEXUS (
n/nex/nexus) - Relaxed PHYLIP (
rp/phylip) - Strict PHYLIP (
sp)
Example:
$ cladekit convert -o n alignment.fasta
#NEXUS
BEGIN DATA;
DIMENSIONS NTAX=3 NCHAR=8;
FORMAT DATATYPE=DNA MISSING=N GAP=-;
MATRIX
Taxon_A ATCGATCG
Taxon_B ATCGATCG
Taxon_C ATCGNNNN
;
END;
$ cladekit convert -o rp alignment.nex
3 8
Taxon_A ATCGATCG
Taxon_B ATCGATCG
Taxon_C ATCGNNNNFlags:
-o, --output_format— output format:f(fasta),n(nexus),rp(relaxed phylip),sp(strict phylip)
- filter — remove taxa exceeding a missingness threshold from a supermatrix
- scrub — alignment outlier detection via pairwise p-distances
- view - in terminal alignment viewer
- slice - cut out and remove sections of an alignment (remove non-homologous seqs, extract homologous seqs)
Cladekit is being built as both a real research tool and a vehicle for learning Rust. Development is assisted by Claude (Anthropic), which serves as a teaching aid and coding partner. The design, domain knowledge, and direction are the author's own.
Andrew Budge