GitHub - andrewbudge/Cladekit: A lightweight CLI toolkit for phylogenetic analysis

Cladekit is a lightweight, composable CLI phylogenetics toolkit. A single binary, Cladekit provides many subcommands that replace common chains of bash commands or a collection of individual programs in phylogenetic pipelines. Examples include header extraction, concatenation, and alignment quality control.

Note: Cladekit is under active development. Subcommands may change or be added as the project matures.

Install

Requires Rust.

cargo install --git https://github.com/andrewbudge/cladekit

This builds the binary and adds cladekit to your PATH.

To update to the latest version:

cargo install --force --git https://github.com/andrewbudge/cladekit

Subcommands

getheaders (ghd)

Extract headers from FASTA files.

Example:

$ cladekit getheaders testdata/test_good.fasta
Sequence1
Sequence2
Sequence1

$ cladekit getheaders -u testdata/test_good.fasta
Sequence1
Sequence2

concat (liger)

Concatenate multiple gene alignments into a supermatrix. Unlike other tools, input files can live anywhere and globs are accepted.

Concat runs in two modes:

Exact match (default): headers must match exactly across files, like FASconCAT and AMAS.
Smart match (-a alias.txt): pass an alias list — a file of clean output names (one per line, e.g. Mus_musculus) that get matched to messy input headers via case-insensitive substring search. Underscores in aliases match spaces in headers, so Mus_musculus finds AB123.1 Mus musculus COX1 gene, partial cds. Longer aliases match first to prevent partial collisions. The alias list doubles as a rename map — input headers stay messy, output gets clean names. Requires -l for a provenance TSV that records exactly which original header matched each alias.

Concat auto-detects DNA vs amino acid data per gene and adjusts missing characters and partition labels accordingly. FASTA output goes to stdout, partition boundaries to stderr in RAxML/IQ-TREE format by default. NEXUS bundles everything into one file.

Exact match — clean headers:

$ cladekit concat gene1.fasta gene2.fasta > supermatrix.fasta
DNA, gene1.fasta = 1-4
DNA, gene2.fasta = 5-8

Smart match — messy headers with an alias list:

$ cat alias.txt
Mus_musculus
Rattus_rattus
Xenopus_laevis

$ cladekit concat -a alias.txt -l prov.tsv gene1.fasta gene2.fasta > supermatrix.fasta
DNA, gene1.fasta = 1-4
DNA, gene2.fasta = 5-8

$ cat supermatrix.fasta
>Mus_musculus
ATCGATCG
>Rattus_rattus
ATCGNNNN
>Xenopus_laevis
NNNNATCG

$ cat prov.tsv
alias.txt	gene1.fasta	gene2.fasta
Mus_musculus	AB123.1 Mus musculus gene1 cds	XM456.1 Mus musculus gene2 cds
Rattus_rattus	AB124.1 Rattus rattus gene1 cds	MISSING
Xenopus_laevis	MISSING	XM789.1 Xenopus laevis gene2 cds

NEXUS output:

$ cladekit concat -a alias.txt -l prov.tsv -f nexus gene1.fasta gene2.fasta
#NEXUS
BEGIN DATA;
  DIMENSIONS NTAX=3 NCHAR=8;
  FORMAT DATATYPE=DNA MISSING=N GAP=-;
  MATRIX
  Mus_musculus    ATCGATCG
  Rattus_rattus   ATCGNNNN
  Xenopus_laevis  NNNNATCG
;
END;
BEGIN SETS;
  CHARSET gene1.fasta = 1-4;
  CHARSET gene2.fasta = 5-8;
END;

Flags:

-a, --alias — alias list for smart matching (clean output names that map to messy input headers)
-l, --log — provenance TSV output file (required with -a)
-f, --format — output format: fasta (default), nexus (also accepts n or nex)
-m, --missing — override missing data character (default: auto per data type — N for DNA, X for amino acid, ? for mixed)
-p, --partitions — partition format: raxml (default, also used by IQ-TREE) or nexus

stats

Get basic alignment statistics from FASTA files. Accepts multiple files via globs. Automatically detects DNA vs amino acid sequences.

Columns:

file — filename (path stripped)
sequences — number of sequences
length — alignment length (NA if unaligned)
type — DNA or AA (auto-detected, supports IUPAC ambiguity codes)
gc_pct — GC content as a percentage of real bases (NA for amino acid data)
missing_pct — percentage of gaps and unknown characters
variable — sites with at least 2 different residues (excluding gaps/unknowns)
variable_pct — variable sites as a percentage of alignment length
informative — parsimony-informative sites (at least 2 residues each appearing 2+ times)
informative_pct — informative sites as a percentage of alignment length

Example:

$ cladekit stats supermatrix.fasta proteins.fasta
file	sequences	length	type	gc_pct	missing_pct	variable	variable_pct	informative	informative_pct
supermatrix.fasta	3	8	DNA	50.0	33.3	0	0.0	0	0.0
proteins.fasta	4	20	AA	NA	0.0	3	15.0	2	10.0

Flags:

-d, --detailed — per-sequence statistics (header, length, GC%, missingness)
-p, --pretty — column-aligned output for readability

coverage

Summarize taxa and loci coverage from a concat provenance TSV. Shows how many loci each taxon appears in, or how many taxa each locus has.

Example:

$ cladekit coverage -t prov.tsv
taxa	loci_present	loci_missing	pct_missing
Mus_musculus	5/5	0/5	0.0%
Smilodon_populator	2/5	3/5	60.0%

$ cladekit coverage -l -p prov.tsv
loci          appearance_count  missing_pct
12S_aln.fas   6/8               25.0%
COX1_aln.fas  6/8               25.0%

Flags:

-t, --taxa — show per-taxon coverage (how many loci each taxon has)
-l, --loci — show per-loci coverage (how many taxa each locus has)
-p, --pretty — column-aligned output for readability

convert

Convert between common sequence data file types. Auto-detects the input format from file contents.

Supported formats:

FASTA (f)
NEXUS (n / nex / nexus)
Relaxed PHYLIP (rp / phylip)
Strict PHYLIP (sp)

Example:

$ cladekit convert -o n alignment.fasta
#NEXUS
BEGIN DATA;
  DIMENSIONS NTAX=3 NCHAR=8;
  FORMAT DATATYPE=DNA MISSING=N GAP=-;
  MATRIX
  Taxon_A    ATCGATCG
  Taxon_B    ATCGATCG
  Taxon_C    ATCGNNNN
;
END;

$ cladekit convert -o rp alignment.nex
3 8
Taxon_A    ATCGATCG
Taxon_B    ATCGATCG
Taxon_C    ATCGNNNN

Flags:

-o, --output_format — output format: f (fasta), n (nexus), rp (relaxed phylip), sp (strict phylip)

Planned Subcommands

filter — remove taxa exceeding a missingness threshold from a supermatrix
scrub — alignment outlier detection via pairwise p-distances
view - in terminal alignment viewer
slice - cut out and remove sections of an alignment (remove non-homologous seqs, extract homologous seqs)

Development Note

Cladekit is being built as both a real research tool and a vehicle for learning Rust. Development is assisted by Claude (Anthropic), which serves as a teaching aid and coding partner. The design, domain knowledge, and direction are the author's own.

Author

Andrew Budge

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
docs/mockup_logos		docs/mockup_logos
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Install

Subcommands

getheaders (ghd)

concat (liger)

stats

coverage

convert

Planned Subcommands

Development Note

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Install

Subcommands

getheaders (ghd)

concat (liger)

stats

coverage

convert

Planned Subcommands

Development Note

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages