Getting the Lineage of an Organism Using Python, BioSQL, and Taxonomy Data
Table of Contents
Introduction#
You can retrieve the complete taxonomic lineage of an organism by querying the NCBI Taxonomy database using the Entrez API. The method described here works offline and will be useful if you need to retrieve the taxonomic lineage for a large number of organisms at once.
Much of the information in this blog post comes from the documentation of the BioSQL and BioPython projects. In addition, I wrote a Python script to query the taxonomy database and return the complete lineage of an organism.
Here is how the script works:
$ python3 lineager.py -n Bos taurus
And the resulting output:
2018-07-14 04:47 INFO Processing organism name provided
at the command line: Bos taurus
Organism,Lineage
Bos taurus,cellular organisms;Eukaryota;Opisthokonta;Metazoa;
Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;
Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;
Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;
Eutheria;Boreoeutheria;Laurasiatheria;Cetartiodactyla;
Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus
You can also provide an input file (-f), containing names of organisms. See the example towards the end of this blog post.
Step 1: Install Required Software#
These steps were tested in Ubuntu 24.04 LTS with the following programs installed.
MySQL 8.0.45
For storing taxonomy data.
Perl 5.38.2 with the DBI, DBD, and libwww-perl modules
For initialising the database with the BioSQL schema and for importing taxonomy data. Optionally, the perl-doc package for viewing documentation.
Python 3.12.3 with the MySQLdb package
For running the lineager.py script.
Note:
sudo privileges are required only for installing packages, creating the database, and creating the database user.
All other commands are run from a regular user account.
Install MySQL using the command:
$ sudo apt install -y mysql-server
Once the installation is complete, check if the MySQL server is up and running using the command:
$ ss -tap | grep mysql
LISTEN 0 151 127.0.0.1:mysql 0.0.0.0:* users:(("mysqld",pid=1323,fd=23))
LISTEN 0 70 127.0.0.1:33060 0.0.0.0:* users:(("mysqld",pid=1323,fd=21))
Note:
For additional information on installing and configuring MySQL, please refer to the Installing MySQL section of the Ubuntu Server documentation.
Install remaining packages using apt:
$ sudo apt install -y libdbi-perl libdbd-mysql-perl \
libmysqlclient-dev python3-mysqldb perl-doc \
libwww-perl
Step 2: Create a Database for Storing Taxonomy Data#
Login as the mysql admin user:
$ sudo mysql -u root
Create Database#
Run the following SQL command. Note the use of backticks (`) around the database name (biosql), instead of single quotes:
CREATE DATABASE `biosql` COLLATE 'utf8_general_ci';
Create Database User#
Run the following command to create a database user. Also, grant this user permissions to access and modify the biosql database.
Note:
Replace ‘your-password-here’ with a strong password in the command below.
CREATE USER `lineager`@`localhost` IDENTIFIED BY 'your-password-here';
GRANT ALL PRIVILEGES ON `biosql`.* TO `lineager`@`localhost`;
Type exit to quit the MySQL shell.
Initialise Database With BioSQL Schema#
First, create a file named .my.cnf in the user HOME directory with the following content:
[client]
user = lineager
password = your-password-here
database = biosql
This file is used for connecting to the database from the lineager.py script. Replace your-password-here with the password you set above in Step 2.
Since the .my.cnf file contains the database password in plain text, it is a good idea to make it readable only for the current user. To do this, use the following command:
$ chmod 600 ~/.my.cnf
Next, download the BioSQL schema:
$ wget https://github.com/biosql/biosql/archive/refs/heads/master.zip
Extract the archive and change directory:
$ unzip master.zip
$ cd biosql-master
Initialise database by running the biosqldb-mysql.sql script:
$ mysql -u lineager -D biosql < sql/biosqldb-mysql.sql
Step 3: Import NCBI Taxonomy Data Into Database#
While still in the biosql-master directory,
create a directory to store taxonomy data and change into it:
$ mkdir taxdata
$ cd taxdata
Download the taxonomy database (~70MB) and its MD5 checksum:
$ wget -c https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
$ wget https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz.md5
Ensure the file has downloaded correctly by verifying its MD5 checksum. A successful download should return “OK”:
$ md5sum -c taxdump.tar.gz.md5
taxdump.tar.gz: OK
Uncompress the archive:
$ tar zxvf taxdump.tar.gz
Before running the script to import taxonomy data, set the DBI_PASSWORD
environment variable.
This is the same password you set for the database user in Step 2:
$ export DBI_PASSWORD='your-password-here'
Now run the script to import taxonomy data into the database:
$ cd ..
$ perl scripts/load_ncbi_taxonomy.pl --dbname biosql --dbuser lineager
Note:
This will take a long time to complete. Leave the terminal session open.
Output:
Loading NCBI taxon database in taxdata:
... retrieving all taxon nodes in the database
... reading in taxon nodes from nodes.dmp
... insert / update / delete taxon nodes
... updating new parent IDs
... (committing nodes)
... rebuilding nested set left/right values
... reading in taxon names from names.dmp
... deleting old taxon names
... inserting new taxon names
... cleaning up
Done.
To view all the options supported by the script, use the command:
$ perl scripts/load_ncbi_taxonomy.pl --help
Step 4: Run Python Script to Get Lineage#
Download the source code of lineager using wget:
$ cd
$ wget https://codeberg.org/vimalkvn/lineager/archive/master.zip
Alternatively, clone the Git repository:
$ git clone https://codeberg.org/vimalkvn/lineager
Run the lineager.py script with the name of the organism, for example:
$ cd lineager
$ python3 lineager.py -n Escherichia coli
Output:
Organism,Lineage
Escherichia coli,cellular organisms;Bacteria;Pseudomonadati;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli
You can also get the lineage for multiple organisms at once. To do this, create a text file containing names of organisms like this:
Canis lupus familiaris
Bos taurus
Escherichia
AMBIGUOUS
Arabidopsis thaliana
Save the file as input.txt and then, run the script like this:
$ python3 lineager.py -f input.txt
When complete, an output file lineage.csv will be
generated in the same directory containing lineage information for all organisms in the input file.
If you have any questions or comments, please send me an email.