Getting the Lineage of an Organism Using Python, BioSQL, and Taxonomy Data

By Vimalkumar Velayudhan · Posted 14 Jul 2018 · Updated 07 Mar 2026

Table of Contents

Introduction
Step 1: Install Required Software
Step 2: Create a Database for Storing Taxonomy Data
Step 3: Import NCBI Taxonomy Data Into Database
Step 4: Run Python Script to Get Lineage

Introduction#

You can retrieve the complete taxonomic lineage of an organism by querying the NCBI Taxonomy database using the Entrez API. The method described here works offline and will be useful if you need to retrieve the taxonomic lineage for a large number of organisms at once.

Much of the information in this blog post comes from the documentation of the BioSQL and BioPython projects. In addition, I wrote a Python script to query the taxonomy database and return the complete lineage of an organism.

Here is how the script works:

$ python3 lineager.py -n Bos taurus

And the resulting output:

2018-07-14 04:47 INFO     Processing organism name provided
                          at the command line: Bos taurus
Organism,Lineage
Bos taurus,cellular organisms;Eukaryota;Opisthokonta;Metazoa;
Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;
Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;
Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;
Eutheria;Boreoeutheria;Laurasiatheria;Cetartiodactyla;
Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus

You can also provide an input file (-f), containing names of organisms. See the example towards the end of this blog post.

Step 1: Install Required Software#

These steps were tested in Ubuntu 24.04 LTS with the following programs installed.

MySQL 8.0.45
For storing taxonomy data.

Perl 5.38.2 with the DBI, DBD, and libwww-perl modules
For initialising the database with the BioSQL schema and for importing taxonomy data. Optionally, the perl-doc package for viewing documentation.

Python 3.12.3 with the MySQLdb package
For running the lineager.py script.

Note:
sudo privileges are required only for installing packages, creating the database, and creating the database user.
All other commands are run from a regular user account.

Install MySQL using the command:

$ sudo apt install -y mysql-server

Once the installation is complete, check if the MySQL server is up and running using the command:

$ ss -tap | grep mysql
LISTEN 0      151        127.0.0.1:mysql       0.0.0.0:*    users:(("mysqld",pid=1323,fd=23))
LISTEN 0      70         127.0.0.1:33060       0.0.0.0:*    users:(("mysqld",pid=1323,fd=21))

Note:
For additional information on installing and configuring MySQL, please refer to the Installing MySQL section of the Ubuntu Server documentation.

Install remaining packages using apt:

$ sudo apt install -y libdbi-perl libdbd-mysql-perl \
  libmysqlclient-dev python3-mysqldb perl-doc \
  libwww-perl

Step 2: Create a Database for Storing Taxonomy Data#

$ sudo mysql -u root

Create Database#

Run the following SQL command. Note the use of backticks (`) around the database name (biosql), instead of single quotes:

CREATE DATABASE `biosql` COLLATE 'utf8_general_ci';

Create Database User#

Run the following command to create a database user. Also, grant this user permissions to access and modify the biosql database.

Note:
Replace ‘your-password-here’ with a strong password in the command below.

CREATE USER `lineager`@`localhost` IDENTIFIED BY 'your-password-here';
GRANT ALL PRIVILEGES ON `biosql`.* TO `lineager`@`localhost`;

Type exit to quit the MySQL shell.

Initialise Database With BioSQL Schema#

First, create a file named .my.cnf in the user HOME directory with the following content:

[client]
user = lineager
password = your-password-here
database = biosql

This file is used for connecting to the database from the lineager.py script. Replace your-password-here with the password you set above in Step 2.

Since the .my.cnf file contains the database password in plain text, it is a good idea to make it readable only for the current user. To do this, use the following command:

$ chmod 600 ~/.my.cnf

Next, download the BioSQL schema:

$ wget https://github.com/biosql/biosql/archive/refs/heads/master.zip

Extract the archive and change directory:

$ unzip master.zip
$ cd biosql-master

Initialise database by running the biosqldb-mysql.sql script:

$ mysql -u lineager -D biosql < sql/biosqldb-mysql.sql

Step 3: Import NCBI Taxonomy Data Into Database#

While still in the biosql-master directory, create a directory to store taxonomy data and change into it:

$ mkdir taxdata
$ cd taxdata

Download the taxonomy database (~70MB) and its MD5 checksum:

$ wget -c https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
$ wget https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz.md5

Ensure the file has downloaded correctly by verifying its MD5 checksum. A successful download should return “OK”:

$ md5sum -c taxdump.tar.gz.md5
taxdump.tar.gz: OK

Uncompress the archive:

$ tar zxvf taxdump.tar.gz

Before running the script to import taxonomy data, set the DBI_PASSWORD environment variable. This is the same password you set for the database user in Step 2:

$ export DBI_PASSWORD='your-password-here'

Now run the script to import taxonomy data into the database:

$ cd ..
$ perl scripts/load_ncbi_taxonomy.pl --dbname biosql --dbuser lineager

Note:
This will take a long time to complete. Leave the terminal session open.

Output:

Loading NCBI taxon database in taxdata:
    ... retrieving all taxon nodes in the database
    ... reading in taxon nodes from nodes.dmp
    ... insert / update / delete taxon nodes
    ... updating new parent IDs
    ... (committing nodes)
    ... rebuilding nested set left/right values
    ... reading in taxon names from names.dmp
    ... deleting old taxon names
    ... inserting new taxon names
    ... cleaning up
Done.

To view all the options supported by the script, use the command:

$ perl scripts/load_ncbi_taxonomy.pl --help

Step 4: Run Python Script to Get Lineage#

Download the source code of lineager using wget:

$ cd
$ wget https://codeberg.org/vimalkvn/lineager/archive/master.zip

Alternatively, clone the Git repository:

$ git clone https://codeberg.org/vimalkvn/lineager

Run the lineager.py script with the name of the organism, for example:

$ cd lineager
$ python3 lineager.py -n Escherichia coli

Output:

Organism,Lineage
Escherichia coli,cellular organisms;Bacteria;Pseudomonadati;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli

You can also get the lineage for multiple organisms at once. To do this, create a text file containing names of organisms like this:

Canis lupus familiaris
Bos taurus
Escherichia
AMBIGUOUS
Arabidopsis thaliana

Save the file as input.txt and then, run the script like this:

$ python3 lineager.py -f input.txt

When complete, an output file lineage.csv will be generated in the same directory containing lineage information for all organisms in the input file.

If you have any questions or comments, please send me an email.