- Introduction
- Before you start
- Workflow
- Accessing the new gene functional annotations
Determining the functional roles of genes and their molecular products in non-model organisms presents substantial challenges. Unlike classical model species such as the house mouse (Mus musculus) or zebrafish (Danio rerio), non-model organisms typically lack extensive and well-curated genomic resources, including high-quality reference genomes, comprehensive multi-tissue transcriptomes, or experimentally validated gene-function datasets across different conditions and experiments. As a result, many genes and protein products remain poorly characterized, complicating efforts to link DNA sequence information with biological function. Consequently, computational tools must be used to infer the biological roles of genes and proteins in less-characterized species. Fortunately, several bioinformatic tool now enable functional information to be inferred from species that have been studied in far greater detail using different approaches:
- eggNOG-mapper infers gene function via orthology, identifying genes derived from a single ancestral gene through speciation events-genes that are generally assumed to retain ancestral functions.
- InterProScan predicts gene function by detecting protein signatures (motifs, domains, and families) that are structurally and functionally conserved across diverse taxa.
- BLAST – Basic Local Alignment Search Tool infers gene function through sequence similarity searches, based on the principle that similar sequences often share similar functions.
In this repository, I present a comprehensive framework in which these tools have been used to retrieve gene/protein functional annotations from multiple databases and vertebrate systems, including Gene Ontology, InterPro, KEGG, and PFAM, for the ruff (Calidris pugnax). This bird species exhibits exceptionally high intraspecific behavioural diversity, and our research group aims to establish it as a model system for studying the molecular and endocrine bases of complex social behaviours. However, genetic resources for this species remain limited; therefore, functional inference must rely heavily on information from other vertebrate systems.
Due to GitHub’s file size limitations, several large files (exceeding 50 MB) are not included in this repository. However, I provide instructions for downloading, generating, and naming these files, as well as for placing them in the correct directories. This workflow makes use of R scripts that I executed in RStudio on a standard laptop (8 cores, 16 GB RAM) running Linux Mint OS. With minor modifications, this workflow can be reproduced on Windows and macOS computers.
Clone this repository by entering the following command in the terminal:
git clone https://github.com/azemella/Ruff_GO_annotations_2025
The list of input files includes gene, transcript, and protein metadata and a protein sequence FASTA file for Calidris pugnax, all of which are publicly available on NCBI, as well as information from public databases such InterPro and Gene Ontology:
- The file
./NCBI_data_and_metadata/ncbi_dataset.csvcontains the complete list of genes and associated metadata for the current ruff reference genome (ASM143184v1). It was downloaded from the NCBI Gene dataset by searching for “Calidris pugnax” under “Taxon.” All available columns were selected under “Select columns,” and the resulting table was downloaded as a text file, renamed toncbi_datasets.csv, and moved to./NCBI_data_and_metadata/ncbi_datasets.csv. - The file
./NCBI_data_and_metadata/ncbi_dataset_protein.csvcontains analogous information at the protein level and is used to link functional annotation results to their corresponding genes. - The file
./NCBI_data_and_metadata/GCF_001431845.1_ASM143184v1_protein.faacontains the complete amino acid sequences of all proteins in the current version of the ruff proteome. This FASTA file serves as the primary input for all downstream bioinformatic tools. It was downloaded from NCBI FTP server. - The file
./NCBI_data_and_metadata/GCF_001431845.1_ASM143184v1_gene_ontology.gafcontains all Gene Ontology terms currently assigned to ruff genes. Because NCBI has not updated this file for several years, many genes lack GO annotations or have incomplete information. Nonetheless, it serves as the baseline to which newly inferred Gene Ontology terms will be added. It was downloaded from the same NCBI FTP directory. - The file
./InterProScan-5.75-106.0/InterPro_database.tsvcontains textual descriptions of all InterPro IDs. The “InterPro entry list” was downloaded from the EBI FTP server and saved as a TSV file.
The following files are not included in this repository due to their large size and have to be downloaded and processed separately:
-
The file
goa_uniprot_all.gaf.gzcontaining all Gene Ontology terms for each UniProtKB ID has to be downloaded from the Gene Ontology Annotation (GOA) Database and placed in the main directory.It can be uncompressed with:
gunzip goa_uniprot_all.gaf.gz
Because this file is ~200 GB, it cannot be opened directly in RStudio. To avoid crashes, it has to be split into ~250 MB chunks using:
split -C 250M -a 3 --numeric-suffixes=1 --additional-suffix=.txt goa_uniprot_all.gaf GOA_UniProt_split/goa_uniprot_split
Approximately 685 files will be generated in the directory GOA_UniProt_split.
-
The file
idmapping.dat.gzhas to be downloaded from the UniProt Database FTP server.It can be uncompressed with:
gunzip idmapping.dat.gz
It must then be split into ~125 MB chunks using:
split -C 125M -a 3 --numeric-suffixes=1 --additional-suffix=.txt idmapping.dat UniProt_IDMapping_split/id_mapping_split
Approximately 660 files will be generated in the directory UniProt_IDMapping_split.
The ruff proteome was first annotated using eggNOG-mapper. This can be done either through their web interface or by running the tool locally as described in their documentation. In this workflow, the proteome was annotated using the web interface (version 2.1.12) with default parameters, restricting the taxonomic scope to vertebrates (tax-scope 7742). The resulting TSV file with functional annotations was renamed ruff_eggnog-mapper_annotations.tsv and moved to ./EggNOG-mapper-2.1.12.
Annotation of the ruff proteome was then performed using a local installation of InterProScan v5.75-106.0 with default parameters:
./interproscan.sh -i ./ruff_annotations/GCF_001431845.1_ASM143184v1_protein.faa -dp -goterms -b ./ruff_annotations/ruff_interproscan_annotations -T ./temp -f TSV -cpu 4
The resulting file, ruff_interproscan_annotations.tsv, was moved to ./InterProScan-5.75-106.0.
Finally, the ruff proteome was annotated using the NCBI Protein BLAST web server (BlastP: Protein > Protein), as running the analysis locally on a standard laptop would be impractical. Because the proteome FASTA file contains more sequences than BLAST can process at once, it was split into 212 smaller FASTA files and moved to ./Ruff_Proteome_Split.
Each chunk was queried against three databases:
- NCBI NR clustered (nr_cluster_seq): a large, comprehensive database integrating protein sequences from multiple sources.
- RefSeq Select proteins (refseq_select): a curated subset of RefSeq protein-coding genes.
- UniProtKB/Swiss-Prot (swissprot): a manually curated section of the UniProt Knowledgebase.
In the Organism section, the search was restricted to vertebrates only (TaxID 7742), while Calidris pugnax was excluded to prevent self-matches. Under General Parameters, the default Max target sequences was reduced to 10, and the Expect threshold was set to 0.00001.
This analysis produced 212 output files for each of the three databases, which were moved to:
./NCBI-blast-2.17.0/NCBI_BlastP_NRclustered_results./NCBI-blast-2.17.0/NCBI_BlastP_RefSeqprotein_results./NCBI-blast-2.17.0/NCBI_BlastP_SwissProt_results
The directory ./Scripts contains 13 numbered R scripts that must be executed sequentially. These scripts process all the previously described files to create a custom Calidris pugnax “Organism Package” using the function makeOrgPackage() from the AnnotationForge package.
1_Prepare_NCBI_annotations.Rimports and merges the two NCBI gene and protein annotation datasets (ncbi_dataset_protein.csvandncbi_dataset.csv) for Calidris pugnax to generate a comprehensiveNCBI_annotations.csvtable. All newly inferred functional annotations will be added to this data frame.2_Prepare_NCBI_GOs.Rimports the current Gene Ontology annotations for Calidris pugnax fromGCF_001431845.1_ASM143184v1_gene_ontology.gafand formats them into a table containing only Gene IDs (GID) and GO terms.3_Prepare_EggNOG_annotations.Rprocesses EggNOG-mapper annotation results, linking protein-level annotations to gene IDs to generate four tables: Gene Ontology terms (EggNOG_GOs.csv), KEGG pathways (EggNOG_KEGGs.csv), PFAM domains (EggNOG_PFAMs.csv), and EggNOG gene symbols/descriptions (EggNOG_annotations.csv).4_Prepare_InterProScan_annotations.Rprocesses InterProScan results, linking protein-level annotations to gene IDs to generate three tables: Gene Ontology terms (InterProScan_GOs.csv), PFAM domains (InterProScan_PFAMs.csv), and InterPro IDs (InterProScan_annotations.csv).5_Prepare_BLAST_results_files.Rimports and cleans NCBI BlastP results from the SwissProt, RefSeq Selected, and NR Clustered databases, filters for the best hit for each query protein, and generates three tables:Blast_SwissProt_Hits.csv,Blast_RefSeqSel_Hits.csv, andBlast_NRClustered_Hits.csv.6_Combine_BLAST_SwissProt_with_GOs.Rextracts all GO terms associated with Swiss-Prot BLAST hits and links them at the gene level, producingBLAST_SwissProt_GOs.csv.7_Prepare_BLAST_RefSeqSel_annotations.Rprocesses RefSeq Selected BLAST hits and maps the resulting IDs to UniProtKB IDs.8_Combine_BLAST_RefSeqSel_with_GOs.Rextracts GO terms for each RefSeq Selected BLAST hit linked to UniProt IDs and summarizes them at the gene level inBLAST_RefSeqSel_GOs.csv.9_Prepare_BLAST_NRClustered_annotations.Rprocesses NR Clustered BLAST hits and maps the resulting IDs to UniProtKB IDs.10_Combine_BLAST_NRClustered_with_GOs.Rextracts GO terms for each NR Clustered BLAST hit linked to UniProtKB IDs and summarizes them at the gene level inBLAST_NRClustered_GOs.csv.11_Combine_all_GO_annotations.Rmerges all Gene Ontology annotation tables (from NCBI, EggNOG, InterProScan, and BLAST) and all PFAM annotation tables (from EggNOG and InterProScan) into two master files:All_combined_GOs.csvandAll_combined_PFAMs.csv.12_Prepare_BLAST_annotations.Radds gene symbols/descriptions to each Swiss-Prot BLAST hit using the corresponding UniProtKB ID, generatingBlast_SwissProt_annotations.csv.13_Prepare_MakeOrgDb_package.Rimports all data tables generated in the previous scripts and formats them for the AnnotationForge package, which creates a custom R gene annotation package (org.Cpugnax.eg.db) for Calidris pugnax.
The functional annotations generated by this workflow are stored in a custom AnnotationDbi 'Organism Package' for Calidris pugnax, named org.Cpugnax.eg.db. This R package is essentially a specialized SQLite database that integrates all the annotation data derived from NCBI, EggNOG-mapper, InterProScan, and BLAST, providing a unified and easily queryable resource within the R environment. Here I present a few example of how this R package can be interrogated to extract the new functional annotations and extra information for one or more genes:
Load the package (assuming you have already generated and installed it using the 13_Prepare_MakeOrgDb_package.R R script):
library("org.Cpugnax.eg.db")
library("GO.db") # Necessary to retrieve Gene Ontology term descriptions
The main function for accessing information from the package is select(). This function requires three main arguments:
keys: A vector of identifiers (e.g., Gene IDs, Gene Symbols, or GO terms) you wish to query.keytype: The type of identifier you are providing in thekeysargument.columns: The types of annotation data you want to retrieve (e.g., Gene Ontology terms, genomic coordinates, or BLAST results).
You can first check all available annotation types and key identifiers:
- Available columns (data types):
columns(org.Cpugnax.eg.db)
- Available keytypes (input identifiers):
keytypes(org.Cpugnax.eg.db)
Here are four examples illustrating how to retrieve specific types of information for the inversion gene HSD17B2 (Gene ID: 106888192) or for a specific Gene Ontology term (e.g., GO:0007610 Behavior).
This code snippet fetches all Gene Ontology terms associated with HSD17B2 and uses the GO.db package to add their descriptions.
# Define the columns to retrieve
GO_cols <- c("GID", "GOALL")
# Retrieve GO terms for Gene ID 106888192
GO_info <- select(org.Cpugnax.eg.db,
keys = "106888192",
keytype = "GID",
columns = GO_cols)
# Add the description for each GO term
GO_info$Description <- Term(GO_info$GOALL)
The output shows all Gene Ontology terms (and their descriptions) associated with the inversion gene HSD17B2.
This code snippet explores the quality of annotations by retrieving summary information derived from NCBI, EggNOG, InterPro, and BLAST for the gene HSD17B2.
# Define the columns to retrieve
cols <- c("Symbol", "NCBI_Description", "EggNOG_Symbol", "EggNOG_Description",
"InterPro_ID", "InterPro_Name", "BLAST_Symbol", "BLAST_Description")
# Retrieve annotations for the inversion gene HSD17B2
gene_info <- select(org.Cpugnax.eg.db,
keys = "HSD17B2",
keytype = "Symbol",
columns = cols)
The output shows that all databases have annotated this gene as "HSD17B2" and with the correct protein family and domains (e.g., "Short-chain dehydrogenase/reductase SDR")
This code snippet fetches the scaffold and start/end coordinates for the inversion gene HSD17B2.
# Define the columns to retrieve
genome_cols <- c("Scaffold", "Scaffold_Start", "Scaffold_End")
# Retrieve genomic coordinates for the inversion gene HSD17B2
gene_genome_info <- select(org.Cpugnax.eg.db,
keys = "106888192",
keytype = "GID",
columns = genome_cols)
The output shows that this gene is located on scaffold NW_015090842.1, which contains the autosomal inversion region. More precisely, this gene starts in position 6257506 and ends in position 6268838.
This code snippet performs a reverse query: it takes a Gene Ontology term (GO:0007610) and retrieves all Calidris pugnax genes that have been annotated with it.
# Define the GO term and columns to retrieve
GO_term <- "GO:0007610"
genes_cols <- c("GID", "Symbol")
# Retrieve all genes annotated with the GO term GO:0007610
GO_genes <- select(org.Cpugnax.eg.db,
keys = GO_term,
keytype = "GOALL",
columns = genes_cols)
The output shows that 950 unique genes in the ruff genome are annotated with the Gene Ontology term 'Behavior'.