Skip to content

azemella/Ruff_GO_annotations_2025

Repository files navigation

License: MIT Bioinformatics Made with R Linux

Gene functional annotations for the ruff (Calidris pugnax)

Table of Contents

Introduction

Determining the functional roles of genes and their molecular products in non-model organisms presents substantial challenges. Unlike classical model species such as the house mouse (Mus musculus) or zebrafish (Danio rerio), non-model organisms typically lack extensive and well-curated genomic resources, including high-quality reference genomes, comprehensive multi-tissue transcriptomes, or experimentally validated gene-function datasets across different conditions and experiments. As a result, many genes and protein products remain poorly characterized, complicating efforts to link DNA sequence information with biological function. Consequently, computational tools must be used to infer the biological roles of genes and proteins in less-characterized species. Fortunately, several bioinformatic tool now enable functional information to be inferred from species that have been studied in far greater detail using different approaches:

  • eggNOG-mapper infers gene function via orthology, identifying genes derived from a single ancestral gene through speciation events-genes that are generally assumed to retain ancestral functions.
  • InterProScan predicts gene function by detecting protein signatures (motifs, domains, and families) that are structurally and functionally conserved across diverse taxa.
  • BLAST – Basic Local Alignment Search Tool infers gene function through sequence similarity searches, based on the principle that similar sequences often share similar functions.

In this repository, I present a comprehensive framework in which these tools have been used to retrieve gene/protein functional annotations from multiple databases and vertebrate systems, including Gene Ontology, InterPro, KEGG, and PFAM, for the ruff (Calidris pugnax). This bird species exhibits exceptionally high intraspecific behavioural diversity, and our research group aims to establish it as a model system for studying the molecular and endocrine bases of complex social behaviours. However, genetic resources for this species remain limited; therefore, functional inference must rely heavily on information from other vertebrate systems.

Before you start

Due to GitHub’s file size limitations, several large files (exceeding 50 MB) are not included in this repository. However, I provide instructions for downloading, generating, and naming these files, as well as for placing them in the correct directories. This workflow makes use of R scripts that I executed in RStudio on a standard laptop (8 cores, 16 GB RAM) running Linux Mint OS. With minor modifications, this workflow can be reproduced on Windows and macOS computers.

Cloning this GitHub repository

Clone this repository by entering the following command in the terminal:

  • git clone https://github.com/azemella/Ruff_GO_annotations_2025

Input files

The list of input files includes gene, transcript, and protein metadata and a protein sequence FASTA file for Calidris pugnax, all of which are publicly available on NCBI, as well as information from public databases such InterPro and Gene Ontology:

  • The file ./NCBI_data_and_metadata/ncbi_dataset.csv contains the complete list of genes and associated metadata for the current ruff reference genome (ASM143184v1). It was downloaded from the NCBI Gene dataset by searching for “Calidris pugnax” under “Taxon.” All available columns were selected under “Select columns,” and the resulting table was downloaded as a text file, renamed to ncbi_datasets.csv, and moved to ./NCBI_data_and_metadata/ncbi_datasets.csv.
  • The file ./NCBI_data_and_metadata/ncbi_dataset_protein.csv contains analogous information at the protein level and is used to link functional annotation results to their corresponding genes.
  • The file ./NCBI_data_and_metadata/GCF_001431845.1_ASM143184v1_protein.faa contains the complete amino acid sequences of all proteins in the current version of the ruff proteome. This FASTA file serves as the primary input for all downstream bioinformatic tools. It was downloaded from NCBI FTP server.
  • The file ./NCBI_data_and_metadata/GCF_001431845.1_ASM143184v1_gene_ontology.gaf contains all Gene Ontology terms currently assigned to ruff genes. Because NCBI has not updated this file for several years, many genes lack GO annotations or have incomplete information. Nonetheless, it serves as the baseline to which newly inferred Gene Ontology terms will be added. It was downloaded from the same NCBI FTP directory.
  • The file ./InterProScan-5.75-106.0/InterPro_database.tsv contains textual descriptions of all InterPro IDs. The “InterPro entry list” was downloaded from the EBI FTP server and saved as a TSV file.

The following files are not included in this repository due to their large size and have to be downloaded and processed separately:

  • The file goa_uniprot_all.gaf.gz containing all Gene Ontology terms for each UniProtKB ID has to be downloaded from the Gene Ontology Annotation (GOA) Database and placed in the main directory.

    It can be uncompressed with:

gunzip goa_uniprot_all.gaf.gz

Because this file is ~200 GB, it cannot be opened directly in RStudio. To avoid crashes, it has to be split into ~250 MB chunks using:

split -C 250M -a 3 --numeric-suffixes=1 --additional-suffix=.txt goa_uniprot_all.gaf GOA_UniProt_split/goa_uniprot_split

Approximately 685 files will be generated in the directory GOA_UniProt_split.

gunzip idmapping.dat.gz

It must then be split into ~125 MB chunks using:

split -C 125M -a 3 --numeric-suffixes=1 --additional-suffix=.txt idmapping.dat UniProt_IDMapping_split/id_mapping_split

Approximately 660 files will be generated in the directory UniProt_IDMapping_split.

Workflow

1. EggNOG-mapper

The ruff proteome was first annotated using eggNOG-mapper. This can be done either through their web interface or by running the tool locally as described in their documentation. In this workflow, the proteome was annotated using the web interface (version 2.1.12) with default parameters, restricting the taxonomic scope to vertebrates (tax-scope 7742). The resulting TSV file with functional annotations was renamed ruff_eggnog-mapper_annotations.tsv and moved to ./EggNOG-mapper-2.1.12.

2. InterProScan

Annotation of the ruff proteome was then performed using a local installation of InterProScan v5.75-106.0 with default parameters:

./interproscan.sh -i ./ruff_annotations/GCF_001431845.1_ASM143184v1_protein.faa -dp -goterms -b ./ruff_annotations/ruff_interproscan_annotations -T ./temp -f TSV -cpu 4

The resulting file, ruff_interproscan_annotations.tsv, was moved to ./InterProScan-5.75-106.0.

3. NCBI Basic Local Alignment Search Tool (BLAST)

Finally, the ruff proteome was annotated using the NCBI Protein BLAST web server (BlastP: Protein > Protein), as running the analysis locally on a standard laptop would be impractical. Because the proteome FASTA file contains more sequences than BLAST can process at once, it was split into 212 smaller FASTA files and moved to ./Ruff_Proteome_Split.

Each chunk was queried against three databases:

In the Organism section, the search was restricted to vertebrates only (TaxID 7742), while Calidris pugnax was excluded to prevent self-matches. Under General Parameters, the default Max target sequences was reduced to 10, and the Expect threshold was set to 0.00001.

This analysis produced 212 output files for each of the three databases, which were moved to:

  • ./NCBI-blast-2.17.0/NCBI_BlastP_NRclustered_results
  • ./NCBI-blast-2.17.0/NCBI_BlastP_RefSeqprotein_results
  • ./NCBI-blast-2.17.0/NCBI_BlastP_SwissProt_results

R scripts

The directory ./Scripts contains 13 numbered R scripts that must be executed sequentially. These scripts process all the previously described files to create a custom Calidris pugnax “Organism Package” using the function makeOrgPackage() from the AnnotationForge package.

  1. 1_Prepare_NCBI_annotations.R imports and merges the two NCBI gene and protein annotation datasets (ncbi_dataset_protein.csv and ncbi_dataset.csv) for Calidris pugnax to generate a comprehensive NCBI_annotations.csv table. All newly inferred functional annotations will be added to this data frame.
  2. 2_Prepare_NCBI_GOs.R imports the current Gene Ontology annotations for Calidris pugnax from GCF_001431845.1_ASM143184v1_gene_ontology.gaf and formats them into a table containing only Gene IDs (GID) and GO terms.
  3. 3_Prepare_EggNOG_annotations.R processes EggNOG-mapper annotation results, linking protein-level annotations to gene IDs to generate four tables: Gene Ontology terms (EggNOG_GOs.csv), KEGG pathways (EggNOG_KEGGs.csv), PFAM domains (EggNOG_PFAMs.csv), and EggNOG gene symbols/descriptions (EggNOG_annotations.csv).
  4. 4_Prepare_InterProScan_annotations.R processes InterProScan results, linking protein-level annotations to gene IDs to generate three tables: Gene Ontology terms (InterProScan_GOs.csv), PFAM domains (InterProScan_PFAMs.csv), and InterPro IDs (InterProScan_annotations.csv).
  5. 5_Prepare_BLAST_results_files.R imports and cleans NCBI BlastP results from the SwissProt, RefSeq Selected, and NR Clustered databases, filters for the best hit for each query protein, and generates three tables: Blast_SwissProt_Hits.csv, Blast_RefSeqSel_Hits.csv, and Blast_NRClustered_Hits.csv.
  6. 6_Combine_BLAST_SwissProt_with_GOs.R extracts all GO terms associated with Swiss-Prot BLAST hits and links them at the gene level, producing BLAST_SwissProt_GOs.csv.
  7. 7_Prepare_BLAST_RefSeqSel_annotations.R processes RefSeq Selected BLAST hits and maps the resulting IDs to UniProtKB IDs.
  8. 8_Combine_BLAST_RefSeqSel_with_GOs.R extracts GO terms for each RefSeq Selected BLAST hit linked to UniProt IDs and summarizes them at the gene level in BLAST_RefSeqSel_GOs.csv.
  9. 9_Prepare_BLAST_NRClustered_annotations.R processes NR Clustered BLAST hits and maps the resulting IDs to UniProtKB IDs.
  10. 10_Combine_BLAST_NRClustered_with_GOs.R extracts GO terms for each NR Clustered BLAST hit linked to UniProtKB IDs and summarizes them at the gene level in BLAST_NRClustered_GOs.csv.
  11. 11_Combine_all_GO_annotations.R merges all Gene Ontology annotation tables (from NCBI, EggNOG, InterProScan, and BLAST) and all PFAM annotation tables (from EggNOG and InterProScan) into two master files: All_combined_GOs.csv and All_combined_PFAMs.csv.
  12. 12_Prepare_BLAST_annotations.R adds gene symbols/descriptions to each Swiss-Prot BLAST hit using the corresponding UniProtKB ID, generating Blast_SwissProt_annotations.csv.
  13. 13_Prepare_MakeOrgDb_package.R imports all data tables generated in the previous scripts and formats them for the AnnotationForge package, which creates a custom R gene annotation package (org.Cpugnax.eg.db) for Calidris pugnax.

How to access the new gene functional annotations

The functional annotations generated by this workflow are stored in a custom AnnotationDbi 'Organism Package' for Calidris pugnax, named org.Cpugnax.eg.db. This R package is essentially a specialized SQLite database that integrates all the annotation data derived from NCBI, EggNOG-mapper, InterProScan, and BLAST, providing a unified and easily queryable resource within the R environment. Here I present a few example of how this R package can be interrogated to extract the new functional annotations and extra information for one or more genes:

Querying the annotation database

Load the package (assuming you have already generated and installed it using the 13_Prepare_MakeOrgDb_package.R R script):

library("org.Cpugnax.eg.db")
library("GO.db") # Necessary to retrieve Gene Ontology term descriptions

The main function for accessing information from the package is select(). This function requires three main arguments:

  • keys: A vector of identifiers (e.g., Gene IDs, Gene Symbols, or GO terms) you wish to query.
  • keytype: The type of identifier you are providing in the keys argument.
  • columns: The types of annotation data you want to retrieve (e.g., Gene Ontology terms, genomic coordinates, or BLAST results).

You can first check all available annotation types and key identifiers:

  • Available columns (data types):
columns(org.Cpugnax.eg.db)
  • Available keytypes (input identifiers):
keytypes(org.Cpugnax.eg.db)

Here are four examples illustrating how to retrieve specific types of information for the inversion gene HSD17B2 (Gene ID: 106888192) or for a specific Gene Ontology term (e.g., GO:0007610 Behavior).

1. Retrieve Gene Ontology terms and descriptions for a gene (by Gene ID)

This code snippet fetches all Gene Ontology terms associated with HSD17B2 and uses the GO.db package to add their descriptions.

# Define the columns to retrieve
GO_cols <- c("GID", "GOALL")

# Retrieve GO terms for Gene ID 106888192
GO_info <- select(org.Cpugnax.eg.db, 
                  keys = "106888192",
                  keytype = "GID",
                  columns = GO_cols)

# Add the description for each GO term
GO_info$Description <- Term(GO_info$GOALL)

The output shows all Gene Ontology terms (and their descriptions) associated with the inversion gene HSD17B2.

2. Retrieve functional annotations from different sources (by Gene Symbol)

This code snippet explores the quality of annotations by retrieving summary information derived from NCBI, EggNOG, InterPro, and BLAST for the gene HSD17B2.

# Define the columns to retrieve
cols <- c("Symbol", "NCBI_Description", "EggNOG_Symbol", "EggNOG_Description", 
          "InterPro_ID", "InterPro_Name", "BLAST_Symbol", "BLAST_Description")

# Retrieve annotations for the inversion gene HSD17B2
gene_info <- select(org.Cpugnax.eg.db, 
                    keys = "HSD17B2",
                    keytype = "Symbol",
                    columns = cols)

The output shows that all databases have annotated this gene as "HSD17B2" and with the correct protein family and domains (e.g., "Short-chain dehydrogenase/reductase SDR")

3. Retrieve the genomic coordinates of a gene (by Gene ID)

This code snippet fetches the scaffold and start/end coordinates for the inversion gene HSD17B2.

# Define the columns to retrieve
genome_cols <- c("Scaffold", "Scaffold_Start", "Scaffold_End")

# Retrieve genomic coordinates for the inversion gene HSD17B2
gene_genome_info <- select(org.Cpugnax.eg.db, 
                           keys = "106888192",
                           keytype = "GID",
                           columns = genome_cols)

The output shows that this gene is located on scaffold NW_015090842.1, which contains the autosomal inversion region. More precisely, this gene starts in position 6257506 and ends in position 6268838.

4. Retrieve all genes annotated with a specific Gene Ontology term

This code snippet performs a reverse query: it takes a Gene Ontology term (GO:0007610) and retrieves all Calidris pugnax genes that have been annotated with it.

# Define the GO term and columns to retrieve
GO_term <- "GO:0007610"
genes_cols <- c("GID", "Symbol")

# Retrieve all genes annotated with the GO term GO:0007610
GO_genes <- select(org.Cpugnax.eg.db, 
                   keys = GO_term,
                   keytype = "GOALL",
                   columns = genes_cols)

The output shows that 950 unique genes in the ruff genome are annotated with the Gene Ontology term 'Behavior'.

About

Annotate the ruff sandpiper (Calidris pugnax) reference proteome with GO, KEGG, PFAM and other information using various bioinformatics tools and databases.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages