Acceleration of a gene prediction method on multicore clusters



Documentation NEW


Github NEW


Project

Gene prediction mechanisms are part of the field of computational biology and are used to identify fragments of sequences (usually DNA) that are biologically functional (for example, identify those genes that encode proteins).

Gene identification is one of the first and most important steps to understand the genome of a species once it has been sequenced.

The FragGeneScan[1] tool allows do predictions of genes in current metagenomic projects, such as the human genome.

FragGeneScan is capable of obtain a high precision, especially in large data sets with many short sequences. However, it has the drawback of requiring a very long computing time.

Original tool

Here you can find the original tool in which this project is based, on the web you can download the latest version of the original application.

Documentation

Here you can see the documentation of the application generated with the Doxygen tool, for a better compression of the code.

Project repository

This is the repository that will be used for the development of the project. Here you can find the code and documentation of this with the current updates.



[1] Rho, M., Tang, H., & Ye, Y. (2010). FragGeneScan: predicting genes in short and error-prone reads. Nucleic acids research, 38(20), e191-e191.

Objectives

The objective of this work will be the development of a hybrid parallel application (with threads and MPI processes) that exploits the computational capacity of multicore system clusters to accelerate gene prediction.

The output of this tool must be the same as that of FragGeneScan in order to guarantee the correctness of the results.

OpenMP

Perform a multi-threaded parallelization to take advantage of all the resources (processors) offered by the system on which it is executed.

MPI

Make a parallelization with MPI processes to be able to use the resources of different nodes in a distributed system to accelerate the acquisition of the results.

Results

Measure and analyze the results obtained to observe the improvements achieved by this parallelization on the performance of the original tool.

Timeline

Publications of the different project updates in chronological order:

  • Jul 3, 2020

    Release version 1.2

    The version 1.2 release has been created with everything necessary to run this version of the tool. This version can be found in the releases section on the github project.

  • Jul 3, 2020

    MPI I/O version

    A version has been created that parallels the unification of the output files. This improvement allows to speed up the unification time since each process is in charge of writing its corresponding part. Support for longer genomes has also been added.

  • Jun 11, 2020

    Release version 1.1

    The version 1.1 release has been created with everything necessary to run this version of the tool. This version can be found in the releases section on the github project.

  • Jun 11, 2020

    Load balanced version

    A version of the tool with load balancing has been implemented taking into account the number of total bases to process within all the sequences. With this it is possible to level the work carried out between the different processes.

  • Jun 11, 2020

    Release version 1.0

    The version 1.0 release has been created with everything necessary to run this version of the tool. This version can be found in the releases section on the github project.

  • Jun 8, 2020

    First version MPI implementation

    A first version of the tool with MPI has been implemented. The correct operation of the implementation has been verified by using the test_correctness_mpi.sh script, obtaining the same results as the original tool.

  • Jun 5, 2020

    OpenMPI implementation tools

    Tools have been added to facilitate the implementation of a version with MPI. These tools facilitate the processes of compilation, execution and testing implementations with this library.

  • Jun 5, 2020

    .gitignore file

    The .gitignore file has been added to the github repository to not track files generated by builds or development environment configurations.

  • Jun 4, 2020

    Code Cleaning

    The run_hmm.c code has been cleaned to facilitate its understanding and subsequent stages of implementation. Confusing structures have been modified as they only added complexity to the code.

  • Apr 3, 2020

    Markdown Readme

    The README of the project has been updated to make use of the Markdown language and to improve the visualization and understanding of its description.

  • Mar 29, 2020

    Clean directory

    Files resulting from compilations have been removed. The script to verify the correctness of the executions has also been modified so that it cleans the environment after its execution.

  • Mar 29, 2020

    GNU General Public License v3.0

    A copy of the GNU General Public License v3.0 under which this project is licensed has been added. The header of the licence has also been added to each of the source files of the project.

  • Mar 28, 2020

    Project website

    A "gh-pages" branch has been created to host the project website on github and allow it to be accessible from the browser. The website has been created using a Bootstrap template.

  • Mar 28, 2020

    Initial documentation

    The initial project documentation has been created with the Doxygen tool to allow a simple visualization and facilitate the understanding of the project structure.

  • Mar 28, 2020

    Correctness test

    A test script has been created in order to verify the correct execution of the program by comparing it with test executions of the original tool.

  • Mar 27, 2020

    Import FragGeneScan tool

    The program has been imported from the official page of the tool in SourceForge in order to be able to carry out the project using version control with git.

  • Start
    the
    project!

Team

Currently the people developing this project are the following:

Bruno Cabado

HPC Master Student