A Bioinformatics Toolkit for Mouse Genome (mm9) Analysis
This project implements a bioinformatics pipeline designed to analyze genomic sequences with a specific focus on the Tricarboxylic Acid (TCA) Cycle enzymes. The tool aims to process large-scale genomic data (FASTA/GTF) to identify regulatory signatures, calculate thermodynamic properties of DNA, and predict primer suitability for biochemical assays.
Given my background in biochemistry engineering, the project is focused on bridging the gap between raw sequence data and physical properties like Melting Temperature (
- Optimized Data Loading: Efficiently parses zipped FASTA files and gene annotation data using memory-mapped logic and state-machine parsing.
- TCA Pathway Focus: Specifically isolates the Citrate Synthase (Cs) enzyme to analyze promoter regions.
-
DNA Thermodynamics: Includes a
$T_m$ prediction engine for genomic sequences based on nucleotide composition. - PCR Primer Design: Automated scanning for candidate primers within regulatory regions.
- Biochemical Visualization: Generates sliding-window GC content plots. Work on plotting regulatory signature comparisons is still in process.
.
├── main.py # Core execution pipeline
├── LoadFASTA_Function.py # High-performance sequence/gene loaders
├── sequence_analysis.py # GC-content, Tm prediction, and Primer design logic
├── tca_analysis.py # Enrichment and isolation of TCA cycle enzymes
├── visualization.py # Matplotlib/Seaborn genomic plotting logic
├── Summary.py # Automated biochemical report generation
└── README.md # Documentation
- OS: Ubuntu 24.04 LTS
- Environment: Python 3.12+
- Dependencies:
pip install numpy matplotlib pandas seaborn
- Ensure
mm9_sel_chroms_knownGene.txtandselChroms_mm9.fa.zipare in the project root. - Run the main pipeline:
python3 main.py
The pipeline analyzes the upstream regions (promoters) of TCA cycle genes. It calculates the GC-content which dictates DNA stability:
Using the predicted
A deep dive into the Cs promoter allows for the identification of potential transcription factor binding sites and the design of PCR primers for experimental verification.
Andrés Benjamin Garcés Cifuentes Biochemistry Engineer Specializing in Genomic Data Science and Bioprocess Modeling.