Protein Expression Classification in Mice using Deep Learning

📌 Introduction

This project implements and evaluates artificial neural networks (ANNs) to classify mice into eight experimental groups based on protein expression and modifications detected in the nuclear fraction of the cortex.

The classification task involves 8 classes determined by three factors:

Genotype: control (c) or trisomic (t).
Behavioral condition: context-shock (CS) or shock-context (SC).
Treatment: memantine (m) or saline (s).

The 8 classes are:

c-CS-s: control – context-shock – saline
c-CS-m: control – context-shock – memantine
c-SC-s: control – shock-context – saline
c-SC-m: control – shock-context – memantine
t-CS-s: trisomic – context-shock – saline
t-CS-m: trisomic – context-shock – memantine
t-SC-s: trisomic – shock-context – saline
t-SC-m: trisomic – shock-context – memantine

The objective is the implementation and evaluation of fully connected neural networks (dense layers) for multiclass classification.

📂 Dataset

data3.csv: Protein expression levels (1080 samples × 73 columns).
- Column 1: sample identifier.
- Columns 2–73: protein expression values.
class3.csv: Class labels (1080 samples × 1 column, values 1–8).

No missing values are present in the dataset.

🔍 Exploratory Data Analysis (EDA)

Summary statistics were computed for all protein features.
Class distribution was visualized (class_distribution.png).
Protein correlation analysis was performed (correlation_matrix.png).

Outputs:

summary_statistics.csv
class_distribution.png
correlation_matrix.png

⚙️ Data Preprocessing

Removal of sample ID column.
Normalization of protein features using Min-Max scaling to [0, 1].
Train-test split: 67% training (723 samples), 33% test (357 samples).
Encoding of class labels from 1–8 to integers 0–7.

Normalized data is saved in data_normalized.csv.

🧠 Models

Model 1 – Baseline Neural Network

Input layer: 72 protein features.
Hidden layer: Dense(35) with ReLU activation.
Dropout layer: Dropout(0.2).
Output layer: Dense(8) with softmax activation.

Parameters:

Trainable parameters: 2,843
Optimizer: Adam
Loss function: Sparse categorical crossentropy
Epochs: 50
Validation split: 20%

Outputs:

learning_curve_model_1.png
confusion_matrix_model_1.png

Model 2 – Extended Architecture

Deeper fully connected neural network with additional dense layers.
Same optimizer, loss function, and training procedure as Model 1.

Outputs:

learning_curve_model_2.png
confusion_matrix_model_2.png

📊 Evaluation

Models were evaluated using:

Accuracy on test set
Confusion matrix
Precision, recall, and F1-score

Visual outputs:

Training vs validation learning curves
Confusion matrices

🛠️ Requirements

Python 3.9+
pandas
numpy
matplotlib
seaborn
scikit-learn
tensorflow

Install dependencies:

pip install -r requirements.txt

🚀 How to Run

Clone the repository:

git clone https://github.com/albaaggbb/protein-expression-classification.git
cd protein-expression-classification

Place data3.csv and class3.csv in the data/ folder.
Run the notebook:

jupyter notebook notebooks/protein_expression_classification.ipynb

Results will be saved in the results/ folder.

📌 Project Structure

├── data/
│   ├── data3.csv
│   └── class3.csv
├── notebooks/
│   └── protein_expression_classification.ipynb
├── results/
│   ├── class_distribution.png
│   ├── confusion_matrix_model_1.png
│   ├── confusion_matrix_model_2.png
│   ├── correlation_matrix.png
│   ├── data_normalized.csv
│   ├── learning_curve_model_1.png
│   ├── learning_curve_model_2.png
│   ├── summary_statistics.csv
│   └── protein_expression_classification.html
├── requirements.txt
├── .gitignore
└── README.md

👤 Author

Project developed by Alba Górriz.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Protein Expression Classification in Mice using Deep Learning

📌 Introduction

📂 Dataset

🔍 Exploratory Data Analysis (EDA)

⚙️ Data Preprocessing

🧠 Models

Model 1 – Baseline Neural Network

Model 2 – Extended Architecture

📊 Evaluation

🛠️ Requirements

🚀 How to Run

📌 Project Structure

👤 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
notebooks		notebooks
results		results
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Protein Expression Classification in Mice using Deep Learning

📌 Introduction

📂 Dataset

🔍 Exploratory Data Analysis (EDA)

⚙️ Data Preprocessing

🧠 Models

Model 1 – Baseline Neural Network

Model 2 – Extended Architecture

📊 Evaluation

🛠️ Requirements

🚀 How to Run

📌 Project Structure

👤 Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages