Skip to content

amckenna41/protPy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

40 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

protpy - Package for generating protein physicochemical, biochemical and structural descriptors using their constituent amino acids.

PyPI pytest Platforms PythonV Documentation Status codecov License: MIT Issues

  • 🧬 A demo of the software is available here
  • πŸ“ A Medium article about protPy and its background is available here

protpy

Table of Contents

Introduction

protpy is a Python software package for generating a variety of physicochemical, biochemical and structural descriptors for proteins. All of these descriptors are calculated using sequence-derived or physicochemical features of the amino acids that make up the proteins. These descriptors have been highly studied and used in a series of Bioinformatic applications including protein engineering, SAR (sequence-activity-relationships), predicting protein structure & function, subcellular localization, protein-protein interactions, drug-target interactions etc.

This software is aimed at any researcher or developer using protein sequence/structural data, and was mainly created to use in my own project pySAR which uses protein sequence data to identify Sequence Activity Relationships (SAR) using Machine Learning [1]. protpy is built and developed in Python 3.10.

The descriptors available in protpy include:

Composition Descriptors (22)
  • Amino Acid Composition (AAComp)
  • Dipeptide Composition (DPComp)
  • Tripeptide Composition (TPComp)
  • Grand Average of Hydropathy (GRAVY)
  • Aromaticity
  • Instability Index
  • Isoelectric Point
  • Molecular Weight
  • Charge Distribution
  • Hydrophobic/Polar/Charged Composition (HPC)
  • Secondary Structure Propensity (SSP)
  • k-mer Composition
  • Reduced Alphabet Composition
  • Motif Composition
  • Amino Acid Pair Composition
  • Aliphatic Index
  • Extinction Coefficient
  • Boman Index
  • Aggregation Propensity
  • Hydrophobic Moment
  • Shannon Entropy
  • Pseudo Amino Acid Composition (PAAComp)
  • Amphiphilic Amino Acid Composition (APAAComp)
Autocorrelation Descriptors (3)
  • Moreaubroto Autocorrelation (MBAuto)
  • Moran Autocorrelation (MAuto)
  • Geary Autocorrelation (GAuto)
Conjoint Triad (1)
  • Conjoint Triad (CTriad)
CTD Descriptors (4)
  • CTD Composition
  • CTD Transition
  • CTD Distribution
  • CTD Combined
Sequence Order Descriptors (5)
  • Sequence Order Coupling Number β€” single (SOCN)
  • Sequence Order Coupling Number β€” series
  • Sequence Order Coupling Number β€” all matrices
  • Quasi Sequence Order (QSO)
  • Quasi Sequence Order β€” all matrices

More detail of each descriptor is listed in the markdown file: DESCRIPTORS.md

Requirements

Installation

Install the latest version of protpy using pip:

pip3 install protpy --upgrade

Install by cloning repository:

git clone https://github.com/amckenna41/protpy.git
python3 setup.py install

Usage

Import protpy after installation:

import protpy as protpy

Import protein sequence from fasta:

from Bio import SeqIO

with open("test_fasta.fasta") as pro:
    protein_seq = str(next(SeqIO.parse(pro,'fasta')).seq)
Composition Descriptors Usage Examples

Calculate Amino Acid Composition:

amino_acid_composition = protpy.amino_acid_composition(protein_seq)
# A      C      D      E      F ...
# 6.693  3.108  5.817  3.347  6.614 ...

Calculate Dipeptide Composition:

dipeptide_composition = protpy.dipeptide_composition(protein_seq)
# AA    AC    AD   AE    AF ...
# 0.72  0.16  0.48  0.4  0.24 ...

Calculate Tripeptide Composition:

tripeptide_composition = protpy.tripeptide_composition(protein_seq)
# AAA  AAC  AAD  AAE  AAF ...
# 1    0    0    2    0 ...

Calculate GRAVY (Grand Average of Hydropathy):

gravy = protpy.gravy(protein_seq)
# GRAVY
# -0.045

Calculate Aromaticity:

aromaticity = protpy.aromaticity(protein_seq)
# Aromaticity
# 0.118

Calculate Instability Index:

instability = protpy.instability_index(protein_seq)
# InstabilityIndex
# 31.836

Calculate Isoelectric Point:

pi = protpy.isoelectric_point(protein_seq)
# IsoelectricPoint
# 5.412

Calculate Molecular Weight:

mw = protpy.molecular_weight(protein_seq)
# MolecularWeight (Da)
# 139122.355

Calculate Charge Distribution:

charge = protpy.charge_distribution(protein_seq)
#using default parameters: ph=7.4

# PositiveCharge  NegativeCharge  NetCharge
# 99.526          114.956         -15.43

Calculate Hydrophobic/Polar/Charged Composition:

hpc = protpy.hydrophobic_polar_charged_composition(protein_seq)
# Hydrophobic  Polar   Charged
# 44.542       32.669  18.247

Calculate Secondary Structure Propensity:

ssp = protpy.secondary_structure_propensity(protein_seq)
# Helix  Sheet  Coil
# 0.983  1.05   1.043

Calculate k-mer Composition:

kmer = protpy.kmer_composition(protein_seq)
#using default parameters: k=2

# AA     AC     AD  ...
# 0.797  0.159  ... ...

Calculate Reduced Alphabet Composition:

reduced = protpy.reduced_alphabet_composition(protein_seq)
#using default parameters: alphabet_size=6

# Group_1  Group_2  Group_3  Group_4  Group_5  Group_6
# 25.339   34.741   9.163    9.084    10.837   10.837

Calculate Motif Composition:

motif = protpy.motif_composition(protein_seq)
# NxST_glycosylation  RGD_integrin  KDEL_retention  ...
# 23                  0             0               ...

Calculate Amino Acid Pair Composition:

aapair = protpy.amino_acid_pair_composition(protein_seq)
# AA_Hydrophobic-Hydrophobic  AA_Hydrophobic-Polar  ...
# 0.797                       0.159                 ...

Calculate Aliphatic Index:

aliphatic = protpy.aliphatic_index(protein_seq)
# AliphaticIndex
# 82.725

Calculate Extinction Coefficient:

extinction = protpy.extinction_coefficient(protein_seq)
# ExtCoeff_Reduced  ExtCoeff_Oxidized
# 140960            143335

Calculate Boman Index:

boman = protpy.boman_index(protein_seq)
# BomanIndex
# 0.119

Calculate Aggregation Propensity:

aggregation = protpy.aggregation_propensity(protein_seq)
# AggregProneRegions  AggregProneFraction
# 58                  11.793

Calculate Hydrophobic Moment:

hm = protpy.hydrophobic_moment(protein_seq)
#using default parameters: window=11, angle=100

# HydrophobicMoment_Mean  HydrophobicMoment_Max
# 0.272                   0.813

Calculate Shannon Entropy:

se = protpy.shannon_entropy(protein_seq)
# ShannonEntropy
# 4.163

Calculate Pseudo Composition:

pseudo_composition = protpy.pseudo_amino_acid_composition(protein_seq)
#using default parameters: lamda=30, weight=0.05, properties=[]

# PAAC_1  PAAC_2  PAAC_3  PAAC_4  PAAC_5 ...
# 0.127   0.059   0.111   0.064   0.126 ...

Calculate Amphiphilic Composition:

amphiphilic_composition = protpy.amphiphilic_pseudo_amino_acid_composition(protein_seq)
#using default parameters: lamda=30, weight=0.5, properties=[hydrophobicity_, hydrophilicity_]

# APAAC_1  APAAC_2  APAAC_3  APAAC_4  APAAC_5 ...
# 6.624    3.076    5.757    3.032    5.988 ...
Autocorrelation Descriptors Usage Examples

Calculate MoreauBroto Autocorrelation:

moreaubroto_autocorrelation = protpy.moreaubroto_autocorrelation(protein_seq)
#using default parameters: lag=30, properties=["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], normalize=True

# MBAuto_CIDH920105_1  MBAuto_CIDH920105_2  MBAuto_CIDH920105_3  MBAuto_CIDH920105_4  MBAuto_CIDH920105_5 ...  
# -0.052               -0.104               -0.156               -0.208               0.246 ...

Calculate Moran Autocorrelation:

moran_autocorrelation = protpy.moran_autocorrelation(protein_seq)
#using default parameters: lag=30, properties=["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], normalize=True

# MAuto_CIDH920105_1  MAuto_CIDH920105_2  MAuto_CIDH920105_3  MAuto_CIDH920105_4  MAuto_CIDH920105_5 ...
# -0.07786            -0.07879            -0.07906            -0.08001            0.14911 ...

Calculate Geary Autocorrelation:

geary_autocorrelation = protpy.geary_autocorrelation(protein_seq)
#using default parameters: lag=30, properties=["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], normalize=True

# GAuto_CIDH920105_1  GAuto_CIDH920105_2  GAuto_CIDH920105_3  GAuto_CIDH920105_4  GAuto_CIDH920105_5 ...
# 1.057               1.077               1.04                1.02                1.013 ...
Conjoint Triad Descriptors Usage Examples

Calculate Conjoint Triad:

conjoint_triad = protpy.conjoint_triad(protein_seq)
# 111  112  113  114  115 ...
# 7    17   11   3    6 ...
CTD Descriptors Usage Examples

Calculate CTD:

ctd = protpy.ctd(protein_seq)
#using default parameters: property="hydrophobicity", all_ctd=True

# hydrophobicity_CTD_C_01  hydrophobicity_CTD_C_02  hydrophobicity_CTD_C_03  normalized_vdwv_CTD_C_01 ...
# 0.279                    0.386                    0.335                    0.389 ...                   
Sequence Order Descriptors Usage Examples

Calculate Sequence Order Coupling Number (SOCN):

socn = protpy.sequence_order_coupling_number_(protein_seq)
#using default parameters: d=1, distance_matrix="schneider-wrede"

#401.387        

Calculate all SOCN's per distance matrix:

#using default parameters: lag=30, distance_matrix="schneider-wrede"
socn_all = protpy.sequence_order_coupling_number(protein_seq)

# SOCN_SW1  SOCN_SW2  SOCN_SW3  SOCN_SW4  SOCN_SW5 ...
# 401.387    409.243    376.946    393.042    396.196 ...  

#using custom parameters: lag=10, distance_matrix="grantham"
socn_all = protpy.sequence_order_coupling_number(protein_seq, lag=10, distance_matrix="grantham")      

# SOCN_Grant1  SOCN_Grant_2  SOCN_Grant_3  SOCN_Grant_4  SOCN_Grant_5 ...
# 399.125    402.153    387.820    393.111    409.096 ...  

Calculate Quasi Sequence Order (QSO):

#using default parameters: lag=30, weight=0.1, distance_matrix="schneider-wrede"
qso = protpy.quasi_sequence_order(protein_seq)

# QSO_SW1   QSO_SW2   QSO_SW3   QSO_SW4   QSO_SW5 ...
# 0.005692  0.002643  0.004947  0.002846  0.005625 ...  

#using custom parameters: lag=10, weight=0.2, distance_matrix="grantham"
qso = protpy.quasi_sequence_order(protein_seq, lag=10, weight=0.2, distance_matrix="grantham")

# QSO_Grant1   QSO_Grant2   QSO_Grant3   QSO_Grant4   QSO_Grant5 ...
# 0.123287  0.079967  0.04332  0.039983  0.013332 ...  

Documentation

The documentation for protpy is hosted on ReadTheDocs and is available here.

Directories

  • /tests - unit and integration tests for protpy package.
  • /protpy - source code and all required external data files for package.
  • /docs - protpy documentation.
  • /examples - example notebook for protpy

Tests

To run all tests, from the main protpy folder run:

python3 -m unittest discover tests -v
-v: verbose output flag

Contact

If you have any questions or comments, please contact [email protected] or raise an issue on the Issues tab.

References

[1]: Mckenna, A., & Dubey, S. (2022). Machine learning based predictive model for the analysis of sequence activity relationships using protein spectra and protein descriptors. Journal of Biomedical Informatics, 128(104016), 104016. https://doi.org/10.1016/j.jbi.2022.104016
[2]: Shuichi Kawashima, Minoru Kanehisa, AAindex: Amino Acid index database, Nucleic Acids Research, Volume 28, Issue 1, 1 January 2000, Page 374, https://doi.org/10.1093/nar/28.1.374
[3]: Dong, J., Yao, ZJ., Zhang, L. et al. PyBioMed: a python library for various molecular representations of chemicals, proteins and DNAs and their interactions. J Cheminform 10, 16 (2018). https://doi.org/10.1186/s13321-018-0270-2
[4]: Reczko, M. and Bohr, H. (1994) The DEF data base of sequence based protein fold class predictions. Nucleic Acids Res, 22, 3616-3619.
[5]: Hua, S. and Sun, Z. (2001) Support vector machine approach for protein subcellular localization prediction. Bioinformatics, 17, 721-728.
[6]: Broto P, Moreau G, Vandicke C: Molecular structures: perception, autocorrelation descriptor and SAR studies. Eur J Med Chem 1984, 19: 71–78.
[7]: Ong, S.A., Lin, H.H., Chen, Y.Z. et al. Efficacy of different protein descriptors in predicting protein functional families. BMC Bioinformatics 8, 300 (2007). https://doi.org/10.1186/1471-2105-8-300
[8]: Inna Dubchak, Ilya Muchink, Stephen R.Holbrook and Sung-Hou Kim. Prediction of protein folding class using global description of amino acid sequence. Proc.Natl. Acad.Sci.USA, 1995, 92, 8700-8704.
[9]: Juwen Shen, Jian Zhang, Xiaomin Luo, Weiliang Zhu, Kunqian Yu, Kaixian Chen, Yixue Li, Huanliang Jiang. Predicting proten-protein interactions based only on sequences inforamtion. PNAS. 2007 (104) 4337-4341.
[10]: Kuo-Chen Chou. Prediction of Protein Subcellar Locations by Incorporating Quasi-Sequence-Order Effect. Biochemical and Biophysical Research Communications 2000, 278, 477-483.
[11]: Kuo-Chen Chou. Prediction of Protein Cellular Attributes Using Pseudo-Amino Acid Composition. PROTEINS: Structure, Function, and Genetics, 2001, 43: 246-255.
[12]: Kuo-Chen Chou. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics, 2005,21,10-19.

Support

Star it on GitHub

Buy Me A Coffee

Back to top

About

Calculating a range of protein descriptors using their physicochemical, biological and structural properties πŸ”¬.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages