GEM-PRO - Calculating Protein Properties

This notebook gives an example of how to calculate protein properties for a list of proteins. The main features demonstrated are:

  1. Information retrieval from UniProt and linking residue numbering sites to structure
  2. Calculating or predicting global protein sequence and structure properties
  3. Calculating or predicting local protein sequence and structure properties
Input: List of gene IDs
Output: Representative protein structures and properties associated with them

Imports

In [1]:
import sys
import logging
In [2]:
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO
In [3]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Logging

Set the logging level in logger.setLevel(logging.<LEVEL_HERE>) to specify how verbose you want the pipeline to be. Debug is most verbose.

  • CRITICAL
    • Only really important messages shown
  • ERROR
    • Major errors
  • WARNING
    • Warnings that don’t affect running of the pipeline
  • INFO (default)
    • Info such as the number of structures mapped per gene
  • DEBUG
    • Really detailed information that will print out a lot of stuff
Warning: DEBUG mode prints out a large amount of information, especially if you have a lot of genes. This may stall your notebook!
In [4]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # SET YOUR LOGGING LEVEL HERE #
In [5]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]

Initialization

Set these three things:

  • ROOT_DIR
    • The directory where a folder named after your PROJECT will be created
  • PROJECT
    • Your project name
  • LIST_OF_GENES
    • Your list of gene IDs

A directory will be created in ROOT_DIR with your PROJECT name. The folders are organized like so:

ROOT_DIR
└── PROJECT
    ├── data  # General storage for pipeline outputs
    ├── model  # SBML and GEM-PRO models are stored here
    ├── genes  # Per gene information
    │   ├── <gene_id1>  # Specific gene directory
    │   │   └── protein
    │   │       ├── sequences  # Protein sequence files, alignments, etc.
    │   │       └── structures  # Protein structure files, calculations, etc.
    │   └── <gene_id2>
    │       └── protein
    │           ├── sequences
    │           └── structures
    ├── reactions  # Per reaction information
    │   └── <reaction_id1>  # Specific reaction directory
    │       └── complex
    │           └── structures  # Protein complex files
    └── metabolites  # Per metabolite information
        └── <metabolite_id1>  # Specific metabolite directory
            └── chemical
                └── structures  # Metabolite 2D and 3D structure files
Note: Methods for protein complexes and metabolites are still in development.
In [6]:
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()

PROJECT = 'ssbio_protein_properties'
LIST_OF_GENES = ['b1276', 'b0118']
In [7]:
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=PROJECT, root_dir=ROOT_DIR, genes_list=LIST_OF_GENES, pdb_file_type='pdb')
[2017-11-08 14:27] [ssbio.pipeline.gempro] INFO: Creating GEM-PRO project directory in folder /tmp
[2017-11-08 14:27] [ssbio.pipeline.gempro] INFO: /tmp/ssbio_protein_properties: GEM-PRO project location
[2017-11-08 14:27] [ssbio.pipeline.gempro] INFO: 2: number of genes

Mapping gene ID –> sequence

First, we need to map these IDs to their protein sequences. There are 2 ID mapping services provided to do this - through KEGG or UniProt. The end goal is to map a UniProt ID to each ID, since there is a comprehensive mapping (and some useful APIs) between UniProt and the PDB.

Note: You only need to map gene IDs using one service. However you can run both if some genes don’t map in one service and do map in another!

In [8]:
# UniProt mapping
my_gempro.uniprot_mapping_and_metadata(model_gene_source='ENSEMBLGENOME_ID')
print('Missing UniProt mapping: ', my_gempro.missing_uniprot_mapping)
my_gempro.df_uniprot_metadata.head()
[2017-11-08 14:27] [root] INFO: getUserAgent: Begin
[2017-11-08 14:27] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.6.3; Linux) Python-requests/2.18.4
[2017-11-08 14:27] [root] INFO: getUserAgent: End

[2017-11-08 14:27] [ssbio.pipeline.gempro] INFO: 2/2: number of genes mapped to UniProt
[2017-11-08 14:27] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> UniProt. See the "df_uniprot_metadata" attribute for a summary dataframe.
Missing UniProt mapping:  []
Out[8]:
uniprot reviewed gene_name kegg refseq pdbs pfam description entry_date entry_version seq_date seq_version sequence_file metadata_file
gene
b0118 P36683 False acnB ecj:JW0114;eco:b0118 NP_414660.1;WP_001307570.1 1L5J PF00330;PF06434;PF11791 Aconitate hydratase B 2017-10-25 163 1997-11-01 3 P36683.fasta P36683.xml
b1276 P25516 False acnA ecj:JW1268;eco:b1276 NP_415792.1;WP_000099535.1 NaN PF00330;PF00694 Aconitate hydratase A 2017-10-25 151 2008-01-15 3 P25516.fasta P25516.xml
In [9]:
# Set representative sequences
my_gempro.set_representative_sequence()
print('Missing a representative sequence: ', my_gempro.missing_representative_sequence)
my_gempro.df_representative_sequences.head()

[2017-11-08 14:27] [ssbio.pipeline.gempro] INFO: 2/2: number of genes with a representative sequence
[2017-11-08 14:27] [ssbio.pipeline.gempro] INFO: See the "df_representative_sequences" attribute for a summary dataframe.
Missing a representative sequence:  []
Out[9]:
uniprot kegg pdbs sequence_file metadata_file
gene
b0118 P36683 ecj:JW0114;eco:b0118 1L5J P36683.fasta P36683.xml
b1276 P25516 ecj:JW1268;eco:b1276 NaN P25516.fasta P25516.xml

Mapping representative sequence –> structure

These are the ways to map sequence to structure:

  1. Use the UniProt ID and their automatic mappings to the PDB
  2. BLAST the sequence to the PDB
  3. Make homology models or
  4. Map to existing homology models

You can only utilize option #1 to map to PDBs if there is a mapped UniProt ID set in the representative sequence. If not, you’ll have to BLAST your sequence to the PDB or make a homology model. You can also run both for maximum coverage.

In [10]:
# Mapping using the PDBe best_structures service
my_gempro.map_uniprot_to_pdb(seq_ident_cutoff=.3)
my_gempro.df_pdb_ranking.head()
[2017-11-08 14:27] [ssbio.pipeline.gempro] INFO: Mapping UniProt IDs --> PDB IDs...
[2017-11-08 14:27] [root] INFO: getUserAgent: Begin
[2017-11-08 14:27] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.6.3; Linux) Python-requests/2.18.4
[2017-11-08 14:27] [root] INFO: getUserAgent: End

[2017-11-08 14:27] [ssbio.pipeline.gempro] INFO: 1/2: number of genes with at least one experimental structure
[2017-11-08 14:27] [ssbio.pipeline.gempro] INFO: Completed UniProt --> best PDB mapping. See the "df_pdb_ranking" attribute for a summary dataframe.
Out[10]:
pdb_id pdb_chain_id uniprot experimental_method resolution coverage start end unp_start unp_end rank
gene
b0118 1l5j A P36683 X-ray diffraction 2.4 1 1 865 1 865 1
b0118 1l5j B P36683 X-ray diffraction 2.4 1 1 865 1 865 2
In [11]:
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.7, evalue=0.00001)
my_gempro.df_pdb_blast.head(2)

[2017-11-08 14:27] [ssbio.pipeline.gempro] INFO: Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute for a summary dataframe.
[2017-11-08 14:27] [ssbio.pipeline.gempro] INFO: 0: number of genes with additional structures added from BLAST
[2017-11-08 14:27] [ssbio.pipeline.gempro] WARNING: Empty dataframe
Out[11]:
In [12]:
import pandas as pd
import os.path as op
In [13]:
# Creating manual mapping dictionary for ECOLI I-TASSER models
homology_models = '/home/nathan/projects_archive/homology_models/ECOLI/zhang/'
homology_models_df = pd.read_csv('/home/nathan/projects_archive/homology_models/ECOLI/zhang_data/160804-ZHANG_INFO.csv')
tmp = homology_models_df[['zhang_id','model_file','m_gene']].drop_duplicates()
tmp = tmp[pd.notnull(tmp.m_gene)]

homology_model_dict = {}

for i,r in tmp.iterrows():
    homology_model_dict[r['m_gene']] = {r['zhang_id']: {'model_file':op.join(homology_models, r['model_file']),
                                                        'file_type':'pdb'}}

my_gempro.get_manual_homology_models(homology_model_dict)

[2017-11-08 14:27] [ssbio.pipeline.gempro] INFO: Updated homology model information for 2 genes.
In [14]:
# Creating manual mapping dictionary for ECOLI SUNPRO models
homology_models = '/home/nathan/projects_archive/homology_models/ECOLI/sunpro/'
homology_models_df = pd.read_csv('/home/nathan/projects_archive/homology_models/ECOLI/sunpro_data/160609-SUNPRO_INFO.csv')
tmp = homology_models_df[['sunpro_id','model_file','m_gene']].drop_duplicates()
tmp = tmp[pd.notnull(tmp.m_gene)]

homology_model_dict = {}

for i,r in tmp.iterrows():
    homology_model_dict[r['m_gene']] = {r['sunpro_id']: {'model_file':op.join(homology_models, r['model_file']),
                                                         'file_type':'pdb'}}

my_gempro.get_manual_homology_models(homology_model_dict)

[2017-11-08 14:27] [ssbio.pipeline.gempro] INFO: Updated homology model information for 2 genes.

Downloading and ranking structures

Warning: Downloading all PDBs takes a while, since they are also parsed for metadata. You can skip this step and just set representative structures below if you want to minimize the number of PDBs downloaded.
In [15]:
# Download all mapped PDBs and gather the metadata
my_gempro.pdb_downloader_and_metadata()
my_gempro.df_pdb_metadata.head(2)

[2017-11-08 14:27] [ssbio.pipeline.gempro] INFO: Updated PDB metadata dataframe. See the "df_pdb_metadata" attribute for a summary dataframe.
[2017-11-08 14:27] [ssbio.pipeline.gempro] INFO: Saved 1 structures total
Out[15]:
pdb_id pdb_title description experimental_method mapped_chains resolution chemicals taxonomy_name structure_file
gene
b0118 1l5j CRYSTAL STRUCTURE OF E. COLI ACONITASE B. Aconitate hydratase 2 (E.C.4.2.1.3) X-ray diffraction A;B 2.4 F3S;TRA Escherichia coli 1l5j.pdb
In [16]:
# Set representative structures
my_gempro.set_representative_structure()
my_gempro.df_representative_structures.head()

[2017-11-08 14:27] [ssbio.pipeline.gempro] INFO: 2/2: number of genes with a representative structure
[2017-11-08 14:27] [ssbio.pipeline.gempro] INFO: See the "df_representative_structures" attribute for a summary dataframe.
Out[16]:
id is_experimental file_type structure_file
gene
b0118 REP-1l5j True pdb 1l5j-A_clean.pdb
b1276 REP-ACON1_ECOLI False pdb ACON1_ECOLI_model1_clean-X_clean.pdb

Computing and storing protein properties

In [17]:
# Requires EMBOSS "pepstats" program
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
# Install using:
# sudo apt-get install emboss
my_gempro.get_sequence_properties()

In [18]:
# Requires SCRATCH installation, replace path_to_scratch with own path to script
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
my_gempro.get_scratch_predictions(path_to_scratch='/home/nathan/software/SCRATCH-1D_1.1/bin/run_SCRATCH-1D_predictors.sh',
                                  results_dir=my_gempro.data_dir,
                                  num_cores=4)
[2017-11-08 14:27] [ssbio.pipeline.gempro] INFO: /tmp/ssbio_protein_properties/data/ssbio_protein_properties.faa: wrote all representative sequences to file

###################################
#                                 #
#  SCRATCH-1D release 1.1 (2015)  #
#                                 #
###################################

[SCRATCH-1D_predictions.pl] 2 protein sequence(s) found
[SCRATCH-1D_predictions.pl] generating sequence profiles...
[SCRATCH-1D_predictions.pl] running SCRATCH-1D predictors...
[SCRATCH-1D_predictions.pl] running homology analysis...
[SCRATCH-1D_predictions.pl] writing SSpro predictions...
[SCRATCH-1D_predictions.pl] writing SSpro8 predictions...
[SCRATCH-1D_predictions.pl] writing ACCpro predictions...
[SCRATCH-1D_predictions.pl] writing ACCpro20 predictions...
[SCRATCH-1D_predictions.pl] job successfully completed!


[2017-11-08 14:36] [ssbio.pipeline.gempro] INFO: 2/2: number of genes with SCRATCH predictions loaded
In [19]:
my_gempro.get_disulfide_bridges(representatives_only=False)

In [21]:
# Requires DSSP installation
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
my_gempro.get_dssp_annotations()

In [22]:
# Requires MSMS installation
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
my_gempro.get_msms_annotations()


Global protein properties

Properties of the entire protein sequence/structure are stored in the representative_sequence and representative_structure attributes. These properties describe aspects of the entire protein, such as its molecular weight, the percentage of amino acids in a particular secondary structure, the percentage of charged or

In [23]:
from pprint import pprint
In [24]:
for g in my_gempro.genes_with_a_representative_structure:
    repseq = g.protein.representative_sequence
    repstruct = g.protein.representative_structure
    repchain = g.protein.representative_chain

    print('Gene: {}'.format(g.id))
    print('Number of structures: {}'.format(g.protein.num_structures))
    print('Representative sequence: {}'.format(repseq.id))
    print('Representative structure: {}'.format(repstruct.id))

    print('Global properties of the representative sequence:')
    pprint(repseq.annotations)

    print('Global properties of the representative structure:')
    pprint(repstruct.chains.get_by_id(repchain).seq_record.annotations)

    print('--------------------------------------------------')
Gene: b1276
Number of structures: 3
Representative sequence: P25516
Representative structure: REP-ACON1_ECOLI
Global properties of the representative sequence:
{'amino_acids_percent': {'A': 0.08641975308641975,
                         'C': 0.007856341189674524,
                         'D': 0.06397306397306397,
                         'E': 0.06172839506172839,
                         'F': 0.025813692480359147,
                         'G': 0.08754208754208755,
                         'H': 0.020202020202020204,
                         'I': 0.04826038159371493,
                         'K': 0.04826038159371493,
                         'L': 0.09427609427609428,
                         'M': 0.028058361391694726,
                         'N': 0.037037037037037035,
                         'P': 0.05611672278338945,
                         'Q': 0.030303030303030304,
                         'R': 0.05723905723905724,
                         'S': 0.05723905723905724,
                         'T': 0.06060606060606061,
                         'V': 0.0819304152637486,
                         'W': 0.014590347923681257,
                         'Y': 0.03254769921436588},
 'aromaticity': 0.07295173961840629,
 'instability_index': 36.28239057239071,
 'isoelectric_point': 5.59344482421875,
 'molecular_weight': 97676.06830000057,
 'monoisotopic': False,
 'percent_B-sspro8': 0.010101010101010102,
 'percent_C-sspro': 0.43546576879910215,
 'percent_C-sspro8': 0.23905723905723905,
 'percent_E-sspro': 0.18855218855218855,
 'percent_E-sspro8': 0.1829405162738496,
 'percent_G-sspro8': 0.03254769921436588,
 'percent_H-sspro': 0.3759820426487093,
 'percent_H-sspro8': 0.3378226711560045,
 'percent_I-sspro8': 0.002244668911335578,
 'percent_S-sspro8': 0.0707070707070707,
 'percent_T-sspro8': 0.12457912457912458,
 'percent_acidic': 0.1257,
 'percent_aliphatic': 0.31089,
 'percent_aromatic': 0.09315,
 'percent_basic': 0.1257,
 'percent_buried-accpro': 0.5847362514029181,
 'percent_buried-accpro20': 0.6273849607182941,
 'percent_charged': 0.2514,
 'percent_exposed-accpro': 0.4152637485970819,
 'percent_exposed-accpro20': 0.372615039281706,
 'percent_helix_naive': 0.29741863075196406,
 'percent_non-polar': 0.56341,
 'percent_polar': 0.43659,
 'percent_small': 0.53872,
 'percent_strand_naive': 0.27048260381593714,
 'percent_tiny': 0.29966000000000004,
 'percent_turn_naive': 0.2379349046015713}
Global properties of the representative structure:
{'percent_B-dssp': 0.010101010101010102,
 'percent_C-dssp': 0.22222222222222221,
 'percent_E-dssp': 0.17396184062850731,
 'percent_G-dssp': 0.039281705948372617,
 'percent_H-dssp': 0.34567901234567899,
 'percent_I-dssp': 0.0056116722783389446,
 'percent_S-dssp': 0.094276094276094277,
 'percent_T-dssp': 0.10886644219977554}
--------------------------------------------------
Gene: b0118
Number of structures: 4
Representative sequence: P36683
Representative structure: REP-1l5j
Global properties of the representative sequence:
{'amino_acids_percent': {'A': 0.11213872832369942,
                         'C': 0.011560693641618497,
                         'D': 0.06358381502890173,
                         'E': 0.06589595375722543,
                         'F': 0.03352601156069364,
                         'G': 0.08786127167630058,
                         'H': 0.017341040462427744,
                         'I': 0.05433526011560694,
                         'K': 0.056647398843930635,
                         'L': 0.10173410404624278,
                         'M': 0.026589595375722544,
                         'N': 0.035838150289017344,
                         'P': 0.06242774566473988,
                         'Q': 0.028901734104046242,
                         'R': 0.04508670520231214,
                         'S': 0.04161849710982659,
                         'T': 0.05433526011560694,
                         'V': 0.06705202312138728,
                         'W': 0.009248554913294798,
                         'Y': 0.024277456647398842},
 'aromaticity': 0.06705202312138728,
 'instability_index': 32.79631213872841,
 'isoelectric_point': 5.23931884765625,
 'molecular_weight': 93497.01500000065,
 'monoisotopic': False,
 'percent_B-sspro8': 0.016184971098265895,
 'percent_C-sspro': 0.4254335260115607,
 'percent_C-sspro8': 0.2138728323699422,
 'percent_E-sspro': 0.15606936416184972,
 'percent_E-sspro8': 0.14335260115606938,
 'percent_G-sspro8': 0.027745664739884393,
 'percent_H-sspro': 0.4184971098265896,
 'percent_H-sspro8': 0.3895953757225434,
 'percent_I-sspro8': 0.0,
 'percent_S-sspro8': 0.07976878612716763,
 'percent_T-sspro8': 0.12947976878612716,
 'percent_acidic': 0.12948,
 'percent_aliphatic': 0.33526000000000006,
 'percent_aromatic': 0.08439,
 'percent_basic': 0.11907999999999999,
 'percent_buried-accpro': 0.6323699421965318,
 'percent_buried-accpro20': 0.6809248554913295,
 'percent_charged': 0.24855,
 'percent_exposed-accpro': 0.3676300578034682,
 'percent_exposed-accpro20': 0.3190751445086705,
 'percent_helix_naive': 0.29017341040462424,
 'percent_non-polar': 0.59075,
 'percent_polar': 0.40924999999999995,
 'percent_small': 0.53642,
 'percent_strand_naive': 0.3063583815028902,
 'percent_tiny': 0.30751,
 'percent_turn_naive': 0.22774566473988442}
Global properties of the representative structure:
{'percent_B-dssp': 0.016241299303944315,
 'percent_C-dssp': 0.20765661252900233,
 'percent_E-dssp': 0.14037122969837587,
 'percent_G-dssp': 0.034802784222737818,
 'percent_H-dssp': 0.38051044083526681,
 'percent_I-dssp': 0.0,
 'percent_S-dssp': 0.082366589327146175,
 'percent_T-dssp': 0.13805104408352667}
--------------------------------------------------

Local protein properties

In [25]:
[x for x in g.protein.representative_sequence.features if 'site' in x.type]
Out[25]:
[SeqFeature(FeatureLocation(ExactPosition(709), ExactPosition(710)), type='metal ion-binding site'),
 SeqFeature(FeatureLocation(ExactPosition(768), ExactPosition(769)), type='metal ion-binding site'),
 SeqFeature(FeatureLocation(ExactPosition(771), ExactPosition(772)), type='metal ion-binding site'),
 SeqFeature(FeatureLocation(ExactPosition(190), ExactPosition(191)), type='binding site'),
 SeqFeature(FeatureLocation(ExactPosition(497), ExactPosition(498)), type='binding site'),
 SeqFeature(FeatureLocation(ExactPosition(790), ExactPosition(791)), type='binding site'),
 SeqFeature(FeatureLocation(ExactPosition(795), ExactPosition(796)), type='binding site'),
 SeqFeature(FeatureLocation(ExactPosition(768), ExactPosition(769)), type='mutagenesis site')]
In [26]:
for g in my_gempro.genes:
    for f in g.protein.representative_sequence.features:
        if 'site' in f.type.lower():
            print(f)
type: metal ion-binding site
location: [434:435]
qualifiers:
    Key: description, Value: Iron-sulfur (4Fe-4S)
    Key: evidence, Value: 1
    Key: type, Value: metal ion-binding site

type: metal ion-binding site
location: [500:501]
qualifiers:
    Key: description, Value: Iron-sulfur (4Fe-4S)
    Key: evidence, Value: 1
    Key: type, Value: metal ion-binding site

type: metal ion-binding site
location: [503:504]
qualifiers:
    Key: description, Value: Iron-sulfur (4Fe-4S)
    Key: evidence, Value: 1
    Key: type, Value: metal ion-binding site

type: metal ion-binding site
location: [709:710]
qualifiers:
    Key: description, Value: Iron-sulfur (4Fe-4S)
    Key: evidence, Value: 5
    Key: type, Value: metal ion-binding site

type: metal ion-binding site
location: [768:769]
qualifiers:
    Key: description, Value: Iron-sulfur (4Fe-4S)
    Key: evidence, Value: 5
    Key: type, Value: metal ion-binding site

type: metal ion-binding site
location: [771:772]
qualifiers:
    Key: description, Value: Iron-sulfur (4Fe-4S)
    Key: evidence, Value: 5
    Key: type, Value: metal ion-binding site

type: binding site
location: [190:191]
qualifiers:
    Key: description, Value: Substrate
    Key: evidence, Value: 5
    Key: type, Value: binding site

type: binding site
location: [497:498]
qualifiers:
    Key: description, Value: Substrate
    Key: evidence, Value: 5
    Key: type, Value: binding site

type: binding site
location: [790:791]
qualifiers:
    Key: description, Value: Substrate
    Key: evidence, Value: 5
    Key: type, Value: binding site

type: binding site
location: [795:796]
qualifiers:
    Key: description, Value: Substrate
    Key: evidence, Value: 5
    Key: type, Value: binding site

type: mutagenesis site
location: [768:769]
qualifiers:
    Key: description, Value: Inhibits the dimer formation.
    Key: evidence, Value: 7
    Key: original, Value: C
    Key: type, Value: mutagenesis site
    Key: variation, Value: S

In [27]:
metal_info = []

for g in my_gempro.genes:
    for f in g.protein.representative_sequence.features:
        if 'metal' in f.type.lower():
            res_info = g.protein.get_residue_annotations(f.location.end, use_representatives=True)
            res_info['gene_id'] = g.id
            res_info['seq_id'] = g.protein.representative_sequence.id
            res_info['struct_id'] = g.protein.representative_structure.id
            res_info['chain_id'] = g.protein.representative_chain
            metal_info.append(res_info)

cols = ['gene_id', 'seq_id', 'struct_id', 'chain_id',
        'seq_residue', 'seq_resnum', 'struct_residue','struct_resnum',
        'seq_SS-sspro','seq_SS-sspro8','seq_RSA-accpro','seq_RSA-accpro20',
        'struct_SS-dssp','struct_RSA-dssp', 'struct_ASA-dssp',
        'struct_PHI-dssp', 'struct_PSI-dssp', 'struct_CA_DEPTH-msms', 'struct_RES_DEPTH-msms']

pd.DataFrame.from_records(metal_info, columns=cols).set_index(['gene_id', 'seq_id', 'struct_id', 'chain_id', 'seq_resnum'])
Out[27]:
seq_residue struct_residue struct_resnum seq_SS-sspro seq_SS-sspro8 seq_RSA-accpro seq_RSA-accpro20 struct_SS-dssp struct_RSA-dssp struct_ASA-dssp struct_PHI-dssp struct_PSI-dssp struct_CA_DEPTH-msms struct_RES_DEPTH-msms
gene_id seq_id struct_id chain_id seq_resnum
b1276 P25516 REP-ACON1_ECOLI X 435 C C 435 H H - 10 H 0.059259 8.0 -61.1 -26.6 2.656722 2.813536
501 C C 501 C C - 0 S 0.088889 12.0 -61.0 -50.0 1.999713 2.409119
504 C C 504 H G - 0 G 0.259259 35.0 -56.0 -45.6 1.999634 1.961484
b0118 P36683 REP-1l5j A 710 C C 710 C T - 10 T 0.118519 16.0 -67.1 -7.2 10.148960 10.009109
769 C C 769 C C - 5 - 0.088889 12.0 -67.8 -28.3 8.296585 8.049832
772 C C 772 H G - 5 G 0.081481 11.0 -50.2 -38.0 8.282292 8.239369

Visualizing residues

StructProp.view_structure(opacity=1.0, recolor=False, gui=False)[source]

Use NGLviewer to display a structure in a Jupyter notebook

Parameters:
  • opacity (float) – Opacity of the structure
  • recolor (bool) – If structure should be cleaned and recolored to silver
  • gui (bool) – If the NGLview GUI should show up
Returns:

NGLviewer object

StructProp.add_residues_highlight_to_nglview(view, structure_resnums, chain=None, res_color=’red’)[source]

Add a residue number or numbers to an NGLWidget view object.

Parameters:
  • view (NGLWidget) – NGLWidget view object
  • structure_resnums (int, list) – Residue number(s) to highlight, structure numbering
  • chain (str, list) – Chain ID or IDs of which residues are a part of. If not provided, all chains in the mapped_chains attribute will be used. If that is also empty, and exception is raised.
  • res_color (str) – Color to highlight residues with
In [30]:
for g in my_gempro.genes:

    # Gather residue numbers
    metal_binding_structure_residues = []
    for f in g.protein.representative_sequence.features:
        if 'metal' in f.type.lower():
            res_info = g.protein.get_residue_annotations(f.location.end, use_representatives=True)
            metal_binding_structure_residues.append(res_info['struct_resnum'])
    print(metal_binding_structure_residues)

    # Display structure
    view = g.protein.representative_structure.view_structure()
    g.protein.representative_structure.add_residues_highlight_to_nglview(view=view, structure_resnums=metal_binding_structure_residues)

    view
[435, 501, 504]
[2017-11-08 14:40] [ssbio.protein.structure.structprop] INFO: Selection: ( :X ) and not hydrogen and ( 504 or 435 or 501 )
[710, 769, 772]
[2017-11-08 14:40] [ssbio.protein.structure.structprop] INFO: Selection: ( :A ) and not hydrogen and ( 769 or 772 or 710 )

Comparing features in different structures of the same protein

In [34]:
# Run all sequence to structure alignments
for g in my_gempro.genes:
    for s in g.protein.structures:
        g.protein.align_seqprop_to_structprop(seqprop=g.protein.representative_sequence, structprop=s)
In [35]:
metal_info_compared = []

for g in my_gempro.genes:
    for f in g.protein.representative_sequence.features:
        if 'metal' in f.type.lower():
            for s in g.protein.structures:
                for c in s.mapped_chains:
                    res_info = g.protein.get_residue_annotations(seq_resnum=f.location.end,
                                                                 seqprop=g.protein.representative_sequence,
                                                                 structprop=s, chain_id=c,
                                                                 use_representatives=False)
                    res_info['gene_id'] = g.id
                    res_info['seq_id'] = g.protein.representative_sequence.id
                    res_info['struct_id'] = s.id
                    res_info['chain_id'] = c
                    metal_info_compared.append(res_info)

cols = ['gene_id', 'seq_id', 'struct_id', 'chain_id',
        'seq_residue', 'seq_resnum', 'struct_residue','struct_resnum',
        'seq_SS-sspro','seq_SS-sspro8','seq_RSA-accpro','seq_RSA-accpro20',
        'struct_SS-dssp','struct_RSA-dssp', 'struct_ASA-dssp',
        'struct_PHI-dssp', 'struct_PSI-dssp', 'struct_CA_DEPTH-msms', 'struct_RES_DEPTH-msms']

pd.DataFrame.from_records(metal_info_compared, columns=cols).sort_values(by=['seq_resnum','struct_id','chain_id']).set_index(['gene_id','seq_id','seq_resnum','seq_residue','struct_id'])
Out[35]:
chain_id struct_residue struct_resnum seq_SS-sspro seq_SS-sspro8 seq_RSA-accpro seq_RSA-accpro20 struct_SS-dssp struct_RSA-dssp struct_ASA-dssp struct_PHI-dssp struct_PSI-dssp struct_CA_DEPTH-msms struct_RES_DEPTH-msms
gene_id seq_id seq_resnum seq_residue struct_id
b1276 P25516 435 C ACON1_ECOLI X C 435 H H - 10 H 0.059259 8.0 -61.1 -26.6 NaN NaN
E01201 X C 435 H H - 10 H 0.037037 5.0 -66.5 -22.1 NaN NaN
REP-ACON1_ECOLI X C 435 H H - 10 H 0.059259 8.0 -61.1 -26.6 2.656722 2.813536
501 C ACON1_ECOLI X C 501 C C - 0 S 0.088889 12.0 -61.0 -50.0 NaN NaN
E01201 X C 501 C C - 0 S 0.066667 9.0 -62.1 -46.0 NaN NaN
REP-ACON1_ECOLI X C 501 C C - 0 S 0.088889 12.0 -61.0 -50.0 1.999713 2.409119
504 C ACON1_ECOLI X C 504 H G - 0 G 0.259259 35.0 -56.0 -45.6 NaN NaN
E01201 X C 504 H G - 0 G 0.170370 23.0 -59.6 -43.5 NaN NaN
REP-ACON1_ECOLI X C 504 H G - 0 G 0.259259 35.0 -56.0 -45.6 1.999634 1.961484
b0118 P36683 710 C 1l5j A C 710 C T - 10 T 0.118519 16.0 -67.1 -7.2 NaN NaN
1l5j B C 710 C T - 10 T 0.111111 15.0 -71.4 -6.4 NaN NaN
ACON2_ECOLI X C 710 C T - 10 T 0.096296 13.0 -66.2 -18.3 NaN NaN
E00113 X C 710 C T - 10 T 0.088889 12.0 -66.3 -17.5 NaN NaN
REP-1l5j A C 710 C T - 10 T 0.118519 16.0 -67.1 -7.2 10.148960 10.009109
769 C 1l5j A C 769 C C - 5 - 0.088889 12.0 -67.8 -28.3 NaN NaN
1l5j B C 769 C C - 5 - 0.081481 11.0 -66.2 -30.6 NaN NaN
ACON2_ECOLI X C 769 C C - 5 - 0.088889 12.0 -66.3 -37.7 NaN NaN
E00113 X C 769 C C - 5 - 0.081481 11.0 -65.8 -38.5 NaN NaN
REP-1l5j A C 769 C C - 5 - 0.088889 12.0 -67.8 -28.3 8.296585 8.049832
772 C 1l5j A C 772 H G - 5 G 0.081481 11.0 -50.2 -38.0 NaN NaN
1l5j B C 772 H G - 5 G 0.096296 13.0 -49.8 -38.4 NaN NaN
ACON2_ECOLI X C 772 H G - 5 G 0.088889 12.0 -53.4 -47.8 NaN NaN
E00113 X C 772 H G - 5 G 0.044444 6.0 -52.4 -48.7 NaN NaN
REP-1l5j A C 772 H G - 5 G 0.081481 11.0 -50.2 -38.0 8.282292 8.239369