GEM-PRO - Calculating Protein Properties

This notebook gives an example of how to calculate protein properties for a list of proteins, by first pulling information from UniProt and then using the 3D structure files to calculate the desired properties

Input: List of gene IDs
Output: Representative protein structures and properties associated with them

Imports

In [1]:
import sys
import logging
In [2]:
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO
In [3]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Logging

Set the logging level in logger.setLevel(logging.<LEVEL_HERE>) to specify how verbose you want the pipeline to be. Debug is most verbose.

  • CRITICAL
    • Only really important messages shown
  • ERROR
    • Major errors
  • WARNING
    • Warnings that don’t affect running of the pipeline
  • INFO (default)
    • Info such as the number of structures mapped per gene
  • DEBUG
    • Really detailed information that will print out a lot of stuff
Warning: DEBUG mode prints out a large amount of information, especially if you have a lot of genes. This may stall your notebook!
In [4]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # SET YOUR LOGGING LEVEL HERE #
In [5]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]

Initialization of the project

Set these three things:

  • ROOT_DIR
    • The directory where a folder named after your PROJECT will be created
  • PROJECT
    • Your project name
  • LIST_OF_GENES
    • Your list of gene IDs

A directory will be created in ROOT_DIR with your PROJECT name. The folders are organized like so:

ROOT_DIR
└── PROJECT
    ├── data  # General storage for pipeline outputs
    ├── model  # SBML and GEM-PRO models are stored here
    ├── genes  # Per gene information
    │   ├── <gene_id1>  # Specific gene directory
    │   │   └── protein
    │   │       ├── sequences  # Protein sequence files, alignments, etc.
    │   │       └── structures  # Protein structure files, calculations, etc.
    │   └── <gene_id2>
    │       └── protein
    │           ├── sequences
    │           └── structures
    ├── reactions  # Per reaction information
    │   └── <reaction_id1>  # Specific reaction directory
    │       └── complex
    │           └── structures  # Protein complex files
    └── metabolites  # Per metabolite information
        └── <metabolite_id1>  # Specific metabolite directory
            └── chemical
                └── structures  # Metabolite 2D and 3D structure files
Note: Methods for protein complexes and metabolites are still in development.
In [6]:
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()

PROJECT = 'ssbio_protein_properties'
LIST_OF_GENES = ['b1276', 'b0118']
In [7]:
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=PROJECT, root_dir=ROOT_DIR, genes_list=LIST_OF_GENES, pdb_file_type='pdb')
[2017-08-30 19:00] [ssbio.pipeline.gempro] INFO: /tmp/ssbio_protein_properties: GEM-PRO project location
[2017-08-30 19:00] [ssbio.pipeline.gempro] INFO: 2: number of genes

Mapping gene ID –> sequence

First, we need to map these IDs to their protein sequences. There are 2 ID mapping services provided to do this - through KEGG or UniProt. The end goal is to map a UniProt ID to each ID, since there is a comprehensive mapping (and some useful APIs) between UniProt and the PDB.

Note: You only need to map gene IDs using one service. However you can run both if some genes don’t map in one service and do map in another!

Methods

In [8]:
# UniProt mapping
my_gempro.uniprot_mapping_and_metadata(model_gene_source='ENSEMBLGENOME_ID')
print('Missing UniProt mapping: ', my_gempro.missing_uniprot_mapping)
my_gempro.df_uniprot_metadata.head()
[2017-08-30 19:00] [root] INFO: getUserAgent: Begin
[2017-08-30 19:00] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.5.3; Linux) Python-requests/2.14.2
[2017-08-30 19:00] [root] INFO: getUserAgent: End
[2017-08-30 19:00] [ssbio.pipeline.gempro] INFO: 2/2: number of genes mapped to UniProt
[2017-08-30 19:00] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> UniProt. See the "df_uniprot_metadata" attribute for a summary dataframe.

Missing UniProt mapping:  []
Out[8]:
uniprot reviewed gene_name kegg refseq num_pdbs pdbs pfam seq_len description entry_date entry_version seq_date seq_version sequence_file metadata_file
gene
b0118 P36683 False acnB ecj:JW0114;eco:b0118 NP_414660.1;WP_001307570.1 1 1L5J PF00330;PF06434;PF11791 865 Aconitate hydratase B 2017-07-05 161 1997-11-01 3 P36683.fasta P36683.xml
b1276 P25516 False acnA ecj:JW1268;eco:b1276 NP_415792.1;WP_000099535.1 0 NaN PF00330;PF00694 891 Aconitate hydratase A 2017-07-05 149 2008-01-15 3 P25516.fasta P25516.xml
In [14]:
# Set representative sequences
my_gempro.set_representative_sequence()
print('Missing a representative sequence: ', my_gempro.missing_representative_sequence)
my_gempro.df_representative_sequences.head()
[2017-08-30 19:01] [ssbio.pipeline.gempro] INFO: 2/2: number of genes with a representative sequence
[2017-08-30 19:01] [ssbio.pipeline.gempro] INFO: See the "df_representative_sequences" attribute for a summary dataframe.

Missing a representative sequence:  []
Out[14]:
uniprot kegg num_pdbs pdbs seq_len sequence_file metadata_file
gene
b0118 P36683 ecj:JW0114;eco:b0118 1 1L5J 865 P36683.fasta P36683.xml
b1276 P25516 ecj:JW1268;eco:b1276 0 NaN 891 P25516.fasta P25516.xml

Mapping representative sequence –> structure

These are the ways to map sequence to structure:

  1. Use the UniProt ID and their automatic mappings to the PDB
  2. BLAST the sequence to the PDB
  3. Make homology models or
  4. Map to existing homology models

You can only utilize option #1 to map to PDBs if there is a mapped UniProt ID set in the representative sequence. If not, you’ll have to BLAST your sequence to the PDB or make a homology model. You can also run both for maximum coverage.

Methods

In [15]:
# Mapping using the PDBe best_structures service
my_gempro.map_uniprot_to_pdb(seq_ident_cutoff=.3)
my_gempro.df_pdb_ranking.head()
[2017-08-30 19:01] [ssbio.pipeline.gempro] INFO: Mapping UniProt IDs --> PDB IDs...
[2017-08-30 19:01] [root] INFO: getUserAgent: Begin
[2017-08-30 19:01] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.5.3; Linux) Python-requests/2.14.2
[2017-08-30 19:01] [root] INFO: getUserAgent: End
[2017-08-30 19:01] [ssbio.pipeline.gempro] INFO: 1/2: number of genes with at least one experimental structure
[2017-08-30 19:01] [ssbio.pipeline.gempro] INFO: Completed UniProt --> best PDB mapping. See the "df_pdb_ranking" attribute for a summary dataframe.

Out[15]:
pdb_id pdb_chain_id uniprot experimental_method resolution coverage start end unp_start unp_end rank
gene
b0118 1l5j A P36683 X-ray diffraction 2.4 1 1 865 1 865 1
b0118 1l5j B P36683 X-ray diffraction 2.4 1 1 865 1 865 2
In [16]:
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.9, evalue=0.00001)
my_gempro.df_pdb_blast.head(2)
[2017-08-30 19:01] [ssbio.pipeline.gempro] INFO: Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute for a summary dataframe.
[2017-08-30 19:01] [ssbio.pipeline.gempro] INFO: 0: number of genes with additional structures added from BLAST
[2017-08-30 19:01] [ssbio.pipeline.gempro] WARNING: Empty dataframe

Out[16]:
In [17]:
import pandas as pd
import os.path as op
In [18]:
# Creating manual mapping dictionary for ECOLI I-TASSER models
homology_models = '/home/nathan/projects_archive/homology_models/ECOLI/zhang/'
homology_models_df = pd.read_csv('/home/nathan/projects_archive/homology_models/ECOLI/zhang_data/160804-ZHANG_INFO.csv')
tmp = homology_models_df[['zhang_id','model_file','m_gene']].drop_duplicates()
tmp = tmp[pd.notnull(tmp.m_gene)]

homology_model_dict = {}

for i,r in tmp.iterrows():
    homology_model_dict[r['m_gene']] = {r['zhang_id']: {'model_file':op.join(homology_models, r['model_file']),
                                                        'file_type':'pdb'}}

my_gempro.get_manual_homology_models(homology_model_dict)
[2017-08-30 19:01] [ssbio.pipeline.gempro] INFO: Updated homology model information for 2 genes.

In [19]:
# Creating manual mapping dictionary for ECOLI SUNPRO models
homology_models = '/home/nathan/projects_archive/homology_models/ECOLI/sunpro/'
homology_models_df = pd.read_csv('/home/nathan/projects_archive/homology_models/ECOLI/sunpro_data/160609-SUNPRO_INFO.csv')
tmp = homology_models_df[['sunpro_id','model_file','m_gene']].drop_duplicates()
tmp = tmp[pd.notnull(tmp.m_gene)]

homology_model_dict = {}

for i,r in tmp.iterrows():
    homology_model_dict[r['m_gene']] = {r['sunpro_id']: {'model_file':op.join(homology_models, r['model_file']),
                                                         'file_type':'pdb'}}

my_gempro.get_manual_homology_models(homology_model_dict)
[2017-08-30 19:01] [ssbio.pipeline.gempro] INFO: Updated homology model information for 2 genes.

Downloading and ranking structures

Methods

Warning: Downloading all PDBs takes a while, since they are also parsed for metadata. You can skip this step and just set representative structures below if you want to minimize the number of PDBs downloaded.
In [20]:
# Download all mapped PDBs and gather the metadata
my_gempro.pdb_downloader_and_metadata()
my_gempro.df_pdb_metadata.head(2)
[2017-08-30 19:01] [ssbio.pipeline.gempro] INFO: Updated PDB metadata dataframe. See the "df_pdb_metadata" attribute for a summary dataframe.
[2017-08-30 19:01] [ssbio.pipeline.gempro] INFO: Saved 1 structures total

Out[20]:
pdb_id pdb_title description experimental_method mapped_chains resolution chemicals taxonomy_name structure_file
gene
b0118 1l5j CRYSTAL STRUCTURE OF E. COLI ACONITASE B. Aconitate hydratase 2 (E.C.4.2.1.3) X-ray diffraction A;B 2.4 F3S;TRA Escherichia coli 1l5j.pdb
In [21]:
# Set representative structures
my_gempro.set_representative_structure()
my_gempro.df_representative_structures.head()
[2017-08-30 19:01] [ssbio.pipeline.gempro] INFO: 2/2: number of genes with a representative structure
[2017-08-30 19:01] [ssbio.pipeline.gempro] INFO: See the "df_representative_structures" attribute for a summary dataframe.

Out[21]:
id is_experimental file_type structure_file
gene
b0118 1l5j-A True pdb 1l5j-A_clean.pdb
b1276 ACON1_ECOLI-X False pdb ACON1_ECOLI_model1_clean-X_clean.pdb

Computing sequence and structure properties

In [22]:
# Requires EMBOSS "pepstats" program
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
# Install using:
# sudo apt-get install emboss
my_gempro.get_sequence_properties()

In [18]:
# Requires SCRATCH installation, replace path_to_scratch with own path to script
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
my_gempro.get_scratch_predictions(path_to_scratch='/home/nathan/software/SCRATCH-1D_1.1/bin/run_SCRATCH-1D_predictors.sh', results_dir=my_gempro.data_dir)
[2017-08-29 17:00] [ssbio.pipeline.gempro] INFO: /tmp/ssbio_protein_properties/data/ssbio_protein_properties_cds.faa: wrote all representative sequences to file
[2017-08-29 17:00] [ssbio.pipeline.gempro] INFO: 2/2: number of genes with SCRATCH predictions loaded

In [19]:
my_gempro.get_disulfide_bridges()

In [20]:
# Requires DSSP installation
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
my_gempro.get_dssp_annotations()

In [21]:
# Requires MSMS installation
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
my_gempro.get_msms_annotations()

Extracting residue-level properties for a list of residue numbers

In [22]:
from Bio.SeqFeature import SeqFeature, FeatureLocation

Looking at one protein

In [23]:
my_protein = my_gempro.genes.b0118.protein
In [24]:
# Here are the features stored for the sequence, parsed from UniProt
my_protein.representative_sequence.seq_record.features
Out[24]:
[SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(865)), type='chain', id='PRO_0000076675'),
 SeqFeature(FeatureLocation(ExactPosition(243), ExactPosition(246)), type='region of interest'),
 SeqFeature(FeatureLocation(ExactPosition(413), ExactPosition(416)), type='region of interest'),
 SeqFeature(FeatureLocation(ExactPosition(709), ExactPosition(710)), type='metal ion-binding site'),
 SeqFeature(FeatureLocation(ExactPosition(768), ExactPosition(769)), type='metal ion-binding site'),
 SeqFeature(FeatureLocation(ExactPosition(771), ExactPosition(772)), type='metal ion-binding site'),
 SeqFeature(FeatureLocation(ExactPosition(190), ExactPosition(191)), type='binding site'),
 SeqFeature(FeatureLocation(ExactPosition(497), ExactPosition(498)), type='binding site'),
 SeqFeature(FeatureLocation(ExactPosition(790), ExactPosition(791)), type='binding site'),
 SeqFeature(FeatureLocation(ExactPosition(795), ExactPosition(796)), type='binding site'),
 SeqFeature(FeatureLocation(ExactPosition(768), ExactPosition(769)), type='mutagenesis site'),
 SeqFeature(FeatureLocation(ExactPosition(1), ExactPosition(14)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(23), ExactPosition(35)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(41), ExactPosition(51)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(58), ExactPosition(72)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(82), ExactPosition(90)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(98), ExactPosition(104)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(104), ExactPosition(107)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(108), ExactPosition(111)), type='turn'),
 SeqFeature(FeatureLocation(ExactPosition(111), ExactPosition(120)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(127), ExactPosition(137)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(140), ExactPosition(151)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(153), ExactPosition(157)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(165), ExactPosition(178)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(178), ExactPosition(182)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(184), ExactPosition(190)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(193), ExactPosition(197)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(197), ExactPosition(200)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(213), ExactPosition(216)), type='turn'),
 SeqFeature(FeatureLocation(ExactPosition(219), ExactPosition(227)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(232), ExactPosition(243)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(247), ExactPosition(257)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(257), ExactPosition(261)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(263), ExactPosition(269)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(271), ExactPosition(278)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(279), ExactPosition(288)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(291), ExactPosition(295)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(305), ExactPosition(310)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(310), ExactPosition(314)), type='turn'),
 SeqFeature(FeatureLocation(ExactPosition(314), ExactPosition(318)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(318), ExactPosition(321)), type='turn'),
 SeqFeature(FeatureLocation(ExactPosition(323), ExactPosition(327)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(333), ExactPosition(341)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(343), ExactPosition(360)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(383), ExactPosition(391)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(391), ExactPosition(394)), type='turn'),
 SeqFeature(FeatureLocation(ExactPosition(408), ExactPosition(413)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(414), ExactPosition(417)), type='turn'),
 SeqFeature(FeatureLocation(ExactPosition(417), ExactPosition(427)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(437), ExactPosition(440)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(443), ExactPosition(448)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(450), ExactPosition(465)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(465), ExactPosition(468)), type='turn'),
 SeqFeature(FeatureLocation(ExactPosition(478), ExactPosition(483)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(483), ExactPosition(486)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(491), ExactPosition(497)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(502), ExactPosition(507)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(511), ExactPosition(521)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(530), ExactPosition(538)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(545), ExactPosition(559)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(572), ExactPosition(576)), type='turn'),
 SeqFeature(FeatureLocation(ExactPosition(576), ExactPosition(583)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(588), ExactPosition(597)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(597), ExactPosition(600)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(600), ExactPosition(603)), type='turn'),
 SeqFeature(FeatureLocation(ExactPosition(604), ExactPosition(609)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(612), ExactPosition(632)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(637), ExactPosition(653)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(666), ExactPosition(673)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(673), ExactPosition(676)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(680), ExactPosition(683)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(690), ExactPosition(693)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(693), ExactPosition(696)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(696), ExactPosition(699)), type='turn'),
 SeqFeature(FeatureLocation(ExactPosition(703), ExactPosition(707)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(713), ExactPosition(726)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(731), ExactPosition(737)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(741), ExactPosition(750)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(752), ExactPosition(760)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(769), ExactPosition(772)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(774), ExactPosition(777)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(783), ExactPosition(790)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(795), ExactPosition(798)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(801), ExactPosition(805)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(807), ExactPosition(817)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(822), ExactPosition(834)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(836), ExactPosition(840)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(845), ExactPosition(848)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(849), ExactPosition(857)), type='helix')]
In [25]:
my_protein.representative_sequence.letter_annotations.keys()
Out[25]:
dict_keys(['P36683_1l5j-B_chain_index', 'P36683_1l5j-A_chain_index', 'SS-sspro', 'RSA-accpro20', 'RSA-accpro', 'SS-sspro8'])
In [26]:
my_protein.get_residue_annotations(1, use_representatives=True)
{}
{}
seq_resnum: 1 <class 'int'>
seq_residue M M <class 'Bio.Seq.Seq'>
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-26-937cd5d9be7f> in <module>()
----> 1 my_protein.get_residue_annotations(1, use_representatives=True)

/mnt/projects/ssbio/ssbio/core/protein.py in get_residue_annotations(self, seq_resnum, seq_id, struct_id, struct_chain_id, use_representatives)
   1776                                                                                      seqprop=seq,
   1777                                                                                      structprop=struct,
-> 1778                                                                                      chain_id=chain_id)
   1779
   1780         # Try finding the residue in the structure

/mnt/projects/ssbio/ssbio/core/protein.py in map_seqprop_resnums_to_structprop_resnums(self, resnums, seqprop, structprop, chain_id, use_representatives)
   1268                                                                                        structprop=structprop,
   1269                                                                                        chain_id=chain_id,
-> 1270                                                                                        use_representatives=use_representatives)
   1271
   1272         chain = structprop.chains.get_by_id(chain_id)

/mnt/projects/ssbio/ssbio/core/protein.py in map_seqprop_resnums_to_structprop_chain_index(self, resnums, seqprop, structprop, chain_id, use_representatives)
   1222         if access_key not in seqprop.letter_annotations:
   1223             raise KeyError('{}: structure mapping {} not available in sequence letter annotations. Was alignment parsed? '
-> 1224                            'Run ``align_seqprop_to_structprop`` with ``parse=True``.'.format(access_key, aln_id))
   1225         chain_index_mapping = seqprop.letter_annotations[access_key]
   1226

KeyError: 'P36683_1l5j-A-A_chain_index: structure mapping P36683_1l5j-A-A not available in sequence letter annotations. Was alignment parsed? Run ``align_seqprop_to_structprop`` with ``parse=True``.'
In [26]:
my_protein.representative_structure.chains
Out[26]:
[<ChainProp A at 0x7f3969fb9128>]
In [37]:
my_protein.representative_sequence.seq_record.annotations
Out[37]:
{'accessions': ['P36683', 'P36648', 'P75652', 'Q59382'],
 'alternativeName_ecNumber': ['4.2.1.99'],
 'alternativeName_fullName': ['(2R,3S)-2-methylisocitrate dehydratase',
  '(2S,3R)-3-hydroxybutane-1,2,3-tricarboxylate dehydratase',
  '2-methyl-cis-aconitate hydratase',
  'Iron-responsive protein-like',
  'RNA-binding protein'],
 'alternativeName_shortName': ['IRP-like'],
 'comment_biophysicochemicalproperties': ['Optimum pH is 7.4.'],
 'comment_catalyticactivity': ['Citrate = isocitrate.',
  '(2S,3R)-3-hydroxybutane-1,2,3-tricarboxylate = (Z)-but-2-ene-1,2,3-tricarboxylate + H(2)O.'],
 'comment_cofactor': ['Binds 1 [4Fe-4S] cluster per subunit.'],
 'comment_disruptionphenotype': ['Cells lacking this gene are more sensitive to peroxide stress. The acnAB double mutant does not grow on unsupplemented glucose minimal medium and does not respond under aerobic conditions to glutamate. The acnAB double mutant retains a low but significant aconitase activity.'],
 'comment_function': ["Involved in the catabolism of short chain fatty acids (SCFA) via the tricarboxylic acid (TCA)(acetyl degradation route) and the 2-methylcitrate cycle I (propionate degradation route). Catalyzes the reversible isomerization of citrate to isocitrate via cis-aconitate. Also catalyzes the hydration of 2-methyl-cis-aconitate to yield (2R,3S)-2-methylisocitrate. The apo form of AcnB functions as a RNA-binding regulatory protein. During oxidative stress inactive AcnB apo-enzyme without iron sulfur clusters binds the acnB mRNA 3' UTRs (untranslated regions), stabilizes acnB mRNA and increases AcnB synthesis, thus mediating a post-transcriptional positive autoregulatory switch. AcnB also decreases the stability of the sodA transcript."],
 'comment_miscellaneous': ['AcnB is sensitive to oxidation in vivo.'],
 'comment_pathway': ['Organic acid metabolism; propanoate degradation.',
  'Carbohydrate metabolism; tricarboxylic acid cycle; isocitrate from oxaloacetate: step 2/2.'],
 'comment_similarity': ['Belongs to the aconitase/IPM isomerase family.'],
 'comment_subunit': ['Monomer. AcnB can also form a homodimer. The monomer-homodimer transition is dependent on iron availability and the carboxymethylation of C-273 inhibits the dimer formation.'],
 'created': '1994-06-01',
 'dataset': 'Swiss-Prot',
 'gene_name_ordered locus': ['b0118', 'JW0114'],
 'gene_name_primary': 'acnB',
 'gene_name_synonym': ['yacI', 'yacJ'],
 'key': ['1',
  '2',
  '3',
  '4',
  '5',
  '6',
  '7',
  '8',
  '9',
  '10',
  '11',
  '12',
  '13',
  '14',
  '15',
  '16'],
 'keywords': ['3D-structure',
  '4Fe-4S',
  'Complete proteome',
  'Direct protein sequencing',
  'Iron',
  'Iron-sulfur',
  'Lyase',
  'Metal-binding',
  'Reference proteome',
  'RNA-binding',
  'Tricarboxylic acid cycle'],
 'modified': '2017-03-15',
 'organism': 'Escherichia coli (strain K12)',
 'proteinExistence': ['evidence at protein level'],
 'recommendedName_ecNumber': ['4.2.1.3'],
 'recommendedName_fullName': ['Aconitate hydratase B'],
 'recommendedName_shortName': ['ACN', 'Aconitase'],
 'references': [Reference(title='Systematic sequencing of the Escherichia coli genome: analysis of the 2.4-4.1 min (110,917-193,643 bp) region.', ...),
  Reference(title='The complete genome sequence of Escherichia coli K-12.', ...),
  Reference(title='Highly accurate genome sequences of Escherichia coli K-12 strains MG1655 and W3110.', ...),
  Reference(title='The second aconitase (AcnB) of Escherichia coli.', ...),
  Reference(title='Comparing the predicted and observed properties of proteins encoded in the genome of Escherichia coli K-12.', ...),
  Reference(title='Oxidation of propionate to pyruvate in Escherichia coli. Involvement of methylcitrate dehydratase and aconitase.', ...),
  Reference(title='Biochemical and spectroscopic characterization of Escherichia coli aconitases (AcnA and AcnB).', ...),
  Reference(title='Two genetically-distinct and differentially-regulated aconitases (AcnA and AcnB) in Escherichia coli.', ...),
  Reference(title='Construction and properties of aconitase mutants of Escherichia coli.', ...),
  Reference(title='Direct evidence for mRNA binding and post-transcriptional regulation by Escherichia coli aconitases.', ...),
  Reference(title='Escherichia coli aconitases and oxidative stress: post-transcriptional regulation of sodA expression.', ...),
  Reference(title='Switching aconitase B between catalytic and regulatory modes involves iron-dependent dimer formation.', ...),
  Reference(title='E. coli aconitase B structure reveals a HEAT-like domain with implications for protein-protein recognition.', ...)],
 'sequence_checksum': 'FEC8AA5D2A9DD2BD',
 'sequence_length': 865,
 'sequence_mass': 93498,
 'sequence_modified': '1997-11-01',
 'sequence_version': 3,
 'taxonomy': ['Bacteria',
  'Proteobacteria',
  'Gammaproteobacteria',
  'Enterobacterales',
  'Enterobacteriaceae',
  'Escherichia'],
 'type': ['ECO:0000244', 'ECO:0000269', 'ECO:0000303', 'ECO:0000305'],
 'version': 159}
In [34]:
print('********GENE: {}********'.format(my_gempro.genes.b0118.id))

for f in my_protein.representative_sequence.seq_record.features:
    if 'metal' in f.type.lower():
        print(f)
        print(f.type)

        # Get sequence properties
        sr = f.extract(my_protein.representative_sequence.seq_record)
        print('**Residue-level properties calculated or predicted from sequence:')
        print(sr.seq)
        print(sr.letter_annotations)

#         gene_info['seq_resnum'] = f.location.end.position
#         gene_info['seq_residue'] = sr.seq
        gene_info.update(sr.letter_annotations)

        # Get structure properties
        mapping_to_structure_index = my_protein.map_seqprop_resnums_to_structprop_chain_index(resnums=f.location.end.position,
                                                                                     use_representatives=True)
        new_f = SeqFeature(FeatureLocation(mapping_to_structure_index[f.location.end.position]-1,
                                           mapping_to_structure_index[f.location.end.position]))
        sr_st = new_f.extract(my_protein.representative_structure.representative_chain.seq_record)
        print('**Residue-level properties calculated from structure:')
        print(sr_st.seq)
        print(sr_st.letter_annotations)
        print('--------------------------------')

        # Get structure residue numbers
        mapping_to_structure_resnum = my_protein.map_seqprop_resnums_to_structprop_resnums(resnums=f.location.end.position,
                                                                                  use_representatives=True)
#         gene_info['struct_resnum'] = mapping_to_structure_resnum[f.location.end.position]
#         gene_info['struct_residue'] = sr_st.seq
        gene_info.update(sr_st.letter_annotations)

print('********************************')
print()
********GENE: b0118********
type: metal ion-binding site
location: [709:710]
qualifiers:
    Key: description, Value: Iron-sulfur (4Fe-4S)
    Key: evidence, Value: 5
    Key: type, Value: metal ion-binding site

metal ion-binding site
**Residue-level properties calculated or predicted from sequence:
C
{}
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-34-b8e22af14079> in <module>()
     14 #         gene_info['seq_resnum'] = f.location.end.position
     15 #         gene_info['seq_residue'] = sr.seq
---> 16         gene_info.update(sr.letter_annotations)
     17
     18         # Get structure properties

NameError: name 'gene_info' is not defined
In [34]:
# Gathering properties for the metal binding sites

metal_info = []

for g in my_gempro.genes:


    p = g.protein
    print('********GENE: {}********'.format(g.id))


    # Saving the structure residue numbering for visualization
    metal_binding_structure_residues = []

    for f in p.representative_sequence.seq_record.features:
        if 'metal' in f.type.lower():
            print(f)
            gene_info = {}
            gene_info['gene'] = g.id
            gene_info['structure_id'] = g.protein.representative_structure.id
            gene_info['feature'] = f.type

            # Get sequence properties
            sr = f.extract(p.representative_sequence.seq_record)
            print('**Residue-level properties calculated or predicted from sequence:')
            print(sr.seq)
            print(sr.letter_annotations)

            gene_info['seq_resnum'] = f.location.end.position
            gene_info['seq_residue'] = sr.seq
            gene_info.update(sr.letter_annotations)

            # Get structure properties
            mapping_to_structure_index = p.map_seqprop_resnums_to_structprop_chain_index(resnums=f.location.end.position,
                                                                                         use_representatives=True)
            new_f = SeqFeature(FeatureLocation(mapping_to_structure_index[f.location.end.position]-1,
                                               mapping_to_structure_index[f.location.end.position]))
            sr_st = new_f.extract(p.representative_structure.representative_chain.seq_record)
            print('**Residue-level properties calculated from structure:')
            print(sr_st.seq)
            print(sr_st.letter_annotations)
            print('--------------------------------')

            # Get structure residue numbers
            mapping_to_structure_resnum = p.map_seqprop_resnums_to_structprop_resnums(resnums=f.location.end.position,
                                                                                      use_representatives=True)
            gene_info['struct_resnum'] = mapping_to_structure_resnum[f.location.end.position]
            gene_info['struct_residue'] = sr_st.seq
            gene_info.update(sr_st.letter_annotations)

            metal_info.append(gene_info)
    print('********************************')
    print()
********GENE: b1276********
type: metal ion-binding site
location: [434:435]
qualifiers:
    Key: description, Value: Iron-sulfur (4Fe-4S)
    Key: evidence, Value: 1
    Key: type, Value: metal ion-binding site

**Residue-level properties calculated or predicted from sequence:
C
{}
**Residue-level properties calculated from structure:
S
{'structure_resnums': [434], 'CA_DEPTH-msms': [1.999802354873969], 'PHI-dssp': [93.2], 'SS-dssp': ['-'], 'PSI-dssp': [133.0], 'RES_DEPTH-msms': [2.4908282495041347], 'ASA-dssp': [7.0], 'RSA-dssp': [0.05384615384615385]}
--------------------------------
type: metal ion-binding site
location: [500:501]
qualifiers:
    Key: description, Value: Iron-sulfur (4Fe-4S)
    Key: evidence, Value: 1
    Key: type, Value: metal ion-binding site

**Residue-level properties calculated or predicted from sequence:
C
{}
**Residue-level properties calculated from structure:
G
{'structure_resnums': [500], 'CA_DEPTH-msms': [4.872765262426097], 'PHI-dssp': [-171.3], 'SS-dssp': ['S'], 'PSI-dssp': [-178.9], 'RES_DEPTH-msms': [4.508099124970176], 'ASA-dssp': [3.0], 'RSA-dssp': [0.03571428571428571]}
--------------------------------
type: metal ion-binding site
location: [503:504]
qualifiers:
    Key: description, Value: Iron-sulfur (4Fe-4S)
    Key: evidence, Value: 1
    Key: type, Value: metal ion-binding site

**Residue-level properties calculated or predicted from sequence:
C
{}
**Residue-level properties calculated from structure:
T
{'structure_resnums': [503], 'CA_DEPTH-msms': [3.1546567153990472], 'PHI-dssp': [-64.7], 'SS-dssp': ['G'], 'PSI-dssp': [-34.6], 'RES_DEPTH-msms': [3.017741837993942], 'ASA-dssp': [3.0], 'RSA-dssp': [0.02112676056338028]}
--------------------------------
********************************

********GENE: b0118********
type: metal ion-binding site
location: [709:710]
qualifiers:
    Key: description, Value: Iron-sulfur (4Fe-4S)
    Key: evidence, Value: 5
    Key: type, Value: metal ion-binding site

**Residue-level properties calculated or predicted from sequence:
C
{}
**Residue-level properties calculated from structure:
S
{'structure_resnums': [709], 'CA_DEPTH-msms': [10.02414716892187], 'PHI-dssp': [-109.6], 'SS-dssp': ['S'], 'PSI-dssp': [-172.4], 'RES_DEPTH-msms': [9.792794250431943], 'ASA-dssp': [2.0], 'RSA-dssp': [0.015384615384615384]}
--------------------------------
type: metal ion-binding site
location: [768:769]
qualifiers:
    Key: description, Value: Iron-sulfur (4Fe-4S)
    Key: evidence, Value: 5
    Key: type, Value: metal ion-binding site

**Residue-level properties calculated or predicted from sequence:
C
{}
**Residue-level properties calculated from structure:
G
{'structure_resnums': [768], 'CA_DEPTH-msms': [6.680980525988339], 'PHI-dssp': [158.5], 'SS-dssp': ['S'], 'PSI-dssp': [-175.6], 'RES_DEPTH-msms': [6.497142417086094], 'ASA-dssp': [4.0], 'RSA-dssp': [0.047619047619047616]}
--------------------------------
type: metal ion-binding site
location: [771:772]
qualifiers:
    Key: description, Value: Iron-sulfur (4Fe-4S)
    Key: evidence, Value: 5
    Key: type, Value: metal ion-binding site

**Residue-level properties calculated or predicted from sequence:
C
{}
**Residue-level properties calculated from structure:
L
{'structure_resnums': [771], 'CA_DEPTH-msms': [6.8059331457785985], 'PHI-dssp': [-52.6], 'SS-dssp': ['G'], 'PSI-dssp': [-24.8], 'RES_DEPTH-msms': [6.339283606143073], 'ASA-dssp': [1.0], 'RSA-dssp': [0.006097560975609756]}
--------------------------------
********************************

In [27]:
cols = ['gene', 'structure_id', 'feature', 'seq_residue', 'seq_resnum',
        'struct_residue','struct_resnum', 'SS-sspro','SS-sspro8','RSA-accpro','RSA-accpro20','SS-dssp','RSA-dssp', 'ASA-dssp',
        'PHI-dssp', 'PSI-dssp', 'CA_DEPTH-msms', 'RES_DEPTH-msms', 'structure_resnums']
In [28]:
pd.DataFrame.from_records(metal_info, columns=cols)
Out[28]:
gene structure_id feature seq_residue seq_resnum struct_residue struct_resnum SS-sspro SS-sspro8 RSA-accpro RSA-accpro20 SS-dssp RSA-dssp ASA-dssp PHI-dssp PSI-dssp CA_DEPTH-msms RES_DEPTH-msms structure_resnums
0 b1276 ACON1_ECOLI-X metal ion-binding site (C) 435 (S) ( , 435, ) H H - [10] [-] [0.0538461538462] [7.0] [93.2] [133.0] [1.99980235487] [2.4908282495] [( , 434, )]
1 b1276 ACON1_ECOLI-X metal ion-binding site (C) 501 (G) ( , 501, ) C C - [0] [S] [0.0357142857143] [3.0] [-171.3] [-178.9] [4.87276526243] [4.50809912497] [( , 500, )]
2 b1276 ACON1_ECOLI-X metal ion-binding site (C) 504 (T) ( , 504, ) H G - [0] [G] [0.0211267605634] [3.0] [-64.7] [-34.6] [3.1546567154] [3.01774183799] [( , 503, )]
3 b0118 1l5j-A metal ion-binding site (C) 710 (S) ( , 710, ) C T - [10] [S] [0.0153846153846] [2.0] [-109.6] [-172.4] [10.0241471689] [9.79279425043] [( , 709, )]
4 b0118 1l5j-A metal ion-binding site (C) 769 (G) ( , 769, ) C C - [5] [S] [0.047619047619] [4.0] [158.5] [-175.6] [6.68098052599] [6.49714241709] [( , 768, )]
5 b0118 1l5j-A metal ion-binding site (C) 772 (L) ( , 772, ) H G - [5] [G] [0.00609756097561] [1.0] [-52.6] [-24.8] [6.80593314578] [6.33928360614] [( , 771, )]

Visualizing residues

In [31]:
metal_binding_structure_residues = [710, 769, 772]
In [29]:
my_protein.representative_structure.view_structure_and_highlight_residues(structure_resnums=metal_binding_structure_residues)
[2017-06-01 09:38] [ssbio.protein.structure.structprop] INFO: Selection: ( :A ) and not hydrogen and ( )

Looking at residue depths in the same protein

In [34]:
for xx in g.protein.representative_sequence.structure_alignments:
    print(xx.id)
P36683_1l5j-A
P36683_1l5j-B
P36683_ACON2_ECOLI-X
P36683_E00113-X
In [35]:
# Run MSMS for all structures
for g in my_gempro.genes:
    print(g)
    for x in g.protein.structures:
        print(x)
        x.get_residue_depths(outdir=g.protein.structure_dir)
        x.align_seqprop_to_mapped_chains(g.protein.representative_sequence)
        x.map_seqprop_resnums_to_mapped_chains(g.protein.representative_sequence, [1,2,3])
b1276
ACON1_ECOLI
Out[35]:
{'X': {1: (' ', 1, ' '), 2: (' ', 2, ' '), 3: (' ', 3, ' ')}}
E01201
Out[35]:
{'X': {1: (' ', 1, ' '), 2: (' ', 2, ' '), 3: (' ', 3, ' ')}}
b1276_ITASSER
Out[35]:
{'A': {1: (' ', 1, ' '), 2: (' ', 2, ' '), 3: (' ', 3, ' ')}}
b0118
1l5j
Out[35]:
{'A': {1: (' ', 1, ' '), 2: (' ', 2, ' '), 3: (' ', 3, ' ')},
 'B': {1: (' ', 1, ' '), 2: (' ', 2, ' '), 3: (' ', 3, ' ')}}
ACON2_ECOLI
Out[35]:
{'X': {1: (' ', 1, ' '), 2: (' ', 2, ' '), 3: (' ', 3, ' ')}}
E00113
Out[35]:
{'X': {1: (' ', 1, ' '), 2: (' ', 2, ' '), 3: (' ', 3, ' ')}}
In [40]:
from Bio.SeqFeature import SeqFeature, FeatureLocation
In [50]:
# Gathering properties for the metal binding sites

metal_info = []

for g in my_gempro.genes:
    print(g)
    p = g.protein
    for f in p.representative_sequence.seq_record.features:
        if 'metal' in f.type.lower():
            print(f)
            # Get sequence properties
            sr = f.extract(p.representative_sequence.seq_record)

            for s in p.structures:
                print(s)
                # Get structure properties
                new_f = SeqFeature(FeatureLocation(int(sr.letter_annotations['repchain_resnums'][0]-1), int(sr.letter_annotations['repchain_resnums'][-1])))
                sr_st = new_f.extract(p.representative_structure.representative_chain.seq_record)

                print(sr_st.letter_annotations['RES_DEPTH-msms'])
    print()
b1276
type: metal ion-binding site
location: [434:435]
qualifiers:
    Key: description, Value: Iron-sulfur (4Fe-4S)
    Key: evidence, Value: 1
    Key: type, Value: metal ion-binding site

ACON1_ECOLI
[2.8135356659097708]
E01201
[2.8135356659097708]
b1276_ITASSER
[2.8135356659097708]
type: metal ion-binding site
location: [500:501]
qualifiers:
    Key: description, Value: Iron-sulfur (4Fe-4S)
    Key: evidence, Value: 1
    Key: type, Value: metal ion-binding site

ACON1_ECOLI
[2.4091192574526867]
E01201
[2.4091192574526867]
b1276_ITASSER
[2.4091192574526867]
type: metal ion-binding site
location: [503:504]
qualifiers:
    Key: description, Value: Iron-sulfur (4Fe-4S)
    Key: evidence, Value: 1
    Key: type, Value: metal ion-binding site

ACON1_ECOLI
[1.961483950461508]
E01201
[1.961483950461508]
b1276_ITASSER
[1.961483950461508]

b0118
type: metal ion-binding site
location: [709:710]
qualifiers:
    Key: description, Value: Iron-sulfur (4Fe-4S)
    Key: evidence, Value: 5
    Key: type, Value: metal ion-binding site

1l5j
[10.009108769044623]
ACON2_ECOLI
[10.009108769044623]
E00113
[10.009108769044623]
type: metal ion-binding site
location: [768:769]
qualifiers:
    Key: description, Value: Iron-sulfur (4Fe-4S)
    Key: evidence, Value: 5
    Key: type, Value: metal ion-binding site

1l5j
[8.0498321032343032]
ACON2_ECOLI
[8.0498321032343032]
E00113
[8.0498321032343032]
type: metal ion-binding site
location: [771:772]
qualifiers:
    Key: description, Value: Iron-sulfur (4Fe-4S)
    Key: evidence, Value: 5
    Key: type, Value: metal ion-binding site

1l5j
[8.2393688432309204]
ACON2_ECOLI
[8.2393688432309204]
E00113
[8.2393688432309204]