GEM-PRO - Genes & Sequences

This notebook gives an example of how to run the GEM-PRO pipeline with a dictionary of gene IDs and their protein sequences.

Input: Dictionary of gene IDs and protein sequences
Output: GEM-PRO model


In [1]:
import sys
import logging
In [2]:
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO
In [3]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


Set the logging level in logger.setLevel(logging.<LEVEL_HERE>) to specify how verbose you want the pipeline to be. Debug is most verbose.

    • Only really important messages shown
    • Major errors
    • Warnings that don’t affect running of the pipeline
  • INFO (default)
    • Info such as the number of structures mapped per gene
    • Really detailed information that will print out a lot of stuff
Warning: DEBUG mode prints out a large amount of information, especially if you have a lot of genes. This may stall your notebook!
In [4]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # SET YOUR LOGGING LEVEL HERE #
In [5]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
logger.handlers = [handler]

Initialization of the project

Set these three things:

    • The directory where a folder named after your PROJECT will be created
    • Your project name
    • Your list of gene IDs

A directory will be created in ROOT_DIR with your PROJECT name. The folders are organized like so:

    ├── data  # General storage for pipeline outputs
    ├── model  # SBML and GEM-PRO models are stored here
    ├── genes  # Per gene information
    │   ├── <gene_id1>  # Specific gene directory
    │   │   └── protein
    │   │       ├── sequences  # Protein sequence files, alignments, etc.
    │   │       └── structures  # Protein structure files, calculations, etc.
    │   └── <gene_id2>
    │       └── protein
    │           ├── sequences
    │           └── structures
    ├── reactions  # Per reaction information
    │   └── <reaction_id1>  # Specific reaction directory
    │       └── complex
    │           └── structures  # Protein complex files
    └── metabolites  # Per metabolite information
        └── <metabolite_id1>  # Specific metabolite directory
            └── chemical
                └── structures  # Metabolite 2D and 3D structure files
Note: Methods for protein complexes and metabolites are still in development.
In [6]:
import tempfile
ROOT_DIR = tempfile.gettempdir()

PROJECT = 'genes_and_sequences_GP'
In [7]:
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=PROJECT, root_dir=ROOT_DIR, genes_and_sequences=GENES_AND_SEQUENCES)
[2017-08-29 14:06] [ssbio.pipeline.gempro] INFO: /tmp/genes_and_sequences_GP: GEM-PRO project location
[2017-08-29 14:06] [ssbio.pipeline.gempro] INFO: Loaded in 2 sequences
[2017-08-29 14:06] [ssbio.pipeline.gempro] INFO: 2: number of genes

Mapping sequence –> structure

Since the sequences have been provided, we just need to BLAST them to the PDB.

Note: These methods do not download any 3D structure files.


In [8]:
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.9, evalue=0.00001)
[2017-08-29 14:06] [ssbio.pipeline.gempro] INFO: Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute for a summary dataframe.
[2017-08-29 14:06] [ssbio.pipeline.gempro] INFO: 2: number of genes with additional structures added from BLAST

pdb_id pdb_chain_id hit_score hit_evalue hit_percent_similar hit_percent_ident hit_num_ident hit_num_similar
b0870 4rjy D 1696.0 0.0 0.993994 0.987988 329 331
b0870 4rjy B 1696.0 0.0 0.993994 0.987988 329 331

Downloading and ranking structures


Warning: Downloading all PDBs takes a while, since they are also parsed for metadata. You can skip this step and just set representative structures below if you want to minimize the number of PDBs downloaded.
In [9]:
# Download all mapped PDBs and gather the metadata
[2017-08-29 14:06] [ssbio.pipeline.gempro] INFO: Updated PDB metadata dataframe. See the "df_pdb_metadata" attribute for a summary dataframe.
[2017-08-29 14:06] [ssbio.pipeline.gempro] INFO: Saved 11 structures total

pdb_id pdb_title description experimental_method mapped_chains resolution chemicals taxonomy_name structure_file
b0870 3wlx Crystal structure of low-specificity L-threoni... Low specificity L-threonine aldolase (E.C.4.1.... X-RAY DIFFRACTION A;B 2.51 PLG Escherichia coli 3wlx.cif
b0870 4lnj Structure of Escherichia coli Threonine Aldola... Low-specificity L-threonine aldolase (E.C.4.1.... X-RAY DIFFRACTION A;B 2.10 EPE;MG;PLR Escherichia coli 4lnj.cif
In [10]:
# Set representative structures
[2017-08-29 14:06] [ssbio.pipeline.gempro] INFO: 2/2: number of genes with a representative structure
[2017-08-29 14:06] [ssbio.pipeline.gempro] INFO: See the "df_representative_structures" attribute for a summary dataframe.

id is_experimental file_type structure_file
b0870 3wlx-A True pdb 3wlx-A_clean.pdb
b3041 1iez-A True pdb 1iez-A_clean.pdb
In [11]:
# Looking at the information saved within a gene
<StructProp 3wlx-A at 0x7f468f1dd470>
{'_structure_dir': '/tmp/genes_and_sequences_GP/genes/b0870/b0870_protein/structures',
 'chains': [<ChainProp A at 0x7f46871dd6a0>],
 'date': None,
 'description': 'Low specificity L-threonine aldolase (E.C.',
 'file_type': 'pdb',
 'id': '3wlx-A',
 'is_experimental': True,
 'mapped_chains': ['A'],
 'notes': {},
 'original_pdb_id': '3wlx',
 'resolution': 2.51,
 'structure_file': '3wlx-A_clean.pdb',
 'taxonomy_name': 'Escherichia coli'}

Creating homology models

For those proteins with no representative structure, we can create homology models for them. ssbio contains some built in functions for easily running I-TASSER locally or on machines with SLURM (ie. on NERSC) or Torque job scheduling.

You can load in I-TASSER models once they complete using the get_itasser_models later.

Info: Homology modeling can take a long time - about 24-72 hours per protein (highly dependent on the sequence length, as well as if there are available templates).


In [12]:
# Prep I-TASSER model folders
my_gempro.prep_itasser_models('~/software/I-TASSER4.4', '~/software/ITLIB/', runtype='local', all_genes=False)
[2017-08-29 14:07] [ssbio.pipeline.gempro] INFO: Prepared I-TASSER modeling folders for 0 genes in folder /tmp/genes_and_sequences_GP/data/homology_models

Saving your GEM-PRO

Warning: Saving is still experimental. For a full GEM-PRO with sequences & structures, depending on the number of genes, saving can take >5 minutes.

In [13]:
import os.path as op
my_gempro.save_json(op.join(my_gempro.model_dir, '{}.json'.format(, compression=False)
[2017-08-29 14:07] [root] WARNING: json-tricks: numpy scalar serialization is experimental and may work differently in future versions
[2017-08-29 14:07] [] INFO: Saved <class 'ssbio.pipeline.gempro.GEMPRO'> (id: genes_and_sequences_GP) to /tmp/genes_and_sequences_GP/model/genes_and_sequences_GP.json