GEM-PRO - List of Gene IDs

This notebook gives an example of how to run the GEM-PRO pipeline with a list of gene IDs.

Input: List of gene IDs
Output: GEM-PRO model

Imports

In [1]:
import sys
import logging
In [2]:
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO
In [3]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Logging

Set the logging level in logger.setLevel(logging.<LEVEL_HERE>) to specify how verbose you want the pipeline to be. Debug is most verbose.

  • CRITICAL
    • Only really important messages shown
  • ERROR
    • Major errors
  • WARNING
    • Warnings that don’t affect running of the pipeline
  • INFO (default)
    • Info such as the number of structures mapped per gene
  • DEBUG
    • Really detailed information that will print out a lot of stuff
Warning: DEBUG mode prints out a large amount of information, especially if you have a lot of genes. This may stall your notebook!
In [4]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # SET YOUR LOGGING LEVEL HERE #
In [5]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]

Initialization of the project

Set these three things:

  • ROOT_DIR
    • The directory where a folder named after your PROJECT will be created
  • PROJECT
    • Your project name
  • LIST_OF_GENES
    • Your list of gene IDs

A directory will be created in ROOT_DIR with your PROJECT name. The folders are organized like so:

ROOT_DIR
└── PROJECT
    ├── data  # General storage for pipeline outputs
    ├── model  # SBML and GEM-PRO models are stored here
    ├── genes  # Per gene information
    │   ├── <gene_id1>  # Specific gene directory
    │   │   └── protein
    │   │       ├── sequences  # Protein sequence files, alignments, etc.
    │   │       └── structures  # Protein structure files, calculations, etc.
    │   └── <gene_id2>
    │       └── protein
    │           ├── sequences
    │           └── structures
    ├── reactions  # Per reaction information
    │   └── <reaction_id1>  # Specific reaction directory
    │       └── complex
    │           └── structures  # Protein complex files
    └── metabolites  # Per metabolite information
        └── <metabolite_id1>  # Specific metabolite directory
            └── chemical
                └── structures  # Metabolite 2D and 3D structure files
Note: Methods for protein complexes and metabolites are still in development.
In [6]:
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()

PROJECT = 'genes_GP'
LIST_OF_GENES = ['b0761', 'b0889', 'b0995', 'b1013', 'b1014', 'b1040', 'b1130', 'b1187', 'b1221', 'b1299']
In [7]:
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=PROJECT, root_dir=ROOT_DIR, genes_list=LIST_OF_GENES)
[2017-08-29 14:07] [ssbio.pipeline.gempro] INFO: /tmp/genes_GP: GEM-PRO project location
[2017-08-29 14:07] [ssbio.pipeline.gempro] INFO: 10: number of genes

Mapping gene ID –> sequence

First, we need to map these IDs to their protein sequences. There are 2 ID mapping services provided to do this - through KEGG or UniProt. The end goal is to map a UniProt ID to each ID, since there is a comprehensive mapping (and some useful APIs) between UniProt and the PDB.

Note: You only need to map gene IDs using one service. However you can run both if some genes don’t map in one service and do map in another!

Methods

In [8]:
# KEGG mapping of gene ids
my_gempro.kegg_mapping_and_metadata(kegg_organism_code='eco')
print('Missing KEGG mapping: ', my_gempro.missing_kegg_mapping)
my_gempro.df_kegg_metadata.head()
Widget Javascript not detected.  It may not be installed or enabled properly.
[2017-08-29 14:07] [ssbio.pipeline.gempro] INFO: 10/10: number of genes mapped to KEGG
[2017-08-29 14:07] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> KEGG. See the "df_kegg_metadata" attribute for a summary dataframe.

Missing KEGG mapping:  []
Out[8]:
kegg refseq uniprot num_pdbs pdbs seq_len sequence_file metadata_file
gene
b0761 eco:b0761 NP_415282 P0A9G8 5 1B9M;1H9S;1B9N;1O7L;1H9R 262 eco-b0761.faa eco-b0761.kegg
b0889 eco:b0889 NP_415409 P0ACJ0 2 2GQQ;2L4A 164 eco-b0889.faa eco-b0889.kegg
b0995 eco:b0995 NP_415515 P38684 1 1ZGZ 230 eco-b0995.faa eco-b0995.kegg
b1013 eco:b1013 NP_415533 P0ACU2 4 4JYK;4XK4;4X1E;3LOC 212 eco-b1013.faa eco-b1013.kegg
b1014 eco:b1014 NP_415534 P09546 16 3E2Q;4JNZ;3E2R;4JNY;2GPE;4O8A;3E2S;2FZN;1TJ1;1... 1320 eco-b1014.faa eco-b1014.kegg
In [9]:
# UniProt mapping
my_gempro.uniprot_mapping_and_metadata(model_gene_source='ENSEMBLGENOME_ID')
print('Missing UniProt mapping: ', my_gempro.missing_uniprot_mapping)
my_gempro.df_uniprot_metadata.head()
[2017-08-29 14:07] [root] INFO: getUserAgent: Begin
[2017-08-29 14:07] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.5.3; Linux) Python-requests/2.14.2
[2017-08-29 14:07] [root] INFO: getUserAgent: End
Widget Javascript not detected.  It may not be installed or enabled properly.
[2017-08-29 14:07] [ssbio.pipeline.gempro] INFO: 10/10: number of genes mapped to UniProt
[2017-08-29 14:07] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> UniProt. See the "df_uniprot_metadata" attribute for a summary dataframe.

Missing UniProt mapping:  []
Out[9]:
uniprot reviewed gene_name kegg refseq num_pdbs pdbs pfam seq_len description entry_date entry_version seq_date seq_version sequence_file metadata_file
gene
b0761 P0A9G8 False modE ecj:JW0744;eco:b0761 NP_415282.1;WP_001147439.1 5 1B9M;1B9N;1H9R;1H9S;1O7L PF00126;PF03459 262 Transcriptional regulator ModE 2017-07-05 104 2005-07-19 1 P0A9G8.fasta P0A9G8.xml
b0889 P0ACJ0 False lrp ecj:JW0872;eco:b0889 NP_415409.1;WP_000228473.1 2 2GQQ;2L4A PF01037 164 Leucine-responsive regulatory protein 2017-07-05 104 2007-01-23 2 P0ACJ0.fasta P0ACJ0.xml
b0995 P38684 False torR ecj:JW0980;eco:b0995 NP_415515.1;WP_001120125.1 1 1ZGZ PF00072;PF00486 230 TorCAD operon transcriptional regulatory prote... 2017-07-05 146 1997-11-01 2 P38684.fasta P38684.xml
b1013 P0ACU2 False rutR ecj:JW0998;eco:b1013 NP_415533.1;WP_000191701.1 4 3LOC;4JYK;4X1E;4XK4 PF00440;PF08362 212 HTH-type transcriptional regulator RutR 2017-07-05 98 2005-11-22 1 P0ACU2.fasta P0ACU2.xml
b1014 P09546 False putA ecj:JW0999;eco:b1014 NP_415534.1;WP_001326840.1 16 1TIW;1TJ0;1TJ1;1TJ2;2AY0;2FZM;2FZN;2GPE;2RBF;3... PF00171;PF01619;PF14850 1320 Bifunctional protein PutA 2017-07-05 177 1997-11-01 3 P09546.fasta P09546.xml
In [10]:
# Set representative sequences
my_gempro.set_representative_sequence()
print('Missing a representative sequence: ', my_gempro.missing_representative_sequence)
my_gempro.df_representative_sequences.head()
Widget Javascript not detected.  It may not be installed or enabled properly.
[2017-08-29 14:07] [ssbio.pipeline.gempro] INFO: 10/10: number of genes with a representative sequence
[2017-08-29 14:07] [ssbio.pipeline.gempro] INFO: See the "df_representative_sequences" attribute for a summary dataframe.

Missing a representative sequence:  []
Out[10]:
uniprot kegg num_pdbs pdbs seq_len sequence_file metadata_file
gene
b0761 P0A9G8 ecj:JW0744;eco:b0761 5 1B9M;1B9N;1H9R;1H9S;1O7L 262 P0A9G8.fasta P0A9G8.xml
b0889 P0ACJ0 ecj:JW0872;eco:b0889 2 2GQQ;2L4A 164 P0ACJ0.fasta P0ACJ0.xml
b0995 P38684 ecj:JW0980;eco:b0995 1 1ZGZ 230 P38684.fasta P38684.xml
b1013 P0ACU2 ecj:JW0998;eco:b1013 4 3LOC;4JYK;4X1E;4XK4 212 P0ACU2.fasta P0ACU2.xml
b1014 P09546 ecj:JW0999;eco:b1014 16 1TIW;1TJ0;1TJ1;1TJ2;2AY0;2FZM;2FZN;2GPE;2RBF;3... 1320 P09546.fasta P09546.xml

Mapping representative sequence –> structure

These are the ways to map sequence to structure:

  1. Use the UniProt ID and their automatic mappings to the PDB
  2. BLAST the sequence to the PDB
  3. Make homology models or
  4. Map to existing homology models

You can only utilize option #1 to map to PDBs if there is a mapped UniProt ID set in the representative sequence. If not, you’ll have to BLAST your sequence to the PDB or make a homology model. You can also run both for maximum coverage.

Methods

In [11]:
# Mapping using the PDBe best_structures service
my_gempro.map_uniprot_to_pdb(seq_ident_cutoff=.3)
my_gempro.df_pdb_ranking.head()
[2017-08-29 14:07] [ssbio.pipeline.gempro] INFO: Mapping UniProt IDs --> PDB IDs...
[2017-08-29 14:07] [root] INFO: getUserAgent: Begin
[2017-08-29 14:07] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.5.3; Linux) Python-requests/2.14.2
[2017-08-29 14:07] [root] INFO: getUserAgent: End
Widget Javascript not detected.  It may not be installed or enabled properly.
[2017-08-29 14:07] [ssbio.pipeline.gempro] INFO: 8/10: number of genes with at least one experimental structure
[2017-08-29 14:07] [ssbio.pipeline.gempro] INFO: Completed UniProt --> best PDB mapping. See the "df_pdb_ranking" attribute for a summary dataframe.

Out[11]:
pdb_id pdb_chain_id uniprot experimental_method resolution coverage start end unp_start unp_end rank
gene
b0761 1h9s A P0A9G8 X-ray diffraction 1.82 0.534 1 140 123 262 9
b0761 1b9m A P0A9G8 X-ray diffraction 1.75 1.000 4 265 1 262 1
b0761 1b9m B P0A9G8 X-ray diffraction 1.75 1.000 4 265 1 262 2
b0761 1o7l D P0A9G8 X-ray diffraction 2.75 1.000 1 262 1 262 8
b0761 1o7l C P0A9G8 X-ray diffraction 2.75 1.000 1 262 1 262 7
In [12]:
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.9, evalue=0.00001)
my_gempro.df_pdb_blast.head(2)
Widget Javascript not detected.  It may not be installed or enabled properly.
[2017-08-29 14:08] [ssbio.pipeline.gempro] INFO: Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute for a summary dataframe.
[2017-08-29 14:08] [ssbio.pipeline.gempro] INFO: 1: number of genes with additional structures added from BLAST

Out[12]:
pdb_id pdb_chain_id hit_score hit_evalue hit_percent_similar hit_percent_ident hit_num_ident hit_num_similar
gene
b1013 4x1e A 966.0 1.372630e-104 0.910377 0.910377 193 193
b1013 4x1e B 966.0 1.372630e-104 0.910377 0.910377 193 193

Downloading and ranking structures

Methods

Warning: Downloading all PDBs takes a while, since they are also parsed for metadata. You can skip this step and just set representative structures below if you want to minimize the number of PDBs downloaded.
In [13]:
# Download all mapped PDBs and gather the metadata
my_gempro.pdb_downloader_and_metadata()
my_gempro.df_pdb_metadata.head(2)
Widget Javascript not detected.  It may not be installed or enabled properly.
[2017-08-29 14:08] [ssbio.pipeline.gempro] INFO: Updated PDB metadata dataframe. See the "df_pdb_metadata" attribute for a summary dataframe.
[2017-08-29 14:08] [ssbio.pipeline.gempro] INFO: Saved 40 structures total

Out[13]:
chemicals description experimental_method mapped_chains pdb_id pdb_title resolution structure_file taxonomy_name
gene
b0761 CA;CL;MOO TRANSCRIPTIONAL REGULATOR MODE X-ray diffraction A;B;C;D 1o7l Molybdate-activated form of ModE from Escheric... 2.75 1o7l.cif ESCHERICHIA COLI
b0761 MOO MOLYBDENUM TRANSPORT PROTEIN MODE X-ray diffraction A;B 1h9s Molybdate bound complex of Dimop domain of Mod... 1.82 1h9s.cif ESCHERICHIA COLI;ESCHERICHIA COLI
In [14]:
# Set representative structures
my_gempro.set_representative_structure()
my_gempro.df_representative_structures.head()
Widget Javascript not detected.  It may not be installed or enabled properly.
[2017-08-29 14:09] [ssbio.core.protein] WARNING: b1014: no structures meet quality checks
[2017-08-29 14:09] [ssbio.core.protein] WARNING: b0995: no structures meet quality checks
[2017-08-29 14:09] [ssbio.core.protein] WARNING: b1130: no structures meet quality checks
[2017-08-29 14:09] [ssbio.pipeline.gempro] INFO: 5/10: number of genes with a representative structure
[2017-08-29 14:09] [ssbio.pipeline.gempro] INFO: See the "df_representative_structures" attribute for a summary dataframe.

Out[14]:
id is_experimental file_type structure_file
gene
b0761 1b9m-A True pdb 1b9m-A_clean.pdb
b0889 2gqq-A True pdb 2gqq-A_clean.pdb
b1013 4jyk-A True pdb 4jyk-A_clean.pdb
b1187 1hw1-A True pdb 1hw1-A_clean.pdb
b1221 1a04-A True pdb 1a04-A_clean.pdb
In [15]:
# Looking at the information saved within a gene
my_gempro.genes.get_by_id('b1187').protein.representative_structure
my_gempro.genes.get_by_id('b1187').protein.representative_structure.get_dict()
Out[15]:
<StructProp 1hw1-A at 0x7f35fed36cf8>
Out[15]:
{'_structure_dir': '/tmp/genes_GP/genes/b1187/b1187_protein/structures',
 'chains': [<ChainProp A at 0x7f35fee7f748>],
 'date': None,
 'description': 'FATTY ACID METABOLISM REGULATOR PROTEIN',
 'file_type': 'pdb',
 'id': '1hw1-A',
 'is_experimental': True,
 'mapped_chains': ['A'],
 'notes': {},
 'original_pdb_id': '1hw1',
 'resolution': 1.5,
 'structure_file': '1hw1-A_clean.pdb',
 'taxonomy_name': 'Escherichia coli'}

Saving your GEM-PRO

Warning: Saving is still experimental. For a full GEM-PRO with sequences & structures, depending on the number of genes, saving can take >5 minutes.

In [16]:
import os.path as op
my_gempro.save_json(op.join(my_gempro.model_dir, '{}.json'.format(my_gempro.id)), compression=False)
[2017-08-29 14:09] [root] WARNING: json-tricks: numpy scalar serialization is experimental and may work differently in future versions
[2017-08-29 14:09] [ssbio.core.io] INFO: Saved <class 'ssbio.pipeline.gempro.GEMPRO'> (id: genes_GP) to /tmp/genes_GP/model/genes_GP.json