GEM-PRO - SBML Model (iNJ661)

This notebook gives an example of how to run the GEM-PRO pipeline with a SBML model, in this case iNJ661, the metabolic model of M. tuberculosis.

Input: GEM (in SBML, JSON, or MAT formats)
Output: GEM-PRO model

Imports

In [1]:
import sys
import logging
In [2]:
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO
In [3]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Logging

Set the logging level in logger.setLevel(logging.<LEVEL_HERE>) to specify how verbose you want the pipeline to be. Debug is most verbose.

  • CRITICAL
    • Only really important messages shown
  • ERROR
    • Major errors
  • WARNING
    • Warnings that don’t affect running of the pipeline
  • INFO (default)
    • Info such as the number of structures mapped per gene
  • DEBUG
    • Really detailed information that will print out a lot of stuff
Warning: DEBUG mode prints out a large amount of information, especially if you have a lot of genes. This may stall your notebook!
In [4]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # SET YOUR LOGGING LEVEL HERE #
In [5]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]

Initialization of the project

Set these three things:

  • ROOT_DIR
    • The directory where a folder named after your PROJECT will be created
  • PROJECT
    • Your project name
  • LIST_OF_GENES
    • Your list of gene IDs

A directory will be created in ROOT_DIR with your PROJECT name. The folders are organized like so:

ROOT_DIR
└── PROJECT
    ├── data  # General storage for pipeline outputs
    ├── model  # SBML and GEM-PRO models are stored here
    ├── genes  # Per gene information
    │   ├── <gene_id1>  # Specific gene directory
    │   │   └── protein
    │   │       ├── sequences  # Protein sequence files, alignments, etc.
    │   │       └── structures  # Protein structure files, calculations, etc.
    │   └── <gene_id2>
    │       └── protein
    │           ├── sequences
    │           └── structures
    ├── reactions  # Per reaction information
    │   └── <reaction_id1>  # Specific reaction directory
    │       └── complex
    │           └── structures  # Protein complex files
    └── metabolites  # Per metabolite information
        └── <metabolite_id1>  # Specific metabolite directory
            └── chemical
                └── structures  # Metabolite 2D and 3D structure files
Note: Methods for protein complexes and metabolites are still in development.
In [6]:
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()

PROJECT = 'mtuberculosis_gp_atlas'
GEM_FILE = '/home/nathan/projects_unsynced/mtuberculosis_gp_atlas/model/iNJ661.json'
GEM_FILE_TYPE = 'json'
PDB_FILE_TYPE = 'mmtf'
In [7]:
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=PROJECT, root_dir=ROOT_DIR, gem_file_path=GEM_FILE, gem_file_type=GEM_FILE_TYPE, pdb_file_type=PDB_FILE_TYPE)
[2017-09-24 05:45] [ssbio.pipeline.gempro] INFO: Creating GEM-PRO project directory in folder /tmp
[2017-09-24 05:45] [ssbio.pipeline.gempro] INFO: /tmp/mtuberculosis_gp_atlas: GEM-PRO project location
[2017-09-24 05:45] [ssbio.pipeline.gempro] INFO: iNJ661: loaded model
[2017-09-24 05:45] [ssbio.pipeline.gempro] INFO: 1025: number of reactions
[2017-09-24 05:45] [ssbio.pipeline.gempro] INFO: 720: number of reactions linked to a gene
[2017-09-24 05:45] [ssbio.pipeline.gempro] INFO: 661: number of genes (excluding spontaneous)
[2017-09-24 05:45] [ssbio.pipeline.gempro] INFO: 826: number of metabolites
[2017-09-24 05:45] [ssbio.pipeline.gempro] WARNING: IMPORTANT: All Gene objects have been transformed into GenePro objects, and will be for any new ones
[2017-09-24 05:45] [ssbio.pipeline.gempro] INFO: 661: number of genes

Mapping gene ID –> sequence

First, we need to map these IDs to their protein sequences. There are 2 ID mapping services provided to do this - through KEGG or UniProt. The end goal is to map a UniProt ID to each ID, since there is a comprehensive mapping (and some useful APIs) between UniProt and the PDB.

Note: You only need to map gene IDs using one service. However you can run both if some genes don’t map in one service and do map in another!

However, you don’t need to map using these services if you already have the amino acid sequences for each protein. You can just manually load in the sequences as shown using the method manual_seq_mapping. Or, if you already have the UniProt IDs, you can load those in using the method manual_uniprot_mapping.

Methods

In [8]:
gene_to_seq_dict = {'Rv1295': 'MTVPPTATHQPWPGVIAAYRDRLPVGDDWTPVTLLEGGTPLIAATNLSKQTGCTIHLKVEGLNPTGSFKDRGMTMAVTDALAHGQRAVLCASTGNTSASAAAYAARAGITCAVLIPQGKIAMGKLAQAVMHGAKIIQIDGNFDDCLELARKMAADFPTISLVNSVNPVRIEGQKTAAFEIVDVLGTAPDVHALPVGNAGNITAYWKGYTEYHQLGLIDKLPRMLGTQAAGAAPLVLGEPVSHPETIATAIRIGSPASWTSAVEAQQQSKGRFLAASDEEILAAYHLVARVEGVFVEPASAASIAGLLKAIDDGWVARGSTVVCTVTGNGLKDPDTALKDMPSVSPVPVDPVAVVEKLGLA',
                    'Rv2233': 'VSSPRERRPASQAPRLSRRPPAHQTSRSSPDTTAPTGSGLSNRFVNDNGIVTDTTASGTNCPPPPRAAARRASSPGESPQLVIFDLDGTLTDSARGIVSSFRHALNHIGAPVPEGDLATHIVGPPMHETLRAMGLGESAEEAIVAYRADYSARGWAMNSLFDGIGPLLADLRTAGVRLAVATSKAEPTARRILRHFGIEQHFEVIAGASTDGSRGSKVDVLAHALAQLRPLPERLVMVGDRSHDVDGAAAHGIDTVVVGWGYGRADFIDKTSTTVVTHAATIDELREALGV'}
my_gempro.manual_seq_mapping(gene_to_seq_dict)
[2017-09-24 05:45] [ssbio.pipeline.gempro] INFO: Loaded in 2 sequences
In [9]:
manual_uniprot_dict = {'Rv1755c': 'P9WIA9', 'Rv2321c': 'P71891', 'Rv0619': 'Q79FY3', 'Rv0618': 'Q79FY4', 'Rv2322c': 'P71890'}
my_gempro.manual_uniprot_mapping(manual_uniprot_dict)
my_gempro.df_uniprot_metadata.tail(4)
[2017-09-24 05:45] [ssbio.pipeline.gempro] INFO: Completed manual ID mapping --> UniProt. See the "df_uniprot_metadata" attribute for a summary dataframe.

Out[9]:
uniprot reviewed gene_name kegg refseq pfam description entry_date entry_version seq_date seq_version sequence_file metadata_file
gene
Rv0619 Q79FY3 False galTb NaN NaN PF02744 Probable galactose-1-phosphate uridylyltransfe... 2017-07-05 78 2004-07-05 1 Q79FY3.fasta Q79FY3.xml
Rv1755c P9WIA9 False plcD NaN NaN PF04185 Phospholipase C 4 2017-07-05 18 2014-04-16 1 P9WIA9.fasta P9WIA9.xml
Rv2321c P71891 False rocD2 mtv:RVBD_2321c WP_003411956.1 PF00202 Probable ornithine aminotransferase (C-terminu... 2017-07-05 116 1997-02-01 1 P71891.fasta P71891.xml
Rv2322c P71890 False rocD1 mtv:RVBD_2322c WP_003411957.1 PF00202 Probable ornithine aminotransferase (N-terminu... 2017-06-07 117 1997-02-01 1 P71890.fasta P71890.xml
In [10]:
# KEGG mapping of gene ids
my_gempro.kegg_mapping_and_metadata(kegg_organism_code='mtu')
print('Missing KEGG mapping: ', my_gempro.missing_kegg_mapping)
my_gempro.df_kegg_metadata.head()
[2017-09-24 05:45] [root] WARNING: status is not ok with Not Found
[2017-09-24 05:45] [ssbio.databases.kegg] WARNING: mtu:Rv1755c: no sequence file available
[2017-09-24 05:45] [root] WARNING: status is not ok with Not Found
[2017-09-24 05:45] [ssbio.databases.kegg] WARNING: mtu:Rv1755c: no metadata file available
[2017-09-24 05:45] [root] WARNING: status is not ok with Not Found
[2017-09-24 05:45] [ssbio.databases.kegg] WARNING: mtu:Rv2233: no sequence file available
[2017-09-24 05:45] [root] WARNING: status is not ok with Not Found
[2017-09-24 05:45] [ssbio.databases.kegg] WARNING: mtu:Rv2233: no metadata file available
[2017-09-24 05:45] [ssbio.core.protein] WARNING: Rv2233: representative sequence does not match mapped KEGG sequence.
[2017-09-24 05:45] [root] WARNING: status is not ok with Not Found
[2017-09-24 05:45] [ssbio.databases.kegg] WARNING: mtu:Rv0619: no sequence file available
[2017-09-24 05:45] [root] WARNING: status is not ok with Not Found
[2017-09-24 05:45] [ssbio.databases.kegg] WARNING: mtu:Rv0619: no metadata file available
[2017-09-24 05:45] [root] WARNING: status is not ok with Not Found
[2017-09-24 05:45] [ssbio.databases.kegg] WARNING: mtu:Rv0618: no sequence file available
[2017-09-24 05:45] [root] WARNING: status is not ok with Not Found
[2017-09-24 05:45] [ssbio.databases.kegg] WARNING: mtu:Rv2321c: no sequence file available
[2017-09-24 05:45] [root] WARNING: status is not ok with Not Found
[2017-09-24 05:45] [ssbio.databases.kegg] WARNING: mtu:Rv2321c: no metadata file available
[2017-09-24 05:45] [root] WARNING: status is not ok with Not Found
[2017-09-24 05:45] [ssbio.databases.kegg] WARNING: mtu:Rv2322c: no sequence file available
[2017-09-24 05:45] [root] WARNING: status is not ok with Not Found
[2017-09-24 05:45] [ssbio.databases.kegg] WARNING: mtu:Rv2322c: no metadata file available
[2017-09-24 05:45] [ssbio.pipeline.gempro] INFO: 655/661: number of genes mapped to KEGG
[2017-09-24 05:45] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> KEGG. See the "df_kegg_metadata" attribute for a summary dataframe.

Missing KEGG mapping:  ['Rv2233', 'Rv2322c', 'Rv0619', 'Rv0618', 'Rv2321c', 'Rv1755c']
Out[10]:
kegg refseq uniprot pdbs sequence_file metadata_file
gene
Rv0013 mtu:Rv0013 YP_177615 P9WN35 NaN mtu-Rv0013.faa mtu-Rv0013.kegg
Rv0032 mtu:Rv0032 NP_214546 P9WQ85 NaN mtu-Rv0032.faa mtu-Rv0032.kegg
Rv0046c mtu:Rv0046c NP_214560 P9WKI1 1GR0 mtu-Rv0046c.faa mtu-Rv0046c.kegg
Rv0066c mtu:Rv0066c NP_214580 O53611 5KVU mtu-Rv0066c.faa mtu-Rv0066c.kegg
Rv0069c mtu:Rv0069c NP_214583 P9WGT5 NaN mtu-Rv0069c.faa mtu-Rv0069c.kegg
In [11]:
# UniProt mapping
my_gempro.uniprot_mapping_and_metadata(model_gene_source='TUBERCULIST_ID')
print('Missing UniProt mapping: ', my_gempro.missing_uniprot_mapping)
my_gempro.df_uniprot_metadata.head()
[2017-09-24 05:45] [root] INFO: getUserAgent: Begin
[2017-09-24 05:45] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.5.3; Linux) Python-requests/2.14.2
[2017-09-24 05:45] [root] INFO: getUserAgent: End
[2017-09-24 05:45] [ssbio.pipeline.gempro] INFO: 589/661: number of genes mapped to UniProt
[2017-09-24 05:45] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> UniProt. See the "df_uniprot_metadata" attribute for a summary dataframe.

Missing UniProt mapping:  ['Rv0753c', 'Rv2458', 'Rv2471', 'Rv0649', 'Rv3758c', 'Rv1902c', 'Rv2233', 'Rv0156', 'Rv3468c', 'Rv2436', 'Rv2320c', 'Rv1916', 'Rv0511', 'Rv1163', 'Rv1928c', 'Rv1662', 'Rv1618', 'Rv0266c', 'Rv3737', 'Rv0812', 'Rv2380c', 'Rv2590', 'Rv3784', 'Rv2858c', 'Rv0253', 'Rv1844c', 'Rv2398c', 'Rv2524c', 'Rv1511', 'Rv2318', 'Rv3565', 'Rv0317c', 'Rv1704c', 'Rv1239c', 'Rv3113', 'Rv2062c', 'Rv2671', 'Rv0860', 'Rv0958', 'Rv1005c', 'Rv3759c', 'Rv2833c', 'Rv0252', 'Rv1127c', 'Rv3331', 'Rv1162', 'Rv2382c', 'Rv2379c', 'Rv0974c', 'Rv2381c', 'Rv0993', 'Rv0082', 'Rv0727c', 'Rv1647', 'Rv0143c', 'Rv3777', 'Rv3469c', 'Rv1915', 'Rv1512', 'Rv3379c', 'Rv0375c', 'Rv0155', 'Rv3281', 'Rv3317', 'Rv2316', 'Rv1164', 'Rv0147', 'Rv3332']
Out[11]:
uniprot reviewed gene_name kegg refseq pdbs pfam description entry_date entry_version seq_date seq_version sequence_file metadata_file
gene
Rv0013 P9WN35 False trpG mtu:Rv0013 WP_003899773.1;YP_177615.1 NaN PF00117 Anthranilate synthase component 2 2017-06-07 22 2014-04-16 1 P9WN35.fasta P9WN35.xml
Rv0032 P9WQ85 False bioF2 mtu:Rv0032 NP_214546.1;WP_003905217.1 NaN PF00155 Putative 8-amino-7-oxononanoate synthase 2 2017-05-10 20 2014-04-16 1 P9WQ85.fasta P9WQ85.xml
Rv0046c P9WKI1 False ino1 mtu:Rv0046c NP_214560.1;WP_003902822.1 1GR0 PF01658 Inositol-3-phosphate synthase 2017-06-07 25 2014-04-16 1 P9WKI1.fasta P9WKI1.xml
Rv0066c O53611 False icd2 mtu:Rv0066c;mtv:RVBD_0066c NP_214580.1;WP_003899797.1 NaN PF03971 Probable isocitrate dehydrogenase [NADP] Icd2 ... 2017-06-07 127 1998-06-01 1 O53611.fasta O53611.xml
Rv0069c P9WGT5 False sdaA mtu:Rv0069c NP_214583.1;WP_003400600.1 NaN PF03313;PF03315 L-serine dehydratase 2017-06-07 20 2014-04-16 1 P9WGT5.fasta P9WGT5.xml

If you have mapped with both KEGG and UniProt mappers, then you can set a representative sequence for the gene using this function. If you used just one, this will just set that ID as representative.

  • If any sequences or IDs were provided manually, these will be set as representative first.
  • UniProt mappings override KEGG mappings except when KEGG mappings have PDBs associated with them and UniProt doesn’t.
In [12]:
# Set representative sequences
my_gempro.set_representative_sequence()
print('Missing a representative sequence: ', my_gempro.missing_representative_sequence)
my_gempro.df_representative_sequences.head()
[2017-09-24 05:45] [ssbio.pipeline.gempro] INFO: 661/661: number of genes with a representative sequence
[2017-09-24 05:45] [ssbio.pipeline.gempro] INFO: See the "df_representative_sequences" attribute for a summary dataframe.

Missing a representative sequence:  []
Out[12]:
uniprot kegg pdbs sequence_file metadata_file
gene
Rv0013 P9WN35 mtu:Rv0013 NaN P9WN35.fasta P9WN35.xml
Rv0032 P9WQ85 mtu:Rv0032 NaN P9WQ85.fasta P9WQ85.xml
Rv0046c P9WKI1 mtu:Rv0046c 1GR0 P9WKI1.fasta P9WKI1.xml
Rv0066c O53611 mtu:Rv0066c;mtv:RVBD_0066c NaN O53611.fasta O53611.xml
Rv0069c P9WGT5 mtu:Rv0069c NaN P9WGT5.fasta P9WGT5.xml

Mapping representative sequence –> structure

These are the ways to map sequence to structure:

  1. Use the UniProt ID and their automatic mappings to the PDB
  2. BLAST the sequence to the PDB
  3. Make homology models or
  4. Map to existing homology models

You can only utilize option #1 to map to PDBs if there is a mapped UniProt ID set in the representative sequence. If not, you’ll have to BLAST your sequence to the PDB or make a homology model. You can also run both for maximum coverage.

Methods

In [13]:
# Mapping using the PDBe best_structures service
my_gempro.map_uniprot_to_pdb(seq_ident_cutoff=.3)
my_gempro.df_pdb_ranking.head()
[2017-09-24 05:46] [ssbio.pipeline.gempro] INFO: Mapping UniProt IDs --> PDB IDs...
[2017-09-24 05:46] [root] INFO: getUserAgent: Begin
[2017-09-24 05:46] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.5.3; Linux) Python-requests/2.14.2
[2017-09-24 05:46] [root] INFO: getUserAgent: End
[2017-09-24 05:46] [ssbio.pipeline.gempro] INFO: 182/661: number of genes with at least one experimental structure
[2017-09-24 05:46] [ssbio.pipeline.gempro] INFO: Completed UniProt --> best PDB mapping. See the "df_pdb_ranking" attribute for a summary dataframe.

Out[13]:
pdb_id pdb_chain_id uniprot experimental_method resolution coverage start end unp_start unp_end rank
gene
Rv0046c 1gr0 A P9WKI1 X-ray diffraction 1.95 1.0 1 367 1 367 1
Rv0098 3b18 A P9WM67 X-ray diffraction 2.75 1.0 1 183 1 183 2
Rv0098 2pfc A P9WM67 X-ray diffraction 2.30 1.0 1 183 1 183 1
Rv0137c 1nwa A P9WJM5 X-ray diffraction 1.50 1.0 22 203 1 182 1
Rv0211 4r43 A P9WIH3 X-ray diffraction 1.80 1.0 21 626 1 606 1
In [14]:
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.9, evalue=0.00001)
my_gempro.df_pdb_blast.head(2)
[2017-09-24 05:46] [ssbio.pipeline.gempro] INFO: Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute for a summary dataframe.
[2017-09-24 05:46] [ssbio.pipeline.gempro] INFO: 31: number of genes with additional structures added from BLAST

Out[14]:
pdb_id pdb_chain_id hit_score hit_evalue hit_percent_similar hit_percent_ident hit_num_ident hit_num_similar
gene
Rv0066c 5kvu D 3828.0 0.0 0.981208 0.981208 731 731
Rv0066c 5kvu B 3828.0 0.0 0.981208 0.981208 731 731
In [15]:
tb_homology_dir = '/home/nathan/projects_archive/homology_models/MTUBERCULOSIS/'

##### EXAMPLE SPECIFIC CODE #####
# Needed to map to older IDs used in this example
import pandas as pd
import os.path as op
old_gene_to_homology = pd.read_csv(op.join(tb_homology_dir, 'data/161031-old_gene_to_uniprot_mapping.csv'))
gene_to_uniprot = old_gene_to_homology.set_index('m_gene').to_dict()['u_uniprot_acc']
my_gempro.get_itasser_models(homology_raw_dir=op.join(tb_homology_dir, 'raw'), custom_itasser_name_mapping=gene_to_uniprot)
### END EXAMPLE SPECIFIC CODE ###

# Organizing I-TASSER homology models
my_gempro.get_itasser_models(homology_raw_dir=op.join(tb_homology_dir, 'raw'))
my_gempro.df_homology_models.head()
[2017-09-24 05:46] [ssbio.pipeline.gempro] INFO: Completed copying of 435 I-TASSER models to GEM-PRO directory. See the "df_homology_models" attribute for a summary dataframe.

[2017-09-24 05:46] [ssbio.pipeline.gempro] INFO: Completed copying of 9 I-TASSER models to GEM-PRO directory. See the "df_homology_models" attribute for a summary dataframe.

Out[15]:
c_score difficulty id model_date model_file rmsd rmsd_err tm_score tm_score_err top_template_chain top_template_pdb
gene
Rv0013 -0.53 easy P9WN35 2015-12-28 P9WN35_model1.pdb 6.8 4.0 0.65 0.13 B 1i7s
Rv0032 -2.89 easy P9WQ85 2016-01-11 P9WQ85_model1.pdb 15.7 3.3 0.39 0.13 A 3a2b
Rv0066c 1.91 easy O53611 2016-01-04 O53611_model1.pdb 4.1 2.8 0.99 0.04 A 1itw
Rv0069c 1.18 easy P9WGT5 2016-01-08 P9WGT5_model1.pdb 4.6 3.0 0.88 0.07 A 4rqo
Rv0070c 1.80 easy P9WGI7 2015-12-29 P9WGI7_model1.pdb 3.3 2.3 0.97 0.05 B 3h7f
In [16]:
homology_model_dict = {}
my_gempro.get_manual_homology_models(homology_model_dict)
[2017-09-24 05:56] [ssbio.pipeline.gempro] INFO: Updated homology model information for 0 genes.

Downloading and ranking structures

Methods

Warning: Downloading all PDBs takes a while, since they are also parsed for metadata. You can skip this step and just set representative structures below if you want to minimize the number of PDBs downloaded.
In [17]:
# Download all mapped PDBs and gather the metadata
my_gempro.pdb_downloader_and_metadata()
my_gempro.df_pdb_metadata.head(2)
555/|/ 84%|| 555/661 [04:40<00:53,  1.98it/s]
[2017-09-24 06:02] [ssbio.pipeline.gempro] INFO: Updated PDB metadata dataframe. See the "df_pdb_metadata" attribute for a summary dataframe.
[2017-09-24 06:02] [ssbio.pipeline.gempro] INFO: Saved 973 structures total

Out[17]:
chemicals description experimental_method mapped_chains pdb_id pdb_title resolution structure_file taxonomy_name
gene
Rv0046c CAC;NAD;ZN INOSITOL-3-PHOSPHATE SYNTHASE (E.C.5.5.1.4) X-ray diffraction A 1gr0 myo-inositol 1-phosphate synthase from Mycobac... 1.95 1gr0.mmtf MYCOBACTERIUM TUBERCULOSIS
Rv0066c EDO;GOL;MLA;MLT;NAP;SIN Isocitrate dehydrogenase (E.C.1.1.1.42) X-RAY DIFFRACTION A;B;C;D 5kvu Crystal structure of isocitrate dehydrogenase-... 2.66 5kvu.mmtf Mycobacterium tuberculosis (strain ATCC 25618 ...
In [18]:
# Set representative structures
my_gempro.set_representative_structure()
my_gempro.df_representative_structures.head()
[2017-09-24 06:05] [ssbio.core.protein] WARNING: Rv0432: no structures meet quality checks
[2017-09-24 06:06] [ssbio.core.protein] WARNING: Rv1286: no structures meet quality checks
[2017-09-24 06:06] [ssbio.core.protein] WARNING: Rv2934: no structures meet quality checks
[2017-09-24 06:06] [ssbio.core.protein] WARNING: Rv2932: no structures meet quality checks
[2017-09-24 06:06] [ssbio.core.protein] WARNING: Rv2933: no structures meet quality checks
[2017-09-24 06:06] [ssbio.core.protein] WARNING: Rv2931: no structures meet quality checks
[2017-09-24 06:06] [ssbio.core.protein] WARNING: Rv2945c: no structures meet quality checks
[2017-09-24 06:06] [ssbio.core.protein] WARNING: Rv2941: no structures meet quality checks
[2017-09-24 06:06] [ssbio.core.protein] WARNING: Rv2495c: no structures meet quality checks
[2017-09-24 06:06] [ssbio.core.protein] WARNING: Rv2380c: no structures meet quality checks
[2017-09-24 06:06] [ssbio.core.protein] WARNING: Rv2987c: no structures meet quality checks
[2017-09-24 06:07] [ssbio.core.protein] WARNING: Rv3859c: no structures meet quality checks
[2017-09-24 06:07] [ssbio.core.protein] WARNING: Rv2476c: no structures meet quality checks
[2017-09-24 06:07] [ssbio.core.protein] WARNING: Rv1653: no structures meet quality checks
[2017-09-24 06:07] [ssbio.core.protein] WARNING: Rv3800c: no structures meet quality checks
[2017-09-24 06:08] [ssbio.core.protein] WARNING: Rv2498c: no structures meet quality checks
[2017-09-24 06:08] [ssbio.core.protein] WARNING: Rv1885c: no structures meet quality checks
[2017-09-24 06:08] [ssbio.core.protein] WARNING: Rv3601c: no structures meet quality checks
[2017-09-24 06:08] [ssbio.core.protein] WARNING: Rv3330: no structures meet quality checks
[2017-09-24 06:08] [ssbio.core.protein] WARNING: Rv3793: no structures meet quality checks
[2017-09-24 06:08] [ssbio.core.protein] WARNING: Rv2940c: no structures meet quality checks
[2017-09-24 06:08] [ssbio.core.protein] WARNING: Rv1662: no structures meet quality checks
[2017-09-24 06:08] [ssbio.core.protein] WARNING: Rv2524c: no structures meet quality checks
[2017-09-24 06:08] [ssbio.core.protein] WARNING: Rv1625c: no structures meet quality checks
[2017-09-24 06:08] [ssbio.pipeline.gempro] INFO: 590/661: number of genes with a representative structure
[2017-09-24 06:08] [ssbio.pipeline.gempro] INFO: See the "df_representative_structures" attribute for a summary dataframe.

Out[18]:
id is_experimental file_type structure_file
gene
Rv0013 REP-P9WN35 False pdb P9WN35_model1-X_clean.pdb
Rv0032 REP-P9WQ85 False pdb P9WQ85_model1-X_clean.pdb
Rv0046c REP-1gr0 True pdb 1gr0-A_clean.pdb
Rv0066c REP-5kvu True pdb 5kvu-A_clean.pdb
Rv0069c REP-P9WGT5 False pdb P9WGT5_model1-X_clean.pdb
In [19]:
# Looking at the information saved within a gene
my_gempro.genes.get_by_id('Rv1295').protein.representative_structure
my_gempro.genes.get_by_id('Rv1295').protein.representative_structure.get_dict()
Out[19]:
<StructProp REP-2d1f at 0x7fc1468b42b0>
Out[19]:
{'_structure_dir': '/tmp/mtuberculosis_gp_atlas/genes/Rv1295/Rv1295_protein/structures',
 'chains': [<ChainProp A at 0x7fc145e106a0>],
 'date': None,
 'description': 'Threonine synthase (E.C.4.2.3.1)',
 'file_type': 'pdb',
 'id': 'REP-2d1f',
 'is_experimental': True,
 'mapped_chains': ['A'],
 'notes': {},
 'original_pdb_id': '2d1f',
 'resolution': 2.5,
 'structure_file': '2d1f-A_clean.pdb',
 'taxonomy_name': 'Mycobacterium tuberculosis'}

Creating homology models

For those proteins with no representative structure, we can create homology models for them. ssbio contains some built in functions for easily running I-TASSER locally or on machines with SLURM (ie. on NERSC) or Torque job scheduling.

You can load in I-TASSER models once they complete using the get_itasser_models later.

Info: Homology modeling can take a long time - about 24-72 hours per protein (highly dependent on the sequence length, as well as if there are available templates).

Methods

In [20]:
# Prep I-TASSER model folders
my_gempro.prep_itasser_models('~/software/I-TASSER4.4', '~/software/ITLIB/', runtype='local', all_genes=False)
[2017-09-24 04:17] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2934: I-TASSER modeling will not run as sequence length (1827) is not in the range [10, 1500]
[2017-09-24 04:17] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2932: I-TASSER modeling will not run as sequence length (1538) is not in the range [10, 1500]
[2017-09-24 04:17] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2933: I-TASSER modeling will not run as sequence length (2188) is not in the range [10, 1500]
[2017-09-24 04:17] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2931: I-TASSER modeling will not run as sequence length (1876) is not in the range [10, 1500]
[2017-09-24 04:17] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2380c: I-TASSER modeling will not run as sequence length (1682) is not in the range [10, 1500]
[2017-09-24 04:17] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv3859c: I-TASSER modeling will not run as sequence length (1527) is not in the range [10, 1500]
[2017-09-24 04:17] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2476c: I-TASSER modeling will not run as sequence length (1624) is not in the range [10, 1500]
[2017-09-24 04:17] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv3800c: I-TASSER modeling will not run as sequence length (1733) is not in the range [10, 1500]
[2017-09-24 04:17] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv0107c: I-TASSER modeling will not run as sequence length (1632) is not in the range [10, 1500]
[2017-09-24 04:17] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2940c: I-TASSER modeling will not run as sequence length (2111) is not in the range [10, 1500]
[2017-09-24 04:17] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv1662: I-TASSER modeling will not run as sequence length (1602) is not in the range [10, 1500]
[2017-09-24 04:17] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2524c: I-TASSER modeling will not run as sequence length (3069) is not in the range [10, 1500]
[2017-09-24 04:17] [ssbio.pipeline.gempro] INFO: Prepared I-TASSER modeling folders for 71 genes in folder /tmp/mtuberculosis_gp_atlas/data/homology_models

Saving your GEM-PRO

Finally, you can save your GEM-PRO as a JSON or pickle file, so you don’t have to run the pipeline again.

For most functions, if you rerun them, they will check for existing results saved as files. The only function that would take a long time is setting the representative structure, as they are each rechecked and cleaned. This is where saving helps!

Warning: Saving in JSON format is still experimental. For a full GEM-PRO with sequences & structures, depending on the number of genes, saving can take >5 minutes.

In [21]:
import os.path as op
my_gempro.save_pickle(op.join(my_gempro.model_dir, '{}.pckl'.format(my_gempro.id)))
In [5]:
import os.path as op
my_gempro.save_json(op.join(my_gempro.model_dir, '{}.json'.format(my_gempro.id)), compression=False)

Loading a saved GEM-PRO

In [1]:
# Loading a pickle file
import pickle
with open('/tmp/mtuberculosis_gp_atlas/model/mtuberculosis_gp_atlas.pckl', 'rb') as f:
    my_saved_gempro = pickle.load(f)
In [13]:
my_saved_gempro.genes[0].__json_encode__()
Out[13]:
{'id': 'Rv0417',
 'name': 'thiG',
 'notes': {'original_bigg_ids': ['Rv0417']},
 'protein': <Protein Rv0417 at 0x7f103b45c1d0>,
 'root_dir': '/tmp/mtuberculosis_gp_atlas/genes'}
In [10]:
my_saved_gempro.genes[0].protein.__json_encode__()
Out[10]:
{'_root_dir': '/tmp/mtuberculosis_gp_atlas/genes/Rv0417',
 'description': None,
 'id': 'Rv0417',
 'notes': {},
 'pdb_file_type': 'mmtf',
 'representative_chain': 'X',
 'representative_chain_seq_coverage': 100.0,
 'representative_sequence': UniProtProp(seq=Seq('MAESKLVIGDRSFASRLIMGTGGATNLAVLEQALIASGTELTTVAIRRVDADGG...PAR', SingleLetterAlphabet()), id='P9WG73', name='sp|P9WG73|THIG_MYCTU', description='Thiazole synthase', dbxrefs=[]),
 'representative_structure': <StructProp REP-P9WG73 at 0x7f103b45c4a8>,
 'sequence_alignments': [<<class 'Bio.Align.MultipleSeqAlignment'> instance (2 records of length 252, SingleLetterAlphabet()) at 7f103b45ca20>],
 'sequences': [KEGGProp(seq=Seq('MAESKLVIGDRSFASRLIMGTGGATNLAVLEQALIASGTELTTVAIRRVDADGG...PAR', SingleLetterAlphabet()), id='mtu:Rv0417', name='mtu:Rv0417', description='mtu:Rv0417 K03149 thiazole synthase [EC:2.8.1.10] | (RefSeq) thiG; thiazole synthase (A)', dbxrefs=[]),
  UniProtProp(seq=Seq('MAESKLVIGDRSFASRLIMGTGGATNLAVLEQALIASGTELTTVAIRRVDADGG...PAR', SingleLetterAlphabet()), id='P9WG73', name='sp|P9WG73|THIG_MYCTU', description='Thiazole synthase', dbxrefs=[])],
 'structure_alignments': [],
 'structures': [<ITASSERProp P9WG73 at 0x7f103b45c208>,
  <StructProp REP-P9WG73 at 0x7f103b45c4a8>]}
In [6]:
# Loading a JSON file
import ssbio.core.io
my_saved_gempro = ssbio.core.io.load_json('/tmp/mtuberculosis_gp_atlas/model/mtuberculosis_gp_atlas.json', decompression=False)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-6-1a7ccd1e60af> in <module>()
      1 # Loading a JSON file
      2 import ssbio.core.io
----> 3 my_saved_gempro = ssbio.core.io.load_json('/tmp/mtuberculosis_gp_atlas/model/mtuberculosis_gp_atlas.json', decompression=False)

/mnt/projects/ssbio/ssbio/core/io.py in load_json(file, new_root_dir, decompression)
     23     else:
     24         with open(file, 'r') as f:
---> 25             my_object = load(f, decompression=decompression)
     26     if new_root_dir:
     27         my_object.root_dir = new_root_dir

/home/nathan/anaconda3/lib/python3.5/site-packages/json_tricks/np.py in load(fp, preserve_order, ignore_comments, decompression, obj_pairs_hooks, extra_obj_pairs_hooks, cls_lookup_map, allow_duplicates, **jsonkwargs)
    125    return nonp.load(fp, preserve_order=preserve_order, ignore_comments=ignore_comments, decompression=decompression,
    126                 obj_pairs_hooks=obj_pairs_hooks, extra_obj_pairs_hooks=extra_obj_pairs_hooks, cls_lookup_map=cls_lookup_map,
--> 127                 allow_duplicates=allow_duplicates, **jsonkwargs)
    128
    129

/home/nathan/anaconda3/lib/python3.5/site-packages/json_tricks/nonp.py in load(fp, preserve_order, ignore_comments, decompression, obj_pairs_hooks, extra_obj_pairs_hooks, cls_lookup_map, allow_duplicates, **jsonkwargs)
    191    return loads(string, preserve_order=preserve_order, ignore_comments=ignore_comments, decompression=decompression,
    192                 obj_pairs_hooks=obj_pairs_hooks, extra_obj_pairs_hooks=extra_obj_pairs_hooks, cls_lookup_map=cls_lookup_map,
--> 193                 allow_duplicates=allow_duplicates, **jsonkwargs)
    194
    195

/home/nathan/anaconda3/lib/python3.5/site-packages/json_tricks/nonp.py in loads(string, preserve_order, ignore_comments, decompression, obj_pairs_hooks, extra_obj_pairs_hooks, cls_lookup_map, allow_duplicates, **jsonkwargs)
    166         hooks = tuple(extra_obj_pairs_hooks) + obj_pairs_hooks
    167         hook = TricksPairHook(ordered=preserve_order, obj_pairs_hooks=hooks, allow_duplicates=allow_duplicates)
--> 168         return json_loads(string, object_pairs_hook=hook, **jsonkwargs)
    169
    170

/home/nathan/anaconda3/lib/python3.5/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    330     if parse_constant is not None:
    331         kw['parse_constant'] = parse_constant
--> 332     return cls(**kw).decode(s)

/home/nathan/anaconda3/lib/python3.5/json/decoder.py in decode(self, s, _w)
    337
    338         """
--> 339         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    340         end = _w(s, end).end()
    341         if end != len(s):

/home/nathan/anaconda3/lib/python3.5/json/decoder.py in raw_decode(self, s, idx)
    353         """
    354         try:
--> 355             obj, end = self.scan_once(s, idx)
    356         except StopIteration as err:
    357             raise JSONDecodeError("Expecting value", s, err.value) from None

/home/nathan/anaconda3/lib/python3.5/site-packages/json_tricks/decoders.py in __call__(self, pairs)
     39                 map = self.map_type(pairs)
     40                 for hook in self.obj_pairs_hooks:
---> 41                         map = hook(map)
     42                 return map
     43

/home/nathan/anaconda3/lib/python3.5/site-packages/json_tricks/decoders.py in __call__(self, dct)
    147                                    '__new__ method and can\'t be restored').format(name))
    148                         if hasattr(obj, '__json_decode__'):
--> 149                                 obj.__json_decode__(**attrs)
    150                         else:
    151                                 obj.__dict__ = dict(attrs)

/mnt/projects/ssbio/ssbio/core/genepro.py in __json_decode__(self, **attrs)
     78     def __json_decode__(self, **attrs):
     79         for k, v in attrs.items():
---> 80             setattr(self, k, v)

/mnt/projects/ssbio/ssbio/core/genepro.py in root_dir(self, path)
     36             raise ValueError('{}: folder does not exist'.format(path))
     37
---> 38         if self._root_dir:
     39             log.debug('Changing root directory of Gene "{}" from {} to {}'.format(self.id, self.root_dir, path))
     40

AttributeError: 'GenePro' object has no attribute '_root_dir'