GEM-PRO - SBML Model (iNJ661)

This notebook gives an example of how to run the GEM-PRO pipeline with a SBML model, in this case iNJ661, the metabolic model of M. tuberculosis.

Input: GEM (in SBML, JSON, or MAT formats)
Output: GEM-PRO model

Imports

In [1]:
import sys
import logging
In [2]:
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO
In [3]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Logging

Set the logging level in logger.setLevel(logging.<LEVEL_HERE>) to specify how verbose you want the pipeline to be. Debug is most verbose.

  • CRITICAL
    • Only really important messages shown
  • ERROR
    • Major errors
  • WARNING
    • Warnings that don’t affect running of the pipeline
  • INFO (default)
    • Info such as the number of structures mapped per gene
  • DEBUG
    • Really detailed information that will print out a lot of stuff
Warning: DEBUG mode prints out a large amount of information, especially if you have a lot of genes. This may stall your notebook!
In [4]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # SET YOUR LOGGING LEVEL HERE #
In [5]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]

Initialization of the project

Set these three things:

  • ROOT_DIR
    • The directory where a folder named after your PROJECT will be created
  • PROJECT
    • Your project name
  • LIST_OF_GENES
    • Your list of gene IDs

A directory will be created in ROOT_DIR with your PROJECT name. The folders are organized like so:

ROOT_DIR
└── PROJECT
    ├── data  # General storage for pipeline outputs
    ├── model  # SBML and GEM-PRO models are stored here
    ├── genes  # Per gene information
    │   ├── <gene_id1>  # Specific gene directory
    │   │   └── protein
    │   │       ├── sequences  # Protein sequence files, alignments, etc.
    │   │       └── structures  # Protein structure files, calculations, etc.
    │   └── <gene_id2>
    │       └── protein
    │           ├── sequences
    │           └── structures
    ├── reactions  # Per reaction information
    │   └── <reaction_id1>  # Specific reaction directory
    │       └── complex
    │           └── structures  # Protein complex files
    └── metabolites  # Per metabolite information
        └── <metabolite_id1>  # Specific metabolite directory
            └── chemical
                └── structures  # Metabolite 2D and 3D structure files
Note: Methods for protein complexes and metabolites are still in development.
In [6]:
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()

PROJECT = 'mtuberculosis_gp_atlas'
GEM_FILE = '/home/nathan/projects_unsynced/mtuberculosis_gp_atlas/model/iNJ661.json'
GEM_FILE_TYPE = 'json'
In [7]:
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=PROJECT, root_dir=ROOT_DIR, gem_file_path=GEM_FILE, gem_file_type=GEM_FILE_TYPE)
[2017-03-29 19:24] [ssbio.pipeline.gempro] INFO: iNJ661: loaded model
[2017-03-29 19:24] [ssbio.pipeline.gempro] INFO: 1025: number of reactions
[2017-03-29 19:24] [ssbio.pipeline.gempro] INFO: 720: number of reactions linked to a gene
[2017-03-29 19:24] [ssbio.pipeline.gempro] INFO: 661: number of genes (excluding spontaneous)
[2017-03-29 19:24] [ssbio.pipeline.gempro] INFO: 826: number of metabolites
[2017-03-29 19:24] [ssbio.pipeline.gempro] WARNING: IMPORTANT: All Gene objects have been transformed into GenePro objects, and will be for any new ones
[2017-03-29 19:24] [ssbio.pipeline.gempro] INFO: /home/nathan/projects_unsynced/mtuberculosis_gp_atlas: GEM-PRO project location
[2017-03-29 19:24] [ssbio.pipeline.gempro] INFO: 661: number of genes

Mapping gene ID –> sequence

First, we need to map these IDs to their protein sequences. There are 2 ID mapping services provided to do this - through KEGG or UniProt. The end goal is to map a UniProt ID to each ID, since there is a comprehensive mapping (and some useful APIs) between UniProt and the PDB.

Note: You only need to map gene IDs using one service. However you can run both if some genes don’t map in one service and do map in another!

However, you don’t need to map using these services if you already have the amino acid sequences for each protein. You can just manually load in the sequences as shown using the method manual_seq_mapping. Or, if you already have the UniProt IDs, you can load those in using the method manual_uniprot_mapping.

Methods

In [8]:
gene_to_seq_dict = {'Rv1295': 'MTVPPTATHQPWPGVIAAYRDRLPVGDDWTPVTLLEGGTPLIAATNLSKQTGCTIHLKVEGLNPTGSFKDRGMTMAVTDALAHGQRAVLCASTGNTSASAAAYAARAGITCAVLIPQGKIAMGKLAQAVMHGAKIIQIDGNFDDCLELARKMAADFPTISLVNSVNPVRIEGQKTAAFEIVDVLGTAPDVHALPVGNAGNITAYWKGYTEYHQLGLIDKLPRMLGTQAAGAAPLVLGEPVSHPETIATAIRIGSPASWTSAVEAQQQSKGRFLAASDEEILAAYHLVARVEGVFVEPASAASIAGLLKAIDDGWVARGSTVVCTVTGNGLKDPDTALKDMPSVSPVPVDPVAVVEKLGLA',
                    'Rv2233': 'VSSPRERRPASQAPRLSRRPPAHQTSRSSPDTTAPTGSGLSNRFVNDNGIVTDTTASGTNCPPPPRAAARRASSPGESPQLVIFDLDGTLTDSARGIVSSFRHALNHIGAPVPEGDLATHIVGPPMHETLRAMGLGESAEEAIVAYRADYSARGWAMNSLFDGIGPLLADLRTAGVRLAVATSKAEPTARRILRHFGIEQHFEVIAGASTDGSRGSKVDVLAHALAQLRPLPERLVMVGDRSHDVDGAAAHGIDTVVVGWGYGRADFIDKTSTTVVTHAATIDELREALGV'}
my_gempro.manual_seq_mapping(gene_to_seq_dict)
[2017-03-29 19:24] [ssbio.pipeline.gempro] INFO: Loaded in 2 sequences
In [9]:
manual_uniprot_dict = {'Rv1755c': 'P9WIA9', 'Rv2321c': 'P71891', 'Rv0619': 'Q79FY3', 'Rv0618': 'Q79FY4', 'Rv2322c': 'P71890'}
my_gempro.manual_uniprot_mapping(manual_uniprot_dict)
my_gempro.df_uniprot_metadata.tail(4)
[2017-03-29 19:24] [ssbio.pipeline.gempro] INFO: Completed manual ID mapping --> UniProt. See the "df_uniprot_metadata" attribute for a summary dataframe.
Out[9]:
uniprot reviewed gene_name kegg refseq num_pdbs ec_number pfam seq_len description entry_version seq_version sequence_file metadata_file
gene
Rv0619 Q79FY3 False galTb NaN NaN 0 NaN PF02744 181 Probable galactose-1-phosphate uridylyltransfe... 2016-11-02 2004-07-05 Q79FY3.fasta Q79FY3.txt
Rv0618 Q79FY4 False galTa mtv:RVBD_0618 NaN 0 NaN PF01087 231 Probable galactose-1-phosphate uridylyltransfe... 2016-11-30 2004-07-05 Q79FY4.fasta Q79FY4.txt
Rv2321c P71891 False rocD2 mtv:RVBD_2321c NaN 0 NaN PF00202 181 Probable ornithine aminotransferase (C-terminu... 2016-11-02 1997-02-01 P71891.fasta P71891.txt
Rv2322c P71890 False rocD1 mtv:RVBD_2322c WP_003411957.1;NZ_KK339370.1 0 NaN PF00202 221 Probable ornithine aminotransferase (N-terminu... 2016-11-30 1997-02-01 P71890.fasta P71890.txt
In [10]:
# KEGG mapping of gene ids
my_gempro.kegg_mapping_and_metadata(kegg_organism_code='mtu')
print('Missing KEGG mapping: ', my_gempro.missing_kegg_mapping)
my_gempro.df_kegg_metadata.head()
[2017-03-29 19:24] [root] WARNING: status is not ok with Not Found
[2017-03-29 19:24] [ssbio.databases.kegg] WARNING: mtu:Rv1755c: no sequence file available
[2017-03-29 19:24] [root] WARNING: status is not ok with Not Found
[2017-03-29 19:24] [ssbio.databases.kegg] WARNING: mtu:Rv1755c: no metadata file available
[2017-03-29 19:24] [root] WARNING: status is not ok with Not Found
[2017-03-29 19:24] [ssbio.databases.kegg] WARNING: mtu:Rv2233: no sequence file available
[2017-03-29 19:24] [root] WARNING: status is not ok with Not Found
[2017-03-29 19:24] [ssbio.databases.kegg] WARNING: mtu:Rv2233: no metadata file available
[2017-03-29 19:24] [root] WARNING: status is not ok with Not Found
[2017-03-29 19:24] [ssbio.databases.kegg] WARNING: mtu:Rv0619: no sequence file available
[2017-03-29 19:24] [root] WARNING: status is not ok with Not Found
[2017-03-29 19:24] [ssbio.databases.kegg] WARNING: mtu:Rv0619: no metadata file available
[2017-03-29 19:24] [root] WARNING: status is not ok with Not Found
[2017-03-29 19:24] [ssbio.databases.kegg] WARNING: mtu:Rv0618: no sequence file available
[2017-03-29 19:24] [root] WARNING: status is not ok with Not Found
[2017-03-29 19:24] [ssbio.databases.kegg] WARNING: mtu:Rv2321c: no sequence file available
[2017-03-29 19:24] [root] WARNING: status is not ok with Not Found
[2017-03-29 19:24] [ssbio.databases.kegg] WARNING: mtu:Rv2321c: no metadata file available
[2017-03-29 19:24] [root] WARNING: status is not ok with Not Found
[2017-03-29 19:24] [ssbio.databases.kegg] WARNING: mtu:Rv2322c: no sequence file available
[2017-03-29 19:24] [root] WARNING: status is not ok with Not Found
[2017-03-29 19:24] [ssbio.databases.kegg] WARNING: mtu:Rv2322c: no metadata file available
[2017-03-29 19:24] [ssbio.pipeline.gempro] INFO: 655/661: number of genes mapped to KEGG
[2017-03-29 19:24] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> KEGG. See the "df_kegg_metadata" attribute for a summary dataframe.

Missing KEGG mapping:  ['Rv2322c', 'Rv2233', 'Rv0619', 'Rv1755c', 'Rv0618', 'Rv2321c']
Out[10]:
kegg refseq uniprot num_pdbs pdbs seq_len sequence_file metadata_file
gene
Rv0417 mtu:Rv0417 NP_214931 P9WG73 0 NaN 252 mtu-Rv0417.faa mtu-Rv0417.kegg
Rv2291 mtu:Rv2291 NP_216807 P9WHF5 0 NaN 284 mtu-Rv2291.faa mtu-Rv2291.kegg
Rv3737 mtu:Rv3737 NP_218254 O69704 0 NaN 529 mtu-Rv3737.faa mtu-Rv3737.kegg
Rv1295 mtu:Rv1295 NP_215811 P9WG59 1 2D1F 360 mtu-Rv1295.faa mtu-Rv1295.kegg
Rv1559 mtu:Rv1559 NP_216075 P9WG95 0 NaN 429 mtu-Rv1559.faa mtu-Rv1559.kegg
In [11]:
# UniProt mapping
my_gempro.uniprot_mapping_and_metadata(model_gene_source='TUBERCULIST_ID')
print('Missing UniProt mapping: ', my_gempro.missing_uniprot_mapping)
my_gempro.df_uniprot_metadata.head()
[2017-03-29 19:24] [root] INFO: getUserAgent: Begin
[2017-03-29 19:24] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.5.2; Linux) Python-requests/2.12.4
[2017-03-29 19:24] [root] INFO: getUserAgent: End
[2017-03-29 19:24] [ssbio.pipeline.gempro] INFO: 589/661: number of genes mapped to UniProt
[2017-03-29 19:24] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> UniProt. See the "df_uniprot_metadata" attribute for a summary dataframe.

Missing UniProt mapping:  ['Rv0156', 'Rv2398c', 'Rv0649', 'Rv0511', 'Rv0266c', 'Rv2458', 'Rv2062c', 'Rv3777', 'Rv3758c', 'Rv2379c', 'Rv1164', 'Rv0753c', 'Rv0317c', 'Rv1163', 'Rv3469c', 'Rv0147', 'Rv2318', 'Rv2671', 'Rv2382c', 'Rv0860', 'Rv0993', 'Rv3113', 'Rv0727c', 'Rv3784', 'Rv1512', 'Rv3737', 'Rv1005c', 'Rv1162', 'Rv3281', 'Rv2380c', 'Rv3331', 'Rv1647', 'Rv2858c', 'Rv2471', 'Rv1844c', 'Rv0812', 'Rv1662', 'Rv2381c', 'Rv2524c', 'Rv1127c', 'Rv1902c', 'Rv1511', 'Rv1916', 'Rv1618', 'Rv2316', 'Rv0958', 'Rv1915', 'Rv0252', 'Rv0082', 'Rv3317', 'Rv3379c', 'Rv0143c', 'Rv1704c', 'Rv2233', 'Rv2436', 'Rv3759c', 'Rv2320c', 'Rv0155', 'Rv0253', 'Rv3565', 'Rv0974c', 'Rv3468c', 'Rv2833c', 'Rv2590', 'Rv3332', 'Rv0375c', 'Rv1239c', 'Rv1928c']
Out[11]:
uniprot reviewed gene_name kegg refseq num_pdbs pdbs ec_number pfam seq_len description entry_version seq_version sequence_file metadata_file
gene
Rv0417 P9WG73 True thiG mtu:Rv0417 NP_214931.1;NC_000962.3;WP_003916659.1;NZ_KK33... 0 NaN 2.8.1.10 NaN 252 Thiazole synthase {ECO:0000255|HAMAP-Rule:MF_0... 2016-11-02 2014-04-16 P9WG73.fasta P9WG73.txt
Rv2291 P9WHF5 True sseB mtu:Rv2291 NP_216807.1;NC_000962.3;WP_003899253.1;NZ_KK33... 0 NaN 2.8.1.1 PF00581 284 Putative thiosulfate sulfurtransferase SseB 2016-11-02 2014-04-16 P9WHF5.fasta P9WHF5.txt
Rv1295 P9WG59 True thrC mtu:Rv1295 NP_215811.1;NC_000962.3;WP_003406652.1;NZ_KK33... 1 2D1F 4.2.3.1 PF00291 360 TS;Threonine synthase 2016-11-02 2014-04-16 P9WG59.fasta P9WG59.txt
Rv1559 P9WG95 True ilvA mtu:Rv1559 NP_216075.1;NC_000962.3;WP_003407781.1;NZ_KK33... 0 NaN 4.3.1.19 PF00291;PF00585 429 Threonine deaminase;L-threonine dehydratase bi... 2016-11-02 2014-04-16 P9WG95.fasta P9WG95.txt
Rv2447c I6Y0R5 False folC mtu:Rv2447c;mtv:RVBD_2447c NP_216963.1;NC_000962.3;WP_003899324.1;NZ_KK33... 0 NaN NaN PF02875;PF08245 487 Probable folylpolyglutamate synthase protein F... 2016-11-02 2012-10-03 I6Y0R5.fasta I6Y0R5.txt

If you have mapped with both KEGG and UniProt mappers, then you can set a representative sequence for the gene using this function. If you used just one, this will just set that ID as representative.

  • If any sequences or IDs were provided manually, these will be set as representative first.
  • UniProt mappings override KEGG mappings except when KEGG mappings have PDBs associated with them and UniProt doesn’t.
In [12]:
# Set representative sequences
my_gempro.set_representative_sequence()
print('Missing a representative sequence: ', my_gempro.missing_representative_sequence)
my_gempro.df_representative_sequences.head()
[2017-03-29 19:24] [ssbio.pipeline.gempro] INFO: 661/661: number of genes with a representative sequence
[2017-03-29 19:24] [ssbio.pipeline.gempro] INFO: See the "df_representative_sequences" attribute for a summary dataframe.

Missing a representative sequence:  []
Out[12]:
uniprot kegg num_pdbs pdbs seq_len sequence_file
gene
Rv0417 P9WG73 mtu:Rv0417 0 NaN 252 P9WG73.fasta
Rv2291 P9WHF5 mtu:Rv2291 0 NaN 284 P9WHF5.fasta
Rv3737 O69704 mtu:Rv3737 0 NaN 529 mtu-Rv3737.faa
Rv1295 P9WG59 mtu:Rv1295 1 2D1F 360 Rv1295.faa
Rv1559 P9WG95 mtu:Rv1559 0 NaN 429 P9WG95.fasta

Mapping representative sequence –> structure

These are the ways to map sequence to structure:

  1. Use the UniProt ID and their automatic mappings to the PDB
  2. BLAST the sequence to the PDB
  3. Make homology models or
  4. Map to existing homology models

You can only utilize option #1 to map to PDBs if there is a mapped UniProt ID set in the representative sequence. If not, you’ll have to BLAST your sequence to the PDB or make a homology model. You can also run both for maximum coverage.

Methods

In [13]:
# Mapping using the PDBe best_structures service
my_gempro.map_uniprot_to_pdb(seq_ident_cutoff=.3)
my_gempro.df_pdb_ranking.head()
[2017-03-29 19:24] [ssbio.pipeline.gempro] INFO: Mapping UniProt IDs --> PDB IDs...
[2017-03-29 19:24] [root] INFO: getUserAgent: Begin
[2017-03-29 19:24] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.5.2; Linux) Python-requests/2.12.4
[2017-03-29 19:24] [root] INFO: getUserAgent: End
[2017-03-29 19:24] [ssbio.pipeline.gempro] INFO: 178/661: number of genes with at least one experimental structure
[2017-03-29 19:24] [ssbio.pipeline.gempro] INFO: Completed UniProt --> best PDB mapping. See the "df_pdb_ranking" attribute for a summary dataframe.

Out[13]:
pdb_id pdb_chain_id uniprot experimental_method resolution coverage start end unp_start unp_end rank
gene
Rv1295 2d1f A P9WG59 X-ray diffraction 2.50 1.000 1 360 1 360 1
Rv1295 2d1f B P9WG59 X-ray diffraction 2.50 1.000 1 360 1 360 2
Rv1201c 3fsy A P9WP21 X-ray diffraction 1.97 0.997 4 319 2 317 1
Rv1201c 3fsy B P9WP21 X-ray diffraction 1.97 0.997 4 319 2 317 2
Rv1201c 3fsy C P9WP21 X-ray diffraction 1.97 0.997 4 319 2 317 3
In [14]:
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.9, evalue=0.00001)
my_gempro.df_pdb_blast.head(2)
[2017-03-29 19:25] [ssbio.pipeline.gempro] INFO: Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute for a summary dataframe.
[2017-03-29 19:25] [ssbio.pipeline.gempro] INFO: 30: number of genes with additional structures added from BLAST

Out[14]:
pdb_id pdb_chain_id hit_score hit_evalue hit_percent_similar hit_percent_ident hit_num_ident hit_num_similar
gene
Rv1908c 4c50 A 3652.0 0.0 0.974324 0.974324 721 721
Rv1908c 4c50 B 3652.0 0.0 0.974324 0.974324 721 721
In [15]:
tb_homology_dir = '/home/nathan/projects_archive/homology_models/MTUBERCULOSIS/'

##### EXAMPLE SPECIFIC CODE #####
# Needed to map to older IDs used in this example
import pandas as pd
import os.path as op
old_gene_to_homology = pd.read_csv(op.join(tb_homology_dir, 'data/161031-old_gene_to_uniprot_mapping.csv'))
gene_to_uniprot = old_gene_to_homology.set_index('m_gene').to_dict()['u_uniprot_acc']
my_gempro.get_itasser_models(homology_raw_dir=op.join(tb_homology_dir, 'raw'), custom_itasser_name_mapping=gene_to_uniprot)
### END EXAMPLE SPECIFIC CODE ###

# Organizing I-TASSER homology models
my_gempro.get_itasser_models(homology_raw_dir=op.join(tb_homology_dir, 'raw'))
my_gempro.df_homology_models.head()
[2017-03-29 19:25] [ssbio.pipeline.gempro] INFO: Completed copying of 435 I-TASSER models to GEM-PRO directory. See the "df_homology_models" attribute for a summary dataframe.

[2017-03-29 19:25] [ssbio.pipeline.gempro] INFO: Completed copying of 9 I-TASSER models to GEM-PRO directory. See the "df_homology_models" attribute for a summary dataframe.

Out[15]:
c_score difficulty id model_date model_file rmsd rmsd_err tm_score tm_score_err top_template_chain top_template_pdb
gene
Rv0417 1.66 easy P9WG73 2015-12-30 P9WG73_model1.pdb 2.6 1.9 0.95 0.05 C 2htm
Rv2291 1.38 easy P9WHF5 2016-01-04 P9WHF5_model1.pdb 3.3 2.3 0.91 0.06 A 3olh
Rv1559 0.73 easy P9WG95 2016-01-08 P9WG95_model1.pdb 5.4 3.4 0.81 0.09 A 1tdj
Rv3113 0.72 easy O05790 2015-12-30 O05790_model1.pdb 4.1 2.8 0.81 0.09 A 3sd7
Rv2447c 0.07 easy I6Y0R5 2016-01-08 I6Y0R5_model1.pdb 7.1 4.2 0.72 0.11 A 2vos
In [16]:
homology_model_dict = {}
my_gempro.get_manual_homology_models(homology_model_dict)
[2017-03-29 19:25] [ssbio.pipeline.gempro] INFO: Updated homology model information for 0 genes.

Downloading and ranking structures

Methods

Warning: Downloading all PDBs takes a while, since they are also parsed for metadata. You can skip this step and just set representative structures below if you want to minimize the number of PDBs downloaded.
In [17]:
# Download all mapped PDBs and gather the metadata
my_gempro.pdb_downloader_and_metadata()
my_gempro.df_pdb_metadata.head(2)
616/|/ 93%|| 616/661 [31:01<02:16,  3.02s/it]
[2017-03-29 19:57] [ssbio.pipeline.gempro] INFO: Updated PDB metadata dataframe. See the "df_pdb_metadata" attribute for a summary dataframe.
[2017-03-29 19:57] [ssbio.pipeline.gempro] INFO: Saved 937 structures total

Out[17]:
chemicals date description experimental_method mapped_chains pdb_id pdb_title resolution structure_file taxonomy_name
gene
Rv1295 PLP 2006-09-05;2009-02-24;2009-04-28;2011-07-13 Threonine synthase (E.C.4.2.3.1) X-ray diffraction A;B 2d1f Structure of Mycobacterium tuberculosis threon... 2.50 2d1f.cif Mycobacterium tuberculosis
Rv1201c SCA;MPD;MG;NA;ACY 2009-06-23;2011-07-13 Tetrahydrodipicolinate N-succinyltransferase (... X-ray diffraction A;B;C;D;E 3fsy Structure of tetrahydrodipicolinate N-succinyl... 1.97 3fsy.cif Mycobacterium tuberculosis
In [18]:
# Set representative structures
my_gempro.set_representative_structure()
my_gempro.df_representative_structures.head()
[2017-03-29 19:58] [ssbio.core.protein] WARNING: Rv0432: no structures meet quality checks
[2017-03-29 19:58] [ssbio.core.protein] WARNING: Rv1286: no structures meet quality checks
[2017-03-29 19:59] [ssbio.core.protein] WARNING: Rv2934: no structures meet quality checks
[2017-03-29 19:59] [ssbio.core.protein] WARNING: Rv2932: no structures meet quality checks
[2017-03-29 19:59] [ssbio.core.protein] WARNING: Rv2933: no structures meet quality checks
[2017-03-29 19:59] [ssbio.core.protein] WARNING: Rv2931: no structures meet quality checks
[2017-03-29 19:59] [ssbio.core.protein] WARNING: Rv2945c: no structures meet quality checks
[2017-03-29 19:59] [ssbio.core.protein] WARNING: Rv2941: no structures meet quality checks
[2017-03-29 19:59] [ssbio.core.protein] WARNING: Rv2495c: no structures meet quality checks
[2017-03-29 20:00] [ssbio.core.protein] WARNING: Rv2380c: no structures meet quality checks
[2017-03-29 20:01] [ssbio.core.protein] WARNING: Rv2987c: no structures meet quality checks
[2017-03-29 20:02] [ssbio.core.protein] WARNING: Rv3859c: no structures meet quality checks
[2017-03-29 20:02] [ssbio.core.protein] WARNING: Rv2476c: no structures meet quality checks
[2017-03-29 20:03] [ssbio.core.protein] WARNING: Rv1653: no structures meet quality checks
[2017-03-29 20:04] [ssbio.core.protein] WARNING: Rv3800c: no structures meet quality checks
[2017-03-29 20:05] [ssbio.core.protein] WARNING: Rv2498c: no structures meet quality checks
[2017-03-29 20:05] [ssbio.core.protein] WARNING: Rv1885c: no structures meet quality checks
[2017-03-29 20:05] [ssbio.core.protein] WARNING: Rv3601c: no structures meet quality checks
[2017-03-29 20:05] [ssbio.core.protein] WARNING: Rv3330: no structures meet quality checks
[2017-03-29 20:06] [ssbio.core.protein] WARNING: Rv3793: no structures meet quality checks
[2017-03-29 20:06] [ssbio.core.protein] WARNING: Rv2940c: no structures meet quality checks
[2017-03-29 20:06] [ssbio.core.protein] WARNING: Rv1662: no structures meet quality checks
[2017-03-29 20:06] [ssbio.core.protein] WARNING: Rv2524c: no structures meet quality checks
[2017-03-29 20:06] [ssbio.core.protein] WARNING: Rv1625c: no structures meet quality checks
[2017-03-29 20:07] [ssbio.pipeline.gempro] INFO: 590/661: number of genes with a representative structure
[2017-03-29 20:07] [ssbio.pipeline.gempro] INFO: See the "df_representative_structures" attribute for a summary dataframe.

Out[18]:
id is_experimental reference_seq_top_coverage structure_file
gene
Rv0417 P9WG73-X False 100.0 P9WG73_model1-X_clean.pdb
Rv2291 P9WHF5-X False 100.0 P9WHF5_model1-X_clean.pdb
Rv1295 2d1f-A True 96.9 2d1f-A_clean.pdb
Rv1559 P9WG95-X False 100.0 P9WG95_model1-X_clean.pdb
Rv3113 O05790-X False 100.0 O05790_model1-X_clean.pdb
In [19]:
# Looking at the information saved within a gene
my_gempro.genes.get_by_id('Rv1295').protein.representative_structure
my_gempro.genes.get_by_id('Rv1295').protein.representative_structure.get_dict()
Out[19]:
<StructProp 2d1f-A at 0x7fb5340cbb00>
Out[19]:
{'_structure_dir': '/home/nathan/projects_unsynced/mtuberculosis_gp_atlas/genes/Rv1295/Rv1295_protein/structures',
 'chains': [<ChainProp A at 0x7fb52c11cfd0>],
 'date': ['2006-09-05', '2009-02-24', '2009-04-28', '2011-07-13'],
 'description': 'Threonine synthase (E.C.4.2.3.1)',
 'file_type': 'pdb',
 'id': '2d1f-A',
 'is_experimental': True,
 'mapped_chains': ['A'],
 'original_pdb_id': '2d1f',
 'reference_seq_top_coverage': 96.9,
 'representative_chain': <ChainProp A at 0x7fb52c11cc88>,
 'resolution': 2.5,
 'structure_file': '2d1f-A_clean.pdb',
 'taxonomy_name': 'Mycobacterium tuberculosis'}

Creating homology models

For those proteins with no representative structure, we can create homology models for them. ssbio contains some built in functions for easily running I-TASSER locally or on machines with SLURM (ie. on NERSC) or Torque job scheduling.

You can load in I-TASSER models once they complete using the get_itasser_models later.

Info: Homology modeling can take a long time - about 24-72 hours per protein (highly dependent on the sequence length, as well as if there are available templates).

Methods

In [20]:
# Prep I-TASSER model folders
my_gempro.prep_itasser_models('~/software/I-TASSER4.4', '~/software/ITLIB/', runtype='local', all_genes=False)
[2017-03-29 20:07] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2934: I-TASSER modeling will not run as sequence length (1827) is not in the range [10, 1500]
[2017-03-29 20:07] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2932: I-TASSER modeling will not run as sequence length (1538) is not in the range [10, 1500]
[2017-03-29 20:07] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2933: I-TASSER modeling will not run as sequence length (2188) is not in the range [10, 1500]
[2017-03-29 20:07] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2931: I-TASSER modeling will not run as sequence length (1876) is not in the range [10, 1500]
[2017-03-29 20:07] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2380c: I-TASSER modeling will not run as sequence length (1682) is not in the range [10, 1500]
[2017-03-29 20:07] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv3859c: I-TASSER modeling will not run as sequence length (1527) is not in the range [10, 1500]
[2017-03-29 20:07] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2476c: I-TASSER modeling will not run as sequence length (1624) is not in the range [10, 1500]
[2017-03-29 20:07] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv3800c: I-TASSER modeling will not run as sequence length (1733) is not in the range [10, 1500]
[2017-03-29 20:07] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv0107c: I-TASSER modeling will not run as sequence length (1632) is not in the range [10, 1500]
[2017-03-29 20:07] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2940c: I-TASSER modeling will not run as sequence length (2111) is not in the range [10, 1500]
[2017-03-29 20:07] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv1662: I-TASSER modeling will not run as sequence length (1602) is not in the range [10, 1500]
[2017-03-29 20:07] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2524c: I-TASSER modeling will not run as sequence length (3069) is not in the range [10, 1500]
[2017-03-29 20:07] [ssbio.pipeline.gempro] INFO: Prepared I-TASSER modeling folders for 71 genes in folder /home/nathan/projects_unsynced/mtuberculosis_gp_atlas/data/homology_models

Saving your GEM-PRO

Finally, you can save your GEM-PRO as a JSON or pickle file, so you don’t have to run the pipeline again.

For most functions, if you rerun them, they will check for existing results saved as files. The only function that would take a long time is setting the representative structure, as they are each rechecked and cleaned. This is where saving helps!

Warning: Saving in JSON format is still experimental. For a full GEM-PRO with sequences & structures, depending on the number of genes, saving can take >5 minutes.

In [21]:
import os.path as op
my_gempro.save_pickle(op.join(my_gempro.model_dir, '{}.pckl'.format(my_gempro.id)))
Out[21]:
'/home/nathan/projects_unsynced/mtuberculosis_gp_atlas/model/mtuberculosis_gp_atlas.pckl'
In [22]:
import os.path as op
my_gempro.save_json(op.join(my_gempro.model_dir, '{}.json'.format(my_gempro.id)), compression=False)
[2017-03-29 20:07] [root] WARNING: json-tricks: numpy scalar serialization is experimental and may work differently in future versions
[2017-03-29 20:08] [ssbio.core.io] INFO: Saved <class 'ssbio.pipeline.gempro.GEMPRO'> (id: mtuberculosis_gp_atlas) to /home/nathan/projects_unsynced/mtuberculosis_gp_atlas/model/mtuberculosis_gp_atlas.json

Loading a saved GEM-PRO

In [22]:
# Loading a pickle file
import pickle
with open(op.join(my_gempro.model_dir, '{}.pckl'.format(my_gempro.id)), 'rb') as f:
    my_saved_gempro = pickle.load(f)
In [23]:
# Loading a JSON file
import ssbio.core.io
my_saved_gempro = ssbio.core.io.load_json(op.join(my_gempro.model_dir, '{}.json'.format(my_gempro.id)), decompression=False)