I-TASSER

Homology modeling

Description

I-TASSER (Iterative Threading ASSEmbly Refinement) is a program for protein homology modeling and functional prediction from a protein sequence. The I-TASSER suite provides numerous other tools such as for ligand-binding site predictions, model refinement, secondary structure predictions, B-factor estimations, and more. ssbio mainly provides tools to run and parse I-TASSER homology modeling results, as well as COACH consensus binding site predictions (optionally with EC number and GO term predictions). Also, scripts are provided to automate homology modeling on a large scale using TORQUE or Slurm job schedulers in a cluster computing environment.

Installation instructions

Note

These instructions were created on an Ubuntu 17.04 system.

Note

Read the README on the I-TASSER Suite page for the most up-to-date instructions

  1. Make sure you have Java installed and it can be run from the command line with java

  2. Head to the I-TASSER download page and register for an license (academic only) to get a password emailed to you

  3. Log in to the I-TASSER download page and download the archive

  4. Unpack the software archive into a convenient directory - a library should also be downloaded to this directory

  5. Run download_lib.pl to then download the library files - this will take some time:

    /path/to/<I-TASSER_directory>/download_lib.pl -libdir ITLIB
    
  6. Now, I-TASSER can be run according to the README under section 4

  7. To enable GO term predictions…

    1. under construction…
  8. Tip: to update template libraries, create a new command in your crontab (first run crontab -e), and make sure to replace <USERNAME> with your username:

    0 4 * * 1,5 <USERNAME> /path/to/I-TASSER4.4/download_lib.pl -libdir /path/to/ITLIB
    

    That will run the library update at 4 am every Monday and Friday.

Program execution

In the shell

To run the program on its own in the shell…

<code>

With ssbio

To run the program using the ssbio Python wrapper, see: ssbio.protein.path.to.wrapper()

FAQs

  • What is a homology model?

    • A predicted 3D structure model of a protein sequence. Models can be template-based, when they are based on an existing experimental structure; or ab initio, generated without a template. Generally, ab initio models are much less reliable.
  • Can I just run I-TASSER using their web server and parse those results with ssbio?

    • Not yet, but you can manually input the model1.pdb file as a new structure for now.
  • How do I cite I-TASSER?

    • Roy A, Kucukural A & Zhang Y (2010) I-TASSER: a unified platform for automated protein structure and function prediction. Nat. Protoc. 5: 725–738 Available at: http://dx.doi.org/10.1038/nprot.2010.5
  • How do I run I-TASSER with TORQUE or Slurm job schedulers?

    • under construction…
  • I’m having issues running I-TASSER…

    • See the ssbio wiki for (hopefully) some solutions - or add yours in when you find the answer!

API

class ssbio.protein.structure.homology.itasser.itasserprep.ITASSERPrep(ident, seq_str, root_dir, itasser_path, itlib_path, execute_dir=None, light=True, runtype='local', print_exec=False, java_home=None, binding_site_pred=False, ec_pred=False, go_pred=False, additional_options=None, job_scheduler_header=None)[source]

Prepare a protein sequence for an I-TASSER homology modeling run.

The main utilities of this class are to:

  • Allow for the input of a protein sequence string and paths to I-TASSER to create execution scripts
  • Automate large-scale homology modeling efforts by creating Slurm or TORQUE job scheduling scripts
Parameters:
  • ident – Identifier for your sequence. Will be used as the global ID (folder name, sequence name)
  • seq_str – Sequence in string format
  • root_dir – Local directory where I-TASSER folder will be created
  • itasser_path – Path to I-TASSER folder, i.e. ‘~/software/I-TASSER4.4’
  • itlib_path – Path to ITLIB folder, i.e. ‘~/software/ITLIB’
  • execute_dir – Optional path to execution directory - use this if you are copying the homology models to another location such as a supercomputer for running
  • light – If simulations should be limited to 5 runs
  • runtype – How you will be running I-TASSER - local, slurm, or torque
  • print_exec – If the execution script should be printed out
  • java_home – Path to Java executable
  • binding_site_pred – If binding site predictions should be run
  • ec_pred – If EC number predictions should be run
  • go_pred – If GO term predictions should be run
  • additional_options – Any other additional I-TASSER options, appended to the command
  • job_scheduler_header – Any job scheduling options, prepended as a header to the file
prep_folder(seq)[source]

Take in a sequence string and prepares the folder for the I-TASSER run.

class ssbio.protein.structure.homology.itasser.itasserprop.ITASSERProp(ident, original_results_path, coach_results_folder='model1/coach', model_to_use='model1')[source]

Parse all available information for a local I-TASSER modeling run.

Initializes a class to collect I-TASSER modeling information and optionally copy results to a new directory. SEE: https://zhanglab.ccmb.med.umich.edu/papers/2015_1.pdf for detailed information.

Parameters:
  • ident (str) – ID of I-TASSER modeling run
  • original_results_path (str) – Path to I-TASSER modeling folder
  • coach_results_folder (str) – Path to original COACH results
  • model_to_use (str) – Which I-TASSER model to use. Default is “model1”
copy_results(copy_to_dir, rename_model_to=None, force_rerun=False)[source]

Copy the raw information from I-TASSER modeling to a new folder.

Copies all files in the list _attrs_to_copy.

Parameters:
  • copy_to_dir (str) – Directory to copy the minimal set of results per sequence.
  • rename_model_to (str) – New file name (without extension)
  • force_rerun (bool) – If existing models and results should be overwritten.
get_dict(only_attributes=None, exclude_attributes=None, df_format=False)[source]

Summarize the I-TASSER run in a dictionary containing modeling results and top predictions from COACH

Parameters:
  • only_attributes (str, list) – Attributes that should be returned. If not provided, all are returned.
  • exclude_attributes (str, list) – Attributes that should be excluded.
  • df_format (bool) – If dictionary values should be formatted for a dataframe (everything possible is transformed into strings, int, or float - if something can’t be transformed it is excluded)
Returns:

Dictionary of attributes

Return type:

dict

load_structure_path(structure_path, file_type='pdb')[source]

Load a structure file and provide pointers to its location

Parameters:
  • structure_path (str) – Path to structure file
  • file_type (str) – Type of structure file
ssbio.protein.structure.homology.itasser.itasserprop.parse_bfp_dat(infile)[source]

Parse the B-factor predictions in BFP.dat

Parameters:infile (str) – Path to BFP.dat
Returns:List of B-factor predictions for all residues
Return type:list
ssbio.protein.structure.homology.itasser.itasserprop.parse_coach_bsites_inf(infile)[source]

Parse the Bsites.inf output file of COACH and return a list of rank-ordered binding site predictions

Bsites.inf contains the summary of COACH clustering results after all other prediction algorithms have finished For each site (cluster), there are three lines:

  • Line 1: site number, c-score of coach prediction, cluster size
  • Line 2: algorithm, PDB ID, ligand ID, center of binding site (cartesian coordinates), c-score of the algorithm’s prediction, binding residues from single template
  • Line 3: Statistics of ligands in the cluster

C-score information:

Parameters:infile (str) – Path to Bsites.inf
Returns:Ranked list of dictionaries, keys defined below
  • site_num: cluster which is the consensus binding site
  • c_score: confidence score of the cluster prediction
  • cluster_size: number of predictions within this cluster
  • algorithm: main? algorithm used to make the prediction
  • pdb_template_id: PDB ID of the template used to make the prediction
  • pdb_template_chain: chain of the PDB which has the ligand
  • pdb_ligand: predicted ligand to bind
  • binding_location_coords: centroid of the predicted ligand position in the homology model
  • c_score_method: confidence score for the main algorithm
  • binding_residues: predicted residues to bind the ligand
  • ligand_cluster_counts: number of predictions per ligand
Return type:list
ssbio.protein.structure.homology.itasser.itasserprop.parse_coach_ec(infile)[source]

Parse the EC.dat output file of COACH and return a list of rank-ordered EC number predictions

EC.dat contains the predicted EC number and active residues. The columns are: PDB_ID, TM-score, RMSD, Sequence identity, Coverage, Confidence score, EC number, and Active site residues

Parameters:infile (str) – Path to EC.dat
Returns:Ranked list of dictionaries, keys defined below
  • pdb_template_id: PDB ID of the template used to make the prediction
  • pdb_template_chain: chain of the PDB which has the ligand
  • tm_score: TM-score of the template to the model (similarity score)
  • rmsd: RMSD of the template to the model (also a measure of similarity)
  • seq_ident: percent sequence identity
  • seq_coverage: percent sequence coverage
  • c_score: confidence score of the EC prediction
  • ec_number: predicted EC number
  • binding_residues: predicted residues to bind the ligand
Return type:list
ssbio.protein.structure.homology.itasser.itasserprop.parse_coach_ec_df(infile)[source]

Parse the EC.dat output file of COACH and return a dataframe of results

EC.dat contains the predicted EC number and active residues. The columns are: PDB_ID, TM-score, RMSD, Sequence identity, Coverage, Confidence score, EC number, and Active site residues

Parameters:infile (str) – Path to EC.dat
Returns:Pandas DataFrame summarizing EC number predictions
Return type:DataFrame
ssbio.protein.structure.homology.itasser.itasserprop.parse_coach_go(infile)[source]

Parse a GO output file from COACH and return a rank-ordered list of GO term predictions

The columns in all files are: GO terms, Confidence score, Name of GO terms. The files are:

  • GO_MF.dat - GO terms in ‘molecular function’
  • GO_BP.dat - GO terms in ‘biological process’
  • GO_CC.dat - GO terms in ‘cellular component’
Parameters:infile (str) – Path to any COACH GO prediction file
Returns:Organized dataframe of results, columns defined below
  • go_id: GO term ID
  • go_term: GO term text
  • c_score: confidence score of the GO prediction
Return type:Pandas DataFrame
ssbio.protein.structure.homology.itasser.itasserprop.parse_cscore(infile)[source]

Parse the cscore file to return a dictionary of scores.

Parameters:infile (str) – Path to cscore
Returns:Dictionary of scores
Return type:dict
ssbio.protein.structure.homology.itasser.itasserprop.parse_exp_dat(infile)[source]

Parse the solvent accessibility predictions in exp.dat

Parameters:infile (str) – Path to exp.dat
Returns:List of solvent accessibility predictions for all residues
Return type:list
ssbio.protein.structure.homology.itasser.itasserprop.parse_init_dat(infile)[source]

Parse the main init.dat file which contains the modeling results

The first line of the file init.dat contains stuff like:

"120 easy  40   8"

The other lines look like this:

"     161   11.051   1  1guqA MUSTER"

and getting the first 10 gives you the top 10 templates used in modeling

Parameters:infile (stt) – Path to init.dat
Returns:Dictionary of parsed information
Return type:dict
ssbio.protein.structure.homology.itasser.itasserprop.parse_seq_dat(infile)[source]

Parse the secondary structure predictions in seq.dat

Parameters:infile (str) – Path to seq.dat
Returns:List of secondary structure predictions for all residues
Return type:list