I-TASSER¶

Description¶

Home page: I-TASSER
Download link: I-TASSER Suite

I-TASSER (Iterative Threading ASSEmbly Refinement) is a program for protein homology modeling and functional prediction from a protein sequence. The I-TASSER suite provides numerous other tools such as for ligand-binding site predictions, model refinement, secondary structure predictions, B-factor estimations, and more. ssbio mainly provides tools to run and parse I-TASSER homology modeling results, as well as COACH consensus binding site predictions (optionally with EC number and GO term predictions). Also, scripts are provided to automate homology modeling on a large scale using TORQUE or Slurm job schedulers in a cluster computing environment.

Installation instructions¶

Note

These instructions were created on an Ubuntu 17.04 system.

Note

Read the README on the I-TASSER Suite page for the most up-to-date instructions

Make sure you have Java installed and it can be run from the command line with java
Head to the I-TASSER download page and register for an license (academic only) to get a password emailed to you
Log in to the I-TASSER download page and download the archive
Unpack the software archive into a convenient directory - a library should also be downloaded to this directory
Run download_lib.pl to then download the library files - this will take some time:
/path/to/<I-TASSER_directory>/download_lib.pl -libdir ITLIB
Now, I-TASSER can be run according to the README under section 4
To enable GO term predictions…
1. under construction…
Tip: to update template libraries, create a new command in your crontab (first run crontab -e), and make sure to replace <USERNAME> with your username:
0 4 * * 1,5 <USERNAME> /path/to/I-TASSER4.4/download_lib.pl -libdir /path/to/ITLIB
That will run the library update at 4 am every Monday and Friday.

Program execution¶

In the shell¶

To run the program on its own in the shell…

<code>

With ssbio¶

To run the program using the ssbio Python wrapper, see: ssbio.protein.path.to.wrapper()

FAQs¶

What is a homology model?
- A predicted 3D structure model of a protein sequence. Models can be template-based, when they are based on an existing experimental structure; or ab initio, generated without a template. Generally, ab initio models are much less reliable.
Can I just run I-TASSER using their web server and parse those results with ssbio?
- Not yet, but you can manually input the model1.pdb file as a new structure for now.
How do I cite I-TASSER?
- Roy A, Kucukural A & Zhang Y (2010) I-TASSER: a unified platform for automated protein structure and function prediction. Nat. Protoc. 5: 725–738 Available at: http://dx.doi.org/10.1038/nprot.2010.5
How do I run I-TASSER with TORQUE or Slurm job schedulers?
- under construction…
I’m having issues running I-TASSER…
- See the ssbio wiki for (hopefully) some solutions - or add yours in when you find the answer!

API¶

class ssbio.protein.structure.homology.itasser.itasserprep.ITASSERPrep(ident, seq_str, root_dir, itasser_path, itlib_path, execute_dir=None, light=True, runtype='local', print_exec=False, java_home=None, binding_site_pred=False, ec_pred=False, go_pred=False, additional_options=None, job_scheduler_header=None)[source]¶

Prepare a protein sequence for an I-TASSER homology modeling run.

The main utilities of this class are to:

Allow for the input of a protein sequence string and paths to I-TASSER to create execution scripts
Automate large-scale homology modeling efforts by creating Slurm or TORQUE job scheduling scripts

Parameters:

ident – Identifier for your sequence. Will be used as the global ID (folder name, sequence name)
seq_str – Sequence in string format
root_dir – Local directory where I-TASSER folder will be created
itasser_path – Path to I-TASSER folder, i.e. ‘~/software/I-TASSER4.4’
itlib_path – Path to ITLIB folder, i.e. ‘~/software/ITLIB’
execute_dir – Optional path to execution directory - use this if you are copying the homology models to another location such as a supercomputer for running
light – If simulations should be limited to 5 runs
runtype – How you will be running I-TASSER - local, slurm, or torque
print_exec – If the execution script should be printed out
java_home – Path to Java executable
binding_site_pred – If binding site predictions should be run
ec_pred – If EC number predictions should be run
go_pred – If GO term predictions should be run
additional_options – Any other additional I-TASSER options, appended to the command
job_scheduler_header – Any job scheduling options, prepended as a header to the file

prep_folder(seq)[source]¶: Take in a sequence string and prepares the folder for the I-TASSER run.

class ssbio.protein.structure.homology.itasser.itasserprop.ITASSERProp(ident, original_results_path, coach_results_folder='model1/coach', model_to_use='model1')[source]¶

Parse all available information for a local I-TASSER modeling run.

Initializes a class to collect I-TASSER modeling information and optionally copy results to a new directory. SEE: https://zhanglab.ccmb.med.umich.edu/papers/2015_1.pdf for detailed information.

Parameters:	ident (str) – ID of I-TASSER modeling run original_results_path (str) – Path to I-TASSER modeling folder coach_results_folder (str) – Path to original COACH results model_to_use (str) – Which I-TASSER model to use. Default is “model1”

copy_results(copy_to_dir, rename_model_to=None, force_rerun=False)[source]¶

Copy the raw information from I-TASSER modeling to a new folder.

Copies all files in the list _attrs_to_copy.

Parameters:	copy_to_dir (str) – Directory to copy the minimal set of results per sequence. rename_model_to (str) – New file name (without extension) force_rerun (bool) – If existing models and results should be overwritten.

get_dict(only_attributes=None, exclude_attributes=None, df_format=False)[source]¶

Summarize the I-TASSER run in a dictionary containing modeling results and top predictions from COACH

Parameters:	only_attributes (str, list) – Attributes that should be returned. If not provided, all are returned. exclude_attributes (str, list) – Attributes that should be excluded. df_format (bool) – If dictionary values should be formatted for a dataframe (everything possible is transformed into strings, int, or float - if something can’t be transformed it is excluded)
Returns:	Dictionary of attributes
Return type:	dict

load_structure_path(structure_path, file_type='pdb')[source]¶

Load a structure file and provide pointers to its location

Parameters:	structure_path (str) – Path to structure file file_type (str) – Type of structure file

ssbio.protein.structure.homology.itasser.itasserprop.parse_bfp_dat(infile)[source]¶

Parse the B-factor predictions in BFP.dat

Parameters:	infile (str) – Path to BFP.dat
Returns:	List of B-factor predictions for all residues
Return type:	list

ssbio.protein.structure.homology.itasser.itasserprop.parse_coach_bsites_inf(infile)[source]¶

Parse the Bsites.inf output file of COACH and return a list of rank-ordered binding site predictions

Bsites.inf contains the summary of COACH clustering results after all other prediction algorithms have finished For each site (cluster), there are three lines:

Line 1: site number, c-score of coach prediction, cluster size

Line 2: algorithm, PDB ID, ligand ID, center of binding site (cartesian coordinates), c-score of the algorithm’s prediction, binding residues from single template

Line 3: Statistics of ligands in the cluster

C-score information:

“In our training data, a prediction with C-score>0.35 has average false positive and false negative rates below 0.16 and 0.13, respectively.” (https://zhanglab.ccmb.med.umich.edu/COACH/COACH.pdf)

Parameters:	infile (str) – Path to Bsites.inf
Returns:	Ranked list of dictionaries, keys defined below `site_num`: cluster which is the consensus binding site `c_score`: confidence score of the cluster prediction `cluster_size`: number of predictions within this cluster `algorithm`: main? algorithm used to make the prediction `pdb_template_id`: PDB ID of the template used to make the prediction `pdb_template_chain`: chain of the PDB which has the ligand `pdb_ligand`: predicted ligand to bind `binding_location_coords`: centroid of the predicted ligand position in the homology model `c_score_method`: confidence score for the main algorithm `binding_residues`: predicted residues to bind the ligand `ligand_cluster_counts`: number of predictions per ligand
Return type:	list

ssbio.protein.structure.homology.itasser.itasserprop.parse_coach_ec(infile)[source]¶

Parse the EC.dat output file of COACH and return a list of rank-ordered EC number predictions

EC.dat contains the predicted EC number and active residues. The columns are: PDB_ID, TM-score, RMSD, Sequence identity, Coverage, Confidence score, EC number, and Active site residues

Parameters:	infile (str) – Path to EC.dat
Returns:	Ranked list of dictionaries, keys defined below `pdb_template_id`: PDB ID of the template used to make the prediction `pdb_template_chain`: chain of the PDB which has the ligand `tm_score`: TM-score of the template to the model (similarity score) `rmsd`: RMSD of the template to the model (also a measure of similarity) `seq_ident`: percent sequence identity `seq_coverage`: percent sequence coverage `c_score`: confidence score of the EC prediction `ec_number`: predicted EC number `binding_residues`: predicted residues to bind the ligand
Return type:	list

ssbio.protein.structure.homology.itasser.itasserprop.parse_coach_ec_df(infile)[source]¶

Parse the EC.dat output file of COACH and return a dataframe of results

EC.dat contains the predicted EC number and active residues. The columns are: PDB_ID, TM-score, RMSD, Sequence identity, Coverage, Confidence score, EC number, and Active site residues

Parameters:	infile (str) – Path to EC.dat
Returns:	Pandas DataFrame summarizing EC number predictions
Return type:	DataFrame

ssbio.protein.structure.homology.itasser.itasserprop.parse_coach_go(infile)[source]¶

Parse a GO output file from COACH and return a rank-ordered list of GO term predictions

The columns in all files are: GO terms, Confidence score, Name of GO terms. The files are:

GO_MF.dat - GO terms in ‘molecular function’

GO_BP.dat - GO terms in ‘biological process’

GO_CC.dat - GO terms in ‘cellular component’

Parameters:	infile (str) – Path to any COACH GO prediction file
Returns:	Organized dataframe of results, columns defined below `go_id`: GO term ID `go_term`: GO term text `c_score`: confidence score of the GO prediction
Return type:	Pandas DataFrame

ssbio.protein.structure.homology.itasser.itasserprop.parse_cscore(infile)[source]¶

Parse the cscore file to return a dictionary of scores.

Parameters:	infile (str) – Path to cscore
Returns:	Dictionary of scores
Return type:	dict

ssbio.protein.structure.homology.itasser.itasserprop.parse_exp_dat(infile)[source]¶

Parse the solvent accessibility predictions in exp.dat

Parameters:	infile (str) – Path to exp.dat
Returns:	List of solvent accessibility predictions for all residues
Return type:	list

ssbio.protein.structure.homology.itasser.itasserprop.parse_init_dat(infile)[source]¶

Parse the main init.dat file which contains the modeling results

The first line of the file init.dat contains stuff like:

"120 easy  40   8"

The other lines look like this:

"     161   11.051   1  1guqA MUSTER"

and getting the first 10 gives you the top 10 templates used in modeling

Parameters:	infile (stt) – Path to init.dat
Returns:	Dictionary of parsed information
Return type:	dict

ssbio.protein.structure.homology.itasser.itasserprop.parse_seq_dat(infile)[source]¶

Parse the secondary structure predictions in seq.dat

Parameters:	infile (str) – Path to seq.dat
Returns:	List of secondary structure predictions for all residues
Return type:	list