I-TASSER¶
Description¶
- Home page: I-TASSER
- Download link: I-TASSER Suite
I-TASSER (Iterative Threading ASSEmbly Refinement) is a program for protein homology modeling and functional prediction from a protein sequence. The I-TASSER suite provides numerous other tools such as for ligand-binding site predictions, model refinement, secondary structure predictions, B-factor estimations, and more. ssbio mainly provides tools to run and parse I-TASSER homology modeling results, as well as COACH consensus binding site predictions (optionally with EC number and GO term predictions). Also, scripts are provided to automate homology modeling on a large scale using TORQUE or Slurm job schedulers in a cluster computing environment.
Installation instructions¶
Note
These instructions were created on an Ubuntu 17.04 system.
Note
Read the README on the I-TASSER Suite page for the most up-to-date instructions
Make sure you have Java installed and it can be run from the command line with
java
Head to the I-TASSER download page and register for an license (academic only) to get a password emailed to you
Log in to the I-TASSER download page and download the archive
Unpack the software archive into a convenient directory - a library should also be downloaded to this directory
Run
download_lib.pl
to then download the library files - this will take some time:/path/to/<I-TASSER_directory>/download_lib.pl -libdir ITLIB
Now, I-TASSER can be run according to the README under section 4
To enable GO term predictions…
- under construction…
Tip: to update template libraries, create a new command in your crontab (first run
crontab -e
), and make sure to replace<USERNAME>
with your username:0 4 * * 1,5 <USERNAME> /path/to/I-TASSER4.4/download_lib.pl -libdir /path/to/ITLIB
That will run the library update at 4 am every Monday and Friday.
Program execution¶
With ssbio¶
To run the program using the ssbio Python wrapper, see: ssbio.protein.path.to.wrapper()
FAQs¶
What is a homology model?
- A predicted 3D structure model of a protein sequence. Models can be template-based, when they are based on an existing experimental structure; or ab initio, generated without a template. Generally, ab initio models are much less reliable.
Can I just run I-TASSER using their web server and parse those results with ssbio?
- Not yet, but you can manually input the model1.pdb file as a new structure for now.
How do I cite I-TASSER?
- Roy A, Kucukural A & Zhang Y (2010) I-TASSER: a unified platform for automated protein structure and function prediction. Nat. Protoc. 5: 725–738 Available at: http://dx.doi.org/10.1038/nprot.2010.5
How do I run I-TASSER with TORQUE or Slurm job schedulers?
- under construction…
I’m having issues running I-TASSER…
- See the ssbio wiki for (hopefully) some solutions - or add yours in when you find the answer!
API¶
-
class
ssbio.protein.structure.homology.itasser.itasserprep.
ITASSERPrep
(ident, seq_str, root_dir, itasser_path, itlib_path, execute_dir=None, light=True, runtype='local', print_exec=False, java_home=None, binding_site_pred=False, ec_pred=False, go_pred=False, additional_options=None, job_scheduler_header=None)[source]¶ Prepare a protein sequence for an I-TASSER homology modeling run.
The main utilities of this class are to:
- Allow for the input of a protein sequence string and paths to I-TASSER to create execution scripts
- Automate large-scale homology modeling efforts by creating Slurm or TORQUE job scheduling scripts
Parameters: - ident – Identifier for your sequence. Will be used as the global ID (folder name, sequence name)
- seq_str – Sequence in string format
- root_dir – Local directory where I-TASSER folder will be created
- itasser_path – Path to I-TASSER folder, i.e. ‘~/software/I-TASSER4.4’
- itlib_path – Path to ITLIB folder, i.e. ‘~/software/ITLIB’
- execute_dir – Optional path to execution directory - use this if you are copying the homology models to another location such as a supercomputer for running
- light – If simulations should be limited to 5 runs
- runtype – How you will be running I-TASSER - local, slurm, or torque
- print_exec – If the execution script should be printed out
- java_home – Path to Java executable
- binding_site_pred – If binding site predictions should be run
- ec_pred – If EC number predictions should be run
- go_pred – If GO term predictions should be run
- additional_options – Any other additional I-TASSER options, appended to the command
- job_scheduler_header – Any job scheduling options, prepended as a header to the file
-
class
ssbio.protein.structure.homology.itasser.itasserprop.
ITASSERProp
(ident, original_results_path, coach_results_folder='model1/coach', model_to_use='model1')[source]¶ Parse all available information for a local I-TASSER modeling run.
Initializes a class to collect I-TASSER modeling information and optionally copy results to a new directory. SEE: https://zhanglab.ccmb.med.umich.edu/papers/2015_1.pdf for detailed information.
Parameters: - ident (str) – ID of I-TASSER modeling run
- original_results_path (str) – Path to I-TASSER modeling folder
- coach_results_folder (str) – Path to original COACH results
- model_to_use (str) – Which I-TASSER model to use. Default is “model1”
-
copy_results
(copy_to_dir, rename_model_to=None, force_rerun=False)[source]¶ Copy the raw information from I-TASSER modeling to a new folder.
Copies all files in the list _attrs_to_copy.
Parameters: - copy_to_dir (str) – Directory to copy the minimal set of results per sequence.
- rename_model_to (str) – New file name (without extension)
- force_rerun (bool) – If existing models and results should be overwritten.
-
get_dict
(only_attributes=None, exclude_attributes=None, df_format=False)[source]¶ Summarize the I-TASSER run in a dictionary containing modeling results and top predictions from COACH
Parameters: - only_attributes (str, list) – Attributes that should be returned. If not provided, all are returned.
- exclude_attributes (str, list) – Attributes that should be excluded.
- df_format (bool) – If dictionary values should be formatted for a dataframe (everything possible is transformed into strings, int, or float - if something can’t be transformed it is excluded)
Returns: Dictionary of attributes
Return type: dict
-
ssbio.protein.structure.homology.itasser.itasserprop.
parse_bfp_dat
(infile)[source]¶ Parse the B-factor predictions in BFP.dat
Parameters: infile (str) – Path to BFP.dat Returns: List of B-factor predictions for all residues Return type: list
-
ssbio.protein.structure.homology.itasser.itasserprop.
parse_coach_bsites_inf
(infile)[source]¶ Parse the Bsites.inf output file of COACH and return a list of rank-ordered binding site predictions
Bsites.inf contains the summary of COACH clustering results after all other prediction algorithms have finished For each site (cluster), there are three lines:
- Line 1: site number, c-score of coach prediction, cluster size
- Line 2: algorithm, PDB ID, ligand ID, center of binding site (cartesian coordinates), c-score of the algorithm’s prediction, binding residues from single template
- Line 3: Statistics of ligands in the cluster
C-score information:
- “In our training data, a prediction with C-score>0.35 has average false positive and false negative rates below 0.16 and 0.13, respectively.” (https://zhanglab.ccmb.med.umich.edu/COACH/COACH.pdf)
Parameters: infile (str) – Path to Bsites.inf Returns: Ranked list of dictionaries, keys defined below site_num
: cluster which is the consensus binding sitec_score
: confidence score of the cluster predictioncluster_size
: number of predictions within this clusteralgorithm
: main? algorithm used to make the predictionpdb_template_id
: PDB ID of the template used to make the predictionpdb_template_chain
: chain of the PDB which has the ligandpdb_ligand
: predicted ligand to bindbinding_location_coords
: centroid of the predicted ligand position in the homology modelc_score_method
: confidence score for the main algorithmbinding_residues
: predicted residues to bind the ligandligand_cluster_counts
: number of predictions per ligand
Return type: list
-
ssbio.protein.structure.homology.itasser.itasserprop.
parse_coach_ec
(infile)[source]¶ Parse the EC.dat output file of COACH and return a list of rank-ordered EC number predictions
EC.dat contains the predicted EC number and active residues. The columns are: PDB_ID, TM-score, RMSD, Sequence identity, Coverage, Confidence score, EC number, and Active site residues
Parameters: infile (str) – Path to EC.dat Returns: Ranked list of dictionaries, keys defined below pdb_template_id
: PDB ID of the template used to make the predictionpdb_template_chain
: chain of the PDB which has the ligandtm_score
: TM-score of the template to the model (similarity score)rmsd
: RMSD of the template to the model (also a measure of similarity)seq_ident
: percent sequence identityseq_coverage
: percent sequence coveragec_score
: confidence score of the EC predictionec_number
: predicted EC numberbinding_residues
: predicted residues to bind the ligand
Return type: list
-
ssbio.protein.structure.homology.itasser.itasserprop.
parse_coach_ec_df
(infile)[source]¶ Parse the EC.dat output file of COACH and return a dataframe of results
EC.dat contains the predicted EC number and active residues. The columns are: PDB_ID, TM-score, RMSD, Sequence identity, Coverage, Confidence score, EC number, and Active site residues
Parameters: infile (str) – Path to EC.dat Returns: Pandas DataFrame summarizing EC number predictions Return type: DataFrame
-
ssbio.protein.structure.homology.itasser.itasserprop.
parse_coach_go
(infile)[source]¶ Parse a GO output file from COACH and return a rank-ordered list of GO term predictions
The columns in all files are: GO terms, Confidence score, Name of GO terms. The files are:
- GO_MF.dat - GO terms in ‘molecular function’
- GO_BP.dat - GO terms in ‘biological process’
- GO_CC.dat - GO terms in ‘cellular component’
Parameters: infile (str) – Path to any COACH GO prediction file Returns: Organized dataframe of results, columns defined below go_id
: GO term IDgo_term
: GO term textc_score
: confidence score of the GO prediction
Return type: Pandas DataFrame
-
ssbio.protein.structure.homology.itasser.itasserprop.
parse_cscore
(infile)[source]¶ Parse the cscore file to return a dictionary of scores.
Parameters: infile (str) – Path to cscore Returns: Dictionary of scores Return type: dict
-
ssbio.protein.structure.homology.itasser.itasserprop.
parse_exp_dat
(infile)[source]¶ Parse the solvent accessibility predictions in exp.dat
Parameters: infile (str) – Path to exp.dat Returns: List of solvent accessibility predictions for all residues Return type: list
-
ssbio.protein.structure.homology.itasser.itasserprop.
parse_init_dat
(infile)[source]¶ Parse the main init.dat file which contains the modeling results
The first line of the file init.dat contains stuff like:
"120 easy 40 8"
The other lines look like this:
" 161 11.051 1 1guqA MUSTER"
and getting the first 10 gives you the top 10 templates used in modeling
Parameters: infile (stt) – Path to init.dat Returns: Dictionary of parsed information Return type: dict