Homology modeling


I-TASSER (Iterative Threading ASSEmbly Refinement) is a program for protein homology modeling and functional prediction from a protein sequence. The I-TASSER suite provides numerous other tools such as for ligand-binding site predictions, model refinement, secondary structure predictions, B-factor estimations, and more. ssbio mainly provides tools to run and parse I-TASSER homology modeling results, as well as COACH consensus binding site predictions (optionally with EC number and GO term predictions). Also, scripts are provided to automate homology modeling on a large scale using TORQUE or Slurm job schedulers in a cluster computing environment.

Installation instructions


These instructions were created on an Ubuntu 17.04 system.


Read the README on the I-TASSER Suite page for the most up-to-date instructions

  1. Make sure you have Java installed and it can be run from the command line with java

  2. Head to the I-TASSER download page and register for an license (academic only) to get a password emailed to you

  3. Log in to the I-TASSER download page and download the archive

  4. Unpack the software archive into a convenient directory - a library should also be downloaded to this directory

  5. Run download_lib.pl to then download the library files - this will take some time:

    /path/to/<I-TASSER_directory>/download_lib.pl -libdir ITLIB
  6. Now, I-TASSER can be run according to the README under section 4

  7. To enable GO term predictions…

    1. under construction…
  8. Tip: to update template libraries, create a new command in your crontab (first run crontab -e), and make sure to replace <USERNAME> with your username:

    0 4 * * 1,5 <USERNAME> /path/to/I-TASSER4.4/download_lib.pl -libdir /path/to/ITLIB

    That will run the library update at 4 am every Monday and Friday.

Program execution

In the shell

To run the program on its own in the shell…


With ssbio

To run the program using the ssbio Python wrapper, see: ssbio.protein.path.to.wrapper()


  • What is a homology model?

    • A predicted 3D structure model of a protein sequence. Models can be template-based, when they are based on an existing experimental structure; or ab initio, generated without a template. Generally, ab initio models are much less reliable.
  • Can I just run I-TASSER using their web server and parse those results with ssbio?

    • Not yet, but you can manually input the model1.pdb file as a new structure for now.
  • How do I cite I-TASSER?

    • Roy A, Kucukural A & Zhang Y (2010) I-TASSER: a unified platform for automated protein structure and function prediction. Nat. Protoc. 5: 725–738 Available at: http://dx.doi.org/10.1038/nprot.2010.5
  • How do I run I-TASSER with TORQUE or Slurm job schedulers?

    • under construction…
  • I’m having issues running I-TASSER…

    • See the ssbio wiki for (hopefully) some solutions - or add yours in when you find the answer!


class ssbio.protein.structure.homology.itasser.itasserprep.ITASSERPrep(ident, seq_str, root_dir, itasser_path, itlib_path, execute_dir=None, light=True, runtype='local', print_exec=False, java_home=None, binding_site_pred=False, ec_pred=False, go_pred=False, additional_options=None, job_scheduler_header=None)[source]

Prepare a protein sequence for an I-TASSER homology modeling run.

The main utilities of this class are to:

  • Allow for the input of a protein sequence string and paths to I-TASSER to create execution scripts
  • Automate large-scale homology modeling efforts by creating Slurm or TORQUE job scheduling scripts
  • ident – Identifier for your sequence. Will be used as the global ID (folder name, sequence name)
  • seq_str – Sequence in string format
  • root_dir – Local directory where I-TASSER folder will be created
  • itasser_path – Path to I-TASSER folder, i.e. ‘~/software/I-TASSER4.4’
  • itlib_path – Path to ITLIB folder, i.e. ‘~/software/ITLIB’
  • execute_dir – Optional path to execution directory - use this if you are copying the homology models to another location such as a supercomputer for running
  • light – If simulations should be limited to 5 runs
  • runtype – How you will be running I-TASSER - local, slurm, or torque
  • print_exec – If the execution script should be printed out
  • java_home – Path to Java executable
  • binding_site_pred – If binding site predictions should be run
  • ec_pred – If EC number predictions should be run
  • go_pred – If GO term predictions should be run
  • additional_options – Any other additional I-TASSER options, appended to the command
  • job_scheduler_header – Any job scheduling options, prepended as a header to the file

Take in a sequence string and prepares the folder for the I-TASSER run.

class ssbio.protein.structure.homology.itasser.itasserprop.ITASSERProp(ident, original_results_path, coach_results_folder='model1/coach', model_to_use='model1')[source]

Parse all available information for a local I-TASSER modeling run.

Initializes a class to collect I-TASSER modeling information and optionally copy results to a new directory. SEE: https://zhanglab.ccmb.med.umich.edu/papers/2015_1.pdf for detailed information.

  • ident (str) – ID of I-TASSER modeling run
  • original_results_path (str) – Path to I-TASSER modeling folder
  • coach_results_folder (str) – Path to original COACH results
  • model_to_use (str) – Which I-TASSER model to use. Default is “model1”
copy_results(copy_to_dir, rename_model_to=None, force_rerun=False)[source]

Copy the raw information from I-TASSER modeling to a new folder.

Copies all files in the list _attrs_to_copy.

  • copy_to_dir (str) – Directory to copy the minimal set of results per sequence.
  • rename_model_to (str) – New file name (without extension)
  • force_rerun (bool) – If existing models and results should be overwritten.
get_dict(only_attributes=None, exclude_attributes=None, df_format=False)[source]

Summarize the I-TASSER run in a dictionary containing modeling results and top predictions from COACH

  • only_attributes (str, list) – Attributes that should be returned. If not provided, all are returned.
  • exclude_attributes (str, list) – Attributes that should be excluded.
  • df_format (bool) – If dictionary values should be formatted for a dataframe (everything possible is transformed into strings, int, or float - if something can’t be transformed it is excluded)

Dictionary of attributes

Return type:


load_structure_path(structure_path, file_type='pdb')[source]

Load a structure file and provide pointers to its location

  • structure_path (str) – Path to structure file
  • file_type (str) – Type of structure file

Parse the B-factor predictions in BFP.dat

Parameters:infile (str) – Path to BFP.dat
Returns:List of B-factor predictions for all residues
Return type:list

Parse the Bsites.inf output file of COACH and return a list of rank-ordered binding site predictions

Bsites.inf contains the summary of COACH clustering results after all other prediction algorithms have finished For each site (cluster), there are three lines:

  • Line 1: site number, c-score of coach prediction, cluster size
  • Line 2: algorithm, PDB ID, ligand ID, center of binding site (cartesian coordinates), c-score of the algorithm’s prediction, binding residues from single template
  • Line 3: Statistics of ligands in the cluster

C-score information:

Parameters:infile (str) – Path to Bsites.inf
Returns:Ranked list of dictionaries, keys defined below
  • site_num: cluster which is the consensus binding site
  • c_score: confidence score of the cluster prediction
  • cluster_size: number of predictions within this cluster
  • algorithm: main? algorithm used to make the prediction
  • pdb_template_id: PDB ID of the template used to make the prediction
  • pdb_template_chain: chain of the PDB which has the ligand
  • pdb_ligand: predicted ligand to bind
  • binding_location_coords: centroid of the predicted ligand position in the homology model
  • c_score_method: confidence score for the main algorithm
  • binding_residues: predicted residues to bind the ligand
  • ligand_cluster_counts: number of predictions per ligand
Return type:list

Parse the EC.dat output file of COACH and return a list of rank-ordered EC number predictions

EC.dat contains the predicted EC number and active residues. The columns are: PDB_ID, TM-score, RMSD, Sequence identity, Coverage, Confidence score, EC number, and Active site residues

Parameters:infile (str) – Path to EC.dat
Returns:Ranked list of dictionaries, keys defined below
  • pdb_template_id: PDB ID of the template used to make the prediction
  • pdb_template_chain: chain of the PDB which has the ligand
  • tm_score: TM-score of the template to the model (similarity score)
  • rmsd: RMSD of the template to the model (also a measure of similarity)
  • seq_ident: percent sequence identity
  • seq_coverage: percent sequence coverage
  • c_score: confidence score of the EC prediction
  • ec_number: predicted EC number
  • binding_residues: predicted residues to bind the ligand
Return type:list

Parse the EC.dat output file of COACH and return a dataframe of results

EC.dat contains the predicted EC number and active residues. The columns are: PDB_ID, TM-score, RMSD, Sequence identity, Coverage, Confidence score, EC number, and Active site residues

Parameters:infile (str) – Path to EC.dat
Returns:Pandas DataFrame summarizing EC number predictions
Return type:DataFrame

Parse a GO output file from COACH and return a rank-ordered list of GO term predictions

The columns in all files are: GO terms, Confidence score, Name of GO terms. The files are:

  • GO_MF.dat - GO terms in ‘molecular function’
  • GO_BP.dat - GO terms in ‘biological process’
  • GO_CC.dat - GO terms in ‘cellular component’
Parameters:infile (str) – Path to any COACH GO prediction file
Returns:Organized dataframe of results, columns defined below
  • go_id: GO term ID
  • go_term: GO term text
  • c_score: confidence score of the GO prediction
Return type:Pandas DataFrame

Parse the cscore file to return a dictionary of scores.

Parameters:infile (str) – Path to cscore
Returns:Dictionary of scores
Return type:dict

Parse the solvent accessibility predictions in exp.dat

Parameters:infile (str) – Path to exp.dat
Returns:List of solvent accessibility predictions for all residues
Return type:list

Parse the main init.dat file which contains the modeling results

The first line of the file init.dat contains stuff like:

"120 easy  40   8"

The other lines look like this:

"     161   11.051   1  1guqA MUSTER"

and getting the first 10 gives you the top 10 templates used in modeling

Parameters:infile (stt) – Path to init.dat
Returns:Dictionary of parsed information
Return type:dict

Parse the secondary structure predictions in seq.dat

Parameters:infile (str) – Path to seq.dat
Returns:List of secondary structure predictions for all residues
Return type:list