Python API

Information on select functions, classes, or methods.

ssbio.pipeline.gempro

GEMPRO

class ssbio.pipeline.gempro.GEMPRO(gem_name, root_dir=None, pdb_file_type='mmtf', gem=None, gem_file_path=None, gem_file_type=None, genes_list=None, genes_and_sequences=None, genome_path=None, write_protein_fasta_files=True, description=None, custom_spont_id=None)[source]

Generic class to represent all information for a GEM-PRO project.

Initialize the GEM-PRO project with a genome-scale model, a list of genes, or a dict of genes and sequences. Specify the name of your project, along with the root directory where a folder with that name will be created.

Main methods provided are:

  1. Automated mapping of sequence IDs

    • With KEGG mapper
    • With UniProt mapper
    • Allowing manual gene ID –> protein sequence entry
    • Allowing manual gene ID –> UniProt ID
  2. Consolidating sequence IDs and setting a representative sequence

    • Currently these are set based on available PDB IDs
  3. Mapping of representative sequence –> structures

    • With UniProt –> ranking of PDB structures
    • BLAST representative sequence –> PDB database
  4. Preparation of files for homology modeling (currently for I-TASSER)

    • Mapping to existing models
    • Preparation for running I-TASSER
    • Parsing I-TASSER runs
  5. Running QC/QA on structures and setting a representative structure

    • Various cutoffs (mutations, insertions, deletions) can be set to filter structures
  6. Automation of protein sequence and structure property calculation

  7. Creation of Pandas DataFrame summaries directly from downloaded metadata

Parameters:
  • gem_name (str) – The name of your GEM or just your project in general. This will be the name of the main folder that is created in root_dir.
  • root_dir (str) – Path to where the folder named after gem_name will be created. If not provided, directories will not be created and output directories need to be specified for some steps.
  • pdb_file_type (str) – pdb, mmCif, xml, mmtf - file type for files downloaded from the PDB
  • gem (Model) – COBRApy Model object
  • gem_file_path (str) – Path to GEM file
  • gem_file_type (str) – GEM model type - sbml (or xml), mat, or json formats
  • genes_list (list) – List of gene IDs that you want to map
  • genes_and_sequences (dict) – Dictionary of gene IDs and their amino acid sequence strings
  • genome_path (str) – FASTA file of all protein sequences
  • write_protein_fasta_files (bool) – If individual protein FASTA files should be written out
  • description (str) – Description string of your project
  • custom_spont_id (str) – ID of spontaneous genes in a COBRA model which will be ignored for analysis
add_gene_ids(genes_list)[source]

Add gene IDs manually into the GEM-PRO project.

Parameters:genes_list (list) – List of gene IDs as strings.
base_dir

str – GEM-PRO project folder.

blast_seqs_to_pdb(seq_ident_cutoff=0, evalue=0.0001, all_genes=False, display_link=False, outdir=None, force_rerun=False)[source]

BLAST each representative protein sequence to the PDB. Saves raw BLAST results (XML files).

Parameters:
  • seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
  • evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
  • all_genes (bool) – If all genes should be BLASTed, or only those without any structures currently mapped
  • display_link (bool, optional) – Set to True if links to the HTML results should be displayed
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False
custom_spont_id = None

str – ID of spontaneous genes in a COBRA model which will be ignored for analysis

data_dir

str – Directory where all data are stored.

df_homology_models

DataFrame – Get a dataframe of I-TASSER homology model results

df_kegg_metadata

DataFrame – Pandas DataFrame of KEGG metadata per protein.

df_pdb_blast

DataFrame – Get a dataframe of PDB BLAST results

df_pdb_metadata

DataFrame – Get a dataframe of PDB metadata (PDBs have to be downloaded first).

df_pdb_ranking

DataFrame – Get a dataframe of UniProt -> best structure in PDB results

df_proteins

DataFrame – Get a summary dataframe of all proteins in the project.

df_representative_sequences

DataFrame – Pandas DataFrame of representative sequence information per protein.

df_representative_structures

DataFrame – Get a dataframe of representative protein structure information.

df_uniprot_metadata

DataFrame – Pandas DataFrame of UniProt metadata per protein.

find_disulfide_bridges(representatives_only=True)[source]

Run Biopython’s disulfide bridge finder and store found bridges.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.annotations['SSBOND-biopython']

Parameters:representative_only (bool) – If analysis should only be run on the representative structure
find_disulfide_bridges_parallelize(sc, representatives_only=True)[source]

Run Biopython’s disulfide bridge finder and store found bridges.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.annotations['SSBOND-biopython']

Parameters:representative_only (bool) – If analysis should only be run on the representative structure
functional_genes

DictList – All genes with a representative protein structure.

genes = None

DictList – All protein-coding genes in this GEM-PRO project

genes_dir

str – Directory where all gene specific information is stored.

genes_with_a_representative_sequence

DictList – All genes with a representative sequence.

genes_with_a_representative_structure

DictList – All genes with a representative protein structure.

genes_with_experimental_structures

DictList – All genes that have at least one experimental structure.

genes_with_homology_models

DictList – All genes that have at least one homology model.

genes_with_structures

DictList – All genes with any mapped protein structures.

genome_path = None

str – Simple link to the filepath of the FASTA file containing all protein sequences

get_dssp_annotations(representatives_only=True, force_rerun=False)[source]

Run DSSP on structures and store calculations.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.letter_annotations['*-dssp']

Parameters:
  • representative_only (bool) – If analysis should only be run on the representative structure
  • force_rerun (bool) – If calculations should be rerun even if an output file exists
get_dssp_annotations_parallelize(sc, representatives_only=True, force_rerun=False)[source]

Run DSSP on structures and store calculations.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.letter_annotations['*-dssp']

Parameters:
  • representative_only (bool) – If analysis should only be run on the representative structure
  • force_rerun (bool) – If calculations should be rerun even if an output file exists
get_freesasa_annotations(include_hetatms=False, representatives_only=True, force_rerun=False)[source]

Run freesasa on structures and store calculations.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.letter_annotations['*-freesasa']

Parameters:
  • include_hetatms (bool) – If HETATMs should be included in calculations. Defaults to False.
  • representative_only (bool) – If analysis should only be run on the representative structure
  • force_rerun (bool) – If calculations should be rerun even if an output file exists
get_freesasa_annotations_parallelize(sc, include_hetatms=False, representatives_only=True, force_rerun=False)[source]

Run freesasa on structures and store calculations.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.letter_annotations['*-freesasa']

Parameters:
  • include_hetatms (bool) – If HETATMs should be included in calculations. Defaults to False.
  • representative_only (bool) – If analysis should only be run on the representative structure
  • force_rerun (bool) – If calculations should be rerun even if an output file exists
get_itasser_models(homology_raw_dir, custom_itasser_name_mapping=None, outdir=None, force_rerun=False)[source]

Copy generated I-TASSER models from a directory to the GEM-PRO directory.

Parameters:
  • homology_raw_dir (str) – Root directory of I-TASSER folders.
  • custom_itasser_name_mapping (dict) – Use this if your I-TASSER folder names differ from your model gene names. Input a dict of {model_gene: ITASSER_folder}.
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • force_rerun (bool) – If homology files should be copied again even if they exist in the GEM-PRO directory
get_manual_homology_models(input_dict, outdir=None, clean=True, force_rerun=False)[source]

Copy homology models to the GEM-PRO project.

Requires an input of a dictionary formatted like so:

{
    model_gene: {
                    homology_model_id1: {
                                            'model_file': '/path/to/homology/model.pdb',
                                            'file_type': 'pdb'
                                            'additional_info': info_value
                                        },
                    homology_model_id2: {
                                            'model_file': '/path/to/homology/model.pdb'
                                            'file_type': 'pdb'
                                        }
                }
}
Parameters:
  • input_dict (dict) – Dictionary of dictionaries of gene names to homology model IDs and other information
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • clean (bool) – If homology files should be cleaned and saved as a new PDB file
  • force_rerun (bool) – If homology files should be copied again even if they exist in the GEM-PRO directory
get_msms_annotations(representatives_only=True, force_rerun=False)[source]

Run MSMS on structures and store calculations.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.letter_annotations['*-msms']

Parameters:
  • representative_only (bool) – If analysis should only be run on the representative structure
  • force_rerun (bool) – If calculations should be rerun even if an output file exists
get_msms_annotations_parallelize(sc, representatives_only=True, force_rerun=False)[source]

Run MSMS on structures and store calculations.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.letter_annotations['*-msms']

Parameters:
  • representative_only (bool) – If analysis should only be run on the representative structure
  • force_rerun (bool) – If calculations should be rerun even if an output file exists
get_scratch_predictions(path_to_scratch, results_dir, scratch_basename='scratch', num_cores=1, exposed_buried_cutoff=25, custom_gene_mapping=None)[source]

Run and parse SCRATCH results to predict secondary structure and solvent accessibility. Annotations are stored in the protein’s representative sequence at:

  • .annotations
  • .letter_annotations
Parameters:
  • path_to_scratch (str) – Path to SCRATCH executable
  • results_dir (str) – Path to SCRATCH results folder, which will have the files (scratch.ss, scratch.ss8, scratch.acc, scratch.acc20)
  • scratch_basename (str) – Basename of the SCRATCH results (‘scratch’ is default)
  • num_cores (int) – Number of cores to use to parallelize SCRATCH run
  • exposed_buried_cutoff (int) – Cutoff of exposed/buried for the acc20 predictions
  • custom_gene_mapping (dict) – Default parsing of SCRATCH output files is to look for the model gene IDs. If your output files contain IDs which differ from the model gene IDs, use this dictionary to map model gene IDs to result file IDs. Dictionary keys must match model genes.
get_sequence_properties(representatives_only=True)[source]

Run Biopython ProteinAnalysis and EMBOSS pepstats to summarize basic statistics of all protein sequences. Results are stored in the protein’s respective SeqProp objects at .annotations

Parameters:representative_only (bool) – If analysis should only be run on the representative sequences
get_tmhmm_predictions(tmhmm_results, custom_gene_mapping=None)[source]

Parse TMHMM results and store in the representative sequences.

This is a basic function to parse pre-run TMHMM results. Run TMHMM from the web service (http://www.cbs.dtu.dk/services/TMHMM/) by doing the following:

  1. Write all representative sequences in the GEM-PRO using the function write_representative_sequences_file
  2. Upload the file to http://www.cbs.dtu.dk/services/TMHMM/ and choose “Extensive, no graphics” as the output
  3. Copy and paste the results (ignoring the top header and above “HELP with output formats”) into a file and save it
  4. Run this function on that file
Parameters:
  • tmhmm_results (str) – Path to TMHMM results (long format)
  • custom_gene_mapping (dict) – Default parsing of TMHMM output is to look for the model gene IDs. If your output file contains IDs which differ from the model gene IDs, use this dictionary to map model gene IDs to result file IDs. Dictionary keys must match model genes.
kegg_mapping_and_metadata(kegg_organism_code, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source]

Map all genes in the model to KEGG IDs using the KEGG service.

Steps:
  1. Download all metadata and sequence files in the sequences directory
  2. Creates a KEGGProp object in the protein.sequences attribute
  3. Returns a Pandas DataFrame of mapping results
Parameters:
  • kegg_organism_code (str) – The three letter KEGG code of your organism
  • custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model gene IDs.
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • set_as_representative (bool) – If mapped KEGG IDs should be set as representative sequences
  • force_rerun (bool) – If you want to overwrite any existing mappings and files
kegg_mapping_and_metadata_parallelize(sc, kegg_organism_code, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source]

Map all genes in the model to KEGG IDs using the KEGG service.

Steps:
  1. Download all metadata and sequence files in the sequences directory
  2. Creates a KEGGProp object in the protein.sequences attribute
  3. Returns a Pandas DataFrame of mapping results
Parameters:
  • sc (SparkContext) – Spark Context to parallelize this function
  • kegg_organism_code (str) – The three letter KEGG code of your organism
  • custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model gene IDs.
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • set_as_representative (bool) – If mapped KEGG IDs should be set as representative sequences
  • force_rerun (bool) – If you want to overwrite any existing mappings and files
load_cobra_model(model)[source]

Load a COBRApy Model object into the GEM-PRO project.

Parameters:model (Model) – COBRApy Model object
manual_seq_mapping(gene_to_seq_dict, outdir=None, write_fasta_files=True, set_as_representative=True)[source]

Read a manual input dictionary of model gene IDs –> protein sequences. By default sets them as representative.

Parameters:
  • gene_to_seq_dict (dict) – Mapping of gene IDs to their protein sequence strings
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • write_fasta_files (bool) – If individual protein FASTA files should be written out
  • set_as_representative (bool) – If mapped sequences should be set as representative
manual_uniprot_mapping(gene_to_uniprot_dict, outdir=None, set_as_representative=True)[source]

Read a manual dictionary of model gene IDs –> UniProt IDs. By default sets them as representative.

This allows for mapping of the missing genes, or overriding of automatic mappings.

Input a dictionary of:

{
    <gene_id1>: <uniprot_id1>,
    <gene_id2>: <uniprot_id2>,
}
Parameters:
  • gene_to_uniprot_dict – Dictionary of mappings as shown above
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • set_as_representative (bool) – If mapped UniProt IDs should be set as representative sequences
map_uniprot_to_pdb(seq_ident_cutoff=0.0, outdir=None, force_rerun=False)[source]

Map all representative sequences’ UniProt ID to PDB IDs using the PDBe “Best Structures” API. Will save a JSON file of the results to each protein’s sequences folder.

The “Best structures” API is available at https://www.ebi.ac.uk/pdbe/api/doc/sifts.html The list of PDB structures mapping to a UniProt accession sorted by coverage of the protein and, if the same, resolution.

Parameters:
  • seq_ident_cutoff (float) – Sequence identity cutoff in decimal form
  • outdir (str) – Output directory to cache JSON results of search
  • force_rerun (bool) – Force re-downloading of JSON results if they already exist
Returns:

A rank-ordered list of PDBProp objects that map to the UniProt ID

Return type:

list

missing_homology_models

list – List of genes with no mapping to any homology models.

missing_kegg_mapping

list – List of genes with no mapping to KEGG.

missing_pdb_structures

list – List of genes with no mapping to any experimental PDB structure.

missing_representative_sequence

list – List of genes with no mapping to a representative sequence.

missing_representative_structure

list – List of genes with no mapping to a representative structure.

missing_uniprot_mapping

list – List of genes with no mapping to UniProt.

model = None

Model – COBRApy model object

model_dir

str – Directory where original GEMs and GEM-related files are stored.

pdb_downloader_and_metadata(outdir=None, pdb_file_type=None, force_rerun=False)[source]

Download ALL mapped experimental structures to each protein’s structures directory.

Parameters:
  • outdir (str) – Path to output directory, if GEM-PRO directories were not set or other output directory is desired
  • pdb_file_type (str) – Type of PDB file to download, if not already set or other format is desired
  • force_rerun (bool) – If files should be re-downloaded if they already exist
pdb_file_type = None

strpdb, mmCif, xml, mmtf - file type for files downloaded from the PDB

prep_itasser_modeling(itasser_installation, itlib_folder, runtype, create_in_dir=None, execute_from_dir=None, all_genes=False, print_exec=False, **kwargs)[source]

Prepare to run I-TASSER homology modeling for genes without structures, or all genes.

Parameters:
  • itasser_installation (str) – Path to I-TASSER folder, i.e. ~/software/I-TASSER4.4
  • itlib_folder (str) – Path to ITLIB folder, i.e. ~/software/ITLIB
  • runtype – How you will be running I-TASSER - local, slurm, or torque
  • create_in_dir (str) – Local directory where folders will be created
  • execute_from_dir (str) – Optional path to execution directory - use this if you are copying the homology models to another location such as a supercomputer for running
  • all_genes (bool) – If all genes should be prepped, or only those without any mapped structures
  • print_exec (bool) – If the execution statement should be printed to run modelling

Todo

  • Document kwargs - extra options for I-TASSER, SLURM or Torque execution
  • Allow modeling of any sequence in sequences attribute, select by ID or provide SeqProp?
root_dir

str – Directory where GEM-PRO project folder named after the attribute base_dir is located.

set_representative_sequence(force_rerun=False)[source]

Automatically consolidate loaded sequences (manual, UniProt, or KEGG) and set a single representative sequence.

Manually set representative sequences override all existing mappings. UniProt mappings override KEGG mappings except when KEGG mappings have PDBs associated with them and UniProt doesn’t.

Parameters:force_rerun (bool) – Set to True to recheck stored sequences
set_representative_structure(seq_outdir=None, struct_outdir=None, pdb_file_type=None, engine='needle', always_use_homology=False, rez_cutoff=0.0, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True, skip_large_structures=False, clean=True, force_rerun=False)[source]

Set all representative structure for proteins from a structure in the structures attribute.

Each gene can have a combination of the following, which will be analyzed to set a representative structure.

  • Homology model(s)
  • Ranked PDBs
  • BLASTed PDBs

If the always_use_homology flag is true, homology models are always set as representative when they exist. If there are multiple homology models, we rank by the percent sequence coverage.

Parameters:
  • seq_outdir (str) – Path to output directory of sequence alignment files, must be set if GEM-PRO directories were not created initially
  • struct_outdir (str) – Path to output directory of structure files, must be set if GEM-PRO directories were not created initially
  • pdb_file_type (str) – pdb, mmCif, xml, mmtf - file type for files downloaded from the PDB
  • engine (str) – biopython or needle - which pairwise alignment program to use. needle is the standard EMBOSS tool to run pairwise alignments. biopython is Biopython’s implementation of needle. Results can differ!
  • always_use_homology (bool) – If homology models should always be set as the representative structure
  • rez_cutoff (float) – Resolution cutoff, in Angstroms (only if experimental structure)
  • seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
  • allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
  • allow_mutants (bool) – If mutations should be allowed or checked for
  • allow_deletions (bool) – If deletions should be allowed or checked for
  • allow_insertions (bool) – If insertions should be allowed or checked for
  • allow_unresolved (bool) – If unresolved residues should be allowed or checked for
  • skip_large_structures (bool) – Default False – currently, large structures can’t be saved as a PDB file even if you just want to save a single chain, so Biopython will throw an error when trying to do so. As an alternative, if a large structure is selected as representative, the pipeline will currently point to it and not clean it. If you don’t want this to happen, set this to true.
  • clean (bool) – If structures should be cleaned
  • force_rerun (bool) – If sequence to structure alignment should be rerun

Todo

  • Remedy large structure representative setting
structures_dir

str – Directory where all structures are stored.

uniprot_mapping_and_metadata(model_gene_source, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source]

Map all genes in the model to UniProt IDs using the UniProt mapping service. Also download all metadata and sequences.

Parameters:
  • model_gene_source (str) –

    the database source of your model gene IDs. See: http://www.uniprot.org/help/api_idmapping Common model gene sources are:

    • Ensembl Genomes - ENSEMBLGENOME_ID (i.e. E. coli b-numbers)
    • Entrez Gene (GeneID) - P_ENTREZGENEID
    • RefSeq Protein - P_REFSEQ_AC
  • custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model genes.
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • set_as_representative (bool) – If mapped UniProt IDs should be set as representative sequences
  • force_rerun (bool) – If you want to overwrite any existing mappings and files
write_representative_sequences_file(outname, outdir=None, set_ids_from_model=True)[source]

Write all the model’s sequences as a single FASTA file. By default, sets IDs to model gene IDs.

Parameters:
  • outname (str) – Name of the output FASTA file without the extension
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • set_ids_from_model (bool) – If the gene ID source should be the model gene IDs, not the original sequence ID

ssbio.databases

PDBProp

class ssbio.databases.pdb.PDBProp(ident, description=None, chains=None, mapped_chains=None, structure_path=None, file_type=None)[source]

Store information about a protein structure from the Protein Data Bank.

Extends the StructProp class to allow initialization of the structure by its PDB ID, and then enabling downloads of the structure file as well as parsing its metadata.

Parameters:
  • ident (str) –
  • description (str) –
  • chains (str) –
  • mapped_chains (str) –
  • structure_path (str) –
  • file_type (str) – pdb, mmCif, xml, mmtf - file type for files downloaded from the PDB
biological_assemblies = None

DictList – A list for storing Bioassembly objects related to this PDB ID

download_structure_file(outdir, file_type=None, load_header_metadata=True, force_rerun=False)[source]

Download a structure file from the PDB, specifying an output directory and a file type. Optionally download the mmCIF header file and parse data from it to store within this object.

Parameters:
  • outdir (str) – Path to output directory
  • file_type (str) – pdb, mmCif, xml, mmtf - file type for files downloaded from the PDB
  • load_header_metadata (bool) – If header metadata should be loaded into this object, fastest with mmtf files
  • force_rerun (bool) – If structure file should be downloaded even if it already exists
ssbio.databases.pdb.best_structures(uniprot_id, outname=None, outdir=None, seq_ident_cutoff=0.0, force_rerun=False)[source]

Use the PDBe REST service to query for the best PDB structures for a UniProt ID.

More information found here: https://www.ebi.ac.uk/pdbe/api/doc/sifts.html Link used to retrieve results: https://www.ebi.ac.uk/pdbe/api/mappings/best_structures/:accession The list of PDB structures mapping to a UniProt accession sorted by coverage of the protein and, if the same, resolution.

Here is the ranking algorithm described by the PDB paper: https://nar.oxfordjournals.org/content/44/D1/D385.full

“Finally, a single quality indicator is also calculated for each entry by taking the harmonic average of all the percentile scores representing model and model-data-fit quality measures and then subtracting 10 times the numerical value of the resolution (in Angstrom) of the entry to ensure that resolution plays a role in characterising the quality of a structure. This single empirical ‘quality measure’ value is used by the PDBe query system to sort results and identify the ‘best’ structure in a given context. At present, entries determined by methods other than X-ray crystallography do not have similar data quality information available and are not considered as ‘best structures’.”

Parameters:
  • uniprot_id (str) – UniProt Accession ID
  • outname (str) – Basename of the output file of JSON results
  • outdir (str) – Path to output directory of JSON results
  • seq_ident_cutoff (float) – Cutoff results based on percent coverage (in decimal form)
  • force_rerun (bool) – Obtain best structures mapping ignoring previously downloaded results
Returns:

Rank-ordered list of dictionaries representing chain-specific PDB entries. Keys are:
  • pdb_id: the PDB ID which maps to the UniProt ID
  • chain_id: the specific chain of the PDB which maps to the UniProt ID
  • coverage: the percent coverage of the entire UniProt sequence
  • resolution: the resolution of the structure
  • start: the structure residue number which maps to the start of the mapped sequence
  • end: the structure residue number which maps to the end of the mapped sequence
  • unp_start: the sequence residue number which maps to the structure start
  • unp_end: the sequence residue number which maps to the structure end
  • experimental_method: type of experiment used to determine structure
  • tax_id: taxonomic ID of the protein’s original organism

Return type:

list

ssbio.databases.pdb.blast_pdb(seq, outfile='', outdir='', evalue=0.0001, seq_ident_cutoff=0.0, link=False, force_rerun=False)[source]

Returns a list of BLAST hits of a sequence to available structures in the PDB.

Parameters:
  • seq (str) – Your sequence, in string format
  • outfile (str) – Name of output file
  • outdir (str, optional) – Path to output directory. Default is the current directory.
  • evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
  • seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
  • link (bool, optional) – Set to True if a link to the HTML results should be displayed
  • force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False
Returns:

Rank ordered list of BLAST hits in dictionaries.

Return type:

list

ssbio.databases.pdb.blast_pdb_df(blast_results)[source]

Make a dataframe of BLAST results

ssbio.databases.pdb.download_mmcif_header(pdb_id, outdir='', force_rerun=False)[source]

Download a mmCIF header file from the RCSB PDB by ID.

Parameters:
  • pdb_id – PDB ID
  • outdir – Optional output directory, default is current working directory
  • force_rerun – If the file should be downloaded again even if it exists
Returns:

Path to outfile

Return type:

str

ssbio.databases.pdb.download_sifts_xml(pdb_id, outdir='', force_rerun=False)[source]

Download the SIFTS file for a PDB ID.

Parameters:
  • pdb_id (str) – PDB ID
  • outdir (str) – Output directory, current working directory if not specified.
  • force_rerun (bool) – If the file should be downloaded again even if it exists
Returns:

Path to downloaded file

Return type:

str

ssbio.databases.pdb.download_structure(pdb_id, file_type, outdir='', only_header=False, force_rerun=False)[source]

Download a structure from the RCSB PDB by ID. Specify the file type desired.

Parameters:
  • pdb_id – PDB ID
  • file_type – pdb, pdb.gz, mmcif, cif, cif.gz, xml.gz, mmtf, mmtf.gz
  • outdir – Optional output directory
  • only_header – If only the header file should be downloaded
  • force_rerun – If the file should be downloaded again even if it exists
Returns:

Path to outfile

Return type:

str

Deprecated since version 1.0: This will be removed in 2.0. Use Biopython’s PDBList.retrieve_pdb_file function instead

ssbio.databases.pdb.get_bioassembly_info(pdb_id, biomol_num, cache=False, outdir=None, force_rerun=False)[source]

Get metadata about a bioassembly from the RCSB PDB’s REST API.

See: https://www.rcsb.org/pdb/rest/bioassembly/bioassembly?structureId=1hv4&nr=1 The API returns an XML file containing the information on a biological assembly that looks like this:

<bioassembly structureId="1HV4" assemblyNr="1" method="PISA" desc="author_and_software_defined_assembly">
    <transformations operator="1" chainIds="A,B,C,D">
        <transformation index="1">
            <matrix m11="1.00000000" m12="0.00000000" m13="0.00000000" m21="0.00000000" m22="1.00000000" m23="0.00000000" m31="0.00000000" m32="0.00000000" m33="1.00000000"/>
            <shift v1="0.00000000" v2="0.00000000" v3="0.00000000"/>
        </transformation>
    </transformations>
</bioassembly>
Parameters:
  • pdb_id (str) – PDB ID
  • biomol_num (int) – Biological assembly number you are interested in
  • cache (bool) – If the XML file should be downloaded
  • outdir (str) – If cache, then specify the output directory
  • force_rerun (bool) – If cache, and if file exists, specify if API should be queried again
ssbio.databases.pdb.get_num_bioassemblies(pdb_id, cache=False, outdir=None, force_rerun=False)[source]

Check if there are bioassemblies using the PDB REST API, and if there are, get the number of bioassemblies available.

See: https://www.rcsb.org/pages/webservices/rest, section ‘List biological assemblies’

Not all PDB entries have biological assemblies available and some have multiple. Details that are necessary to recreate a biological assembly from the asymmetric unit can be accessed from the following requests.

  • Number of biological assemblies associated with a PDB entry
  • Access the transformation information needed to generate a biological assembly (nr=0 will return information for the asymmetric unit, nr=1 will return information for the first assembly, etc.)

A query of https://www.rcsb.org/pdb/rest/bioassembly/nrbioassemblies?structureId=1hv4 returns this:

<nrBioAssemblies structureId="1HV4" hasAssemblies="true" count="2"/>
Parameters:
  • pdb_id (str) – PDB ID
  • cache (bool) – If the XML file should be downloaded
  • outdir (str) – If cache, then specify the output directory
  • force_rerun (bool) – If cache, and if file exists, specify if API should be queried again
ssbio.databases.pdb.get_release_date(pdb_id)[source]

Quick way to get the release date of a PDB ID using the table of results from the REST service

Returns None if the release date is not available.

Returns:Organism of a PDB ID
Return type:str
ssbio.databases.pdb.get_resolution(pdb_id)[source]

Quick way to get the resolution of a PDB ID using the table of results from the REST service

Returns infinity if the resolution is not available.

Returns:resolution of a PDB ID in Angstroms
Return type:float

Todo

  • Unit test
ssbio.databases.pdb.map_uniprot_resnum_to_pdb(uniprot_resnum, chain_id, sifts_file)[source]

Map a UniProt residue number to its corresponding PDB residue number.

This function requires that the SIFTS file be downloaded, and also a chain ID (as different chains may have different mappings).

Parameters:
  • uniprot_resnum (int) – integer of the residue number you’d like to map
  • chain_id (str) – string of the PDB chain to map to
  • sifts_file (str) – Path to the SIFTS XML file
Returns:

tuple containing:

mapped_resnum (int): Mapped residue number is_observed (bool): Indicates if the 3D structure actually shows the residue

Return type:

(tuple)

ssbio.databases.pdb.parse_mmcif_header(infile)[source]

Parse a couple important fields from the mmCIF file format with some manual curation of ligands.

If you want full access to the mmCIF file just use the MMCIF2Dict class in Biopython.

Parameters:infile – Path to mmCIF file
Returns:Dictionary of parsed header
Return type:dict
ssbio.databases.pdb.parse_mmtf_header(infile)[source]

Parse an MMTF file and return basic header-like information.

Parameters:infile (str) – Path to MMTF file
Returns:Dictionary of parsed header
Return type:dict

Todo

  • Can this be sped up by not parsing the 3D coordinate info somehow?
  • OR just store the sequences when this happens since it is already being parsed.

PISA

ssbio.databases.pisa.download_pisa_multimers_xml(pdb_ids, save_single_xml_files=True, outdir=None, force_rerun=False)[source]

Download the PISA XML file for multimers.

See: http://www.ebi.ac.uk/pdbe/pisa/pi_download.html for more info

XML description of macromolecular assemblies:
http://www.ebi.ac.uk/pdbe/pisa/cgi-bin/multimers.pisa?pdbcodelist where “pdbcodelist” is a comma-separated (strictly no spaces) list of PDB codes. The resulting file contain XML output of assembly data, equivalent to that displayed in PISA assembly pages, for each of the specified PDB entries. NOTE: If a mass-download is intended, please minimize the number of retrievals by specifying as many PDB codes in the URL as feasible (20-50 is a good range), and never send another URL request until the previous one has been completed (meaning that the multimers.pisa file has been downloaded). Excessive requests will silently die in the server queue.
Parameters:
  • pdb_ids (str, list) – PDB ID or list of IDs
  • save_single_xml_files (bool) – If single XML files should be saved per PDB ID. If False, if multiple PDB IDs are provided, then a single, combined XML output file is downloaded
  • outdir (str) – Directory to output PISA XML files
  • force_rerun (bool) – Redownload files if they already exist
Returns:

of files downloaded

Return type:

list

ssbio.databases.pisa.parse_pisa_multimers_xml(pisa_multimers_xml, download_structures=False, outdir=None, force_rerun=False)[source]

Retrieve PISA information from an XML results file

See: http://www.ebi.ac.uk/pdbe/pisa/pi_download.html for more info

XML description of macromolecular assemblies:
http://www.ebi.ac.uk/pdbe/pisa/cgi-bin/multimers.pisa?pdbcodelist where “pdbcodelist” is a comma-separated (strictly no spaces) list of PDB codes. The resulting file contain XML output of assembly data, equivalent to that displayed in PISA assembly pages, for each of the specified PDB entries. NOTE: If a mass-download is intended, please minimize the number of retrievals by specifying as many PDB codes in the URL as feasible (20-50 is a good range), and never send another URL request until the previous one has been completed (meaning that the multimers.pisa file has been downloaded). Excessive requests will silently die in the server queue.
Parameters:
  • pisa_multimers_xml (str) – Path to PISA XML output file
  • download_structures (bool) – If assembly files should be downloaded
  • outdir (str) – Directory to output assembly files
  • force_rerun (bool) – Redownload files if they already exist
Returns:

of parsed PISA information

Return type:

dict

ssbio.databases.pisa.pdb_chain_stoichiometry_biomolone(pdbid)[source]

Get the stoichiometry of the chains in biological assembly 1 as a dictionary.

Steps taken are: 1) Download PDB and parse header, make biomolecule if provided 2) Count how many times each chain appears in biomolecule #1 3) Convert chain id to uniprot id 4) Return final dictionary

Parameters:pdbid (str) – 4 character PDB ID
Returns:{(ChainID,UniProtID): # occurences}
Return type:dict

SWISSMODEL

class ssbio.databases.swissmodel.SWISSMODEL(metadata_dir)[source]

Methods to parse through a SWISS-MODEL metadata set.

Download a particular organism’s metadata from SWISS-MODEL here: https://swissmodel.expasy.org/repository

Parameters:metadata_dir (str) – Path to the extracted SWISS-MODEL_Repository folder
all_models = None

dict – Dictionary of lists, UniProt ID as the keys

download_models(uniprot_acc, outdir='', force_rerun=False)[source]

Download all models available for a UniProt accession number.

Parameters:
  • uniprot_acc (str) – UniProt ACC/ID
  • outdir (str) – Path to output directory, uses working directory if not set
  • force_rerun (bool) – Force a redownload the models if they already exist
Returns:

Paths to the downloaded models

Return type:

list

get_model_filepath(infodict)[source]

Get the path to the homology model using information from the index dictionary for a single model.

Example: use self.get_models(UNIPROT_ID) to get all the models, which returns a list of dictionaries.
Use one of those dictionaries as input to this function to get the filepath to the model itself.
Parameters:infodict (dict) – Information about a model from get_models
Returns:Path to homology model
Return type:str
get_models(uniprot_acc)[source]

Return all available models for a UniProt accession number.

Parameters:uniprot_acc (str) – UniProt ACC/ID
Returns:All available models in SWISS-MODEL for this UniProt entry
Return type:dict
metadata_dir = None

str – Path to the extracted SWISS-MODEL_Repository folder

metadata_index_json

str – Path to the INDEX_JSON file.

organize_models(outdir, force_rerun=False)[source]

Organize and rename SWISS-MODEL models to a single folder with a name containing template information.

Parameters:
  • outdir (str) – New directory to copy renamed models to
  • force_rerun (bool) – If models should be copied again even if they already exist
Returns:

Dictionary of lists, UniProt IDs as the keys and new file paths as the values

Return type:

dict

parse_metadata()[source]

Parse the INDEX_JSON file and reorganize it as a dictionary of lists.

uniprots_modeled

list – Return all UniProt accession numbers with at least one model

ssbio.databases.swissmodel.get_oligomeric_state(swiss_model_path)[source]

Parse the oligomeric prediction in a SWISS-MODEL repository file

As of 2018-02-26, works on all E. coli models. Untested on other pre-made organism models.

Parameters:swiss_model_path (str) – Path to SWISS-MODEL PDB file
Returns:Information parsed about the oligomeric state
Return type:dict
ssbio.databases.swissmodel.translate_ostat(ostat)[source]

Translate the OSTAT field to an integer.

As of 2018-02-26, works on all E. coli models. Untested on other pre-made organism models.

Parameters:ostat (str) – Predicted oligomeric state of the PDB file
Returns:Translated string to integer
Return type:int

UniProtProp

class ssbio.databases.uniprot.UniProtProp(seq, id, name='<unknown name>', description='<unknown description>', fasta_path=None, xml_path=None, gff_path=None)[source]

Generic class to store information on a UniProt entry, extended from a SeqProp object.

The main utilities of this class are to:

  1. Download and/or parse UniProt text or xml files
  2. Store extra parsed information in attributes
uniprot

str – Main UniProt accession code

alt_uniprots

list – Alternate accession codes that point to the main one

file_type

str – Metadata file type

reviewed

bool – If this entry is a “reviewed” entry. If None, then status is unknown.

ec_number

str – EC number

pfam

list – PFAM IDs

entry_version

str – Date of last update of the UniProt entry

seq_version

str – Date of last update of the UniProt sequence

download_metadata_file(outdir, force_rerun=False)[source]

Download and load the UniProt XML file

download_seq_file(outdir, force_rerun=False)[source]

Download and load the UniProt FASTA file

features

list – Get the features from the feature file, metadata file, or in memory

ranking_score()[source]

Provide a score for this UniProt ID based on reviewed (True=1, False=0) + number of PDBs

Returns:Scoring for this ID
Return type:int
seq

Seq – Get the Seq object from the sequence file, metadata file, or in memory

ssbio.databases.uniprot.blast_uniprot(seq_str, seq_ident=1, evalue=0.0001, reviewed_only=True, organism=None)[source]

BLAST the UniProt db to find what IDs match the sequence input

Parameters:
  • seq_str – Sequence string
  • seq_ident – Percent identity to match
  • evalue – E-value of BLAST hit

Returns:

ssbio.databases.uniprot.download_uniprot_file(uniprot_id, filetype, outdir='', force_rerun=False)[source]

Download a UniProt file for a UniProt ID/ACC

Parameters:
  • uniprot_id – Valid UniProt ID
  • filetype – txt, fasta, xml, rdf, or gff
  • outdir – Directory to download the file
Returns:

Absolute path to file

Return type:

str

ssbio.databases.uniprot.get_fasta(uniprot_id)[source]

Get the protein sequence for a UniProt ID as a string.

Parameters:uniprot_id – Valid UniProt ID
Returns:String of the protein (amino acid) sequence
Return type:str
ssbio.databases.uniprot.is_valid_uniprot_id(instring)[source]

Check if a string is a valid UniProt ID.

See regex from: http://www.uniprot.org/help/accession_numbers

Parameters:instring – any string identifier

Returns: True if the string is a valid UniProt ID

ssbio.databases.uniprot.old_parse_uniprot_txt_file(infile)[source]

From: boscoh/uniprot github Parses the text of metadata retrieved from uniprot.org.

Only a few fields have been parsed, but this provides a template for the other fields.

A single description is generated from joining alternative descriptions.

Returns a dictionary with the main UNIPROT ACC as keys.

ssbio.databases.uniprot.parse_uniprot_txt_file(infile)[source]

Parse a raw UniProt metadata file and return a dictionary.

Parameters:infile – Path to metadata file
Returns:Metadata dictionary
Return type:dict
ssbio.databases.uniprot.parse_uniprot_xml_metadata(sr)[source]

Load relevant attributes and dbxrefs from a parsed UniProt XML file in a SeqRecord.

Returns:All parsed information
Return type:dict
ssbio.databases.uniprot.uniprot_ec(uniprot_id)[source]

Retrieve the EC number annotation for a UniProt ID.

Parameters:uniprot_id – Valid UniProt ID

Returns:

ssbio.databases.uniprot.uniprot_reviewed_checker(uniprot_id)[source]

Check if a single UniProt ID is reviewed or not.

Parameters:uniprot_id
Returns:If the entry is reviewed
Return type:bool
ssbio.databases.uniprot.uniprot_reviewed_checker_batch(uniprot_ids)[source]

Batch check if uniprot IDs are reviewed or not

Parameters:uniprot_ids – UniProt ID or list of UniProt IDs
Returns:Boolean}
Return type:A dictionary of {UniProtID
ssbio.databases.uniprot.uniprot_sites(uniprot_id)[source]

Retrieve a list of UniProt sites parsed from the feature file

Sites are defined here: http://www.uniprot.org/help/site and here: http://www.uniprot.org/help/function_section

Parameters:uniprot_id – Valid UniProt ID

Returns:

KEGGProp

class ssbio.databases.kegg.KEGGProp(seq, id, name='<unknown name>', description='<unknown description>', fasta_path=None, txt_path=None, gff_path=None)[source]
ssbio.databases.kegg.download_kegg_aa_seq(gene_id, outdir=None, force_rerun=False)[source]

Download a FASTA sequence of a protein from the KEGG database and return the path.

Parameters:
  • gene_id – the gene identifier
  • outdir – optional path to output directory
Returns:

Path to FASTA file

ssbio.databases.kegg.download_kegg_gene_metadata(gene_id, outdir=None, force_rerun=False)[source]

Download the KEGG flatfile for a KEGG ID and return the path.

Parameters:
  • gene_id – KEGG gene ID (with organism code), i.e. “eco:1244”
  • outdir – optional output directory of metadata
Returns:

Path to metadata file

ssbio.databases.kegg.map_kegg_all_genes(organism_code, target_db)[source]

Map all of an organism’s gene IDs to the target database.

This is faster than supplying a specific list of genes to map, plus there seems to be a limit on the number you can map with a manual REST query anyway.

Parameters:
  • organism_code – the three letter KEGG code of your organism
  • target_db – ncbi-proteinid | ncbi-geneid | uniprot
Returns:

Dictionary of ID mapping

ssbio.databases.kegg.parse_kegg_gene_metadata(infile)[source]

Parse the KEGG flatfile and return a dictionary of metadata.

Dictionary keys are:
refseq uniprot pdbs taxonomy
Parameters:infile – Path to KEGG flatfile
Returns:Dictionary of metadata
Return type:dict

ssbio.protein.structure.utils

CleanPDB

ssbio.protein.structure.utils.cleanpdb.clean_pdb(pdb_file, out_suffix='_clean', outdir=None, force_rerun=False, remove_atom_alt=True, keep_atom_alt_id='A', remove_atom_hydrogen=True, add_atom_occ=True, remove_res_hetero=True, keep_chemicals=None, keep_res_only=None, add_chain_id_if_empty='X', keep_chains=None)[source]

Clean a PDB file.

Parameters:
  • pdb_file (str) – Path to input PDB file
  • out_suffix (str) – Suffix to append to original filename
  • outdir (str) – Path to output directory
  • force_rerun (bool) – If structure should be re-cleaned if a clean file exists already
  • remove_atom_alt (bool) – Remove alternate positions
  • keep_atom_alt_id (str) – If removing alternate positions, which alternate ID to keep
  • remove_atom_hydrogen (bool) – Remove hydrogen atoms
  • add_atom_occ (bool) – Add atom occupancy fields if not present
  • remove_res_hetero (bool) – Remove all HETATMs
  • keep_chemicals (str, list) – If removing HETATMs, keep specified chemical names
  • keep_res_only (str, list) – Keep ONLY specified resnames, deletes everything else!
  • add_chain_id_if_empty (str) – Add a chain ID if not present
  • keep_chains (str, list) – Keep only these chains
Returns:

Path to cleaned PDB file

Return type:

str

MutatePDB

DOCK

class ssbio.protein.structure.utils.dock.DOCK(structure_id, pdb_file, amb_file, flex1_file, flex2_file, root_dir=None)[source]

Class to prepare a structure file for docking with DOCK6.

Attributes:

auto_flexdock(binding_residues, radius, ligand_path=None, force_rerun=False)[source]

Run DOCK6 on a PDB file, given its binding residues and a radius around them.

Provide a path to a ligand to dock a ligand to it. If no ligand is provided, DOCK6 preparations will be run on that structure file.

Parameters:
  • binding_residues (str) – Comma separated string of residues (eg: ‘144,170,199’)
  • radius (int, float) – Radius around binding residues to dock to
  • ligand_path (str) – Path to ligand (mol2 format) to dock to protein
  • force_rerun (bool) – If method should be rerun even if output files exist
binding_site_mol2(residues, force_rerun=False)[source]

Create mol2 of only binding site residues from the receptor

This function will take in a .pdb file (preferably the _receptor_noH.pdb file) and a string of residues (eg: ‘144,170,199’) and delete all other residues in the .pdb file. It then saves the coordinates of the selected residues as a .mol2 file. This is necessary for Chimera to select spheres within the radius of the binding site.

Parameters:
  • residues (str) – Comma separated string of residues (eg: ‘144,170,199’)
  • force_rerun (bool) – If method should be rerun even if output file exists
dms_maker(force_rerun=False)[source]

Create surface representation (dms file) of receptor

Parameters:force_rerun (bool) – If method should be rerun even if output file exists
do_dock6_flexible(ligand_path, force_rerun=False)[source]

Dock a ligand to the protein.

Parameters:
  • ligand_path (str) – Path to ligand (mol2 format) to dock to protein
  • force_rerun (bool) – If method should be rerun even if output file exists
dock_dir

str – DOCK folder

dockprep(force_rerun=False)[source]

Prepare a PDB file for docking by first converting it to mol2 format.

Parameters:force_rerun (bool) – If method should be rerun even if output file exists
grid(force_rerun=False)[source]

Create the scoring grid within the dummy box.

Parameters:force_rerun (bool) – If method should be rerun even if output file exists
protein_only_and_noH(keep_ligands=None, force_rerun=False)[source]

Isolate the receptor by stripping everything except protein and specified ligands.

Parameters:
  • keep_ligands (str, list) – Ligand(s) to keep in PDB file
  • force_rerun (bool) – If method should be rerun even if output file exists
root_dir

str – Directory where DOCK project folder is located

showbox(force_rerun=False)[source]

Create the dummy PDB box around the selected spheres.

Parameters:force_rerun (bool) – If method should be rerun even if output file exists
sphere_selector_using_residues(radius, force_rerun=False)[source]

Select spheres based on binding site residues

Parameters:
  • radius (int, float) – Radius around binding residues to dock to
  • force_rerun (bool) – If method should be rerun even if output file exists
sphgen(force_rerun=False)[source]

Create sphere representation (sph file) of receptor from the surface representation

Parameters:force_rerun (bool) – If method should be rerun even if output file exists
ssbio.protein.structure.utils.dock.parse_results_mol2(mol2_outpath)[source]

Parse a DOCK6 mol2 output file, return a Pandas DataFrame of the results.

Parameters:mol2_outpath (str) – Path to mol2 output file
Returns:Pandas DataFrame of the results
Return type:DataFrame

ssbio.protein.structure.properties

Structure Residues

ssbio.protein.structure.properties.residues.distance_to_site(residue_of_interest, residues, model)[source]

Calculate the distance between an amino acid and a group of amino acids.

Parameters:
  • residue_of_interest – Residue number you are interested in (ie. a mutation)
  • residues – List of residue numbers
Returns:

Distance (in Angstroms) to the group of residues

Return type:

float

ssbio.protein.structure.properties.residues.get_structure_seqrecords(model)[source]

Get a dictionary of a PDB file’s sequences.

Special cases include:
  • Insertion codes. In the case of residue numbers like “15A”, “15B”, both residues are written out. Example: 9LPR
  • HETATMs. Currently written as an “X”, or unknown amino acid.
Parameters:model – Biopython Model object of a Structure
Returns:List of SeqRecords
Return type:list
ssbio.protein.structure.properties.residues.get_structure_seqs(pdb_file, file_type)[source]

Get a dictionary of a PDB file’s sequences.

Special cases include:
  • Insertion codes. In the case of residue numbers like “15A”, “15B”, both residues are written out. Example: 9LPR
  • HETATMs. Currently written as an “X”, or unknown amino acid.
Parameters:pdb_file – Path to PDB file
Returns:Dictionary of: {chain_id: sequence}
Return type:dict
ssbio.protein.structure.properties.residues.hse_output(pdb_file, file_type)[source]

The solvent exposure of an amino acid residue is important for analyzing, understanding and predicting aspects of protein structure and function [73]. A residue’s solvent exposure can be classified as four categories: exposed, partly exposed, buried and deeply buried residues. Hamelryck et al. [73] established a new 2D measure that provides a different view of solvent exposure, i.e. half-sphere exposure (HSE). By conceptually dividing the sphere of a residue into two halves- HSE-up and HSE-down, HSE provides a more detailed description of an amino acid residue’s spatial neighborhood. HSE is calculated by the hsexpo module implemented in the BioPython package [74] from a PDB file.

http://onlinelibrary.wiley.com/doi/10.1002/prot.20379/abstract

Parameters:pdb_file

Returns:

ssbio.protein.structure.properties.residues.match_structure_sequence(orig_seq, new_seq, match='X', fill_with='X', ignore_excess=False)[source]

Correct a sequence to match inserted X’s in a structure sequence

This is useful for mapping a sequence obtained from structural tools like MSMS or DSSP
to the sequence obtained by the get_structure_seqs method.

Examples

>>> structure_seq = 'XXXABCDEF'
>>> prop_list = [4, 5, 6, 7, 8, 9]
>>> match_structure_sequence(structure_seq, prop_list)
['X', 'X', 'X', 4, 5, 6, 7, 8, 9]
>>> match_structure_sequence(structure_seq, prop_list, fill_with=float('Inf'))
[inf, inf, inf, 4, 5, 6, 7, 8, 9]
>>> structure_seq = '---ABCDEF---'
>>> prop_list = ('H','H','H','C','C','C')
>>> match_structure_sequence(structure_seq, prop_list, match='-', fill_with='-')
('-', '-', '-', 'H', 'H', 'H', 'C', 'C', 'C', '-', '-', '-')
>>> structure_seq = 'ABCDEF---'
>>> prop_list = 'HHHCCC'
>>> match_structure_sequence(structure_seq, prop_list, match='-', fill_with='-')
'HHHCCC---'
>>> structure_seq = 'AXBXCXDXEXF'
>>> prop_list = ['H', 'H', 'H', 'C', 'C', 'C']
>>> match_structure_sequence(structure_seq, prop_list, match='X', fill_with='X')
['H', 'X', 'H', 'X', 'H', 'X', 'C', 'X', 'C', 'X', 'C']
Parameters:
  • orig_seq (str, Seq, SeqRecord) – Sequence to match to
  • new_seq (str, tuple, list) – Sequence to fill in
  • match (str) – What to match
  • fill_with – What to fill in when matches are found
  • ignore_excess (bool) – If excess sequence on the tail end of new_seq should be ignored
Returns:

new_seq which will match the length of orig_seq

Return type:

str, tuple, list

ssbio.protein.structure.properties.residues.resname_in_proximity(resname, model, chains, resnums, threshold=5)[source]

Search within the proximity of a defined list of residue numbers and their chains for any specifed residue name.

Parameters:
  • resname (str) – Residue name to search for in proximity of specified chains + resnums
  • model – Biopython Model object
  • chains (str, list) – Chain ID or IDs to check
  • resnums (int, list) – Residue numbers within the chain to check
  • threshold (float) – Cutoff in Angstroms for returning True if a RESNAME is near
Returns:

True if a RESNAME is within the threshold cutoff

Return type:

bool

ssbio.protein.structure.properties.residues.search_ss_bonds(model, threshold=3.0)[source]

Searches S-S bonds based on distances between atoms in the structure (first model only). Average distance is 2.05A. Threshold is 3A default. Returns iterator with tuples of residues.

ADAPTED FROM JOAO RODRIGUES’ BIOPYTHON GSOC PROJECT (http://biopython.org/wiki/GSOC2010_Joao)

ssbio.protein.structure.properties.residues.site_centroid(residues, model)[source]

Get the XYZ coordinate of the center of a list of residues.

Parameters:
  • residues – List of residue numbers
  • pdb_file – Path to PDB file
Returns:

(X, Y, Z) coordinate of centroid

Return type:

tuple

ssbio.protein.sequence.utils

Sequence Alignment

ssbio.protein.sequence.utils.alignment.get_alignment_df(a_aln_seq, b_aln_seq, a_seq_id=None, b_seq_id=None)[source]

Summarize two alignment strings in a dataframe.

Parameters:
  • a_aln_seq (str) – Aligned sequence string
  • b_aln_seq (str) – Aligned sequence string
  • a_seq_id (str) – Optional ID of a_seq
  • b_seq_id (str) – Optional ID of b_aln_seq
Returns:

a per-residue level annotation of the alignment

Return type:

DataFrame

ssbio.protein.sequence.utils.alignment.get_alignment_df_from_file(alignment_file, a_seq_id=None, b_seq_id=None)[source]

Get a Pandas DataFrame of the Needle alignment results. Contains all positions of the sequences.

Parameters:
  • alignment_file
  • a_seq_id – Optional specification of the ID of the reference sequence
  • b_seq_id – Optional specification of the ID of the aligned sequence
Returns:

all positions in the alignment

Return type:

Pandas DataFrame

ssbio.protein.sequence.utils.alignment.get_deletions(aln_df)[source]

Get a list of tuples indicating the first and last residues of a deletion region, as well as the length of the deletion.

Examples

# Deletion of residues 1 to 4, length 4 >>> test = {‘id_a’: {0: ‘a’, 1: ‘a’, 2: ‘a’, 3: ‘a’}, ‘id_a_aa’: {0: ‘M’, 1: ‘G’, 2: ‘I’, 3: ‘T’}, ‘id_a_pos’: {0: 1.0, 1: 2.0, 2: 3.0, 3: 4.0}, ‘id_b’: {0: ‘b’, 1: ‘b’, 2: ‘b’, 3: ‘b’}, ‘id_b_aa’: {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan}, ‘id_b_pos’: {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan}, ‘type’: {0: ‘deletion’, 1: ‘deletion’, 2: ‘deletion’, 3: ‘deletion’}} >>> my_alignment = pd.DataFrame.from_dict(test) >>> get_deletions(my_alignment) [((1.0, 4.0), 4)]

Parameters:aln_df (DataFrame) – Alignment DataFrame
Returns:A list of tuples with the format ((deletion_start_resnum, deletion_end_resnum), deletion_length)
Return type:list
ssbio.protein.sequence.utils.alignment.get_insertions(aln_df)[source]

Get a list of tuples indicating the first and last residues of a insertion region, as well as the length of the insertion.

If the first tuple is:
(-1, 1) that means the insertion is at the beginning of the original protein (X, Inf) where X is the length of the original protein, that means the insertion is at the end of the protein

Examples

# Insertion at beginning, length 3 >>> test = {‘id_a’: {0: ‘a’, 1: ‘a’, 2: ‘a’, 3: ‘a’}, ‘id_a_aa’: {0: np.nan, 1: np.nan, 2: np.nan, 3: ‘M’}, ‘id_a_pos’: {0: np.nan, 1: np.nan, 2: np.nan, 3: 1.0}, ‘id_b’: {0: ‘b’, 1: ‘b’, 2: ‘b’, 3: ‘b’}, ‘id_b_aa’: {0: ‘M’, 1: ‘M’, 2: ‘L’, 3: ‘M’}, ‘id_b_pos’: {0: 1, 1: 2, 2: 3, 3: 4}, ‘type’: {0: ‘insertion’, 1: ‘insertion’, 2: ‘insertion’, 3: ‘match’}} >>> my_alignment = pd.DataFrame.from_dict(test) >>> get_insertions(my_alignment) [((-1, 1.0), 3)]

Parameters:aln_df (DataFrame) – Alignment DataFrame
Returns:A list of tuples with the format ((insertion_start_resnum, insertion_end_resnum), insertion_length)
Return type:list
ssbio.protein.sequence.utils.alignment.get_mutations(aln_df)[source]

Get a list of residue numbers (in the original sequence’s numbering) that are mutated

Parameters:
  • aln_df (DataFrame) – Alignment DataFrame
  • just_resnums – If only the residue numbers should be returned, instead of a list of tuples of (original_residue, resnum, mutated_residue)
Returns:

Residue mutations

Return type:

list

ssbio.protein.sequence.utils.alignment.get_percent_identity(a_aln_seq, b_aln_seq)[source]

Get the percent identity between two alignment strings

ssbio.protein.sequence.utils.alignment.get_unresolved(aln_df)[source]

Get a list of residue numbers (in the original sequence’s numbering) that are unresolved

Parameters:aln_df (DataFrame) – Alignment DataFrame
Returns:Residue numbers that are mutated
Return type:list
ssbio.protein.sequence.utils.alignment.map_resnum_a_to_resnum_b(a_resnum, a_aln, b_aln)[source]

Map a residue number in a sequence to the corresponding residue number in an aligned sequence.

Examples: >>> map_resnum_a_to_resnum_b(5, ‘–ABCDEF’, ‘XXABCDEF’) 7

Parameters:
  • a_resnum (int) – Residue number in the first aligned sequence
  • a_aln (str, Seq, SeqRecord) – Aligned sequence string
  • b_aln (str, Seq, SeqRecord) – Aligned sequence string
Returns:

Residue number in the second aligned sequence

Return type:

int

ssbio.protein.sequence.utils.alignment.needle_statistics(infile)[source]

Reads in a needle alignment file and spits out statistics of the alignment.

Parameters:infile (str) – Alignment file name
Returns:alignment_properties - a dictionary telling you the number of gaps, identity, etc.
Return type:dict
ssbio.protein.sequence.utils.alignment.pairwise_sequence_alignment(a_seq, b_seq, engine, a_seq_id=None, b_seq_id=None, gapopen=10, gapextend=0.5, outfile=None, outdir=None, force_rerun=False)[source]

Run a global pairwise sequence alignment between two sequence strings.

Parameters:
  • a_seq (str, Seq, SeqRecord, SeqProp) – Reference sequence
  • b_seq (str, Seq, SeqRecord, SeqProp) – Sequence to be aligned to reference
  • engine (str) – biopython or needle - which pairwise alignment program to use
  • a_seq_id (str) – Reference sequence ID. If not set, is “a_seq”
  • b_seq_id (str) – Sequence to be aligned ID. If not set, is “b_seq”
  • gapopen (int) – Only for needle - Gap open penalty is the score taken away when a gap is created
  • gapextend (float) – Only for needle - Gap extension penalty is added to the standard gap penalty for each base or residue in the gap
  • outfile (str) – Only for needle - name of output file. If not set, is {id_a}_{id_b}_align.txt
  • outdir (str) – Only for needle - Path to output directory. Default is the current directory.
  • force_rerun (bool) – Only for needle - Default False, set to True if you want to rerun the alignment if outfile exists.
Returns:

Biopython object to represent an alignment

Return type:

MultipleSeqAlignment

ssbio.protein.sequence.utils.alignment.run_needle_alignment(seq_a, seq_b, gapopen=10, gapextend=0.5, outdir=None, outfile=None, force_rerun=False)[source]

Run the needle alignment program for two strings and return the raw alignment result.

More info: EMBOSS needle: http://www.bioinformatics.nl/cgi-bin/emboss/help/needle Biopython wrapper: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc84 Using strings as input: https://www.biostars.org/p/91124/

Parameters:
  • id_a – ID of reference sequence
  • seq_a (str, Seq, SeqRecord) – Reference sequence
  • id_b – ID of sequence to be aligned
  • seq_b (str, Seq, SeqRecord) – String representation of sequence to be aligned
  • gapopen – Gap open penalty is the score taken away when a gap is created
  • gapextend – Gap extension penalty is added to the standard gap penalty for each base or residue in the gap
  • outdir (str, optional) – Path to output directory. Default is the current directory.
  • outfile (str, optional) – Name of output file. If not set, is {id_a}_{id_b}_align.txt
  • force_rerun (bool) – Default False, set to True if you want to rerun the alignment if outfile exists.
Returns:

Raw alignment result of the needle alignment in srspair format.

Return type:

str

ssbio.protein.sequence.utils.alignment.run_needle_alignment_on_files(id_a, faa_a, id_b, faa_b, gapopen=10, gapextend=0.5, outdir='', outfile='', force_rerun=False)[source]

Run the needle alignment program for two fasta files and return the raw alignment result.

More info: EMBOSS needle: http://www.bioinformatics.nl/cgi-bin/emboss/help/needle Biopython wrapper: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc84

Parameters:
  • id_a – ID of reference sequence
  • faa_a – File path to reference sequence
  • id_b – ID of sequence to be aligned
  • faa_b – File path to sequence to be aligned
  • gapopen – Gap open penalty is the score taken away when a gap is created
  • gapextend – Gap extension penalty is added to the standard gap penalty for each base or residue in the gap
  • outdir (str, optional) – Path to output directory. Default is the current directory.
  • outfile (str, optional) – Name of output file. If not set, is {id_a}_{id_b}_align.txt
  • force_rerun (bool) – Default False, set to True if you want to rerun the alignment if outfile exists.
Returns:

Raw alignment result of the needle alignment in srspair format.

Return type:

str

Sequence BLAST

ssbio.protein.sequence.utils.blast.calculate_bbh(blast_results_1, blast_results_2, r_name=None, g_name=None, outdir='')[source]

Calculate the best bidirectional BLAST hits (BBH) and save a dataframe of results.

Parameters:
  • blast_results_1 (str) – BLAST results for reference vs. other genome
  • blast_results_2 (str) – BLAST results for other vs. reference genome
  • r_name – Name of reference genome
  • g_name – Name of other genome
  • outdir – Directory where BLAST results are stored.
Returns:

Path to Pandas DataFrame of the BBH results.

ssbio.protein.sequence.utils.blast.create_orthology_matrix(r_name, genome_to_bbh_files, pid_cutoff=None, bitscore_cutoff=None, evalue_cutoff=None, filter_condition='OR', outname='', outdir='', force_rerun=False)[source]

Create an orthology matrix using best bidirectional BLAST hits (BBH) outputs.

Parameters:
  • r_name (str) – Name of the reference genome
  • genome_to_bbh_files (dict) – Mapping of genome names to the BBH csv output from the calculate_bbh() method
  • pid_cutoff (float) – Minimum percent identity between BLAST hits to filter for in the range [0, 100]
  • bitscore_cutoff (float) – Minimum bitscore allowed between BLAST hits
  • evalue_cutoff (float) – Maximum E-value allowed between BLAST hits
  • filter_condition (str) – ‘OR’ or ‘AND’, how to combine cutoff filters. ‘OR’ gives more results since it is less stringent, as you will be filtering for hits with (>80% PID or >30 bitscore or <0.0001 evalue).
  • outname – Name of output file of orthology matrix
  • outdir – Path to output directory
  • force_rerun (bool) – Force recreation of the orthology matrix even if the outfile exists
Returns:

Path to orthologous genes matrix.

Return type:

str

ssbio.protein.sequence.utils.blast.print_run_bidirectional_blast(reference, other_genome, dbtype, outdir)[source]

Write torque submission files for running bidirectional blast on a server and print execution command.

Parameters:
  • reference (str) – Path to “reference” genome, aka your “base strain”
  • other_genome (str) – Path to other genome which will be BLASTed to the reference
  • dbtype (str) – “nucl” or “prot” - what format your genome files are in
  • outdir (str) – Path to folder where Torque scripts should be placed
ssbio.protein.sequence.utils.blast.run_bidirectional_blast(reference, other_genome, dbtype, outdir='')[source]

BLAST a genome against another, and vice versa.

This function requires BLAST to be installed, do so by running: sudo apt install ncbi-blast+

Parameters:
  • reference (str) – path to “reference” genome, aka your “base strain”
  • other_genome (str) – path to other genome which will be BLASTed to the reference
  • dbtype (str) – “nucl” or “prot” - what format your genome files are in
  • outdir (str) – path to folder where BLAST outputs should be placed
Returns:

Paths to BLAST output files. (reference_vs_othergenome.out, othergenome_vs_reference.out)

ssbio.protein.sequence.utils.blast.run_makeblastdb(infile, dbtype, outdir='')[source]

Make the BLAST database for a genome file.

Parameters:
  • infile (str) – path to genome FASTA file
  • dbtype (str) – “nucl” or “prot” - what format your genome files are in
  • outdir (str) – path to directory to output database files (default is original folder)
Returns:

Paths to BLAST databases.

ssbio.protein.sequence.properties

Thermostability

This module provides functions to predict thermostability parameters (specifically the free energy of unfolding dG)
of an amino acid sequence.

These methods are adapted from:

Oobatake, M., & Ooi, T. (1993). ‘Hydration and heat stability effects on protein unfolding’,
Progress in biophysics and molecular biology, 59/3: 237–84.
Dill, K. A., Ghosh, K., & Schmit, J. D. (2011). ‘Physical limits of cells and proteomes’,
Proceedings of the National Academy of Sciences of the United States of America, 108/44: 17876–82. DOI: 10.1073/pnas.1114477108

For an example of usage of these parameters in a genome-scale model:

Chen, K., Gao, Y., Mih, N., O’Brien, E., Yang, L., Palsson, B.O. (2017).
‘Thermo-sensitivity of growth is determined by chaperone-mediated proteome re-allocation.’, Submitted to PNAS.
ssbio.protein.sequence.properties.thermostability.calculate_dill_dG(seq_len, temp)[source]

Get free energy of unfolding (dG) using Dill method in units J/mol.

Parameters:
  • seq_len (int) – Length of amino acid sequence
  • temp (float) – Temperature in degrees C
Returns:

Free energy of unfolding dG (J/mol)

Return type:

float

ssbio.protein.sequence.properties.thermostability.calculate_oobatake_dG(seq, temp)[source]

Get free energy of unfolding (dG) using Oobatake method in units cal/mol.

Parameters:
  • seq (str, Seq, SeqRecord) – Amino acid sequence
  • temp (float) – Temperature in degrees C
Returns:

Free energy of unfolding dG (J/mol)

Return type:

float

ssbio.protein.sequence.properties.thermostability.calculate_oobatake_dH(seq, temp)[source]

Get dH using Oobatake method in units cal/mol.

Parameters:
  • seq (str, Seq, SeqRecord) – Amino acid sequence
  • temp (float) – Temperature in degrees C
Returns:

dH in units cal/mol

Return type:

float

ssbio.protein.sequence.properties.thermostability.calculate_oobatake_dS(seq, temp)[source]

Get dS using Oobatake method in units cal/mol.

Parameters:
  • seq (str, Seq, SeqRecord) – Amino acid sequence
  • temp (float) – Temperature in degrees C
Returns:

dS in units cal/mol

Return type:

float

ssbio.protein.sequence.properties.thermostability.get_dG_at_T(seq, temp)[source]

Predict dG at temperature T, using best predictions from Dill or Oobatake methods.

Parameters:
  • seq (str, Seq, SeqRecord) – Amino acid sequence
  • temp (float) – Temperature in degrees C
Returns:

tuple containing:

dG (float) Free energy of unfolding dG (cal/mol) keq (float): Equilibrium constant Keq method (str): Method used to calculate

Return type:

(tuple)