Python API

Information on select functions, classes, or methods.

ssbio.databases

PDBProp

class ssbio.databases.pdb.PDBProp(ident, description=None, chains=None, mapped_chains=None, structure_path=None, file_type=None)[source]

Store information about a protein structure from the Protein Data Bank.

Extends the StructProp class to allow initialization of the structure by its PDB ID, and then enabling downloads of the structure file as well as parsing its metadata.

ssbio.databases.pdb.best_structures(uniprot_id, outname=None, outdir=None, seq_ident_cutoff=0.0, force_rerun=False)[source]

Use the PDBe REST service to query for the best PDB structures for a UniProt ID.

More information found here: https://www.ebi.ac.uk/pdbe/api/doc/sifts.html Link used to retrieve results: https://www.ebi.ac.uk/pdbe/api/mappings/best_structures/:accession The list of PDB structures mapping to a UniProt accession sorted by coverage of the protein and, if the same, resolution.

Here is the ranking algorithm described by the PDB paper: https://nar.oxfordjournals.org/content/44/D1/D385.full

“Finally, a single quality indicator is also calculated for each entry by taking the harmonic average of all the percentile scores representing model and model-data-fit quality measures and then subtracting 10 times the numerical value of the resolution (in Angstrom) of the entry to ensure that resolution plays a role in characterising the quality of a structure. This single empirical ‘quality measure’ value is used by the PDBe query system to sort results and identify the ‘best’ structure in a given context. At present, entries determined by methods other than X-ray crystallography do not have similar data quality information available and are not considered as ‘best structures’.”

Parameters:
  • uniprot_id (str) – UniProt Accession ID
  • outname (str) – Basename of the output file of JSON results
  • outdir (str) – Path to output directory of JSON results
  • seq_ident_cutoff (float) – Cutoff results based on percent coverage (in decimal form)
  • force_rerun (bool) – Obtain best structures mapping ignoring previously downloaded results
Returns:

Rank-ordered list of dictionaries representing chain-specific PDB entries. Keys are:
  • pdb_id: the PDB ID which maps to the UniProt ID
  • chain_id: the specific chain of the PDB which maps to the UniProt ID
  • coverage: the percent coverage of the entire UniProt sequence
  • resolution: the resolution of the structure
  • start: the structure residue number which maps to the start of the mapped sequence
  • end: the structure residue number which maps to the end of the mapped sequence
  • unp_start: the sequence residue number which maps to the structure start
  • unp_end: the sequence residue number which maps to the structure end
  • experimental_method: type of experiment used to determine structure
  • tax_id: taxonomic ID of the protein’s original organism

Return type:

list

ssbio.databases.pdb.blast_pdb(seq, outfile='', outdir='', evalue=0.0001, seq_ident_cutoff=0.0, link=False, force_rerun=False)[source]

Returns a list of BLAST hits of a sequence to available structures in the PDB.

Parameters:
  • seq (str) – Your sequence, in string format
  • outfile (str) – Name of output file
  • outdir (str, optional) – Path to output directory. Default is the current directory.
  • evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
  • seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
  • link (bool, optional) – Set to True if a link to the HTML results should be displayed
  • force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False
Returns:

Rank ordered list of BLAST hits in dictionaries.

Return type:

list

ssbio.databases.pdb.blast_pdb_df(blast_results)[source]

Make a dataframe of BLAST results

ssbio.databases.pdb.download_biological_assemblies(pdb_id, outdir)[source]

Downloads biological assembly file from: ftp://ftp.wwpdb.org/pub/pdb/data/biounit/coordinates/divided/

Parameters:outdir (str) – Output directory of the decompressed assembly
ssbio.databases.pdb.download_sifts_xml(pdb_id, outdir='', outfile='')[source]

Download the SIFTS file for a PDB ID.

Parameters:
  • pdb_id
  • outdir
  • outfile

Returns:

ssbio.databases.pdb.download_structure(pdb_id, file_type, outdir='', outfile='', only_header=False, force_rerun=False)[source]

Download a structure from the RCSB PDB by ID. Specify the file type desired.

Parameters:
  • pdb_id – PDB ID
  • file_type – pdb, pdb.gz, mmcif, cif, cif.gz, xml.gz, mmtf, mmtf.gz
  • outdir – Optional output directory
  • outfile – Optional output name
  • only_header – If only the header file should be downloaded
  • force_rerun – If the file should be downloaded again even if it exists
Returns:

Path to outfile

Return type:

str

ssbio.databases.pdb.get_release_date(pdb_id)[source]

Quick way to get the release date of a PDB ID using the table of results from the REST service

Returns None if the release date is not available.

Returns:Organism of a PDB ID
Return type:str
ssbio.databases.pdb.get_resolution(pdb_id)[source]

Quick way to get the resolution of a PDB ID using the table of results from the REST service

Returns infinity if the resolution is not available.

Returns:resolution of a PDB ID in Angstroms
Return type:float

Todo

  • Unit test
ssbio.databases.pdb.map_uniprot_resnum_to_pdb(uniprot_resnum, chain_id, sifts_file)[source]

Map a UniProt residue number to its corresponding PDB residue number.

This function requires that the SIFTS file be downloaded, and also a chain ID (as different chains may have different mappings).

Parameters:
  • uniprot_resnum (int) – integer of the residue number you’d like to map
  • chain_id (str) – string of the PDB chain to map to
  • sifts_file (str) – Path to the SIFTS XML file
Returns:

tuple containing:

mapped_resnum (int): Mapped residue number is_observed (bool): Indicates if the 3D structure actually shows the residue

Return type:

(tuple)

ssbio.databases.pdb.parse_mmcif_header(infile)[source]

Parse a couple important fields from the mmCIF file format with some manual curation of ligands.

If you want full access to the mmCIF file just use the MMCIF2Dict class in Biopython.

Parameters:infile – Path to mmCIF file
Returns:Dictionary of parsed header
Return type:dict
ssbio.databases.pdb.parse_pdb_header(infile)[source]

Parse a couple important fields from the mmCIF file format with some manual curation of ligands.

If you want full access to the mmCIF file just use the MMCIF2Dict class in Biopython.

Parameters:infile – Path to mmCIF file
Returns:Dictionary of parsed header
Return type:dict

PISA

ssbio.databases.pisa.download_pisa_multimers_xml(pdb_ids, save_single_xml_files=True, outdir=None, force_rerun=False)[source]

Download the PISA XML file for multimers.

See: http://www.ebi.ac.uk/pdbe/pisa/pi_download.html for more info

XML description of macromolecular assemblies:
http://www.ebi.ac.uk/pdbe/pisa/cgi-bin/multimers.pisa?pdbcodelist where “pdbcodelist” is a comma-separated (strictly no spaces) list of PDB codes. The resulting file contain XML output of assembly data, equivalent to that displayed in PISA assembly pages, for each of the specified PDB entries. NOTE: If a mass-download is intended, please minimize the number of retrievals by specifying as many PDB codes in the URL as feasible (20-50 is a good range), and never send another URL request until the previous one has been completed (meaning that the multimers.pisa file has been downloaded). Excessive requests will silently die in the server queue.
Parameters:
  • pdb_ids (str, list) – PDB ID or list of IDs
  • save_single_xml_files (bool) – If single XML files should be saved per PDB ID. If False, if multiple PDB IDs are provided, then a single, combined XML output file is downloaded
  • outdir (str) – Directory to output PISA XML files
  • force_rerun (bool) – Redownload files if they already exist
Returns:

of files downloaded

Return type:

list

ssbio.databases.pisa.parse_pisa_multimers_xml(pisa_multimers_xml, download_structures=False, outdir=None, force_rerun=False)[source]

Retrieve PISA information from an XML results file

See: http://www.ebi.ac.uk/pdbe/pisa/pi_download.html for more info

XML description of macromolecular assemblies:
http://www.ebi.ac.uk/pdbe/pisa/cgi-bin/multimers.pisa?pdbcodelist where “pdbcodelist” is a comma-separated (strictly no spaces) list of PDB codes. The resulting file contain XML output of assembly data, equivalent to that displayed in PISA assembly pages, for each of the specified PDB entries. NOTE: If a mass-download is intended, please minimize the number of retrievals by specifying as many PDB codes in the URL as feasible (20-50 is a good range), and never send another URL request until the previous one has been completed (meaning that the multimers.pisa file has been downloaded). Excessive requests will silently die in the server queue.
Parameters:
  • pisa_multimers_xml (str) – Path to PISA XML output file
  • download_structures (bool) – If assembly files should be downloaded
  • outdir (str) – Directory to output assembly files
  • force_rerun (bool) – Redownload files if they already exist
Returns:

of parsed PISA information

Return type:

dict

ssbio.databases.pisa.pdb_chain_stoichiometry_biomolone(pdbid)[source]

Get the stoichiometry of the chains in biological assembly 1 as a dictionary.

Steps taken are: 1) Download PDB and parse header, make biomolecule if provided 2) Count how many times each chain appears in biomolecule #1 3) Convert chain id to uniprot id 4) Return final dictionary

Parameters:pdbid (str) – 4 character PDB ID
Returns:{(ChainID,UniProtID): # occurences}
Return type:dict

SWISSMODEL

class ssbio.databases.swissmodel.SWISSMODEL(metadata_dir)[source]

Methods to parse through a SWISS-MODEL metadata set.

Download a particular organism’s metadata from SWISS-MODEL here: https://swissmodel.expasy.org/repository

Parameters:metadata_dir (str) – Path to the extracted SWISS-MODEL_Repository folder
all_models = None

dict – Dictionary of lists, UniProt ID as the keys

download_models(uniprot_acc, outdir='', force_rerun=False)[source]

Download all models available for a UniProt accession number.

Parameters:
  • uniprot_acc (str) – UniProt ACC/ID
  • outdir (str) – Path to output directory, uses working directory if not set
  • force_rerun (bool) – Force a redownload the models if they already exist
Returns:

Paths to the downloaded models

Return type:

list

get_models(uniprot_acc)[source]

Return all available models for a UniProt accession number.

Parameters:uniprot_acc (str) – UniProt ACC/ID
Returns:All available models in SWISS-MODEL for this UniProt entry
Return type:dict
metadata_dir = None

str – Path to the extracted SWISS-MODEL_Repository folder

metadata_index_json

str – Path to the INDEX_JSON file.

parse_metadata()[source]

Parse the INDEX_JSON file and reorganize it as a dictionary of lists.

uniprots_modeled

list – Return all UniProt accession numbers with at least one model

UniProtProp

class ssbio.databases.uniprot.UniProtProp(seq, id, name='<unknown name>', description='<unknown description>', fasta_path=None, xml_path=None, gff_path=None)[source]

Generic class to store information on a UniProt entry, extended from a SeqProp object.

The main utilities of this class are to:

  1. Download and/or parse UniProt text or xml files
  2. Store extra parsed information in attributes
uniprot

str – Main UniProt accession code

alt_uniprots

list – Alternate accession codes that point to the main one

file_type

str – Metadata file type

reviewed

bool – If this entry is a “reviewed” entry. If None, then status is unknown.

ec_number

str – EC number

pfam

list – PFAM IDs

entry_version

str – Date of last update of the UniProt entry

seq_version

str – Date of last update of the UniProt sequence

download_metadata_file(outdir, force_rerun=False)[source]

Download and load the UniProt XML file

download_seq_file(outdir, force_rerun=False)[source]

Download and load the UniProt FASTA file

features

list – Get the features from the feature file, metadata file, or in memory

ranking_score()[source]

Provide a score for this UniProt ID based on reviewed (True=1, False=0) + number of PDBs

Returns:Scoring for this ID
Return type:int
seq

Seq – Get the Seq object from the sequence file, metadata file, or in memory

ssbio.databases.uniprot.blast_uniprot(seq_str, seq_ident=1, evalue=0.0001, reviewed_only=True, organism=None)[source]

BLAST the UniProt db to find what IDs match the sequence input

Parameters:
  • seq_str – Sequence string
  • seq_ident – Percent identity to match
  • evalue – E-value of BLAST hit

Returns:

ssbio.databases.uniprot.download_uniprot_file(uniprot_id, filetype, outdir='', force_rerun=False)[source]

Download a UniProt file for a UniProt ID/ACC

Parameters:
  • uniprot_id – Valid UniProt ID
  • filetype – txt, fasta, xml, rdf, or gff
  • outdir – Directory to download the file
Returns:

Absolute path to file

Return type:

str

ssbio.databases.uniprot.get_fasta(uniprot_id)[source]

Get the protein sequence for a UniProt ID as a string.

Parameters:uniprot_id – Valid UniProt ID
Returns:String of the protein (amino acid) sequence
Return type:str
ssbio.databases.uniprot.is_valid_uniprot_id(instring)[source]

Check if a string is a valid UniProt ID.

See regex from: http://www.uniprot.org/help/accession_numbers

Parameters:instring – any string identifier

Returns: True if the string is a valid UniProt ID

ssbio.databases.uniprot.old_parse_uniprot_txt_file(infile)[source]

From: boscoh/uniprot github Parses the text of metadata retrieved from uniprot.org.

Only a few fields have been parsed, but this provides a template for the other fields.

A single description is generated from joining alternative descriptions.

Returns a dictionary with the main UNIPROT ACC as keys.

ssbio.databases.uniprot.parse_uniprot_txt_file(infile)[source]

Parse a raw UniProt metadata file and return a dictionary.

Parameters:infile – Path to metadata file
Returns:Metadata dictionary
Return type:dict
ssbio.databases.uniprot.parse_uniprot_xml_metadata(sr)[source]

Load relevant attributes and dbxrefs from a parsed UniProt XML file in a SeqRecord.

Returns:All parsed information
Return type:dict
ssbio.databases.uniprot.uniprot_ec(uniprot_id)[source]

Retrieve the EC number annotation for a UniProt ID.

Parameters:uniprot_id – Valid UniProt ID

Returns:

ssbio.databases.uniprot.uniprot_reviewed_checker(uniprot_id)[source]

Check if a single UniProt ID is reviewed or not.

Parameters:uniprot_id
Returns:If the entry is reviewed
Return type:bool
ssbio.databases.uniprot.uniprot_reviewed_checker_batch(uniprot_ids)[source]

Batch check if uniprot IDs are reviewed or not

Parameters:uniprot_ids – UniProt ID or list of UniProt IDs
Returns:Boolean}
Return type:A dictionary of {UniProtID
ssbio.databases.uniprot.uniprot_sites(uniprot_id)[source]

Retrieve a list of UniProt sites parsed from the feature file

Sites are defined here: http://www.uniprot.org/help/site and here: http://www.uniprot.org/help/function_section

Parameters:uniprot_id – Valid UniProt ID

Returns:

KEGGProp

ssbio.databases.kegg.download_kegg_aa_seq(gene_id, outdir=None, force_rerun=False)[source]

Download a FASTA sequence of a protein from the KEGG database and return the path.

Parameters:
  • gene_id – the gene identifier
  • outdir – optional path to output directory
Returns:

Path to FASTA file

ssbio.databases.kegg.download_kegg_gene_metadata(gene_id, outdir=None, force_rerun=False)[source]

Download the KEGG flatfile for a KEGG ID and return the path.

Parameters:
  • gene_id – KEGG gene ID (with organism code), i.e. “eco:1244”
  • outdir – optional output directory of metadata
Returns:

Path to metadata file

ssbio.databases.kegg.map_kegg_all_genes(organism_code, target_db)[source]

Map all of an organism’s gene IDs to the target database.

This is faster than supplying a specific list of genes to map, plus there seems to be a limit on the number you can map with a manual REST query anyway.

Parameters:
  • organism_code – the three letter KEGG code of your organism
  • target_db – ncbi-proteinid | ncbi-geneid | uniprot
Returns:

Dictionary of ID mapping

ssbio.databases.kegg.parse_kegg_gene_metadata(infile)[source]

Parse the KEGG flatfile and return a dictionary of metadata.

Dictionary keys are:
refseq uniprot pdbs taxonomy
Parameters:infile – Path to KEGG flatfile
Returns:Dictionary of metadata
Return type:dict

ssbio.protein.structure.utils

CleanPDB

ssbio.protein.structure.utils.cleanpdb.clean_pdb(pdb_file, out_suffix='_clean', outdir=None, force_rerun=False, remove_atom_alt=True, keep_atom_alt_id='A', remove_atom_hydrogen=True, add_atom_occ=True, remove_res_hetero=True, keep_chemicals=None, keep_res_only=None, add_chain_id_if_empty='X', keep_chains=None)[source]

Clean a PDB file.

Parameters:
  • pdb_file (str) – Path to input PDB file
  • out_suffix (str) – Suffix to append to original filename
  • outdir (str) – Path to output directory
  • force_rerun (bool) – If structure should be re-cleaned if a clean file exists already
  • remove_atom_alt (bool) – Remove alternate positions
  • keep_atom_alt_id (str) – If removing alternate positions, which alternate ID to keep
  • remove_atom_hydrogen (bool) – Remove hydrogen atoms
  • add_atom_occ (bool) – Add atom occupancy fields if not present
  • remove_res_hetero (bool) – Remove all HETATMs
  • keep_chemicals (str, list) – If removing HETATMs, keep specified chemical names
  • keep_res_only (str, list) – Keep ONLY specified resnames, deletes everything else!
  • add_chain_id_if_empty (str) – Add a chain ID if not present
  • keep_chains (str, list) – Keep only these chains
Returns:

Path to cleaned PDB file

Return type:

str

MutatePDB

DOCK

class ssbio.protein.structure.utils.dock.DOCK(structure_id, pdb_file, amb_file, flex1_file, flex2_file, root_dir=None)[source]

Class to prepare a structure file for docking with DOCK6.

Attributes:

auto_flexdock(binding_residues, radius, ligand_path=None, force_rerun=False)[source]

Run DOCK6 on a PDB file, given its binding residues and a radius around them.

Provide a path to a ligand to dock a ligand to it. If no ligand is provided, DOCK6 preparations will be run on that structure file.

Parameters:
  • binding_residues (str) – Comma separated string of residues (eg: ‘144,170,199’)
  • radius (int, float) – Radius around binding residues to dock to
  • ligand_path (str) – Path to ligand (mol2 format) to dock to protein
  • force_rerun (bool) – If method should be rerun even if output files exist
binding_site_mol2(residues, force_rerun=False)[source]

Create mol2 of only binding site residues from the receptor

This function will take in a .pdb file (preferably the _receptor_noH.pdb file) and a string of residues (eg: ‘144,170,199’) and delete all other residues in the .pdb file. It then saves the coordinates of the selected residues as a .mol2 file. This is necessary for Chimera to select spheres within the radius of the binding site.

Parameters:
  • residues (str) – Comma separated string of residues (eg: ‘144,170,199’)
  • force_rerun (bool) – If method should be rerun even if output file exists
dms_maker(force_rerun=False)[source]

Create surface representation (dms file) of receptor

Parameters:force_rerun (bool) – If method should be rerun even if output file exists
do_dock6_flexible(ligand_path, force_rerun=False)[source]

Dock a ligand to the protein.

Parameters:
  • ligand_path (str) – Path to ligand (mol2 format) to dock to protein
  • force_rerun (bool) – If method should be rerun even if output file exists
dock_dir

str – DOCK folder

dockprep(force_rerun=False)[source]

Prepare a PDB file for docking by first converting it to mol2 format.

Parameters:force_rerun (bool) – If method should be rerun even if output file exists
grid(force_rerun=False)[source]

Create the scoring grid within the dummy box.

Parameters:force_rerun (bool) – If method should be rerun even if output file exists
protein_only_and_noH(keep_ligands=None, force_rerun=False)[source]

Isolate the receptor by stripping everything except protein and specified ligands.

Parameters:
  • keep_ligands (str, list) – Ligand(s) to keep in PDB file
  • force_rerun (bool) – If method should be rerun even if output file exists
root_dir

str – Directory where DOCK project folder is located

showbox(force_rerun=False)[source]

Create the dummy PDB box around the selected spheres.

Parameters:force_rerun (bool) – If method should be rerun even if output file exists
sphere_selector_using_residues(radius, force_rerun=False)[source]

Select spheres based on binding site residues

Parameters:
  • radius (int, float) – Radius around binding residues to dock to
  • force_rerun (bool) – If method should be rerun even if output file exists
sphgen(force_rerun=False)[source]

Create sphere representation (sph file) of receptor from the surface representation

Parameters:force_rerun (bool) – If method should be rerun even if output file exists
ssbio.protein.structure.utils.dock.parse_results_mol2(mol2_outpath)[source]

Parse a DOCK6 mol2 output file, return a Pandas DataFrame of the results.

Parameters:mol2_outpath (str) – Path to mol2 output file
Returns:Pandas DataFrame of the results
Return type:DataFrame

ssbio.protein.structure.properties

Structure Residues

ssbio.protein.structure.properties.residues.distance_to_site(residue_of_interest, residues, model)[source]

Calculate the distance between an amino acid and a group of amino acids.

Parameters:
  • residue_of_interest – Residue number you are interested in (ie. a mutation)
  • residues – List of residue numbers
Returns:

Distance (in Angstroms) to the group of residues

Return type:

float

ssbio.protein.structure.properties.residues.get_structure_seqrecords(model)[source]

Get a dictionary of a PDB file’s sequences.

Special cases include:
  • Insertion codes. In the case of residue numbers like “15A”, “15B”, both residues are written out. Example: 9LPR
  • HETATMs. Currently written as an “X”, or unknown amino acid.
Parameters:model – Biopython Model object of a Structure
Returns:List of SeqRecords
Return type:list
ssbio.protein.structure.properties.residues.get_structure_seqs(pdb_file, file_type)[source]

Get a dictionary of a PDB file’s sequences.

Special cases include:
  • Insertion codes. In the case of residue numbers like “15A”, “15B”, both residues are written out. Example: 9LPR
  • HETATMs. Currently written as an “X”, or unknown amino acid.
Parameters:pdb_file – Path to PDB file
Returns:Dictionary of: {chain_id: sequence}
Return type:dict
ssbio.protein.structure.properties.residues.hse_output(pdb_file, file_type)[source]

The solvent exposure of an amino acid residue is important for analyzing, understanding and predicting aspects of protein structure and function [73]. A residue’s solvent exposure can be classified as four categories: exposed, partly exposed, buried and deeply buried residues. Hamelryck et al. [73] established a new 2D measure that provides a different view of solvent exposure, i.e. half-sphere exposure (HSE). By conceptually dividing the sphere of a residue into two halves- HSE-up and HSE-down, HSE provides a more detailed description of an amino acid residue’s spatial neighborhood. HSE is calculated by the hsexpo module implemented in the BioPython package [74] from a PDB file.

http://onlinelibrary.wiley.com/doi/10.1002/prot.20379/abstract

Parameters:pdb_file

Returns:

ssbio.protein.structure.properties.residues.match_structure_sequence(orig_seq, new_seq, match='X', fill_with='X', ignore_excess=False)[source]

Correct a sequence to match inserted X’s in a structure sequence

This is useful for mapping a sequence obtained from structural tools like MSMS or DSSP
to the sequence obtained by the get_structure_seqs method.

Examples

>>> structure_seq = 'XXXABCDEF'
>>> prop_list = [4, 5, 6, 7, 8, 9]
>>> match_structure_sequence(structure_seq, prop_list)
['X', 'X', 'X', 4, 5, 6, 7, 8, 9]
>>> match_structure_sequence(structure_seq, prop_list, fill_with=float('Inf'))
[inf, inf, inf, 4, 5, 6, 7, 8, 9]
>>> structure_seq = '---ABCDEF---'
>>> prop_list = ('H','H','H','C','C','C')
>>> match_structure_sequence(structure_seq, prop_list, match='-', fill_with='-')
('-', '-', '-', 'H', 'H', 'H', 'C', 'C', 'C', '-', '-', '-')
>>> structure_seq = 'ABCDEF---'
>>> prop_list = 'HHHCCC'
>>> match_structure_sequence(structure_seq, prop_list, match='-', fill_with='-')
'HHHCCC---'
>>> structure_seq = 'AXBXCXDXEXF'
>>> prop_list = ['H', 'H', 'H', 'C', 'C', 'C']
>>> match_structure_sequence(structure_seq, prop_list, match='X', fill_with='X')
['H', 'X', 'H', 'X', 'H', 'X', 'C', 'X', 'C', 'X', 'C']
Parameters:
  • orig_seq (str, Seq, SeqRecord) – Sequence to match to
  • new_seq (str, tuple, list) – Sequence to fill in
  • match (str) – What to match
  • fill_with – What to fill in when matches are found
  • ignore_excess (bool) – If excess sequence on the tail end of new_seq should be ignored
Returns:

new_seq which will match the length of orig_seq

Return type:

str, tuple, list

ssbio.protein.structure.properties.residues.resname_in_proximity(resname, model, chains, resnums, threshold=5)[source]

Search within the proximity of a defined list of residue numbers and their chains for any specifed residue name.

Parameters:
  • resname (str) – Residue name to search for in proximity of specified chains + resnums
  • model – Biopython Model object
  • chains (str, list) – Chain ID or IDs to check
  • resnums (int, list) – Residue numbers within the chain to check
  • threshold (float) – Cutoff in Angstroms for returning True if a RESNAME is near
Returns:

True if a RESNAME is within the threshold cutoff

Return type:

bool

ssbio.protein.structure.properties.residues.search_ss_bonds(model, threshold=3.0)[source]

Searches S-S bonds based on distances between atoms in the structure (first model only). Average distance is 2.05A. Threshold is 3A default. Returns iterator with tuples of residues.

ADAPTED FROM JOAO RODRIGUES’ BIOPYTHON GSOC PROJECT (http://biopython.org/wiki/GSOC2010_Joao)

ssbio.protein.structure.properties.residues.site_centroid(residues, model)[source]

Get the XYZ coordinate of the center of a list of residues.

Parameters:
  • residues – List of residue numbers
  • pdb_file – Path to PDB file
Returns:

(X, Y, Z) coordinate of centroid

Return type:

tuple

ssbio.protein.sequence.utils

Sequence Alignment

ssbio.protein.sequence.utils.alignment.get_alignment_df(a_aln_seq, b_aln_seq, a_seq_id=None, b_seq_id=None)[source]

Summarize two alignment strings in a dataframe.

Parameters:
  • a_aln_seq (str) – Aligned sequence string
  • b_aln_seq (str) – Aligned sequence string
  • a_seq_id (str) – Optional ID of a_seq
  • b_seq_id (str) – Optional ID of b_aln_seq
Returns:

a per-residue level annotation of the alignment

Return type:

DataFrame

ssbio.protein.sequence.utils.alignment.get_alignment_df_from_file(alignment_file, a_seq_id=None, b_seq_id=None)[source]

Get a Pandas DataFrame of the Needle alignment results. Contains all positions of the sequences.

Parameters:
  • alignment_file
  • a_seq_id – Optional specification of the ID of the reference sequence
  • b_seq_id – Optional specification of the ID of the aligned sequence
Returns:

all positions in the alignment

Return type:

Pandas DataFrame

ssbio.protein.sequence.utils.alignment.get_deletions(aln_df)[source]

Get a list of tuples indicating the first and last residues of a deletion region, as well as the length of the deletion.

Examples

# Deletion of residues 1 to 4, length 4 >>> test = {‘id_a’: {0: ‘a’, 1: ‘a’, 2: ‘a’, 3: ‘a’}, ‘id_a_aa’: {0: ‘M’, 1: ‘G’, 2: ‘I’, 3: ‘T’}, ‘id_a_pos’: {0: 1.0, 1: 2.0, 2: 3.0, 3: 4.0}, ‘id_b’: {0: ‘b’, 1: ‘b’, 2: ‘b’, 3: ‘b’}, ‘id_b_aa’: {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan}, ‘id_b_pos’: {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan}, ‘type’: {0: ‘deletion’, 1: ‘deletion’, 2: ‘deletion’, 3: ‘deletion’}} >>> my_alignment = pd.DataFrame.from_dict(test) >>> get_deletions(my_alignment) [((1.0, 4.0), 4)]

Parameters:aln_df (DataFrame) – Alignment DataFrame
Returns:A list of tuples with the format ((deletion_start_resnum, deletion_end_resnum), deletion_length)
Return type:list
ssbio.protein.sequence.utils.alignment.get_insertions(aln_df)[source]

Get a list of tuples indicating the first and last residues of a insertion region, as well as the length of the insertion.

If the first tuple is:
(-1, 1) that means the insertion is at the beginning of the original protein (X, Inf) where X is the length of the original protein, that means the insertion is at the end of the protein

Examples

# Insertion at beginning, length 3 >>> test = {‘id_a’: {0: ‘a’, 1: ‘a’, 2: ‘a’, 3: ‘a’}, ‘id_a_aa’: {0: np.nan, 1: np.nan, 2: np.nan, 3: ‘M’}, ‘id_a_pos’: {0: np.nan, 1: np.nan, 2: np.nan, 3: 1.0}, ‘id_b’: {0: ‘b’, 1: ‘b’, 2: ‘b’, 3: ‘b’}, ‘id_b_aa’: {0: ‘M’, 1: ‘M’, 2: ‘L’, 3: ‘M’}, ‘id_b_pos’: {0: 1, 1: 2, 2: 3, 3: 4}, ‘type’: {0: ‘insertion’, 1: ‘insertion’, 2: ‘insertion’, 3: ‘match’}} >>> my_alignment = pd.DataFrame.from_dict(test) >>> get_insertions(my_alignment) [((-1, 1.0), 3)]

Parameters:aln_df (DataFrame) – Alignment DataFrame
Returns:A list of tuples with the format ((insertion_start_resnum, insertion_end_resnum), insertion_length)
Return type:list
ssbio.protein.sequence.utils.alignment.get_mutations(aln_df)[source]

Get a list of residue numbers (in the original sequence’s numbering) that are mutated

Parameters:
  • aln_df (DataFrame) – Alignment DataFrame
  • just_resnums – If only the residue numbers should be returned, instead of a list of tuples of (original_residue, resnum, mutated_residue)
Returns:

Residue mutations

Return type:

list

ssbio.protein.sequence.utils.alignment.get_percent_identity(a_aln_seq, b_aln_seq)[source]

Get the percent identity between two alignment strings

ssbio.protein.sequence.utils.alignment.get_unresolved(aln_df)[source]

Get a list of residue numbers (in the original sequence’s numbering) that are unresolved

Parameters:aln_df (DataFrame) – Alignment DataFrame
Returns:Residue numbers that are mutated
Return type:list
ssbio.protein.sequence.utils.alignment.map_resnum_a_to_resnum_b(a_resnum, a_aln, b_aln)[source]

Map a residue number in a sequence to the corresponding residue number in an aligned sequence.

Examples: >>> map_resnum_a_to_resnum_b(5, ‘–ABCDEF’, ‘XXABCDEF’) 7

Parameters:
  • a_resnum (int) – Residue number in the first aligned sequence
  • a_aln (str, Seq, SeqRecord) – Aligned sequence string
  • b_aln (str, Seq, SeqRecord) – Aligned sequence string
Returns:

Residue number in the second aligned sequence

Return type:

int

ssbio.protein.sequence.utils.alignment.needle_statistics(infile)[source]

Reads in a needle alignment file and spits out statistics of the alignment.

Parameters:infile (str) – Alignment file name
Returns:alignment_properties - a dictionary telling you the number of gaps, identity, etc.
Return type:dict
ssbio.protein.sequence.utils.alignment.pairwise_sequence_alignment(a_seq, b_seq, engine, a_seq_id=None, b_seq_id=None, gapopen=10, gapextend=0.5, outfile=None, outdir=None, force_rerun=False)[source]

Run a global pairwise sequence alignment between two sequence strings.

Parameters:
  • a_seq (str, Seq, SeqRecord, SeqProp) – Reference sequence
  • b_seq (str, Seq, SeqRecord, SeqProp) – Sequence to be aligned to reference
  • engine (str) – biopython or needle - which pairwise alignment program to use
  • a_seq_id (str) – Reference sequence ID. If not set, is “a_seq”
  • b_seq_id (str) – Sequence to be aligned ID. If not set, is “b_seq”
  • gapopen (int) – Only for needle - Gap open penalty is the score taken away when a gap is created
  • gapextend (float) – Only for needle - Gap extension penalty is added to the standard gap penalty for each base or residue in the gap
  • outfile (str) – Only for needle - name of output file. If not set, is {id_a}_{id_b}_align.txt
  • outdir (str) – Only for needle - Path to output directory. Default is the current directory.
  • force_rerun (bool) – Only for needle - Default False, set to True if you want to rerun the alignment if outfile exists.
Returns:

Biopython object to represent an alignment

Return type:

MultipleSeqAlignment

ssbio.protein.sequence.utils.alignment.run_needle_alignment(seq_a, seq_b, gapopen=10, gapextend=0.5, outdir=None, outfile=None, force_rerun=False)[source]

Run the needle alignment program for two strings and return the raw alignment result.

More info: EMBOSS needle: http://www.bioinformatics.nl/cgi-bin/emboss/help/needle Biopython wrapper: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc84 Using strings as input: https://www.biostars.org/p/91124/

Parameters:
  • id_a – ID of reference sequence
  • seq_a (str, Seq, SeqRecord) – Reference sequence
  • id_b – ID of sequence to be aligned
  • seq_b (str, Seq, SeqRecord) – String representation of sequence to be aligned
  • gapopen – Gap open penalty is the score taken away when a gap is created
  • gapextend – Gap extension penalty is added to the standard gap penalty for each base or residue in the gap
  • outdir (str, optional) – Path to output directory. Default is the current directory.
  • outfile (str, optional) – Name of output file. If not set, is {id_a}_{id_b}_align.txt
  • force_rerun (bool) – Default False, set to True if you want to rerun the alignment if outfile exists.
Returns:

Raw alignment result of the needle alignment in srspair format.

Return type:

str

ssbio.protein.sequence.utils.alignment.run_needle_alignment_on_files(id_a, faa_a, id_b, faa_b, gapopen=10, gapextend=0.5, outdir='', outfile='', force_rerun=False)[source]

Run the needle alignment program for two fasta files and return the raw alignment result.

More info: EMBOSS needle: http://www.bioinformatics.nl/cgi-bin/emboss/help/needle Biopython wrapper: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc84

Parameters:
  • id_a – ID of reference sequence
  • faa_a – File path to reference sequence
  • id_b – ID of sequence to be aligned
  • faa_b – File path to sequence to be aligned
  • gapopen – Gap open penalty is the score taken away when a gap is created
  • gapextend – Gap extension penalty is added to the standard gap penalty for each base or residue in the gap
  • outdir (str, optional) – Path to output directory. Default is the current directory.
  • outfile (str, optional) – Name of output file. If not set, is {id_a}_{id_b}_align.txt
  • force_rerun (bool) – Default False, set to True if you want to rerun the alignment if outfile exists.
Returns:

Raw alignment result of the needle alignment in srspair format.

Return type:

str

Sequence BLAST

ssbio.protein.sequence.utils.blast.calculate_bbh(blast_results_1, blast_results_2, r_name=None, g_name=None, outdir='')[source]

Calculate the best bidirectional BLAST hits (BBH) and save a dataframe of results.

Parameters:
  • blast_results_1 (str) – BLAST results for reference vs. other genome
  • blast_results_2 (str) – BLAST results for other vs. reference genome
  • r_name – Name of reference genome
  • g_name – Name of other genome
  • outdir – Directory where BLAST results are stored.
Returns:

Path to Pandas DataFrame of the BBH results.

ssbio.protein.sequence.utils.blast.create_orthology_matrix(r_name, genome_to_bbh_files, pid_cutoff=None, bitscore_cutoff=None, evalue_cutoff=None, filter_condition='OR', outname='', outdir='', force_rerun=False)[source]

Create an orthology matrix using best bidirectional BLAST hits (BBH) outputs.

Parameters:
  • r_name (str) – Name of the reference genome
  • genome_to_bbh_files (dict) – Mapping of genome names to the BBH csv output from the calculate_bbh() method
  • pid_cutoff (float) – Minimum percent identity between BLAST hits to filter for in the range [0, 100]
  • bitscore_cutoff (float) – Minimum bitscore allowed between BLAST hits
  • evalue_cutoff (float) – Maximum E-value allowed between BLAST hits
  • filter_condition (str) – ‘OR’ or ‘AND’, how to combine cutoff filters. ‘OR’ gives more results since it is less stringent, as you will be filtering for hits with (>80% PID or >30 bitscore or <0.0001 evalue).
  • outname – Name of output file of orthology matrix
  • outdir – Path to output directory
  • force_rerun (bool) – Force recreation of the orthology matrix even if the outfile exists
Returns:

Path to orthologous genes matrix.

Return type:

str

ssbio.protein.sequence.utils.blast.print_run_bidirectional_blast(reference, other_genome, dbtype, outdir)[source]

Write torque submission files for running bidirectional blast on a server and print execution command.

Parameters:
  • reference (str) – Path to “reference” genome, aka your “base strain”
  • other_genome (str) – Path to other genome which will be BLASTed to the reference
  • dbtype (str) – “nucl” or “prot” - what format your genome files are in
  • outdir (str) – Path to folder where Torque scripts should be placed
ssbio.protein.sequence.utils.blast.run_bidirectional_blast(reference, other_genome, dbtype, outdir='')[source]

BLAST a genome against another, and vice versa.

This function requires BLAST to be installed, do so by running: sudo apt install ncbi-blast+

Parameters:
  • reference (str) – path to “reference” genome, aka your “base strain”
  • other_genome (str) – path to other genome which will be BLASTed to the reference
  • dbtype (str) – “nucl” or “prot” - what format your genome files are in
  • outdir (str) – path to folder where BLAST outputs should be placed
Returns:

Paths to BLAST output files. (reference_vs_othergenome.out, othergenome_vs_reference.out)

ssbio.protein.sequence.utils.blast.run_makeblastdb(infile, dbtype, outdir='')[source]

Make the BLAST database for a genome file.

Parameters:
  • infile (str) – path to genome FASTA file
  • dbtype (str) – “nucl” or “prot” - what format your genome files are in
  • outdir (str) – path to directory to output database files (default is original folder)
Returns:

Paths to BLAST databases.

ssbio.protein.sequence.properties

Thermostability

This module provides functions to predict thermostability parameters (specifically the free energy of unfolding dG)
of an amino acid sequence.

These methods are adapted from:

Oobatake, M., & Ooi, T. (1993). ‘Hydration and heat stability effects on protein unfolding’,
Progress in biophysics and molecular biology, 59/3: 237–84.
Dill, K. A., Ghosh, K., & Schmit, J. D. (2011). ‘Physical limits of cells and proteomes’,
Proceedings of the National Academy of Sciences of the United States of America, 108/44: 17876–82. DOI: 10.1073/pnas.1114477108

For an example of usage of these parameters in a genome-scale model:

Chen, K., Gao, Y., Mih, N., O’Brien, E., Yang, L., Palsson, B.O. (2017).
‘Thermo-sensitivity of growth is determined by chaperone-mediated proteome re-allocation.’, Submitted to PNAS.
ssbio.protein.sequence.properties.thermostability.calculate_dill_dG(seq_len, temp)[source]

Get free energy of unfolding (dG) using Dill method in units J/mol.

Parameters:
  • seq_len (int) – Length of amino acid sequence
  • temp (float) – Temperature in degrees C
Returns:

Free energy of unfolding dG (J/mol)

Return type:

float

ssbio.protein.sequence.properties.thermostability.calculate_oobatake_dG(seq, temp)[source]

Get free energy of unfolding (dG) using Oobatake method in units cal/mol.

Parameters:
  • seq (str, Seq, SeqRecord) – Amino acid sequence
  • temp (float) – Temperature in degrees C
Returns:

Free energy of unfolding dG (J/mol)

Return type:

float

ssbio.protein.sequence.properties.thermostability.calculate_oobatake_dH(seq, temp)[source]

Get dH using Oobatake method in units cal/mol.

Parameters:
  • seq (str, Seq, SeqRecord) – Amino acid sequence
  • temp (float) – Temperature in degrees C
Returns:

dH in units cal/mol

Return type:

float

ssbio.protein.sequence.properties.thermostability.calculate_oobatake_dS(seq, temp)[source]

Get dS using Oobatake method in units cal/mol.

Parameters:
  • seq (str, Seq, SeqRecord) – Amino acid sequence
  • temp (float) – Temperature in degrees C
Returns:

dS in units cal/mol

Return type:

float

ssbio.protein.sequence.properties.thermostability.get_dG_at_T(seq, temp)[source]

Predict dG at temperature T, using best predictions from Dill or Oobatake methods.

Parameters:
  • seq (str, Seq, SeqRecord) – Amino acid sequence
  • temp (float) – Temperature in degrees C
Returns:

tuple containing:

dG (float) Free energy of unfolding dG (cal/mol) keq (float): Equilibrium constant Keq method (str): Method used to calculate

Return type:

(tuple)