Python API¶

Information on select functions, classes, or methods.

`ssbio.pipeline.gempro`¶

GEMPRO¶

class ssbio.pipeline.gempro.GEMPRO(gem_name, root_dir=None, pdb_file_type='mmtf', gem=None, gem_file_path=None, gem_file_type=None, genes_list=None, genes_and_sequences=None, genome_path=None, write_protein_fasta_files=True, description=None, custom_spont_id=None)[source]¶

Generic class to represent all information for a GEM-PRO project.

Initialize the GEM-PRO project with a genome-scale model, a list of genes, or a dict of genes and sequences. Specify the name of your project, along with the root directory where a folder with that name will be created.

Main methods provided are:

Automated mapping of sequence IDs
- With KEGG mapper
- With UniProt mapper
- Allowing manual gene ID –> protein sequence entry
- Allowing manual gene ID –> UniProt ID
Consolidating sequence IDs and setting a representative sequence
- Currently these are set based on available PDB IDs
Mapping of representative sequence –> structures
- With UniProt –> ranking of PDB structures
- BLAST representative sequence –> PDB database
Preparation of files for homology modeling (currently for I-TASSER)
- Mapping to existing models
- Preparation for running I-TASSER
- Parsing I-TASSER runs
Running QC/QA on structures and setting a representative structure
- Various cutoffs (mutations, insertions, deletions) can be set to filter structures
Automation of protein sequence and structure property calculation
Creation of Pandas DataFrame summaries directly from downloaded metadata

Parameters:

gem_name (str) – The name of your GEM or just your project in general. This will be the name of the main folder that is created in root_dir.
root_dir (str) – Path to where the folder named after gem_name will be created. If not provided, directories will not be created and output directories need to be specified for some steps.
pdb_file_type (str) – pdb, mmCif, xml, mmtf - file type for files downloaded from the PDB
gem (Model) – COBRApy Model object
gem_file_path (str) – Path to GEM file
gem_file_type (str) – GEM model type - sbml (or xml), mat, or json formats
genes_list (list) – List of gene IDs that you want to map
genes_and_sequences (dict) – Dictionary of gene IDs and their amino acid sequence strings
genome_path (str) – FASTA file of all protein sequences
write_protein_fasta_files (bool) – If individual protein FASTA files should be written out
description (str) – Description string of your project
custom_spont_id (str) – ID of spontaneous genes in a COBRA model which will be ignored for analysis

add_gene_ids(genes_list)[source]¶

Add gene IDs manually into the GEM-PRO project.

Parameters:	genes_list (list) – List of gene IDs as strings.

base_dir¶: str – GEM-PRO project folder.

blast_seqs_to_pdb(seq_ident_cutoff=0, evalue=0.0001, all_genes=False, display_link=False, outdir=None, force_rerun=False)[source]¶

BLAST each representative protein sequence to the PDB. Saves raw BLAST results (XML files).

Parameters:

seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
all_genes (bool) – If all genes should be BLASTed, or only those without any structures currently mapped
display_link (bool, optional) – Set to True if links to the HTML results should be displayed
outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False

custom_spont_id = None¶: str – ID of spontaneous genes in a COBRA model which will be ignored for analysis

data_dir¶: str – Directory where all data are stored.

df_homology_models¶: DataFrame – Get a dataframe of I-TASSER homology model results

df_kegg_metadata¶: DataFrame – Pandas DataFrame of KEGG metadata per protein.

df_pdb_blast¶: DataFrame – Get a dataframe of PDB BLAST results

df_pdb_metadata¶: DataFrame – Get a dataframe of PDB metadata (PDBs have to be downloaded first).

df_pdb_ranking¶: DataFrame – Get a dataframe of UniProt -> best structure in PDB results

df_proteins¶: DataFrame – Get a summary dataframe of all proteins in the project.

df_representative_sequences¶: DataFrame – Pandas DataFrame of representative sequence information per protein.

df_representative_structures¶: DataFrame – Get a dataframe of representative protein structure information.

df_uniprot_metadata¶: DataFrame – Pandas DataFrame of UniProt metadata per protein.

find_disulfide_bridges(representatives_only=True)[source]¶

Run Biopython’s disulfide bridge finder and store found bridges.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.annotations['SSBOND-biopython']

Parameters:	representative_only (bool) – If analysis should only be run on the representative structure

find_disulfide_bridges_parallelize(sc, representatives_only=True)[source]¶

Run Biopython’s disulfide bridge finder and store found bridges.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.annotations['SSBOND-biopython']

Parameters:	representative_only (bool) – If analysis should only be run on the representative structure

functional_genes¶: DictList – All genes with a representative protein structure.

genes = None¶: DictList – All protein-coding genes in this GEM-PRO project

genes_dir¶: str – Directory where all gene specific information is stored.

genes_with_a_representative_sequence¶: DictList – All genes with a representative sequence.

genes_with_a_representative_structure¶: DictList – All genes with a representative protein structure.

genes_with_experimental_structures¶: DictList – All genes that have at least one experimental structure.

genes_with_homology_models¶: DictList – All genes that have at least one homology model.

genes_with_structures¶: DictList – All genes with any mapped protein structures.

genome_path = None¶: str – Simple link to the filepath of the FASTA file containing all protein sequences

get_dssp_annotations(representatives_only=True, force_rerun=False)[source]¶

Run DSSP on structures and store calculations.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.letter_annotations['*-dssp']

Parameters:	representative_only (bool) – If analysis should only be run on the representative structure force_rerun (bool) – If calculations should be rerun even if an output file exists

get_dssp_annotations_parallelize(sc, representatives_only=True, force_rerun=False)[source]¶

Run DSSP on structures and store calculations.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.letter_annotations['*-dssp']

Parameters:	representative_only (bool) – If analysis should only be run on the representative structure force_rerun (bool) – If calculations should be rerun even if an output file exists

get_freesasa_annotations(include_hetatms=False, representatives_only=True, force_rerun=False)[source]¶

Run freesasa on structures and store calculations.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.letter_annotations['*-freesasa']

Parameters:	include_hetatms (bool) – If HETATMs should be included in calculations. Defaults to `False`. representative_only (bool) – If analysis should only be run on the representative structure force_rerun (bool) – If calculations should be rerun even if an output file exists

get_freesasa_annotations_parallelize(sc, include_hetatms=False, representatives_only=True, force_rerun=False)[source]¶

Run freesasa on structures and store calculations.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.letter_annotations['*-freesasa']

Parameters:	include_hetatms (bool) – If HETATMs should be included in calculations. Defaults to `False`. representative_only (bool) – If analysis should only be run on the representative structure force_rerun (bool) – If calculations should be rerun even if an output file exists

get_itasser_models(homology_raw_dir, custom_itasser_name_mapping=None, outdir=None, force_rerun=False)[source]¶

Copy generated I-TASSER models from a directory to the GEM-PRO directory.

Parameters:

homology_raw_dir (str) – Root directory of I-TASSER folders.
custom_itasser_name_mapping (dict) – Use this if your I-TASSER folder names differ from your model gene names. Input a dict of {model_gene: ITASSER_folder}.
outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
force_rerun (bool) – If homology files should be copied again even if they exist in the GEM-PRO directory

get_manual_homology_models(input_dict, outdir=None, clean=True, force_rerun=False)[source]¶

Copy homology models to the GEM-PRO project.

Requires an input of a dictionary formatted like so:

{
    model_gene: {
                    homology_model_id1: {
                                            'model_file': '/path/to/homology/model.pdb',
                                            'file_type': 'pdb'
                                            'additional_info': info_value
                                        },
                    homology_model_id2: {
                                            'model_file': '/path/to/homology/model.pdb'
                                            'file_type': 'pdb'
                                        }
                }
}

Parameters:

input_dict (dict) – Dictionary of dictionaries of gene names to homology model IDs and other information
outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
clean (bool) – If homology files should be cleaned and saved as a new PDB file
force_rerun (bool) – If homology files should be copied again even if they exist in the GEM-PRO directory

get_msms_annotations(representatives_only=True, force_rerun=False)[source]¶

Run MSMS on structures and store calculations.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.letter_annotations['*-msms']

Parameters:	representative_only (bool) – If analysis should only be run on the representative structure force_rerun (bool) – If calculations should be rerun even if an output file exists

get_msms_annotations_parallelize(sc, representatives_only=True, force_rerun=False)[source]¶

Run MSMS on structures and store calculations.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.letter_annotations['*-msms']

Parameters:	representative_only (bool) – If analysis should only be run on the representative structure force_rerun (bool) – If calculations should be rerun even if an output file exists

get_scratch_predictions(path_to_scratch, results_dir, scratch_basename='scratch', num_cores=1, exposed_buried_cutoff=25, custom_gene_mapping=None)[source]¶

Run and parse SCRATCH results to predict secondary structure and solvent accessibility. Annotations are stored in the protein’s representative sequence at:

.annotations

.letter_annotations

Parameters:

path_to_scratch (str) – Path to SCRATCH executable
results_dir (str) – Path to SCRATCH results folder, which will have the files (scratch.ss, scratch.ss8, scratch.acc, scratch.acc20)
scratch_basename (str) – Basename of the SCRATCH results (‘scratch’ is default)
num_cores (int) – Number of cores to use to parallelize SCRATCH run
exposed_buried_cutoff (int) – Cutoff of exposed/buried for the acc20 predictions
custom_gene_mapping (dict) – Default parsing of SCRATCH output files is to look for the model gene IDs. If your output files contain IDs which differ from the model gene IDs, use this dictionary to map model gene IDs to result file IDs. Dictionary keys must match model genes.

get_sequence_properties(representatives_only=True)[source]¶

Run Biopython ProteinAnalysis and EMBOSS pepstats to summarize basic statistics of all protein sequences. Results are stored in the protein’s respective SeqProp objects at .annotations

Parameters:	representative_only (bool) – If analysis should only be run on the representative sequences

get_tmhmm_predictions(tmhmm_results, custom_gene_mapping=None)[source]¶

Parse TMHMM results and store in the representative sequences.

This is a basic function to parse pre-run TMHMM results. Run TMHMM from the web service (http://www.cbs.dtu.dk/services/TMHMM/) by doing the following:

Write all representative sequences in the GEM-PRO using the function write_representative_sequences_file

Upload the file to http://www.cbs.dtu.dk/services/TMHMM/ and choose “Extensive, no graphics” as the output

Copy and paste the results (ignoring the top header and above “HELP with output formats”) into a file and save it

Run this function on that file

Parameters:	tmhmm_results (str) – Path to TMHMM results (long format) custom_gene_mapping (dict) – Default parsing of TMHMM output is to look for the model gene IDs. If your output file contains IDs which differ from the model gene IDs, use this dictionary to map model gene IDs to result file IDs. Dictionary keys must match model genes.

kegg_mapping_and_metadata(kegg_organism_code, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source]¶

Map all genes in the model to KEGG IDs using the KEGG service.

Steps:

Download all metadata and sequence files in the sequences directory
Creates a KEGGProp object in the protein.sequences attribute
Returns a Pandas DataFrame of mapping results

Parameters:

kegg_organism_code (str) – The three letter KEGG code of your organism
custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model gene IDs.
outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
set_as_representative (bool) – If mapped KEGG IDs should be set as representative sequences
force_rerun (bool) – If you want to overwrite any existing mappings and files

kegg_mapping_and_metadata_parallelize(sc, kegg_organism_code, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source]¶

Map all genes in the model to KEGG IDs using the KEGG service.

Steps:

Download all metadata and sequence files in the sequences directory
Creates a KEGGProp object in the protein.sequences attribute
Returns a Pandas DataFrame of mapping results

Parameters:

sc (SparkContext) – Spark Context to parallelize this function
kegg_organism_code (str) – The three letter KEGG code of your organism
custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model gene IDs.
outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
set_as_representative (bool) – If mapped KEGG IDs should be set as representative sequences
force_rerun (bool) – If you want to overwrite any existing mappings and files

load_cobra_model(model)[source]¶

Load a COBRApy Model object into the GEM-PRO project.

Parameters:	model (Model) – COBRApy `Model` object

manual_seq_mapping(gene_to_seq_dict, outdir=None, write_fasta_files=True, set_as_representative=True)[source]¶

Read a manual input dictionary of model gene IDs –> protein sequences. By default sets them as representative.

Parameters:

gene_to_seq_dict (dict) – Mapping of gene IDs to their protein sequence strings
outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
write_fasta_files (bool) – If individual protein FASTA files should be written out
set_as_representative (bool) – If mapped sequences should be set as representative

manual_uniprot_mapping(gene_to_uniprot_dict, outdir=None, set_as_representative=True)[source]¶

Read a manual dictionary of model gene IDs –> UniProt IDs. By default sets them as representative.

This allows for mapping of the missing genes, or overriding of automatic mappings.

Input a dictionary of:

{
    <gene_id1>: <uniprot_id1>,
    <gene_id2>: <uniprot_id2>,
}

Parameters:	gene_to_uniprot_dict – Dictionary of mappings as shown above outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially set_as_representative (bool) – If mapped UniProt IDs should be set as representative sequences

map_uniprot_to_pdb(seq_ident_cutoff=0.0, outdir=None, force_rerun=False)[source]¶

Map all representative sequences’ UniProt ID to PDB IDs using the PDBe “Best Structures” API. Will save a JSON file of the results to each protein’s sequences folder.

The “Best structures” API is available at https://www.ebi.ac.uk/pdbe/api/doc/sifts.html The list of PDB structures mapping to a UniProt accession sorted by coverage of the protein and, if the same, resolution.

Parameters:	seq_ident_cutoff (float) – Sequence identity cutoff in decimal form outdir (str) – Output directory to cache JSON results of search force_rerun (bool) – Force re-downloading of JSON results if they already exist
Returns:	A rank-ordered list of PDBProp objects that map to the UniProt ID
Return type:	list

missing_homology_models¶: list – List of genes with no mapping to any homology models.

missing_kegg_mapping¶: list – List of genes with no mapping to KEGG.

missing_pdb_structures¶: list – List of genes with no mapping to any experimental PDB structure.

missing_representative_sequence¶: list – List of genes with no mapping to a representative sequence.

missing_representative_structure¶: list – List of genes with no mapping to a representative structure.

missing_uniprot_mapping¶: list – List of genes with no mapping to UniProt.

model = None¶: Model – COBRApy model object

model_dir¶: str – Directory where original GEMs and GEM-related files are stored.

pdb_downloader_and_metadata(outdir=None, pdb_file_type=None, force_rerun=False)[source]¶

Download ALL mapped experimental structures to each protein’s structures directory.

Parameters:	outdir (str) – Path to output directory, if GEM-PRO directories were not set or other output directory is desired pdb_file_type (str) – Type of PDB file to download, if not already set or other format is desired force_rerun (bool) – If files should be re-downloaded if they already exist

pdb_file_type = None¶: str – pdb, mmCif, xml, mmtf - file type for files downloaded from the PDB

prep_itasser_modeling(itasser_installation, itlib_folder, runtype, create_in_dir=None, execute_from_dir=None, all_genes=False, print_exec=False, **kwargs)[source]¶

Prepare to run I-TASSER homology modeling for genes without structures, or all genes.

Parameters:

itasser_installation (str) – Path to I-TASSER folder, i.e. ~/software/I-TASSER4.4
itlib_folder (str) – Path to ITLIB folder, i.e. ~/software/ITLIB
runtype – How you will be running I-TASSER - local, slurm, or torque
create_in_dir (str) – Local directory where folders will be created
execute_from_dir (str) – Optional path to execution directory - use this if you are copying the homology models to another location such as a supercomputer for running
all_genes (bool) – If all genes should be prepped, or only those without any mapped structures
print_exec (bool) – If the execution statement should be printed to run modelling

Todo

Document kwargs - extra options for I-TASSER, SLURM or Torque execution
Allow modeling of any sequence in sequences attribute, select by ID or provide SeqProp?

root_dir¶: str – Directory where GEM-PRO project folder named after the attribute base_dir is located.

set_representative_sequence(force_rerun=False)[source]¶

Automatically consolidate loaded sequences (manual, UniProt, or KEGG) and set a single representative sequence.

Manually set representative sequences override all existing mappings. UniProt mappings override KEGG mappings except when KEGG mappings have PDBs associated with them and UniProt doesn’t.

Parameters:	force_rerun (bool) – Set to True to recheck stored sequences

set_representative_structure(seq_outdir=None, struct_outdir=None, pdb_file_type=None, engine='needle', always_use_homology=False, rez_cutoff=0.0, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True, skip_large_structures=False, clean=True, force_rerun=False)[source]¶

Set all representative structure for proteins from a structure in the structures attribute.

Each gene can have a combination of the following, which will be analyzed to set a representative structure.

Homology model(s)

Ranked PDBs

BLASTed PDBs

If the always_use_homology flag is true, homology models are always set as representative when they exist. If there are multiple homology models, we rank by the percent sequence coverage.

Parameters:

seq_outdir (str) – Path to output directory of sequence alignment files, must be set if GEM-PRO directories were not created initially
struct_outdir (str) – Path to output directory of structure files, must be set if GEM-PRO directories were not created initially
pdb_file_type (str) – pdb, mmCif, xml, mmtf - file type for files downloaded from the PDB
engine (str) – biopython or needle - which pairwise alignment program to use. needle is the standard EMBOSS tool to run pairwise alignments. biopython is Biopython’s implementation of needle. Results can differ!
always_use_homology (bool) – If homology models should always be set as the representative structure
rez_cutoff (float) – Resolution cutoff, in Angstroms (only if experimental structure)
seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
allow_mutants (bool) – If mutations should be allowed or checked for
allow_deletions (bool) – If deletions should be allowed or checked for
allow_insertions (bool) – If insertions should be allowed or checked for
allow_unresolved (bool) – If unresolved residues should be allowed or checked for
skip_large_structures (bool) – Default False – currently, large structures can’t be saved as a PDB file even if you just want to save a single chain, so Biopython will throw an error when trying to do so. As an alternative, if a large structure is selected as representative, the pipeline will currently point to it and not clean it. If you don’t want this to happen, set this to true.
clean (bool) – If structures should be cleaned
force_rerun (bool) – If sequence to structure alignment should be rerun

Todo

Remedy large structure representative setting

structures_dir¶: str – Directory where all structures are stored.

uniprot_mapping_and_metadata(model_gene_source, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source]¶

Map all genes in the model to UniProt IDs using the UniProt mapping service. Also download all metadata and sequences.

Parameters:

model_gene_source (str) –
the database source of your model gene IDs. See: http://www.uniprot.org/help/api_idmapping Common model gene sources are:
- Ensembl Genomes - ENSEMBLGENOME_ID (i.e. E. coli b-numbers)
- Entrez Gene (GeneID) - P_ENTREZGENEID
- RefSeq Protein - P_REFSEQ_AC
custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model genes.
outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
set_as_representative (bool) – If mapped UniProt IDs should be set as representative sequences
force_rerun (bool) – If you want to overwrite any existing mappings and files

write_representative_sequences_file(outname, outdir=None, set_ids_from_model=True)[source]¶

Write all the model’s sequences as a single FASTA file. By default, sets IDs to model gene IDs.

Parameters:	outname (str) – Name of the output FASTA file without the extension outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially set_ids_from_model (bool) – If the gene ID source should be the model gene IDs, not the original sequence ID

`ssbio.databases`¶

PDBProp¶

class ssbio.databases.pdb.PDBProp(ident, description=None, chains=None, mapped_chains=None, structure_path=None, file_type=None)[source]¶

Store information about a protein structure from the Protein Data Bank.

Extends the StructProp class to allow initialization of the structure by its PDB ID, and then enabling downloads of the structure file as well as parsing its metadata.

Parameters:	ident (str) – description (str) – chains (str) – mapped_chains (str) – structure_path (str) – file_type (str) – `pdb`, `mmCif`, `xml`, `mmtf` - file type for files downloaded from the PDB

biological_assemblies = None¶: DictList – A list for storing Bioassembly objects related to this PDB ID

download_structure_file(outdir, file_type=None, load_header_metadata=True, force_rerun=False)[source]¶

Download a structure file from the PDB, specifying an output directory and a file type. Optionally download the mmCIF header file and parse data from it to store within this object.

Parameters:	outdir (str) – Path to output directory file_type (str) – `pdb`, `mmCif`, `xml`, `mmtf` - file type for files downloaded from the PDB load_header_metadata (bool) – If header metadata should be loaded into this object, fastest with mmtf files force_rerun (bool) – If structure file should be downloaded even if it already exists

ssbio.databases.pdb.best_structures(uniprot_id, outname=None, outdir=None, seq_ident_cutoff=0.0, force_rerun=False)[source]¶

Use the PDBe REST service to query for the best PDB structures for a UniProt ID.

More information found here: https://www.ebi.ac.uk/pdbe/api/doc/sifts.html Link used to retrieve results: https://www.ebi.ac.uk/pdbe/api/mappings/best_structures/:accession The list of PDB structures mapping to a UniProt accession sorted by coverage of the protein and, if the same, resolution.

Here is the ranking algorithm described by the PDB paper: https://nar.oxfordjournals.org/content/44/D1/D385.full

“Finally, a single quality indicator is also calculated for each entry by taking the harmonic average of all the percentile scores representing model and model-data-fit quality measures and then subtracting 10 times the numerical value of the resolution (in Angstrom) of the entry to ensure that resolution plays a role in characterising the quality of a structure. This single empirical ‘quality measure’ value is used by the PDBe query system to sort results and identify the ‘best’ structure in a given context. At present, entries determined by methods other than X-ray crystallography do not have similar data quality information available and are not considered as ‘best structures’.”

Parameters:

uniprot_id (str) – UniProt Accession ID
outname (str) – Basename of the output file of JSON results
outdir (str) – Path to output directory of JSON results
seq_ident_cutoff (float) – Cutoff results based on percent coverage (in decimal form)
force_rerun (bool) – Obtain best structures mapping ignoring previously downloaded results

Returns:

Rank-ordered list of dictionaries representing chain-specific PDB entries. Keys are:

pdb_id: the PDB ID which maps to the UniProt ID
chain_id: the specific chain of the PDB which maps to the UniProt ID
coverage: the percent coverage of the entire UniProt sequence
resolution: the resolution of the structure
start: the structure residue number which maps to the start of the mapped sequence
end: the structure residue number which maps to the end of the mapped sequence
unp_start: the sequence residue number which maps to the structure start
unp_end: the sequence residue number which maps to the structure end
experimental_method: type of experiment used to determine structure
tax_id: taxonomic ID of the protein’s original organism

Return type:

list

ssbio.databases.pdb.blast_pdb(seq, outfile='', outdir='', evalue=0.0001, seq_ident_cutoff=0.0, link=False, force_rerun=False)[source]¶

Returns a list of BLAST hits of a sequence to available structures in the PDB.

Parameters:	seq (str) – Your sequence, in string format outfile (str) – Name of output file outdir (str, optional) – Path to output directory. Default is the current directory. evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default). seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form) link (bool, optional) – Set to True if a link to the HTML results should be displayed force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False
Returns:	Rank ordered list of BLAST hits in dictionaries.
Return type:	list

ssbio.databases.pdb.blast_pdb_df(blast_results)[source]¶: Make a dataframe of BLAST results

ssbio.databases.pdb.download_mmcif_header(pdb_id, outdir='', force_rerun=False)[source]¶

Download a mmCIF header file from the RCSB PDB by ID.

Parameters:	pdb_id – PDB ID outdir – Optional output directory, default is current working directory force_rerun – If the file should be downloaded again even if it exists
Returns:	Path to outfile
Return type:	str

ssbio.databases.pdb.download_sifts_xml(pdb_id, outdir='', force_rerun=False)[source]¶

Download the SIFTS file for a PDB ID.

Parameters:	pdb_id (str) – PDB ID outdir (str) – Output directory, current working directory if not specified. force_rerun (bool) – If the file should be downloaded again even if it exists
Returns:	Path to downloaded file
Return type:	str

ssbio.databases.pdb.download_structure(pdb_id, file_type, outdir='', only_header=False, force_rerun=False)[source]¶

Download a structure from the RCSB PDB by ID. Specify the file type desired.

Parameters:	pdb_id – PDB ID file_type – pdb, pdb.gz, mmcif, cif, cif.gz, xml.gz, mmtf, mmtf.gz outdir – Optional output directory only_header – If only the header file should be downloaded force_rerun – If the file should be downloaded again even if it exists
Returns:	Path to outfile
Return type:	str

Deprecated since version 1.0: This will be removed in 2.0. Use Biopython’s PDBList.retrieve_pdb_file function instead

ssbio.databases.pdb.get_bioassembly_info(pdb_id, biomol_num, cache=False, outdir=None, force_rerun=False)[source]¶

Get metadata about a bioassembly from the RCSB PDB’s REST API.

See: https://www.rcsb.org/pdb/rest/bioassembly/bioassembly?structureId=1hv4&nr=1 The API returns an XML file containing the information on a biological assembly that looks like this:

<bioassembly structureId="1HV4" assemblyNr="1" method="PISA" desc="author_and_software_defined_assembly">
    <transformations operator="1" chainIds="A,B,C,D">
        <transformation index="1">
            <matrix m11="1.00000000" m12="0.00000000" m13="0.00000000" m21="0.00000000" m22="1.00000000" m23="0.00000000" m31="0.00000000" m32="0.00000000" m33="1.00000000"/>
            <shift v1="0.00000000" v2="0.00000000" v3="0.00000000"/>
        </transformation>
    </transformations>
</bioassembly>

Parameters:	pdb_id (str) – PDB ID biomol_num (int) – Biological assembly number you are interested in cache (bool) – If the XML file should be downloaded outdir (str) – If cache, then specify the output directory force_rerun (bool) – If cache, and if file exists, specify if API should be queried again

ssbio.databases.pdb.get_num_bioassemblies(pdb_id, cache=False, outdir=None, force_rerun=False)[source]¶

Check if there are bioassemblies using the PDB REST API, and if there are, get the number of bioassemblies available.

See: https://www.rcsb.org/pages/webservices/rest, section ‘List biological assemblies’

Not all PDB entries have biological assemblies available and some have multiple. Details that are necessary to recreate a biological assembly from the asymmetric unit can be accessed from the following requests.

Number of biological assemblies associated with a PDB entry
Access the transformation information needed to generate a biological assembly (nr=0 will return information for the asymmetric unit, nr=1 will return information for the first assembly, etc.)

A query of https://www.rcsb.org/pdb/rest/bioassembly/nrbioassemblies?structureId=1hv4 returns this:

<nrBioAssemblies structureId="1HV4" hasAssemblies="true" count="2"/>

Parameters:	pdb_id (str) – PDB ID cache (bool) – If the XML file should be downloaded outdir (str) – If cache, then specify the output directory force_rerun (bool) – If cache, and if file exists, specify if API should be queried again

ssbio.databases.pdb.get_release_date(pdb_id)[source]¶

Quick way to get the release date of a PDB ID using the table of results from the REST service

Returns None if the release date is not available.

Returns:	Organism of a PDB ID
Return type:	str

ssbio.databases.pdb.get_resolution(pdb_id)[source]¶

Quick way to get the resolution of a PDB ID using the table of results from the REST service

Returns infinity if the resolution is not available.

Returns:	resolution of a PDB ID in Angstroms
Return type:	float

Todo

Unit test

ssbio.databases.pdb.map_uniprot_resnum_to_pdb(uniprot_resnum, chain_id, sifts_file)[source]¶

Map a UniProt residue number to its corresponding PDB residue number.

This function requires that the SIFTS file be downloaded, and also a chain ID (as different chains may have different mappings).

Parameters:

uniprot_resnum (int) – integer of the residue number you’d like to map
chain_id (str) – string of the PDB chain to map to
sifts_file (str) – Path to the SIFTS XML file

Returns:

tuple containing:

mapped_resnum (int): Mapped residue number is_observed (bool): Indicates if the 3D structure actually shows the residue

Return type:

(tuple)

ssbio.databases.pdb.parse_mmcif_header(infile)[source]¶

Parse a couple important fields from the mmCIF file format with some manual curation of ligands.

If you want full access to the mmCIF file just use the MMCIF2Dict class in Biopython.

Parameters:	infile – Path to mmCIF file
Returns:	Dictionary of parsed header
Return type:	dict

ssbio.databases.pdb.parse_mmtf_header(infile)[source]¶

Parse an MMTF file and return basic header-like information.

Parameters:	infile (str) – Path to MMTF file
Returns:	Dictionary of parsed header
Return type:	dict

Todo

Can this be sped up by not parsing the 3D coordinate info somehow?
OR just store the sequences when this happens since it is already being parsed.

PISA¶

ssbio.databases.pisa.download_pisa_multimers_xml(pdb_ids, save_single_xml_files=True, outdir=None, force_rerun=False)[source]¶

Download the PISA XML file for multimers.

See: http://www.ebi.ac.uk/pdbe/pisa/pi_download.html for more info

XML description of macromolecular assemblies:: http://www.ebi.ac.uk/pdbe/pisa/cgi-bin/multimers.pisa?pdbcodelist where “pdbcodelist” is a comma-separated (strictly no spaces) list of PDB codes. The resulting file contain XML output of assembly data, equivalent to that displayed in PISA assembly pages, for each of the specified PDB entries. NOTE: If a mass-download is intended, please minimize the number of retrievals by specifying as many PDB codes in the URL as feasible (20-50 is a good range), and never send another URL request until the previous one has been completed (meaning that the multimers.pisa file has been downloaded). Excessive requests will silently die in the server queue.

Parameters:	pdb_ids (str, list) – PDB ID or list of IDs save_single_xml_files (bool) – If single XML files should be saved per PDB ID. If False, if multiple PDB IDs are provided, then a single, combined XML output file is downloaded outdir (str) – Directory to output PISA XML files force_rerun (bool) – Redownload files if they already exist
Returns:	of files downloaded
Return type:	list

ssbio.databases.pisa.parse_pisa_multimers_xml(pisa_multimers_xml, download_structures=False, outdir=None, force_rerun=False)[source]¶

Retrieve PISA information from an XML results file

See: http://www.ebi.ac.uk/pdbe/pisa/pi_download.html for more info

XML description of macromolecular assemblies:: http://www.ebi.ac.uk/pdbe/pisa/cgi-bin/multimers.pisa?pdbcodelist where “pdbcodelist” is a comma-separated (strictly no spaces) list of PDB codes. The resulting file contain XML output of assembly data, equivalent to that displayed in PISA assembly pages, for each of the specified PDB entries. NOTE: If a mass-download is intended, please minimize the number of retrievals by specifying as many PDB codes in the URL as feasible (20-50 is a good range), and never send another URL request until the previous one has been completed (meaning that the multimers.pisa file has been downloaded). Excessive requests will silently die in the server queue.

Parameters:	pisa_multimers_xml (str) – Path to PISA XML output file download_structures (bool) – If assembly files should be downloaded outdir (str) – Directory to output assembly files force_rerun (bool) – Redownload files if they already exist
Returns:	of parsed PISA information
Return type:	dict

ssbio.databases.pisa.pdb_chain_stoichiometry_biomolone(pdbid)[source]¶

Get the stoichiometry of the chains in biological assembly 1 as a dictionary.

Steps taken are: 1) Download PDB and parse header, make biomolecule if provided 2) Count how many times each chain appears in biomolecule #1 3) Convert chain id to uniprot id 4) Return final dictionary

Parameters:	pdbid (str) – 4 character PDB ID
Returns:	{(ChainID,UniProtID): # occurences}
Return type:	dict

SWISSMODEL¶

class ssbio.databases.swissmodel.SWISSMODEL(metadata_dir)[source]¶

Methods to parse through a SWISS-MODEL metadata set.

Download a particular organism’s metadata from SWISS-MODEL here: https://swissmodel.expasy.org/repository

Parameters:	metadata_dir (str) – Path to the extracted SWISS-MODEL_Repository folder

all_models = None¶: dict – Dictionary of lists, UniProt ID as the keys

download_models(uniprot_acc, outdir='', force_rerun=False)[source]¶

Download all models available for a UniProt accession number.

Parameters:	uniprot_acc (str) – UniProt ACC/ID outdir (str) – Path to output directory, uses working directory if not set force_rerun (bool) – Force a redownload the models if they already exist
Returns:	Paths to the downloaded models
Return type:	list

get_model_filepath(infodict)[source]¶

Get the path to the homology model using information from the index dictionary for a single model.

Example: use self.get_models(UNIPROT_ID) to get all the models, which returns a list of dictionaries.: Use one of those dictionaries as input to this function to get the filepath to the model itself.

Parameters:	infodict (dict) – Information about a model from get_models
Returns:	Path to homology model
Return type:	str

get_models(uniprot_acc)[source]¶

Return all available models for a UniProt accession number.

Parameters:	uniprot_acc (str) – UniProt ACC/ID
Returns:	All available models in SWISS-MODEL for this UniProt entry
Return type:	dict

metadata_dir = None¶: str – Path to the extracted SWISS-MODEL_Repository folder

metadata_index_json¶: str – Path to the INDEX_JSON file.

organize_models(outdir, force_rerun=False)[source]¶

Organize and rename SWISS-MODEL models to a single folder with a name containing template information.

Parameters:	outdir (str) – New directory to copy renamed models to force_rerun (bool) – If models should be copied again even if they already exist
Returns:	Dictionary of lists, UniProt IDs as the keys and new file paths as the values
Return type:	dict

parse_metadata()[source]¶: Parse the INDEX_JSON file and reorganize it as a dictionary of lists.

uniprots_modeled¶: list – Return all UniProt accession numbers with at least one model

ssbio.databases.swissmodel.get_oligomeric_state(swiss_model_path)[source]¶

Parse the oligomeric prediction in a SWISS-MODEL repository file

As of 2018-02-26, works on all E. coli models. Untested on other pre-made organism models.

Parameters:	swiss_model_path (str) – Path to SWISS-MODEL PDB file
Returns:	Information parsed about the oligomeric state
Return type:	dict

ssbio.databases.swissmodel.translate_ostat(ostat)[source]¶

Translate the OSTAT field to an integer.

As of 2018-02-26, works on all E. coli models. Untested on other pre-made organism models.

Parameters:	ostat (str) – Predicted oligomeric state of the PDB file
Returns:	Translated string to integer
Return type:	int

UniProtProp¶

class ssbio.databases.uniprot.UniProtProp(seq, id, name='<unknown name>', description='<unknown description>', fasta_path=None, xml_path=None, gff_path=None)[source]¶

Generic class to store information on a UniProt entry, extended from a SeqProp object.

The main utilities of this class are to:

Download and/or parse UniProt text or xml files
Store extra parsed information in attributes

uniprot¶: str – Main UniProt accession code

alt_uniprots¶: list – Alternate accession codes that point to the main one

file_type¶: str – Metadata file type

reviewed¶: bool – If this entry is a “reviewed” entry. If None, then status is unknown.

ec_number¶: str – EC number

pfam¶: list – PFAM IDs

entry_version¶: str – Date of last update of the UniProt entry

seq_version¶: str – Date of last update of the UniProt sequence

download_metadata_file(outdir, force_rerun=False)[source]¶: Download and load the UniProt XML file

download_seq_file(outdir, force_rerun=False)[source]¶: Download and load the UniProt FASTA file

features¶: list – Get the features from the feature file, metadata file, or in memory

ranking_score()[source]¶

Provide a score for this UniProt ID based on reviewed (True=1, False=0) + number of PDBs

Returns:	Scoring for this ID
Return type:	int

seq¶: Seq – Get the Seq object from the sequence file, metadata file, or in memory

ssbio.databases.uniprot.blast_uniprot(seq_str, seq_ident=1, evalue=0.0001, reviewed_only=True, organism=None)[source]¶

BLAST the UniProt db to find what IDs match the sequence input

Parameters:	seq_str – Sequence string seq_ident – Percent identity to match evalue – E-value of BLAST hit

Returns:

ssbio.databases.uniprot.download_uniprot_file(uniprot_id, filetype, outdir='', force_rerun=False)[source]¶

Download a UniProt file for a UniProt ID/ACC

Parameters:	uniprot_id – Valid UniProt ID filetype – txt, fasta, xml, rdf, or gff outdir – Directory to download the file
Returns:	Absolute path to file
Return type:	str

ssbio.databases.uniprot.get_fasta(uniprot_id)[source]¶

Get the protein sequence for a UniProt ID as a string.

Parameters:	uniprot_id – Valid UniProt ID
Returns:	String of the protein (amino acid) sequence
Return type:	str

ssbio.databases.uniprot.is_valid_uniprot_id(instring)[source]¶

Check if a string is a valid UniProt ID.

See regex from: http://www.uniprot.org/help/accession_numbers

Parameters:	instring – any string identifier

Returns: True if the string is a valid UniProt ID

ssbio.databases.uniprot.old_parse_uniprot_txt_file(infile)[source]¶

From: boscoh/uniprot github Parses the text of metadata retrieved from uniprot.org.

Only a few fields have been parsed, but this provides a template for the other fields.

A single description is generated from joining alternative descriptions.

Returns a dictionary with the main UNIPROT ACC as keys.

ssbio.databases.uniprot.parse_uniprot_txt_file(infile)[source]¶

Parse a raw UniProt metadata file and return a dictionary.

Parameters:	infile – Path to metadata file
Returns:	Metadata dictionary
Return type:	dict

ssbio.databases.uniprot.parse_uniprot_xml_metadata(sr)[source]¶

Load relevant attributes and dbxrefs from a parsed UniProt XML file in a SeqRecord.

Returns:	All parsed information
Return type:	dict

ssbio.databases.uniprot.uniprot_ec(uniprot_id)[source]¶

Retrieve the EC number annotation for a UniProt ID.

Parameters:	uniprot_id – Valid UniProt ID

Returns:

ssbio.databases.uniprot.uniprot_reviewed_checker(uniprot_id)[source]¶

Check if a single UniProt ID is reviewed or not.

Parameters:	uniprot_id –
Returns:	If the entry is reviewed
Return type:	bool

ssbio.databases.uniprot.uniprot_reviewed_checker_batch(uniprot_ids)[source]¶

Batch check if uniprot IDs are reviewed or not

Parameters:	uniprot_ids – UniProt ID or list of UniProt IDs
Returns:	Boolean}
Return type:	A dictionary of {UniProtID

ssbio.databases.uniprot.uniprot_sites(uniprot_id)[source]¶

Retrieve a list of UniProt sites parsed from the feature file

Sites are defined here: http://www.uniprot.org/help/site and here: http://www.uniprot.org/help/function_section

Parameters:	uniprot_id – Valid UniProt ID

Returns:

KEGGProp¶

class ssbio.databases.kegg.KEGGProp(seq, id, name='<unknown name>', description='<unknown description>', fasta_path=None, txt_path=None, gff_path=None)[source]¶

ssbio.databases.kegg.download_kegg_aa_seq(gene_id, outdir=None, force_rerun=False)[source]¶

Download a FASTA sequence of a protein from the KEGG database and return the path.

Parameters:	gene_id – the gene identifier outdir – optional path to output directory
Returns:	Path to FASTA file

ssbio.databases.kegg.download_kegg_gene_metadata(gene_id, outdir=None, force_rerun=False)[source]¶

Download the KEGG flatfile for a KEGG ID and return the path.

Parameters:	gene_id – KEGG gene ID (with organism code), i.e. “eco:1244” outdir – optional output directory of metadata
Returns:	Path to metadata file

ssbio.databases.kegg.map_kegg_all_genes(organism_code, target_db)[source]¶

Map all of an organism’s gene IDs to the target database.

This is faster than supplying a specific list of genes to map, plus there seems to be a limit on the number you can map with a manual REST query anyway.

Parameters:	organism_code – the three letter KEGG code of your organism target_db – ncbi-proteinid \| ncbi-geneid \| uniprot
Returns:	Dictionary of ID mapping

ssbio.databases.kegg.parse_kegg_gene_metadata(infile)[source]¶

Parse the KEGG flatfile and return a dictionary of metadata.

Dictionary keys are:: refseq uniprot pdbs taxonomy

Parameters:	infile – Path to KEGG flatfile
Returns:	Dictionary of metadata
Return type:	dict

`ssbio.protein.structure.utils`¶

CleanPDB¶

ssbio.protein.structure.utils.cleanpdb.clean_pdb(pdb_file, out_suffix='_clean', outdir=None, force_rerun=False, remove_atom_alt=True, keep_atom_alt_id='A', remove_atom_hydrogen=True, add_atom_occ=True, remove_res_hetero=True, keep_chemicals=None, keep_res_only=None, add_chain_id_if_empty='X', keep_chains=None)[source]¶

Clean a PDB file.

Parameters:	pdb_file (str) – Path to input PDB file out_suffix (str) – Suffix to append to original filename outdir (str) – Path to output directory force_rerun (bool) – If structure should be re-cleaned if a clean file exists already remove_atom_alt (bool) – Remove alternate positions keep_atom_alt_id (str) – If removing alternate positions, which alternate ID to keep remove_atom_hydrogen (bool) – Remove hydrogen atoms add_atom_occ (bool) – Add atom occupancy fields if not present remove_res_hetero (bool) – Remove all HETATMs keep_chemicals (str, list) – If removing HETATMs, keep specified chemical names keep_res_only (str, list) – Keep ONLY specified resnames, deletes everything else! add_chain_id_if_empty (str) – Add a chain ID if not present keep_chains (str, list) – Keep only these chains
Returns:	Path to cleaned PDB file
Return type:	str

MutatePDB¶

DOCK¶

class ssbio.protein.structure.utils.dock.DOCK(structure_id, pdb_file, amb_file, flex1_file, flex2_file, root_dir=None)[source]¶

Class to prepare a structure file for docking with DOCK6.

Attributes:

auto_flexdock(binding_residues, radius, ligand_path=None, force_rerun=False)[source]¶

Run DOCK6 on a PDB file, given its binding residues and a radius around them.

Provide a path to a ligand to dock a ligand to it. If no ligand is provided, DOCK6 preparations will be run on that structure file.

Parameters:	binding_residues (str) – Comma separated string of residues (eg: ‘144,170,199’) radius (int, float) – Radius around binding residues to dock to ligand_path (str) – Path to ligand (mol2 format) to dock to protein force_rerun (bool) – If method should be rerun even if output files exist

binding_site_mol2(residues, force_rerun=False)[source]¶

Create mol2 of only binding site residues from the receptor

This function will take in a .pdb file (preferably the _receptor_noH.pdb file) and a string of residues (eg: ‘144,170,199’) and delete all other residues in the .pdb file. It then saves the coordinates of the selected residues as a .mol2 file. This is necessary for Chimera to select spheres within the radius of the binding site.

Parameters:	residues (str) – Comma separated string of residues (eg: ‘144,170,199’) force_rerun (bool) – If method should be rerun even if output file exists

dms_maker(force_rerun=False)[source]¶

Create surface representation (dms file) of receptor

Parameters:	force_rerun (bool) – If method should be rerun even if output file exists

do_dock6_flexible(ligand_path, force_rerun=False)[source]¶

Dock a ligand to the protein.

Parameters:	ligand_path (str) – Path to ligand (mol2 format) to dock to protein force_rerun (bool) – If method should be rerun even if output file exists

dock_dir¶: str – DOCK folder

dockprep(force_rerun=False)[source]¶

Prepare a PDB file for docking by first converting it to mol2 format.

Parameters:	force_rerun (bool) – If method should be rerun even if output file exists

grid(force_rerun=False)[source]¶

Create the scoring grid within the dummy box.

Parameters:	force_rerun (bool) – If method should be rerun even if output file exists

protein_only_and_noH(keep_ligands=None, force_rerun=False)[source]¶

Isolate the receptor by stripping everything except protein and specified ligands.

Parameters:	keep_ligands (str, list) – Ligand(s) to keep in PDB file force_rerun (bool) – If method should be rerun even if output file exists

root_dir¶: str – Directory where DOCK project folder is located

showbox(force_rerun=False)[source]¶

Create the dummy PDB box around the selected spheres.

Parameters:	force_rerun (bool) – If method should be rerun even if output file exists

sphere_selector_using_residues(radius, force_rerun=False)[source]¶

Select spheres based on binding site residues

Parameters:	radius (int, float) – Radius around binding residues to dock to force_rerun (bool) – If method should be rerun even if output file exists

sphgen(force_rerun=False)[source]¶

Create sphere representation (sph file) of receptor from the surface representation

Parameters:	force_rerun (bool) – If method should be rerun even if output file exists

ssbio.protein.structure.utils.dock.parse_results_mol2(mol2_outpath)[source]¶

Parse a DOCK6 mol2 output file, return a Pandas DataFrame of the results.

Parameters:	mol2_outpath (str) – Path to mol2 output file
Returns:	Pandas DataFrame of the results
Return type:	DataFrame

`ssbio.protein.structure.properties`¶

Structure Residues¶

ssbio.protein.structure.properties.residues.distance_to_site(residue_of_interest, residues, model)[source]¶

Calculate the distance between an amino acid and a group of amino acids.

Parameters:	residue_of_interest – Residue number you are interested in (ie. a mutation) residues – List of residue numbers
Returns:	Distance (in Angstroms) to the group of residues
Return type:	float

ssbio.protein.structure.properties.residues.get_structure_seqrecords(model)[source]¶

Get a dictionary of a PDB file’s sequences.

Special cases include:

Insertion codes. In the case of residue numbers like “15A”, “15B”, both residues are written out. Example: 9LPR
HETATMs. Currently written as an “X”, or unknown amino acid.

Parameters:	model – Biopython Model object of a Structure
Returns:	List of SeqRecords
Return type:	list

ssbio.protein.structure.properties.residues.get_structure_seqs(pdb_file, file_type)[source]¶

Get a dictionary of a PDB file’s sequences.

Special cases include:

Insertion codes. In the case of residue numbers like “15A”, “15B”, both residues are written out. Example: 9LPR
HETATMs. Currently written as an “X”, or unknown amino acid.

Parameters:	pdb_file – Path to PDB file
Returns:	Dictionary of: {chain_id: sequence}
Return type:	dict

ssbio.protein.structure.properties.residues.hse_output(pdb_file, file_type)[source]¶

The solvent exposure of an amino acid residue is important for analyzing, understanding and predicting aspects of protein structure and function [73]. A residue’s solvent exposure can be classified as four categories: exposed, partly exposed, buried and deeply buried residues. Hamelryck et al. [73] established a new 2D measure that provides a different view of solvent exposure, i.e. half-sphere exposure (HSE). By conceptually dividing the sphere of a residue into two halves- HSE-up and HSE-down, HSE provides a more detailed description of an amino acid residue’s spatial neighborhood. HSE is calculated by the hsexpo module implemented in the BioPython package [74] from a PDB file.

http://onlinelibrary.wiley.com/doi/10.1002/prot.20379/abstract

Parameters:	pdb_file –

Returns:

ssbio.protein.structure.properties.residues.match_structure_sequence(orig_seq, new_seq, match='X', fill_with='X', ignore_excess=False)[source]¶

Correct a sequence to match inserted X’s in a structure sequence

This is useful for mapping a sequence obtained from structural tools like MSMS or DSSP: to the sequence obtained by the get_structure_seqs method.

Examples

>>> structure_seq = 'XXXABCDEF'
>>> prop_list = [4, 5, 6, 7, 8, 9]
>>> match_structure_sequence(structure_seq, prop_list)
['X', 'X', 'X', 4, 5, 6, 7, 8, 9]

>>> match_structure_sequence(structure_seq, prop_list, fill_with=float('Inf'))
[inf, inf, inf, 4, 5, 6, 7, 8, 9]

>>> structure_seq = '---ABCDEF---'
>>> prop_list = ('H','H','H','C','C','C')
>>> match_structure_sequence(structure_seq, prop_list, match='-', fill_with='-')
('-', '-', '-', 'H', 'H', 'H', 'C', 'C', 'C', '-', '-', '-')

>>> structure_seq = 'ABCDEF---'
>>> prop_list = 'HHHCCC'
>>> match_structure_sequence(structure_seq, prop_list, match='-', fill_with='-')
'HHHCCC---'

>>> structure_seq = 'AXBXCXDXEXF'
>>> prop_list = ['H', 'H', 'H', 'C', 'C', 'C']
>>> match_structure_sequence(structure_seq, prop_list, match='X', fill_with='X')
['H', 'X', 'H', 'X', 'H', 'X', 'C', 'X', 'C', 'X', 'C']

Parameters:	orig_seq (str, Seq, SeqRecord) – Sequence to match to new_seq (str, tuple, list) – Sequence to fill in match (str) – What to match fill_with – What to fill in when matches are found ignore_excess (bool) – If excess sequence on the tail end of new_seq should be ignored
Returns:	new_seq which will match the length of orig_seq
Return type:	str, tuple, list

ssbio.protein.structure.properties.residues.resname_in_proximity(resname, model, chains, resnums, threshold=5)[source]¶

Search within the proximity of a defined list of residue numbers and their chains for any specifed residue name.

Parameters:	resname (str) – Residue name to search for in proximity of specified chains + resnums model – Biopython Model object chains (str, list) – Chain ID or IDs to check resnums (int, list) – Residue numbers within the chain to check threshold (float) – Cutoff in Angstroms for returning True if a RESNAME is near
Returns:	True if a RESNAME is within the threshold cutoff
Return type:	bool

ssbio.protein.structure.properties.residues.search_ss_bonds(model, threshold=3.0)[source]¶

Searches S-S bonds based on distances between atoms in the structure (first model only). Average distance is 2.05A. Threshold is 3A default. Returns iterator with tuples of residues.

ADAPTED FROM JOAO RODRIGUES’ BIOPYTHON GSOC PROJECT (http://biopython.org/wiki/GSOC2010_Joao)

ssbio.protein.structure.properties.residues.site_centroid(residues, model)[source]¶

Get the XYZ coordinate of the center of a list of residues.

Parameters:	residues – List of residue numbers pdb_file – Path to PDB file
Returns:	(X, Y, Z) coordinate of centroid
Return type:	tuple

`ssbio.protein.sequence.utils`¶

Sequence Alignment¶

ssbio.protein.sequence.utils.alignment.get_alignment_df(a_aln_seq, b_aln_seq, a_seq_id=None, b_seq_id=None)[source]¶

Summarize two alignment strings in a dataframe.

Parameters:	a_aln_seq (str) – Aligned sequence string b_aln_seq (str) – Aligned sequence string a_seq_id (str) – Optional ID of a_seq b_seq_id (str) – Optional ID of b_aln_seq
Returns:	a per-residue level annotation of the alignment
Return type:	DataFrame

ssbio.protein.sequence.utils.alignment.get_alignment_df_from_file(alignment_file, a_seq_id=None, b_seq_id=None)[source]¶

Get a Pandas DataFrame of the Needle alignment results. Contains all positions of the sequences.

Parameters:	alignment_file – a_seq_id – Optional specification of the ID of the reference sequence b_seq_id – Optional specification of the ID of the aligned sequence
Returns:	all positions in the alignment
Return type:	Pandas DataFrame

ssbio.protein.sequence.utils.alignment.get_deletions(aln_df)[source]¶

Get a list of tuples indicating the first and last residues of a deletion region, as well as the length of the deletion.

Examples

# Deletion of residues 1 to 4, length 4 >>> test = {‘id_a’: {0: ‘a’, 1: ‘a’, 2: ‘a’, 3: ‘a’}, ‘id_a_aa’: {0: ‘M’, 1: ‘G’, 2: ‘I’, 3: ‘T’}, ‘id_a_pos’: {0: 1.0, 1: 2.0, 2: 3.0, 3: 4.0}, ‘id_b’: {0: ‘b’, 1: ‘b’, 2: ‘b’, 3: ‘b’}, ‘id_b_aa’: {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan}, ‘id_b_pos’: {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan}, ‘type’: {0: ‘deletion’, 1: ‘deletion’, 2: ‘deletion’, 3: ‘deletion’}} >>> my_alignment = pd.DataFrame.from_dict(test) >>> get_deletions(my_alignment) [((1.0, 4.0), 4)]

Parameters:	aln_df (DataFrame) – Alignment DataFrame
Returns:	A list of tuples with the format ((deletion_start_resnum, deletion_end_resnum), deletion_length)
Return type:	list

ssbio.protein.sequence.utils.alignment.get_insertions(aln_df)[source]¶

Get a list of tuples indicating the first and last residues of a insertion region, as well as the length of the insertion.

If the first tuple is:: (-1, 1) that means the insertion is at the beginning of the original protein (X, Inf) where X is the length of the original protein, that means the insertion is at the end of the protein

Examples

# Insertion at beginning, length 3 >>> test = {‘id_a’: {0: ‘a’, 1: ‘a’, 2: ‘a’, 3: ‘a’}, ‘id_a_aa’: {0: np.nan, 1: np.nan, 2: np.nan, 3: ‘M’}, ‘id_a_pos’: {0: np.nan, 1: np.nan, 2: np.nan, 3: 1.0}, ‘id_b’: {0: ‘b’, 1: ‘b’, 2: ‘b’, 3: ‘b’}, ‘id_b_aa’: {0: ‘M’, 1: ‘M’, 2: ‘L’, 3: ‘M’}, ‘id_b_pos’: {0: 1, 1: 2, 2: 3, 3: 4}, ‘type’: {0: ‘insertion’, 1: ‘insertion’, 2: ‘insertion’, 3: ‘match’}} >>> my_alignment = pd.DataFrame.from_dict(test) >>> get_insertions(my_alignment) [((-1, 1.0), 3)]

Parameters:	aln_df (DataFrame) – Alignment DataFrame
Returns:	A list of tuples with the format ((insertion_start_resnum, insertion_end_resnum), insertion_length)
Return type:	list

ssbio.protein.sequence.utils.alignment.get_mutations(aln_df)[source]¶

Get a list of residue numbers (in the original sequence’s numbering) that are mutated

Parameters:	aln_df (DataFrame) – Alignment DataFrame just_resnums – If only the residue numbers should be returned, instead of a list of tuples of (original_residue, resnum, mutated_residue)
Returns:	Residue mutations
Return type:	list

ssbio.protein.sequence.utils.alignment.get_percent_identity(a_aln_seq, b_aln_seq)[source]¶: Get the percent identity between two alignment strings

ssbio.protein.sequence.utils.alignment.get_unresolved(aln_df)[source]¶

Get a list of residue numbers (in the original sequence’s numbering) that are unresolved

Parameters:	aln_df (DataFrame) – Alignment DataFrame
Returns:	Residue numbers that are mutated
Return type:	list

ssbio.protein.sequence.utils.alignment.map_resnum_a_to_resnum_b(a_resnum, a_aln, b_aln)[source]¶

Map a residue number in a sequence to the corresponding residue number in an aligned sequence.

Examples: >>> map_resnum_a_to_resnum_b(5, ‘–ABCDEF’, ‘XXABCDEF’) 7

Parameters:	a_resnum (int) – Residue number in the first aligned sequence a_aln (str, Seq, SeqRecord) – Aligned sequence string b_aln (str, Seq, SeqRecord) – Aligned sequence string
Returns:	Residue number in the second aligned sequence
Return type:	int

ssbio.protein.sequence.utils.alignment.needle_statistics(infile)[source]¶

Reads in a needle alignment file and spits out statistics of the alignment.

Parameters:	infile (str) – Alignment file name
Returns:	alignment_properties - a dictionary telling you the number of gaps, identity, etc.
Return type:	dict

ssbio.protein.sequence.utils.alignment.pairwise_sequence_alignment(a_seq, b_seq, engine, a_seq_id=None, b_seq_id=None, gapopen=10, gapextend=0.5, outfile=None, outdir=None, force_rerun=False)[source]¶

Run a global pairwise sequence alignment between two sequence strings.

Parameters:	a_seq (str, Seq, SeqRecord, SeqProp) – Reference sequence b_seq (str, Seq, SeqRecord, SeqProp) – Sequence to be aligned to reference engine (str) – biopython or needle - which pairwise alignment program to use a_seq_id (str) – Reference sequence ID. If not set, is “a_seq” b_seq_id (str) – Sequence to be aligned ID. If not set, is “b_seq” gapopen (int) – Only for needle - Gap open penalty is the score taken away when a gap is created gapextend (float) – Only for needle - Gap extension penalty is added to the standard gap penalty for each base or residue in the gap outfile (str) – Only for needle - name of output file. If not set, is {id_a}_{id_b}_align.txt outdir (str) – Only for needle - Path to output directory. Default is the current directory. force_rerun (bool) – Only for needle - Default False, set to True if you want to rerun the alignment if outfile exists.
Returns:	Biopython object to represent an alignment
Return type:	MultipleSeqAlignment

ssbio.protein.sequence.utils.alignment.run_needle_alignment(seq_a, seq_b, gapopen=10, gapextend=0.5, outdir=None, outfile=None, force_rerun=False)[source]¶

Run the needle alignment program for two strings and return the raw alignment result.

More info: EMBOSS needle: http://www.bioinformatics.nl/cgi-bin/emboss/help/needle Biopython wrapper: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc84 Using strings as input: https://www.biostars.org/p/91124/

Parameters:	id_a – ID of reference sequence seq_a (str, Seq, SeqRecord) – Reference sequence id_b – ID of sequence to be aligned seq_b (str, Seq, SeqRecord) – String representation of sequence to be aligned gapopen – Gap open penalty is the score taken away when a gap is created gapextend – Gap extension penalty is added to the standard gap penalty for each base or residue in the gap outdir (str, optional) – Path to output directory. Default is the current directory. outfile (str, optional) – Name of output file. If not set, is {id_a}_{id_b}_align.txt force_rerun (bool) – Default False, set to True if you want to rerun the alignment if outfile exists.
Returns:	Raw alignment result of the needle alignment in srspair format.
Return type:	str

ssbio.protein.sequence.utils.alignment.run_needle_alignment_on_files(id_a, faa_a, id_b, faa_b, gapopen=10, gapextend=0.5, outdir='', outfile='', force_rerun=False)[source]¶

Run the needle alignment program for two fasta files and return the raw alignment result.

More info: EMBOSS needle: http://www.bioinformatics.nl/cgi-bin/emboss/help/needle Biopython wrapper: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc84

Parameters:	id_a – ID of reference sequence faa_a – File path to reference sequence id_b – ID of sequence to be aligned faa_b – File path to sequence to be aligned gapopen – Gap open penalty is the score taken away when a gap is created gapextend – Gap extension penalty is added to the standard gap penalty for each base or residue in the gap outdir (str, optional) – Path to output directory. Default is the current directory. outfile (str, optional) – Name of output file. If not set, is {id_a}_{id_b}_align.txt force_rerun (bool) – Default False, set to True if you want to rerun the alignment if outfile exists.
Returns:	Raw alignment result of the needle alignment in srspair format.
Return type:	str

Sequence BLAST¶

ssbio.protein.sequence.utils.blast.calculate_bbh(blast_results_1, blast_results_2, r_name=None, g_name=None, outdir='')[source]¶

Calculate the best bidirectional BLAST hits (BBH) and save a dataframe of results.

Parameters:	blast_results_1 (str) – BLAST results for reference vs. other genome blast_results_2 (str) – BLAST results for other vs. reference genome r_name – Name of reference genome g_name – Name of other genome outdir – Directory where BLAST results are stored.
Returns:	Path to Pandas DataFrame of the BBH results.

ssbio.protein.sequence.utils.blast.create_orthology_matrix(r_name, genome_to_bbh_files, pid_cutoff=None, bitscore_cutoff=None, evalue_cutoff=None, filter_condition='OR', outname='', outdir='', force_rerun=False)[source]¶

Create an orthology matrix using best bidirectional BLAST hits (BBH) outputs.

Parameters:	r_name (str) – Name of the reference genome genome_to_bbh_files (dict) – Mapping of genome names to the BBH csv output from the `calculate_bbh()` method pid_cutoff (float) – Minimum percent identity between BLAST hits to filter for in the range [0, 100] bitscore_cutoff (float) – Minimum bitscore allowed between BLAST hits evalue_cutoff (float) – Maximum E-value allowed between BLAST hits filter_condition (str) – ‘OR’ or ‘AND’, how to combine cutoff filters. ‘OR’ gives more results since it is less stringent, as you will be filtering for hits with (>80% PID or >30 bitscore or <0.0001 evalue). outname – Name of output file of orthology matrix outdir – Path to output directory force_rerun (bool) – Force recreation of the orthology matrix even if the outfile exists
Returns:	Path to orthologous genes matrix.
Return type:	str

ssbio.protein.sequence.utils.blast.print_run_bidirectional_blast(reference, other_genome, dbtype, outdir)[source]¶

Write torque submission files for running bidirectional blast on a server and print execution command.

Parameters:	reference (str) – Path to “reference” genome, aka your “base strain” other_genome (str) – Path to other genome which will be BLASTed to the reference dbtype (str) – “nucl” or “prot” - what format your genome files are in outdir (str) – Path to folder where Torque scripts should be placed

ssbio.protein.sequence.utils.blast.run_bidirectional_blast(reference, other_genome, dbtype, outdir='')[source]¶

BLAST a genome against another, and vice versa.

This function requires BLAST to be installed, do so by running: sudo apt install ncbi-blast+

Parameters:	reference (str) – path to “reference” genome, aka your “base strain” other_genome (str) – path to other genome which will be BLASTed to the reference dbtype (str) – “nucl” or “prot” - what format your genome files are in outdir (str) – path to folder where BLAST outputs should be placed
Returns:	Paths to BLAST output files. (reference_vs_othergenome.out, othergenome_vs_reference.out)

ssbio.protein.sequence.utils.blast.run_makeblastdb(infile, dbtype, outdir='')[source]¶

Make the BLAST database for a genome file.

Parameters:	infile (str) – path to genome FASTA file dbtype (str) – “nucl” or “prot” - what format your genome files are in outdir (str) – path to directory to output database files (default is original folder)
Returns:	Paths to BLAST databases.

`ssbio.protein.sequence.properties`¶

Thermostability¶

This module provides functions to predict thermostability parameters (specifically the free energy of unfolding dG): of an amino acid sequence.

These methods are adapted from:

Oobatake, M., & Ooi, T. (1993). ‘Hydration and heat stability effects on protein unfolding’,

Progress in biophysics and molecular biology, 59/3: 237–84.

Dill, K. A., Ghosh, K., & Schmit, J. D. (2011). ‘Physical limits of cells and proteomes’,

Proceedings of the National Academy of Sciences of the United States of America, 108/44: 17876–82. DOI: 10.1073/pnas.1114477108

For an example of usage of these parameters in a genome-scale model:

Chen, K., Gao, Y., Mih, N., O’Brien, E., Yang, L., Palsson, B.O. (2017).

‘Thermo-sensitivity of growth is determined by chaperone-mediated proteome re-allocation.’, Submitted to PNAS.

ssbio.protein.sequence.properties.thermostability.calculate_dill_dG(seq_len, temp)[source]¶

Get free energy of unfolding (dG) using Dill method in units J/mol.

Parameters:	seq_len (int) – Length of amino acid sequence temp (float) – Temperature in degrees C
Returns:	Free energy of unfolding dG (J/mol)
Return type:	float

ssbio.protein.sequence.properties.thermostability.calculate_oobatake_dG(seq, temp)[source]¶

Get free energy of unfolding (dG) using Oobatake method in units cal/mol.

Parameters:	seq (str, Seq, SeqRecord) – Amino acid sequence temp (float) – Temperature in degrees C
Returns:	Free energy of unfolding dG (J/mol)
Return type:	float

ssbio.protein.sequence.properties.thermostability.calculate_oobatake_dH(seq, temp)[source]¶

Get dH using Oobatake method in units cal/mol.

Parameters:	seq (str, Seq, SeqRecord) – Amino acid sequence temp (float) – Temperature in degrees C
Returns:	dH in units cal/mol
Return type:	float

ssbio.protein.sequence.properties.thermostability.calculate_oobatake_dS(seq, temp)[source]¶

Get dS using Oobatake method in units cal/mol.

Parameters:	seq (str, Seq, SeqRecord) – Amino acid sequence temp (float) – Temperature in degrees C
Returns:	dS in units cal/mol
Return type:	float

ssbio.protein.sequence.properties.thermostability.get_dG_at_T(seq, temp)[source]¶

Predict dG at temperature T, using best predictions from Dill or Oobatake methods.

Parameters:

seq (str, Seq, SeqRecord) – Amino acid sequence
temp (float) – Temperature in degrees C

Returns:

tuple containing:

dG (float) Free energy of unfolding dG (cal/mol) keq (float): Equilibrium constant Keq method (str): Method used to calculate

Return type:

(tuple)

Python API¶

ssbio.pipeline.gempro¶

GEMPRO¶

ssbio.databases¶

PDBProp¶

PISA¶

SWISSMODEL¶

UniProtProp¶

KEGGProp¶

ssbio.protein.structure.utils¶

CleanPDB¶

MutatePDB¶

DOCK¶

ssbio.protein.structure.properties¶

Structure Residues¶

ssbio.protein.sequence.utils¶

Sequence Alignment¶

Sequence BLAST¶

ssbio.protein.sequence.properties¶

Thermostability¶

`ssbio.pipeline.gempro`¶

`ssbio.databases`¶

`ssbio.protein.structure.utils`¶

`ssbio.protein.structure.properties`¶

`ssbio.protein.sequence.utils`¶

`ssbio.protein.sequence.properties`¶