Python API¶
Information on select functions, classes, or methods.
ssbio.pipeline.gempro
¶
GEMPRO¶
-
class
ssbio.pipeline.gempro.
GEMPRO
(gem_name, root_dir=None, pdb_file_type='mmtf', gem=None, gem_file_path=None, gem_file_type=None, genes_list=None, genes_and_sequences=None, genome_path=None, write_protein_fasta_files=True, description=None, custom_spont_id=None)[source]¶ Generic class to represent all information for a GEM-PRO project.
Initialize the GEM-PRO project with a genome-scale model, a list of genes, or a dict of genes and sequences. Specify the name of your project, along with the root directory where a folder with that name will be created.
Main methods provided are:
Automated mapping of sequence IDs
- With KEGG mapper
- With UniProt mapper
- Allowing manual gene ID –> protein sequence entry
- Allowing manual gene ID –> UniProt ID
Consolidating sequence IDs and setting a representative sequence
- Currently these are set based on available PDB IDs
Mapping of representative sequence –> structures
- With UniProt –> ranking of PDB structures
- BLAST representative sequence –> PDB database
Preparation of files for homology modeling (currently for I-TASSER)
- Mapping to existing models
- Preparation for running I-TASSER
- Parsing I-TASSER runs
Running QC/QA on structures and setting a representative structure
- Various cutoffs (mutations, insertions, deletions) can be set to filter structures
Automation of protein sequence and structure property calculation
Creation of Pandas DataFrame summaries directly from downloaded metadata
Parameters: - gem_name (str) – The name of your GEM or just your project in general. This will be the name of the main folder that is created in root_dir.
- root_dir (str) – Path to where the folder named after
gem_name
will be created. If not provided, directories will not be created and output directories need to be specified for some steps. - pdb_file_type (str) –
pdb
,mmCif
,xml
,mmtf
- file type for files downloaded from the PDB - gem (Model) – COBRApy Model object
- gem_file_path (str) – Path to GEM file
- gem_file_type (str) – GEM model type -
sbml
(orxml
),mat
, orjson
formats - genes_list (list) – List of gene IDs that you want to map
- genes_and_sequences (dict) – Dictionary of gene IDs and their amino acid sequence strings
- genome_path (str) – FASTA file of all protein sequences
- write_protein_fasta_files (bool) – If individual protein FASTA files should be written out
- description (str) – Description string of your project
- custom_spont_id (str) – ID of spontaneous genes in a COBRA model which will be ignored for analysis
-
add_gene_ids
(genes_list)[source]¶ Add gene IDs manually into the GEM-PRO project.
Parameters: genes_list (list) – List of gene IDs as strings.
-
base_dir
¶ str – GEM-PRO project folder.
-
blast_seqs_to_pdb
(seq_ident_cutoff=0, evalue=0.0001, all_genes=False, display_link=False, outdir=None, force_rerun=False)[source]¶ BLAST each representative protein sequence to the PDB. Saves raw BLAST results (XML files).
Parameters: - seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
- evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
- all_genes (bool) – If all genes should be BLASTed, or only those without any structures currently mapped
- display_link (bool, optional) – Set to True if links to the HTML results should be displayed
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False
-
custom_spont_id
= None¶ str – ID of spontaneous genes in a COBRA model which will be ignored for analysis
-
data_dir
¶ str – Directory where all data are stored.
-
df_homology_models
¶ DataFrame – Get a dataframe of I-TASSER homology model results
-
df_kegg_metadata
¶ DataFrame – Pandas DataFrame of KEGG metadata per protein.
-
df_pdb_blast
¶ DataFrame – Get a dataframe of PDB BLAST results
-
df_pdb_metadata
¶ DataFrame – Get a dataframe of PDB metadata (PDBs have to be downloaded first).
-
df_pdb_ranking
¶ DataFrame – Get a dataframe of UniProt -> best structure in PDB results
-
df_proteins
¶ DataFrame – Get a summary dataframe of all proteins in the project.
-
df_representative_sequences
¶ DataFrame – Pandas DataFrame of representative sequence information per protein.
-
df_representative_structures
¶ DataFrame – Get a dataframe of representative protein structure information.
-
df_uniprot_metadata
¶ DataFrame – Pandas DataFrame of UniProt metadata per protein.
-
find_disulfide_bridges
(representatives_only=True)[source]¶ Run Biopython’s disulfide bridge finder and store found bridges.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.annotations['SSBOND-biopython']
Parameters: representative_only (bool) – If analysis should only be run on the representative structure
-
find_disulfide_bridges_parallelize
(sc, representatives_only=True)[source]¶ Run Biopython’s disulfide bridge finder and store found bridges.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.annotations['SSBOND-biopython']
Parameters: representative_only (bool) – If analysis should only be run on the representative structure
-
functional_genes
¶ DictList – All genes with a representative protein structure.
-
genes
= None¶ DictList – All protein-coding genes in this GEM-PRO project
-
genes_dir
¶ str – Directory where all gene specific information is stored.
-
genes_with_a_representative_sequence
¶ DictList – All genes with a representative sequence.
-
genes_with_a_representative_structure
¶ DictList – All genes with a representative protein structure.
-
genes_with_experimental_structures
¶ DictList – All genes that have at least one experimental structure.
-
genes_with_homology_models
¶ DictList – All genes that have at least one homology model.
-
genes_with_structures
¶ DictList – All genes with any mapped protein structures.
-
genome_path
= None¶ str – Simple link to the filepath of the FASTA file containing all protein sequences
-
get_dssp_annotations
(representatives_only=True, force_rerun=False)[source]¶ Run DSSP on structures and store calculations.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.letter_annotations['*-dssp']
Parameters: - representative_only (bool) – If analysis should only be run on the representative structure
- force_rerun (bool) – If calculations should be rerun even if an output file exists
-
get_dssp_annotations_parallelize
(sc, representatives_only=True, force_rerun=False)[source]¶ Run DSSP on structures and store calculations.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.letter_annotations['*-dssp']
Parameters: - representative_only (bool) – If analysis should only be run on the representative structure
- force_rerun (bool) – If calculations should be rerun even if an output file exists
-
get_freesasa_annotations
(include_hetatms=False, representatives_only=True, force_rerun=False)[source]¶ Run freesasa on structures and store calculations.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.letter_annotations['*-freesasa']
Parameters: - include_hetatms (bool) – If HETATMs should be included in calculations. Defaults to
False
. - representative_only (bool) – If analysis should only be run on the representative structure
- force_rerun (bool) – If calculations should be rerun even if an output file exists
- include_hetatms (bool) – If HETATMs should be included in calculations. Defaults to
-
get_freesasa_annotations_parallelize
(sc, include_hetatms=False, representatives_only=True, force_rerun=False)[source]¶ Run freesasa on structures and store calculations.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.letter_annotations['*-freesasa']
Parameters: - include_hetatms (bool) – If HETATMs should be included in calculations. Defaults to
False
. - representative_only (bool) – If analysis should only be run on the representative structure
- force_rerun (bool) – If calculations should be rerun even if an output file exists
- include_hetatms (bool) – If HETATMs should be included in calculations. Defaults to
-
get_itasser_models
(homology_raw_dir, custom_itasser_name_mapping=None, outdir=None, force_rerun=False)[source]¶ Copy generated I-TASSER models from a directory to the GEM-PRO directory.
Parameters: - homology_raw_dir (str) – Root directory of I-TASSER folders.
- custom_itasser_name_mapping (dict) – Use this if your I-TASSER folder names differ from your model gene names. Input a dict of {model_gene: ITASSER_folder}.
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- force_rerun (bool) – If homology files should be copied again even if they exist in the GEM-PRO directory
-
get_manual_homology_models
(input_dict, outdir=None, clean=True, force_rerun=False)[source]¶ Copy homology models to the GEM-PRO project.
Requires an input of a dictionary formatted like so:
{ model_gene: { homology_model_id1: { 'model_file': '/path/to/homology/model.pdb', 'file_type': 'pdb' 'additional_info': info_value }, homology_model_id2: { 'model_file': '/path/to/homology/model.pdb' 'file_type': 'pdb' } } }
Parameters: - input_dict (dict) – Dictionary of dictionaries of gene names to homology model IDs and other information
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- clean (bool) – If homology files should be cleaned and saved as a new PDB file
- force_rerun (bool) – If homology files should be copied again even if they exist in the GEM-PRO directory
-
get_msms_annotations
(representatives_only=True, force_rerun=False)[source]¶ Run MSMS on structures and store calculations.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.letter_annotations['*-msms']
Parameters: - representative_only (bool) – If analysis should only be run on the representative structure
- force_rerun (bool) – If calculations should be rerun even if an output file exists
-
get_msms_annotations_parallelize
(sc, representatives_only=True, force_rerun=False)[source]¶ Run MSMS on structures and store calculations.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.letter_annotations['*-msms']
Parameters: - representative_only (bool) – If analysis should only be run on the representative structure
- force_rerun (bool) – If calculations should be rerun even if an output file exists
-
get_scratch_predictions
(path_to_scratch, results_dir, scratch_basename='scratch', num_cores=1, exposed_buried_cutoff=25, custom_gene_mapping=None)[source]¶ Run and parse
SCRATCH
results to predict secondary structure and solvent accessibility. Annotations are stored in the protein’s representative sequence at:.annotations
.letter_annotations
Parameters: - path_to_scratch (str) – Path to SCRATCH executable
- results_dir (str) – Path to SCRATCH results folder, which will have the files (scratch.ss, scratch.ss8, scratch.acc, scratch.acc20)
- scratch_basename (str) – Basename of the SCRATCH results (‘scratch’ is default)
- num_cores (int) – Number of cores to use to parallelize SCRATCH run
- exposed_buried_cutoff (int) – Cutoff of exposed/buried for the acc20 predictions
- custom_gene_mapping (dict) – Default parsing of SCRATCH output files is to look for the model gene IDs. If your output files contain IDs which differ from the model gene IDs, use this dictionary to map model gene IDs to result file IDs. Dictionary keys must match model genes.
-
get_sequence_properties
(representatives_only=True)[source]¶ Run Biopython ProteinAnalysis and EMBOSS pepstats to summarize basic statistics of all protein sequences. Results are stored in the protein’s respective SeqProp objects at
.annotations
Parameters: representative_only (bool) – If analysis should only be run on the representative sequences
-
get_tmhmm_predictions
(tmhmm_results, custom_gene_mapping=None)[source]¶ Parse TMHMM results and store in the representative sequences.
This is a basic function to parse pre-run TMHMM results. Run TMHMM from the web service (http://www.cbs.dtu.dk/services/TMHMM/) by doing the following:
- Write all representative sequences in the GEM-PRO using the function
write_representative_sequences_file
- Upload the file to http://www.cbs.dtu.dk/services/TMHMM/ and choose “Extensive, no graphics” as the output
- Copy and paste the results (ignoring the top header and above “HELP with output formats”) into a file and save it
- Run this function on that file
Parameters: - tmhmm_results (str) – Path to TMHMM results (long format)
- custom_gene_mapping (dict) – Default parsing of TMHMM output is to look for the model gene IDs. If your output file contains IDs which differ from the model gene IDs, use this dictionary to map model gene IDs to result file IDs. Dictionary keys must match model genes.
- Write all representative sequences in the GEM-PRO using the function
-
kegg_mapping_and_metadata
(kegg_organism_code, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source]¶ Map all genes in the model to KEGG IDs using the KEGG service.
- Steps:
- Download all metadata and sequence files in the sequences directory
- Creates a KEGGProp object in the protein.sequences attribute
- Returns a Pandas DataFrame of mapping results
Parameters: - kegg_organism_code (str) – The three letter KEGG code of your organism
- custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model gene IDs.
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- set_as_representative (bool) – If mapped KEGG IDs should be set as representative sequences
- force_rerun (bool) – If you want to overwrite any existing mappings and files
-
kegg_mapping_and_metadata_parallelize
(sc, kegg_organism_code, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source]¶ Map all genes in the model to KEGG IDs using the KEGG service.
- Steps:
- Download all metadata and sequence files in the sequences directory
- Creates a KEGGProp object in the protein.sequences attribute
- Returns a Pandas DataFrame of mapping results
Parameters: - sc (SparkContext) – Spark Context to parallelize this function
- kegg_organism_code (str) – The three letter KEGG code of your organism
- custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model gene IDs.
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- set_as_representative (bool) – If mapped KEGG IDs should be set as representative sequences
- force_rerun (bool) – If you want to overwrite any existing mappings and files
-
load_cobra_model
(model)[source]¶ Load a COBRApy Model object into the GEM-PRO project.
Parameters: model (Model) – COBRApy Model
object
-
manual_seq_mapping
(gene_to_seq_dict, outdir=None, write_fasta_files=True, set_as_representative=True)[source]¶ Read a manual input dictionary of model gene IDs –> protein sequences. By default sets them as representative.
Parameters: - gene_to_seq_dict (dict) – Mapping of gene IDs to their protein sequence strings
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- write_fasta_files (bool) – If individual protein FASTA files should be written out
- set_as_representative (bool) – If mapped sequences should be set as representative
-
manual_uniprot_mapping
(gene_to_uniprot_dict, outdir=None, set_as_representative=True)[source]¶ Read a manual dictionary of model gene IDs –> UniProt IDs. By default sets them as representative.
This allows for mapping of the missing genes, or overriding of automatic mappings.
Input a dictionary of:
{ <gene_id1>: <uniprot_id1>, <gene_id2>: <uniprot_id2>, }
Parameters: - gene_to_uniprot_dict – Dictionary of mappings as shown above
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- set_as_representative (bool) – If mapped UniProt IDs should be set as representative sequences
-
map_uniprot_to_pdb
(seq_ident_cutoff=0.0, outdir=None, force_rerun=False)[source]¶ Map all representative sequences’ UniProt ID to PDB IDs using the PDBe “Best Structures” API. Will save a JSON file of the results to each protein’s
sequences
folder.The “Best structures” API is available at https://www.ebi.ac.uk/pdbe/api/doc/sifts.html The list of PDB structures mapping to a UniProt accession sorted by coverage of the protein and, if the same, resolution.
Parameters: - seq_ident_cutoff (float) – Sequence identity cutoff in decimal form
- outdir (str) – Output directory to cache JSON results of search
- force_rerun (bool) – Force re-downloading of JSON results if they already exist
Returns: A rank-ordered list of PDBProp objects that map to the UniProt ID
Return type: list
-
missing_homology_models
¶ list – List of genes with no mapping to any homology models.
-
missing_kegg_mapping
¶ list – List of genes with no mapping to KEGG.
-
missing_pdb_structures
¶ list – List of genes with no mapping to any experimental PDB structure.
-
missing_representative_sequence
¶ list – List of genes with no mapping to a representative sequence.
-
missing_representative_structure
¶ list – List of genes with no mapping to a representative structure.
-
missing_uniprot_mapping
¶ list – List of genes with no mapping to UniProt.
-
model
= None¶ Model – COBRApy model object
-
model_dir
¶ str – Directory where original GEMs and GEM-related files are stored.
-
pdb_downloader_and_metadata
(outdir=None, pdb_file_type=None, force_rerun=False)[source]¶ Download ALL mapped experimental structures to each protein’s structures directory.
Parameters: - outdir (str) – Path to output directory, if GEM-PRO directories were not set or other output directory is desired
- pdb_file_type (str) – Type of PDB file to download, if not already set or other format is desired
- force_rerun (bool) – If files should be re-downloaded if they already exist
-
pdb_file_type
= None¶ str –
pdb
,mmCif
,xml
,mmtf
- file type for files downloaded from the PDB
-
prep_itasser_modeling
(itasser_installation, itlib_folder, runtype, create_in_dir=None, execute_from_dir=None, all_genes=False, print_exec=False, **kwargs)[source]¶ Prepare to run I-TASSER homology modeling for genes without structures, or all genes.
Parameters: - itasser_installation (str) – Path to I-TASSER folder, i.e.
~/software/I-TASSER4.4
- itlib_folder (str) – Path to ITLIB folder, i.e.
~/software/ITLIB
- runtype – How you will be running I-TASSER - local, slurm, or torque
- create_in_dir (str) – Local directory where folders will be created
- execute_from_dir (str) – Optional path to execution directory - use this if you are copying the homology models to another location such as a supercomputer for running
- all_genes (bool) – If all genes should be prepped, or only those without any mapped structures
- print_exec (bool) – If the execution statement should be printed to run modelling
Todo
- Document kwargs - extra options for I-TASSER, SLURM or Torque execution
- Allow modeling of any sequence in sequences attribute, select by ID or provide SeqProp?
- itasser_installation (str) – Path to I-TASSER folder, i.e.
-
root_dir
¶ str – Directory where GEM-PRO project folder named after the attribute
base_dir
is located.
-
set_representative_sequence
(force_rerun=False)[source]¶ Automatically consolidate loaded sequences (manual, UniProt, or KEGG) and set a single representative sequence.
Manually set representative sequences override all existing mappings. UniProt mappings override KEGG mappings except when KEGG mappings have PDBs associated with them and UniProt doesn’t.
Parameters: force_rerun (bool) – Set to True to recheck stored sequences
-
set_representative_structure
(seq_outdir=None, struct_outdir=None, pdb_file_type=None, engine='needle', always_use_homology=False, rez_cutoff=0.0, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True, skip_large_structures=False, clean=True, force_rerun=False)[source]¶ Set all representative structure for proteins from a structure in the structures attribute.
Each gene can have a combination of the following, which will be analyzed to set a representative structure.
- Homology model(s)
- Ranked PDBs
- BLASTed PDBs
If the
always_use_homology
flag is true, homology models are always set as representative when they exist. If there are multiple homology models, we rank by the percent sequence coverage.Parameters: - seq_outdir (str) – Path to output directory of sequence alignment files, must be set if GEM-PRO directories were not created initially
- struct_outdir (str) – Path to output directory of structure files, must be set if GEM-PRO directories were not created initially
- pdb_file_type (str) –
pdb
,mmCif
,xml
,mmtf
- file type for files downloaded from the PDB - engine (str) –
biopython
orneedle
- which pairwise alignment program to use.needle
is the standard EMBOSS tool to run pairwise alignments.biopython
is Biopython’s implementation of needle. Results can differ! - always_use_homology (bool) – If homology models should always be set as the representative structure
- rez_cutoff (float) – Resolution cutoff, in Angstroms (only if experimental structure)
- seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
- allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
- allow_mutants (bool) – If mutations should be allowed or checked for
- allow_deletions (bool) – If deletions should be allowed or checked for
- allow_insertions (bool) – If insertions should be allowed or checked for
- allow_unresolved (bool) – If unresolved residues should be allowed or checked for
- skip_large_structures (bool) – Default False – currently, large structures can’t be saved as a PDB file even if you just want to save a single chain, so Biopython will throw an error when trying to do so. As an alternative, if a large structure is selected as representative, the pipeline will currently point to it and not clean it. If you don’t want this to happen, set this to true.
- clean (bool) – If structures should be cleaned
- force_rerun (bool) – If sequence to structure alignment should be rerun
Todo
- Remedy large structure representative setting
-
structures_dir
¶ str – Directory where all structures are stored.
-
uniprot_mapping_and_metadata
(model_gene_source, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source]¶ Map all genes in the model to UniProt IDs using the UniProt mapping service. Also download all metadata and sequences.
Parameters: - model_gene_source (str) –
the database source of your model gene IDs. See: http://www.uniprot.org/help/api_idmapping Common model gene sources are:
- Ensembl Genomes -
ENSEMBLGENOME_ID
(i.e. E. coli b-numbers) - Entrez Gene (GeneID) -
P_ENTREZGENEID
- RefSeq Protein -
P_REFSEQ_AC
- Ensembl Genomes -
- custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model genes.
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- set_as_representative (bool) – If mapped UniProt IDs should be set as representative sequences
- force_rerun (bool) – If you want to overwrite any existing mappings and files
- model_gene_source (str) –
-
write_representative_sequences_file
(outname, outdir=None, set_ids_from_model=True)[source]¶ Write all the model’s sequences as a single FASTA file. By default, sets IDs to model gene IDs.
Parameters: - outname (str) – Name of the output FASTA file without the extension
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- set_ids_from_model (bool) – If the gene ID source should be the model gene IDs, not the original sequence ID
ssbio.databases
¶
PDBProp¶
-
class
ssbio.databases.pdb.
PDBProp
(ident, description=None, chains=None, mapped_chains=None, structure_path=None, file_type=None)[source]¶ Store information about a protein structure from the Protein Data Bank.
Extends the
StructProp
class to allow initialization of the structure by its PDB ID, and then enabling downloads of the structure file as well as parsing its metadata.Parameters: - ident (str) –
- description (str) –
- chains (str) –
- mapped_chains (str) –
- structure_path (str) –
- file_type (str) –
pdb
,mmCif
,xml
,mmtf
- file type for files downloaded from the PDB
-
biological_assemblies
= None¶ DictList – A list for storing Bioassembly objects related to this PDB ID
-
download_structure_file
(outdir, file_type=None, load_header_metadata=True, force_rerun=False)[source]¶ Download a structure file from the PDB, specifying an output directory and a file type. Optionally download the mmCIF header file and parse data from it to store within this object.
Parameters: - outdir (str) – Path to output directory
- file_type (str) –
pdb
,mmCif
,xml
,mmtf
- file type for files downloaded from the PDB - load_header_metadata (bool) – If header metadata should be loaded into this object, fastest with mmtf files
- force_rerun (bool) – If structure file should be downloaded even if it already exists
-
ssbio.databases.pdb.
best_structures
(uniprot_id, outname=None, outdir=None, seq_ident_cutoff=0.0, force_rerun=False)[source]¶ Use the PDBe REST service to query for the best PDB structures for a UniProt ID.
More information found here: https://www.ebi.ac.uk/pdbe/api/doc/sifts.html Link used to retrieve results: https://www.ebi.ac.uk/pdbe/api/mappings/best_structures/:accession The list of PDB structures mapping to a UniProt accession sorted by coverage of the protein and, if the same, resolution.
Here is the ranking algorithm described by the PDB paper: https://nar.oxfordjournals.org/content/44/D1/D385.full
“Finally, a single quality indicator is also calculated for each entry by taking the harmonic average of all the percentile scores representing model and model-data-fit quality measures and then subtracting 10 times the numerical value of the resolution (in Angstrom) of the entry to ensure that resolution plays a role in characterising the quality of a structure. This single empirical ‘quality measure’ value is used by the PDBe query system to sort results and identify the ‘best’ structure in a given context. At present, entries determined by methods other than X-ray crystallography do not have similar data quality information available and are not considered as ‘best structures’.”
Parameters: - uniprot_id (str) – UniProt Accession ID
- outname (str) – Basename of the output file of JSON results
- outdir (str) – Path to output directory of JSON results
- seq_ident_cutoff (float) – Cutoff results based on percent coverage (in decimal form)
- force_rerun (bool) – Obtain best structures mapping ignoring previously downloaded results
Returns: - Rank-ordered list of dictionaries representing chain-specific PDB entries. Keys are:
- pdb_id: the PDB ID which maps to the UniProt ID
- chain_id: the specific chain of the PDB which maps to the UniProt ID
- coverage: the percent coverage of the entire UniProt sequence
- resolution: the resolution of the structure
- start: the structure residue number which maps to the start of the mapped sequence
- end: the structure residue number which maps to the end of the mapped sequence
- unp_start: the sequence residue number which maps to the structure start
- unp_end: the sequence residue number which maps to the structure end
- experimental_method: type of experiment used to determine structure
- tax_id: taxonomic ID of the protein’s original organism
Return type: list
-
ssbio.databases.pdb.
blast_pdb
(seq, outfile='', outdir='', evalue=0.0001, seq_ident_cutoff=0.0, link=False, force_rerun=False)[source]¶ Returns a list of BLAST hits of a sequence to available structures in the PDB.
Parameters: - seq (str) – Your sequence, in string format
- outfile (str) – Name of output file
- outdir (str, optional) – Path to output directory. Default is the current directory.
- evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
- seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
- link (bool, optional) – Set to True if a link to the HTML results should be displayed
- force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False
Returns: Rank ordered list of BLAST hits in dictionaries.
Return type: list
-
ssbio.databases.pdb.
download_mmcif_header
(pdb_id, outdir='', force_rerun=False)[source]¶ Download a mmCIF header file from the RCSB PDB by ID.
Parameters: - pdb_id – PDB ID
- outdir – Optional output directory, default is current working directory
- force_rerun – If the file should be downloaded again even if it exists
Returns: Path to outfile
Return type: str
-
ssbio.databases.pdb.
download_sifts_xml
(pdb_id, outdir='', force_rerun=False)[source]¶ Download the SIFTS file for a PDB ID.
Parameters: - pdb_id (str) – PDB ID
- outdir (str) – Output directory, current working directory if not specified.
- force_rerun (bool) – If the file should be downloaded again even if it exists
Returns: Path to downloaded file
Return type: str
-
ssbio.databases.pdb.
download_structure
(pdb_id, file_type, outdir='', only_header=False, force_rerun=False)[source]¶ Download a structure from the RCSB PDB by ID. Specify the file type desired.
Parameters: - pdb_id – PDB ID
- file_type – pdb, pdb.gz, mmcif, cif, cif.gz, xml.gz, mmtf, mmtf.gz
- outdir – Optional output directory
- only_header – If only the header file should be downloaded
- force_rerun – If the file should be downloaded again even if it exists
Returns: Path to outfile
Return type: str
Deprecated since version 1.0: This will be removed in 2.0. Use Biopython’s PDBList.retrieve_pdb_file function instead
-
ssbio.databases.pdb.
get_bioassembly_info
(pdb_id, biomol_num, cache=False, outdir=None, force_rerun=False)[source]¶ Get metadata about a bioassembly from the RCSB PDB’s REST API.
See: https://www.rcsb.org/pdb/rest/bioassembly/bioassembly?structureId=1hv4&nr=1 The API returns an XML file containing the information on a biological assembly that looks like this:
<bioassembly structureId="1HV4" assemblyNr="1" method="PISA" desc="author_and_software_defined_assembly"> <transformations operator="1" chainIds="A,B,C,D"> <transformation index="1"> <matrix m11="1.00000000" m12="0.00000000" m13="0.00000000" m21="0.00000000" m22="1.00000000" m23="0.00000000" m31="0.00000000" m32="0.00000000" m33="1.00000000"/> <shift v1="0.00000000" v2="0.00000000" v3="0.00000000"/> </transformation> </transformations> </bioassembly>
Parameters: - pdb_id (str) – PDB ID
- biomol_num (int) – Biological assembly number you are interested in
- cache (bool) – If the XML file should be downloaded
- outdir (str) – If cache, then specify the output directory
- force_rerun (bool) – If cache, and if file exists, specify if API should be queried again
-
ssbio.databases.pdb.
get_num_bioassemblies
(pdb_id, cache=False, outdir=None, force_rerun=False)[source]¶ Check if there are bioassemblies using the PDB REST API, and if there are, get the number of bioassemblies available.
See: https://www.rcsb.org/pages/webservices/rest, section ‘List biological assemblies’
Not all PDB entries have biological assemblies available and some have multiple. Details that are necessary to recreate a biological assembly from the asymmetric unit can be accessed from the following requests.
- Number of biological assemblies associated with a PDB entry
- Access the transformation information needed to generate a biological assembly (nr=0 will return information for the asymmetric unit, nr=1 will return information for the first assembly, etc.)
A query of https://www.rcsb.org/pdb/rest/bioassembly/nrbioassemblies?structureId=1hv4 returns this:
<nrBioAssemblies structureId="1HV4" hasAssemblies="true" count="2"/>
Parameters: - pdb_id (str) – PDB ID
- cache (bool) – If the XML file should be downloaded
- outdir (str) – If cache, then specify the output directory
- force_rerun (bool) – If cache, and if file exists, specify if API should be queried again
-
ssbio.databases.pdb.
get_release_date
(pdb_id)[source]¶ Quick way to get the release date of a PDB ID using the table of results from the REST service
Returns None if the release date is not available.
Returns: Organism of a PDB ID Return type: str
-
ssbio.databases.pdb.
get_resolution
(pdb_id)[source]¶ Quick way to get the resolution of a PDB ID using the table of results from the REST service
Returns infinity if the resolution is not available.
Returns: resolution of a PDB ID in Angstroms Return type: float Todo
- Unit test
-
ssbio.databases.pdb.
map_uniprot_resnum_to_pdb
(uniprot_resnum, chain_id, sifts_file)[source]¶ Map a UniProt residue number to its corresponding PDB residue number.
This function requires that the SIFTS file be downloaded, and also a chain ID (as different chains may have different mappings).
Parameters: - uniprot_resnum (int) – integer of the residue number you’d like to map
- chain_id (str) – string of the PDB chain to map to
- sifts_file (str) – Path to the SIFTS XML file
Returns: tuple containing:
mapped_resnum (int): Mapped residue number is_observed (bool): Indicates if the 3D structure actually shows the residue
Return type: (tuple)
-
ssbio.databases.pdb.
parse_mmcif_header
(infile)[source]¶ Parse a couple important fields from the mmCIF file format with some manual curation of ligands.
If you want full access to the mmCIF file just use the MMCIF2Dict class in Biopython.
Parameters: infile – Path to mmCIF file Returns: Dictionary of parsed header Return type: dict
-
ssbio.databases.pdb.
parse_mmtf_header
(infile)[source]¶ Parse an MMTF file and return basic header-like information.
Parameters: infile (str) – Path to MMTF file Returns: Dictionary of parsed header Return type: dict Todo
- Can this be sped up by not parsing the 3D coordinate info somehow?
- OR just store the sequences when this happens since it is already being parsed.
PISA¶
-
ssbio.databases.pisa.
download_pisa_multimers_xml
(pdb_ids, save_single_xml_files=True, outdir=None, force_rerun=False)[source]¶ Download the PISA XML file for multimers.
See: http://www.ebi.ac.uk/pdbe/pisa/pi_download.html for more info
- XML description of macromolecular assemblies:
- http://www.ebi.ac.uk/pdbe/pisa/cgi-bin/multimers.pisa?pdbcodelist where “pdbcodelist” is a comma-separated (strictly no spaces) list of PDB codes. The resulting file contain XML output of assembly data, equivalent to that displayed in PISA assembly pages, for each of the specified PDB entries. NOTE: If a mass-download is intended, please minimize the number of retrievals by specifying as many PDB codes in the URL as feasible (20-50 is a good range), and never send another URL request until the previous one has been completed (meaning that the multimers.pisa file has been downloaded). Excessive requests will silently die in the server queue.
Parameters: - pdb_ids (str, list) – PDB ID or list of IDs
- save_single_xml_files (bool) – If single XML files should be saved per PDB ID. If False, if multiple PDB IDs are provided, then a single, combined XML output file is downloaded
- outdir (str) – Directory to output PISA XML files
- force_rerun (bool) – Redownload files if they already exist
Returns: of files downloaded
Return type: list
-
ssbio.databases.pisa.
parse_pisa_multimers_xml
(pisa_multimers_xml, download_structures=False, outdir=None, force_rerun=False)[source]¶ Retrieve PISA information from an XML results file
See: http://www.ebi.ac.uk/pdbe/pisa/pi_download.html for more info
- XML description of macromolecular assemblies:
- http://www.ebi.ac.uk/pdbe/pisa/cgi-bin/multimers.pisa?pdbcodelist where “pdbcodelist” is a comma-separated (strictly no spaces) list of PDB codes. The resulting file contain XML output of assembly data, equivalent to that displayed in PISA assembly pages, for each of the specified PDB entries. NOTE: If a mass-download is intended, please minimize the number of retrievals by specifying as many PDB codes in the URL as feasible (20-50 is a good range), and never send another URL request until the previous one has been completed (meaning that the multimers.pisa file has been downloaded). Excessive requests will silently die in the server queue.
Parameters: - pisa_multimers_xml (str) – Path to PISA XML output file
- download_structures (bool) – If assembly files should be downloaded
- outdir (str) – Directory to output assembly files
- force_rerun (bool) – Redownload files if they already exist
Returns: of parsed PISA information
Return type: dict
-
ssbio.databases.pisa.
pdb_chain_stoichiometry_biomolone
(pdbid)[source]¶ Get the stoichiometry of the chains in biological assembly 1 as a dictionary.
Steps taken are: 1) Download PDB and parse header, make biomolecule if provided 2) Count how many times each chain appears in biomolecule #1 3) Convert chain id to uniprot id 4) Return final dictionary
Parameters: pdbid (str) – 4 character PDB ID Returns: {(ChainID,UniProtID): # occurences} Return type: dict
SWISSMODEL¶
-
class
ssbio.databases.swissmodel.
SWISSMODEL
(metadata_dir)[source]¶ Methods to parse through a SWISS-MODEL metadata set.
Download a particular organism’s metadata from SWISS-MODEL here: https://swissmodel.expasy.org/repository
Parameters: metadata_dir (str) – Path to the extracted SWISS-MODEL_Repository folder -
all_models
= None¶ dict – Dictionary of lists, UniProt ID as the keys
-
download_models
(uniprot_acc, outdir='', force_rerun=False)[source]¶ Download all models available for a UniProt accession number.
Parameters: - uniprot_acc (str) – UniProt ACC/ID
- outdir (str) – Path to output directory, uses working directory if not set
- force_rerun (bool) – Force a redownload the models if they already exist
Returns: Paths to the downloaded models
Return type: list
-
get_model_filepath
(infodict)[source]¶ Get the path to the homology model using information from the index dictionary for a single model.
- Example: use self.get_models(UNIPROT_ID) to get all the models, which returns a list of dictionaries.
- Use one of those dictionaries as input to this function to get the filepath to the model itself.
Parameters: infodict (dict) – Information about a model from get_models Returns: Path to homology model Return type: str
-
get_models
(uniprot_acc)[source]¶ Return all available models for a UniProt accession number.
Parameters: uniprot_acc (str) – UniProt ACC/ID Returns: All available models in SWISS-MODEL for this UniProt entry Return type: dict
-
metadata_dir
= None¶ str – Path to the extracted SWISS-MODEL_Repository folder
-
metadata_index_json
¶ str – Path to the INDEX_JSON file.
-
organize_models
(outdir, force_rerun=False)[source]¶ Organize and rename SWISS-MODEL models to a single folder with a name containing template information.
Parameters: - outdir (str) – New directory to copy renamed models to
- force_rerun (bool) – If models should be copied again even if they already exist
Returns: Dictionary of lists, UniProt IDs as the keys and new file paths as the values
Return type: dict
-
uniprots_modeled
¶ list – Return all UniProt accession numbers with at least one model
-
-
ssbio.databases.swissmodel.
get_oligomeric_state
(swiss_model_path)[source]¶ Parse the oligomeric prediction in a SWISS-MODEL repository file
As of 2018-02-26, works on all E. coli models. Untested on other pre-made organism models.
Parameters: swiss_model_path (str) – Path to SWISS-MODEL PDB file Returns: Information parsed about the oligomeric state Return type: dict
-
ssbio.databases.swissmodel.
translate_ostat
(ostat)[source]¶ Translate the OSTAT field to an integer.
As of 2018-02-26, works on all E. coli models. Untested on other pre-made organism models.
Parameters: ostat (str) – Predicted oligomeric state of the PDB file Returns: Translated string to integer Return type: int
UniProtProp¶
-
class
ssbio.databases.uniprot.
UniProtProp
(seq, id, name='<unknown name>', description='<unknown description>', fasta_path=None, xml_path=None, gff_path=None)[source]¶ Generic class to store information on a UniProt entry, extended from a SeqProp object.
The main utilities of this class are to:
- Download and/or parse UniProt text or xml files
- Store extra parsed information in attributes
-
uniprot
¶ str – Main UniProt accession code
-
alt_uniprots
¶ list – Alternate accession codes that point to the main one
-
file_type
¶ str – Metadata file type
-
reviewed
¶ bool – If this entry is a “reviewed” entry. If None, then status is unknown.
-
ec_number
¶ str – EC number
-
pfam
¶ list – PFAM IDs
-
entry_version
¶ str – Date of last update of the UniProt entry
-
seq_version
¶ str – Date of last update of the UniProt sequence
-
features
¶ list – Get the features from the feature file, metadata file, or in memory
-
ranking_score
()[source]¶ Provide a score for this UniProt ID based on reviewed (True=1, False=0) + number of PDBs
Returns: Scoring for this ID Return type: int
-
seq
¶ Seq – Get the Seq object from the sequence file, metadata file, or in memory
-
ssbio.databases.uniprot.
blast_uniprot
(seq_str, seq_ident=1, evalue=0.0001, reviewed_only=True, organism=None)[source]¶ BLAST the UniProt db to find what IDs match the sequence input
Parameters: - seq_str – Sequence string
- seq_ident – Percent identity to match
- evalue – E-value of BLAST hit
Returns:
-
ssbio.databases.uniprot.
download_uniprot_file
(uniprot_id, filetype, outdir='', force_rerun=False)[source]¶ Download a UniProt file for a UniProt ID/ACC
Parameters: - uniprot_id – Valid UniProt ID
- filetype – txt, fasta, xml, rdf, or gff
- outdir – Directory to download the file
Returns: Absolute path to file
Return type: str
-
ssbio.databases.uniprot.
get_fasta
(uniprot_id)[source]¶ Get the protein sequence for a UniProt ID as a string.
Parameters: uniprot_id – Valid UniProt ID Returns: String of the protein (amino acid) sequence Return type: str
-
ssbio.databases.uniprot.
is_valid_uniprot_id
(instring)[source]¶ Check if a string is a valid UniProt ID.
See regex from: http://www.uniprot.org/help/accession_numbers
Parameters: instring – any string identifier Returns: True if the string is a valid UniProt ID
-
ssbio.databases.uniprot.
old_parse_uniprot_txt_file
(infile)[source]¶ From: boscoh/uniprot github Parses the text of metadata retrieved from uniprot.org.
Only a few fields have been parsed, but this provides a template for the other fields.
A single description is generated from joining alternative descriptions.
Returns a dictionary with the main UNIPROT ACC as keys.
-
ssbio.databases.uniprot.
parse_uniprot_txt_file
(infile)[source]¶ Parse a raw UniProt metadata file and return a dictionary.
Parameters: infile – Path to metadata file Returns: Metadata dictionary Return type: dict
-
ssbio.databases.uniprot.
parse_uniprot_xml_metadata
(sr)[source]¶ Load relevant attributes and dbxrefs from a parsed UniProt XML file in a SeqRecord.
Returns: All parsed information Return type: dict
-
ssbio.databases.uniprot.
uniprot_ec
(uniprot_id)[source]¶ Retrieve the EC number annotation for a UniProt ID.
Parameters: uniprot_id – Valid UniProt ID Returns:
-
ssbio.databases.uniprot.
uniprot_reviewed_checker
(uniprot_id)[source]¶ Check if a single UniProt ID is reviewed or not.
Parameters: uniprot_id – Returns: If the entry is reviewed Return type: bool
-
ssbio.databases.uniprot.
uniprot_reviewed_checker_batch
(uniprot_ids)[source]¶ Batch check if uniprot IDs are reviewed or not
Parameters: uniprot_ids – UniProt ID or list of UniProt IDs Returns: Boolean} Return type: A dictionary of {UniProtID
-
ssbio.databases.uniprot.
uniprot_sites
(uniprot_id)[source]¶ Retrieve a list of UniProt sites parsed from the feature file
Sites are defined here: http://www.uniprot.org/help/site and here: http://www.uniprot.org/help/function_section
Parameters: uniprot_id – Valid UniProt ID Returns:
KEGGProp¶
-
class
ssbio.databases.kegg.
KEGGProp
(seq, id, name='<unknown name>', description='<unknown description>', fasta_path=None, txt_path=None, gff_path=None)[source]¶
-
ssbio.databases.kegg.
download_kegg_aa_seq
(gene_id, outdir=None, force_rerun=False)[source]¶ Download a FASTA sequence of a protein from the KEGG database and return the path.
Parameters: - gene_id – the gene identifier
- outdir – optional path to output directory
Returns: Path to FASTA file
-
ssbio.databases.kegg.
download_kegg_gene_metadata
(gene_id, outdir=None, force_rerun=False)[source]¶ Download the KEGG flatfile for a KEGG ID and return the path.
Parameters: - gene_id – KEGG gene ID (with organism code), i.e. “eco:1244”
- outdir – optional output directory of metadata
Returns: Path to metadata file
-
ssbio.databases.kegg.
map_kegg_all_genes
(organism_code, target_db)[source]¶ Map all of an organism’s gene IDs to the target database.
This is faster than supplying a specific list of genes to map, plus there seems to be a limit on the number you can map with a manual REST query anyway.
Parameters: - organism_code – the three letter KEGG code of your organism
- target_db – ncbi-proteinid | ncbi-geneid | uniprot
Returns: Dictionary of ID mapping
ssbio.protein.structure.utils
¶
CleanPDB¶
-
ssbio.protein.structure.utils.cleanpdb.
clean_pdb
(pdb_file, out_suffix='_clean', outdir=None, force_rerun=False, remove_atom_alt=True, keep_atom_alt_id='A', remove_atom_hydrogen=True, add_atom_occ=True, remove_res_hetero=True, keep_chemicals=None, keep_res_only=None, add_chain_id_if_empty='X', keep_chains=None)[source]¶ Clean a PDB file.
Parameters: - pdb_file (str) – Path to input PDB file
- out_suffix (str) – Suffix to append to original filename
- outdir (str) – Path to output directory
- force_rerun (bool) – If structure should be re-cleaned if a clean file exists already
- remove_atom_alt (bool) – Remove alternate positions
- keep_atom_alt_id (str) – If removing alternate positions, which alternate ID to keep
- remove_atom_hydrogen (bool) – Remove hydrogen atoms
- add_atom_occ (bool) – Add atom occupancy fields if not present
- remove_res_hetero (bool) – Remove all HETATMs
- keep_chemicals (str, list) – If removing HETATMs, keep specified chemical names
- keep_res_only (str, list) – Keep ONLY specified resnames, deletes everything else!
- add_chain_id_if_empty (str) – Add a chain ID if not present
- keep_chains (str, list) – Keep only these chains
Returns: Path to cleaned PDB file
Return type: str
MutatePDB¶
DOCK¶
-
class
ssbio.protein.structure.utils.dock.
DOCK
(structure_id, pdb_file, amb_file, flex1_file, flex2_file, root_dir=None)[source]¶ Class to prepare a structure file for docking with DOCK6.
Attributes:
-
auto_flexdock
(binding_residues, radius, ligand_path=None, force_rerun=False)[source]¶ Run DOCK6 on a PDB file, given its binding residues and a radius around them.
Provide a path to a ligand to dock a ligand to it. If no ligand is provided, DOCK6 preparations will be run on that structure file.
Parameters: - binding_residues (str) – Comma separated string of residues (eg: ‘144,170,199’)
- radius (int, float) – Radius around binding residues to dock to
- ligand_path (str) – Path to ligand (mol2 format) to dock to protein
- force_rerun (bool) – If method should be rerun even if output files exist
-
binding_site_mol2
(residues, force_rerun=False)[source]¶ Create mol2 of only binding site residues from the receptor
This function will take in a .pdb file (preferably the _receptor_noH.pdb file) and a string of residues (eg: ‘144,170,199’) and delete all other residues in the .pdb file. It then saves the coordinates of the selected residues as a .mol2 file. This is necessary for Chimera to select spheres within the radius of the binding site.
Parameters: - residues (str) – Comma separated string of residues (eg: ‘144,170,199’)
- force_rerun (bool) – If method should be rerun even if output file exists
-
dms_maker
(force_rerun=False)[source]¶ Create surface representation (dms file) of receptor
Parameters: force_rerun (bool) – If method should be rerun even if output file exists
-
do_dock6_flexible
(ligand_path, force_rerun=False)[source]¶ Dock a ligand to the protein.
Parameters: - ligand_path (str) – Path to ligand (mol2 format) to dock to protein
- force_rerun (bool) – If method should be rerun even if output file exists
-
dock_dir
¶ str – DOCK folder
-
dockprep
(force_rerun=False)[source]¶ Prepare a PDB file for docking by first converting it to mol2 format.
Parameters: force_rerun (bool) – If method should be rerun even if output file exists
-
grid
(force_rerun=False)[source]¶ Create the scoring grid within the dummy box.
Parameters: force_rerun (bool) – If method should be rerun even if output file exists
-
protein_only_and_noH
(keep_ligands=None, force_rerun=False)[source]¶ Isolate the receptor by stripping everything except protein and specified ligands.
Parameters: - keep_ligands (str, list) – Ligand(s) to keep in PDB file
- force_rerun (bool) – If method should be rerun even if output file exists
-
root_dir
¶ str – Directory where DOCK project folder is located
-
showbox
(force_rerun=False)[source]¶ Create the dummy PDB box around the selected spheres.
Parameters: force_rerun (bool) – If method should be rerun even if output file exists
-
ssbio.protein.structure.properties
¶
Structure Residues¶
-
ssbio.protein.structure.properties.residues.
distance_to_site
(residue_of_interest, residues, model)[source]¶ Calculate the distance between an amino acid and a group of amino acids.
Parameters: - residue_of_interest – Residue number you are interested in (ie. a mutation)
- residues – List of residue numbers
Returns: Distance (in Angstroms) to the group of residues
Return type: float
-
ssbio.protein.structure.properties.residues.
get_structure_seqrecords
(model)[source]¶ Get a dictionary of a PDB file’s sequences.
- Special cases include:
- Insertion codes. In the case of residue numbers like “15A”, “15B”, both residues are written out. Example: 9LPR
- HETATMs. Currently written as an “X”, or unknown amino acid.
Parameters: model – Biopython Model object of a Structure Returns: List of SeqRecords Return type: list
-
ssbio.protein.structure.properties.residues.
get_structure_seqs
(pdb_file, file_type)[source]¶ Get a dictionary of a PDB file’s sequences.
- Special cases include:
- Insertion codes. In the case of residue numbers like “15A”, “15B”, both residues are written out. Example: 9LPR
- HETATMs. Currently written as an “X”, or unknown amino acid.
Parameters: pdb_file – Path to PDB file Returns: Dictionary of: {chain_id: sequence} Return type: dict
-
ssbio.protein.structure.properties.residues.
hse_output
(pdb_file, file_type)[source]¶ The solvent exposure of an amino acid residue is important for analyzing, understanding and predicting aspects of protein structure and function [73]. A residue’s solvent exposure can be classified as four categories: exposed, partly exposed, buried and deeply buried residues. Hamelryck et al. [73] established a new 2D measure that provides a different view of solvent exposure, i.e. half-sphere exposure (HSE). By conceptually dividing the sphere of a residue into two halves- HSE-up and HSE-down, HSE provides a more detailed description of an amino acid residue’s spatial neighborhood. HSE is calculated by the hsexpo module implemented in the BioPython package [74] from a PDB file.
http://onlinelibrary.wiley.com/doi/10.1002/prot.20379/abstract
Parameters: pdb_file – Returns:
-
ssbio.protein.structure.properties.residues.
match_structure_sequence
(orig_seq, new_seq, match='X', fill_with='X', ignore_excess=False)[source]¶ Correct a sequence to match inserted X’s in a structure sequence
- This is useful for mapping a sequence obtained from structural tools like MSMS or DSSP
- to the sequence obtained by the get_structure_seqs method.
Examples
>>> structure_seq = 'XXXABCDEF' >>> prop_list = [4, 5, 6, 7, 8, 9] >>> match_structure_sequence(structure_seq, prop_list) ['X', 'X', 'X', 4, 5, 6, 7, 8, 9]
>>> match_structure_sequence(structure_seq, prop_list, fill_with=float('Inf')) [inf, inf, inf, 4, 5, 6, 7, 8, 9]
>>> structure_seq = '---ABCDEF---' >>> prop_list = ('H','H','H','C','C','C') >>> match_structure_sequence(structure_seq, prop_list, match='-', fill_with='-') ('-', '-', '-', 'H', 'H', 'H', 'C', 'C', 'C', '-', '-', '-')
>>> structure_seq = 'ABCDEF---' >>> prop_list = 'HHHCCC' >>> match_structure_sequence(structure_seq, prop_list, match='-', fill_with='-') 'HHHCCC---'
>>> structure_seq = 'AXBXCXDXEXF' >>> prop_list = ['H', 'H', 'H', 'C', 'C', 'C'] >>> match_structure_sequence(structure_seq, prop_list, match='X', fill_with='X') ['H', 'X', 'H', 'X', 'H', 'X', 'C', 'X', 'C', 'X', 'C']
Parameters: - orig_seq (str, Seq, SeqRecord) – Sequence to match to
- new_seq (str, tuple, list) – Sequence to fill in
- match (str) – What to match
- fill_with – What to fill in when matches are found
- ignore_excess (bool) – If excess sequence on the tail end of new_seq should be ignored
Returns: new_seq which will match the length of orig_seq
Return type: str, tuple, list
-
ssbio.protein.structure.properties.residues.
resname_in_proximity
(resname, model, chains, resnums, threshold=5)[source]¶ Search within the proximity of a defined list of residue numbers and their chains for any specifed residue name.
Parameters: - resname (str) – Residue name to search for in proximity of specified chains + resnums
- model – Biopython Model object
- chains (str, list) – Chain ID or IDs to check
- resnums (int, list) – Residue numbers within the chain to check
- threshold (float) – Cutoff in Angstroms for returning True if a RESNAME is near
Returns: True if a RESNAME is within the threshold cutoff
Return type: bool
-
ssbio.protein.structure.properties.residues.
search_ss_bonds
(model, threshold=3.0)[source]¶ Searches S-S bonds based on distances between atoms in the structure (first model only). Average distance is 2.05A. Threshold is 3A default. Returns iterator with tuples of residues.
ADAPTED FROM JOAO RODRIGUES’ BIOPYTHON GSOC PROJECT (http://biopython.org/wiki/GSOC2010_Joao)
ssbio.protein.sequence.utils
¶
Sequence Alignment¶
-
ssbio.protein.sequence.utils.alignment.
get_alignment_df
(a_aln_seq, b_aln_seq, a_seq_id=None, b_seq_id=None)[source]¶ Summarize two alignment strings in a dataframe.
Parameters: - a_aln_seq (str) – Aligned sequence string
- b_aln_seq (str) – Aligned sequence string
- a_seq_id (str) – Optional ID of a_seq
- b_seq_id (str) – Optional ID of b_aln_seq
Returns: a per-residue level annotation of the alignment
Return type: DataFrame
-
ssbio.protein.sequence.utils.alignment.
get_alignment_df_from_file
(alignment_file, a_seq_id=None, b_seq_id=None)[source]¶ Get a Pandas DataFrame of the Needle alignment results. Contains all positions of the sequences.
Parameters: - alignment_file –
- a_seq_id – Optional specification of the ID of the reference sequence
- b_seq_id – Optional specification of the ID of the aligned sequence
Returns: all positions in the alignment
Return type: Pandas DataFrame
-
ssbio.protein.sequence.utils.alignment.
get_deletions
(aln_df)[source]¶ Get a list of tuples indicating the first and last residues of a deletion region, as well as the length of the deletion.
Examples
# Deletion of residues 1 to 4, length 4 >>> test = {‘id_a’: {0: ‘a’, 1: ‘a’, 2: ‘a’, 3: ‘a’}, ‘id_a_aa’: {0: ‘M’, 1: ‘G’, 2: ‘I’, 3: ‘T’}, ‘id_a_pos’: {0: 1.0, 1: 2.0, 2: 3.0, 3: 4.0}, ‘id_b’: {0: ‘b’, 1: ‘b’, 2: ‘b’, 3: ‘b’}, ‘id_b_aa’: {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan}, ‘id_b_pos’: {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan}, ‘type’: {0: ‘deletion’, 1: ‘deletion’, 2: ‘deletion’, 3: ‘deletion’}} >>> my_alignment = pd.DataFrame.from_dict(test) >>> get_deletions(my_alignment) [((1.0, 4.0), 4)]
Parameters: aln_df (DataFrame) – Alignment DataFrame Returns: A list of tuples with the format ((deletion_start_resnum, deletion_end_resnum), deletion_length) Return type: list
-
ssbio.protein.sequence.utils.alignment.
get_insertions
(aln_df)[source]¶ Get a list of tuples indicating the first and last residues of a insertion region, as well as the length of the insertion.
- If the first tuple is:
- (-1, 1) that means the insertion is at the beginning of the original protein (X, Inf) where X is the length of the original protein, that means the insertion is at the end of the protein
Examples
# Insertion at beginning, length 3 >>> test = {‘id_a’: {0: ‘a’, 1: ‘a’, 2: ‘a’, 3: ‘a’}, ‘id_a_aa’: {0: np.nan, 1: np.nan, 2: np.nan, 3: ‘M’}, ‘id_a_pos’: {0: np.nan, 1: np.nan, 2: np.nan, 3: 1.0}, ‘id_b’: {0: ‘b’, 1: ‘b’, 2: ‘b’, 3: ‘b’}, ‘id_b_aa’: {0: ‘M’, 1: ‘M’, 2: ‘L’, 3: ‘M’}, ‘id_b_pos’: {0: 1, 1: 2, 2: 3, 3: 4}, ‘type’: {0: ‘insertion’, 1: ‘insertion’, 2: ‘insertion’, 3: ‘match’}} >>> my_alignment = pd.DataFrame.from_dict(test) >>> get_insertions(my_alignment) [((-1, 1.0), 3)]
Parameters: aln_df (DataFrame) – Alignment DataFrame Returns: A list of tuples with the format ((insertion_start_resnum, insertion_end_resnum), insertion_length) Return type: list
-
ssbio.protein.sequence.utils.alignment.
get_mutations
(aln_df)[source]¶ Get a list of residue numbers (in the original sequence’s numbering) that are mutated
Parameters: - aln_df (DataFrame) – Alignment DataFrame
- just_resnums – If only the residue numbers should be returned, instead of a list of tuples of (original_residue, resnum, mutated_residue)
Returns: Residue mutations
Return type: list
-
ssbio.protein.sequence.utils.alignment.
get_percent_identity
(a_aln_seq, b_aln_seq)[source]¶ Get the percent identity between two alignment strings
-
ssbio.protein.sequence.utils.alignment.
get_unresolved
(aln_df)[source]¶ Get a list of residue numbers (in the original sequence’s numbering) that are unresolved
Parameters: aln_df (DataFrame) – Alignment DataFrame Returns: Residue numbers that are mutated Return type: list
-
ssbio.protein.sequence.utils.alignment.
map_resnum_a_to_resnum_b
(a_resnum, a_aln, b_aln)[source]¶ Map a residue number in a sequence to the corresponding residue number in an aligned sequence.
Examples: >>> map_resnum_a_to_resnum_b(5, ‘–ABCDEF’, ‘XXABCDEF’) 7
Parameters: - a_resnum (int) – Residue number in the first aligned sequence
- a_aln (str, Seq, SeqRecord) – Aligned sequence string
- b_aln (str, Seq, SeqRecord) – Aligned sequence string
Returns: Residue number in the second aligned sequence
Return type: int
-
ssbio.protein.sequence.utils.alignment.
needle_statistics
(infile)[source]¶ Reads in a needle alignment file and spits out statistics of the alignment.
Parameters: infile (str) – Alignment file name Returns: alignment_properties - a dictionary telling you the number of gaps, identity, etc. Return type: dict
-
ssbio.protein.sequence.utils.alignment.
pairwise_sequence_alignment
(a_seq, b_seq, engine, a_seq_id=None, b_seq_id=None, gapopen=10, gapextend=0.5, outfile=None, outdir=None, force_rerun=False)[source]¶ Run a global pairwise sequence alignment between two sequence strings.
Parameters: - a_seq (str, Seq, SeqRecord, SeqProp) – Reference sequence
- b_seq (str, Seq, SeqRecord, SeqProp) – Sequence to be aligned to reference
- engine (str) – biopython or needle - which pairwise alignment program to use
- a_seq_id (str) – Reference sequence ID. If not set, is “a_seq”
- b_seq_id (str) – Sequence to be aligned ID. If not set, is “b_seq”
- gapopen (int) – Only for needle - Gap open penalty is the score taken away when a gap is created
- gapextend (float) – Only for needle - Gap extension penalty is added to the standard gap penalty for each base or residue in the gap
- outfile (str) – Only for needle - name of output file. If not set, is {id_a}_{id_b}_align.txt
- outdir (str) – Only for needle - Path to output directory. Default is the current directory.
- force_rerun (bool) – Only for needle - Default False, set to True if you want to rerun the alignment if outfile exists.
Returns: Biopython object to represent an alignment
Return type: MultipleSeqAlignment
-
ssbio.protein.sequence.utils.alignment.
run_needle_alignment
(seq_a, seq_b, gapopen=10, gapextend=0.5, outdir=None, outfile=None, force_rerun=False)[source]¶ Run the needle alignment program for two strings and return the raw alignment result.
More info: EMBOSS needle: http://www.bioinformatics.nl/cgi-bin/emboss/help/needle Biopython wrapper: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc84 Using strings as input: https://www.biostars.org/p/91124/
Parameters: - id_a – ID of reference sequence
- seq_a (str, Seq, SeqRecord) – Reference sequence
- id_b – ID of sequence to be aligned
- seq_b (str, Seq, SeqRecord) – String representation of sequence to be aligned
- gapopen – Gap open penalty is the score taken away when a gap is created
- gapextend – Gap extension penalty is added to the standard gap penalty for each base or residue in the gap
- outdir (str, optional) – Path to output directory. Default is the current directory.
- outfile (str, optional) – Name of output file. If not set, is {id_a}_{id_b}_align.txt
- force_rerun (bool) – Default False, set to True if you want to rerun the alignment if outfile exists.
Returns: Raw alignment result of the needle alignment in srspair format.
Return type: str
-
ssbio.protein.sequence.utils.alignment.
run_needle_alignment_on_files
(id_a, faa_a, id_b, faa_b, gapopen=10, gapextend=0.5, outdir='', outfile='', force_rerun=False)[source]¶ Run the needle alignment program for two fasta files and return the raw alignment result.
More info: EMBOSS needle: http://www.bioinformatics.nl/cgi-bin/emboss/help/needle Biopython wrapper: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc84
Parameters: - id_a – ID of reference sequence
- faa_a – File path to reference sequence
- id_b – ID of sequence to be aligned
- faa_b – File path to sequence to be aligned
- gapopen – Gap open penalty is the score taken away when a gap is created
- gapextend – Gap extension penalty is added to the standard gap penalty for each base or residue in the gap
- outdir (str, optional) – Path to output directory. Default is the current directory.
- outfile (str, optional) – Name of output file. If not set, is {id_a}_{id_b}_align.txt
- force_rerun (bool) – Default False, set to True if you want to rerun the alignment if outfile exists.
Returns: Raw alignment result of the needle alignment in srspair format.
Return type: str
Sequence BLAST¶
-
ssbio.protein.sequence.utils.blast.
calculate_bbh
(blast_results_1, blast_results_2, r_name=None, g_name=None, outdir='')[source]¶ Calculate the best bidirectional BLAST hits (BBH) and save a dataframe of results.
Parameters: - blast_results_1 (str) – BLAST results for reference vs. other genome
- blast_results_2 (str) – BLAST results for other vs. reference genome
- r_name – Name of reference genome
- g_name – Name of other genome
- outdir – Directory where BLAST results are stored.
Returns: Path to Pandas DataFrame of the BBH results.
-
ssbio.protein.sequence.utils.blast.
create_orthology_matrix
(r_name, genome_to_bbh_files, pid_cutoff=None, bitscore_cutoff=None, evalue_cutoff=None, filter_condition='OR', outname='', outdir='', force_rerun=False)[source]¶ Create an orthology matrix using best bidirectional BLAST hits (BBH) outputs.
Parameters: - r_name (str) – Name of the reference genome
- genome_to_bbh_files (dict) – Mapping of genome names to the BBH csv output from the
calculate_bbh()
method - pid_cutoff (float) – Minimum percent identity between BLAST hits to filter for in the range [0, 100]
- bitscore_cutoff (float) – Minimum bitscore allowed between BLAST hits
- evalue_cutoff (float) – Maximum E-value allowed between BLAST hits
- filter_condition (str) – ‘OR’ or ‘AND’, how to combine cutoff filters. ‘OR’ gives more results since it is less stringent, as you will be filtering for hits with (>80% PID or >30 bitscore or <0.0001 evalue).
- outname – Name of output file of orthology matrix
- outdir – Path to output directory
- force_rerun (bool) – Force recreation of the orthology matrix even if the outfile exists
Returns: Path to orthologous genes matrix.
Return type: str
-
ssbio.protein.sequence.utils.blast.
print_run_bidirectional_blast
(reference, other_genome, dbtype, outdir)[source]¶ Write torque submission files for running bidirectional blast on a server and print execution command.
Parameters: - reference (str) – Path to “reference” genome, aka your “base strain”
- other_genome (str) – Path to other genome which will be BLASTed to the reference
- dbtype (str) – “nucl” or “prot” - what format your genome files are in
- outdir (str) – Path to folder where Torque scripts should be placed
-
ssbio.protein.sequence.utils.blast.
run_bidirectional_blast
(reference, other_genome, dbtype, outdir='')[source]¶ BLAST a genome against another, and vice versa.
This function requires BLAST to be installed, do so by running: sudo apt install ncbi-blast+
Parameters: - reference (str) – path to “reference” genome, aka your “base strain”
- other_genome (str) – path to other genome which will be BLASTed to the reference
- dbtype (str) – “nucl” or “prot” - what format your genome files are in
- outdir (str) – path to folder where BLAST outputs should be placed
Returns: Paths to BLAST output files. (reference_vs_othergenome.out, othergenome_vs_reference.out)
-
ssbio.protein.sequence.utils.blast.
run_makeblastdb
(infile, dbtype, outdir='')[source]¶ Make the BLAST database for a genome file.
Parameters: - infile (str) – path to genome FASTA file
- dbtype (str) – “nucl” or “prot” - what format your genome files are in
- outdir (str) – path to directory to output database files (default is original folder)
Returns: Paths to BLAST databases.
ssbio.protein.sequence.properties
¶
Thermostability¶
- This module provides functions to predict thermostability parameters (specifically the free energy of unfolding dG)
- of an amino acid sequence.
These methods are adapted from:
- Oobatake, M., & Ooi, T. (1993). ‘Hydration and heat stability effects on protein unfolding’,
- Progress in biophysics and molecular biology, 59/3: 237–84.
- Dill, K. A., Ghosh, K., & Schmit, J. D. (2011). ‘Physical limits of cells and proteomes’,
- Proceedings of the National Academy of Sciences of the United States of America, 108/44: 17876–82. DOI: 10.1073/pnas.1114477108
For an example of usage of these parameters in a genome-scale model:
- Chen, K., Gao, Y., Mih, N., O’Brien, E., Yang, L., Palsson, B.O. (2017).
- ‘Thermo-sensitivity of growth is determined by chaperone-mediated proteome re-allocation.’, Submitted to PNAS.
-
ssbio.protein.sequence.properties.thermostability.
calculate_dill_dG
(seq_len, temp)[source]¶ Get free energy of unfolding (dG) using Dill method in units J/mol.
Parameters: - seq_len (int) – Length of amino acid sequence
- temp (float) – Temperature in degrees C
Returns: Free energy of unfolding dG (J/mol)
Return type: float
-
ssbio.protein.sequence.properties.thermostability.
calculate_oobatake_dG
(seq, temp)[source]¶ Get free energy of unfolding (dG) using Oobatake method in units cal/mol.
Parameters: - seq (str, Seq, SeqRecord) – Amino acid sequence
- temp (float) – Temperature in degrees C
Returns: Free energy of unfolding dG (J/mol)
Return type: float
-
ssbio.protein.sequence.properties.thermostability.
calculate_oobatake_dH
(seq, temp)[source]¶ Get dH using Oobatake method in units cal/mol.
Parameters: - seq (str, Seq, SeqRecord) – Amino acid sequence
- temp (float) – Temperature in degrees C
Returns: dH in units cal/mol
Return type: float
-
ssbio.protein.sequence.properties.thermostability.
calculate_oobatake_dS
(seq, temp)[source]¶ Get dS using Oobatake method in units cal/mol.
Parameters: - seq (str, Seq, SeqRecord) – Amino acid sequence
- temp (float) – Temperature in degrees C
Returns: dS in units cal/mol
Return type: float
-
ssbio.protein.sequence.properties.thermostability.
get_dG_at_T
(seq, temp)[source]¶ Predict dG at temperature T, using best predictions from Dill or Oobatake methods.
Parameters: - seq (str, Seq, SeqRecord) – Amino acid sequence
- temp (float) – Temperature in degrees C
Returns: tuple containing:
dG (float) Free energy of unfolding dG (cal/mol) keq (float): Equilibrium constant Keq method (str): Method used to calculate
Return type: (tuple)