Python API

class ssbio.core.protein.Protein(ident, description=None, root_dir=None, pdb_file_type='cif')[source]

Store information on a protein that represents the translated unit of a gene.

The main utilities of this class are to:

  1. Load, parse, and store multiple sources of the same or similar (ie. from different strains) protein sequences as SeqProp objects in the sequences attribute
  2. Load, parse, and store multiple experimental or predicted protein structures as StructProp objects in the structures attribute
  3. Calculate, store, and access sequence alignments to stored sequences or structures
  4. Provide summaries of alignments and mutations seen
  5. Map between residue numbers of sequences and structures
id

str – Unique identifier for this protein

description

str – Optional description for this protein

sequences

DictList – Stored amino acids which are related to this protein

structures

DictList – Stored protein structures which are related to this protein

representative_sequence

SeqProp – Sequence set to represent this protein

representative_structure

StructProp – Structure set to represent this protein, optionally in monomeric form

representative_chain

str – Chain ID in the representative structure which best represents a sequence

representative_chain_seq_coverage

float – Percent identity of sequence coverage for the representative chain

sequence_alignments

DictList – Pairwise or multiple sequence alignments stored as Bio.Align.MultipleSeqAlignment objects

structure_alignments

DictList – Pairwise or multiple structure alignments - incomplete implementation

root_dir

str – Path to where the folder named by this protein’s ID will be created. Default is current working directory.

pdb_file_type

strpdb, pdb.gz, mmcif, cif, cif.gz, xml.gz, mmtf, mmtf.gz - choose a file type for files downloaded from the PDB

align_seqprop_to_structprop(seqprop, structprop, chains=None, outdir=None, engine='needle', parse=True, force_rerun=False, **kwargs)[source]

Run and store alignments of a SeqProp to chains in the mapped_chains attribute of a StructProp.

Alignments are stored in the sequence_alignments attribute, with the IDs formatted
as <SeqProp_ID>_<StructProp_ID>-<Chain_ID>. Although it is more intuitive to align to individual ChainProps, StructProps should be loaded as little as possible to reduce run times.
Parameters:
  • seqprop (SeqProp) – SeqProp object with a loaded sequence
  • structprop (StructProp) – StructProp object with a loaded structure
  • chains (str, list) – Chain ID or IDs to map to. If not specified, mapped_chains attribute is inspected for chains. If no chains there, all chains will be aligned to.
  • outdir (str) – Directory to output sequence alignment files (only if running with needle)
  • engine (str) – Which pairwise alignment tool to use (“needle” or “biopython”)
  • parse (bool) – Store locations of mutations, insertions, and deletions in the alignment object (as an annotation)
  • force_rerun (bool) – If alignments should be rerun
  • **kwargs – Other alignment options
blast_representative_sequence_to_pdb(seq_ident_cutoff=0, evalue=0.0001, display_link=False, outdir=None, force_rerun=False)[source]

BLAST repseq to PDB and return the list of new structures added, also saves df_pdb_blast

df_homology_models

Get a dataframe of homology models

df_pdb_blast

Get a dataframe of PDB BLAST results

df_pdb_metadata

Get a dataframe of PDB metadata (PDBs have to be downloaded first)

df_pdb_ranking

Get a dataframe of UniProt -> best structure in PDB results

filter_sequences(seq_type)[source]

Return a DictList of only specified types in the sequences attribute.

Parameters:seq_type (SeqProp) – Object type
Returns:A filtered DictList of specified object type only
Return type:DictList
find_representative_chain(seqprop, structprop, chains_to_check=None, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True)[source]

Set and return the representative chain based on sequence quality checks to a reference sequence.

Parameters:
  • seqprop (SeqProp) – SeqProp object to compare to chain sequences
  • structprop (StructProp) – StructProp object with chains to compare to in the mapped_chains attribute. If there are none present, chains_to_check can be specified, otherwise all chains are checked.
  • chains_to_check (str, list) – Chain ID or IDs to check for sequence coverage quality
  • seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
  • allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
  • allow_mutants (bool) – If mutations should be allowed or checked for
  • allow_deletions (bool) – If deletions should be allowed or checked for
  • allow_insertions (bool) – If insertions should be allowed or checked for
  • allow_unresolved (bool) – If unresolved residues should be allowed or checked for
Returns:

the best chain ID, if any

Return type:

str

get_disulfide_bridges(representative_only=True)[source]

Run Biopython’s disulfide bridge finder and store found bridges. Annotations are stored in the protein structure’s chain sequence at: seq_record.annotations['SSBOND-biopython']

Parameters:representative_only (bool) – If analysis should only be run on the representative structure
get_dssp_annotations(representative_only=True, force_rerun=False)[source]

Run DSSP on structures and store calculations. Annotations are stored in the protein structure’s chain sequence at: seq_record.letter_annotations['*-dssp']

Todo

  • Some errors arise from storing annotations for nonstandard amino acids, need to run DSSP separately for those
Parameters:
  • representative_only (bool) – If analysis should only be run on the representative structure
  • force_rerun (bool) – If calculations should be rerun even if an output file exists
get_experimental_structures()[source]

DictList: Return a DictList of all experimental structures in self.structures

get_freesasa_annotations(include_hetatms=False, representative_only=True, force_rerun=False)[source]

Run freesasa on structures and store calculations. Annotations are stored in the protein structure’s chain sequence at: seq_record.letter_annotations['*-freesasa']

Parameters:
  • include_hetatms (bool) – If HETATMs should be included in calculations. Defaults to False.
  • representative_only (bool) – If analysis should only be run on the representative structure
  • force_rerun (bool) – If calculations should be rerun even if an output file exists
get_homology_models()[source]

DictList: Return a DictList of all homology models in self.structures

get_msms_annotations(representative_only=True, force_rerun=False)[source]

Run MSMS on structures and store calculations. Annotations are stored in the protein structure’s chain sequence at: seq_record.letter_annotations['*-msms']

Parameters:
  • representative_only (bool) – If analysis should only be run on the representative structure
  • force_rerun (bool) – If calculations should be rerun even if an output file exists
get_residue_annotations(seq_resnum, seq_id=None, struct_id=None, struct_chain_id=None, use_representatives=True)[source]

Get all residue-level annotations stored in the SeqRecord letter_annotations field for a given residue number

Uses the representative sequence, structure, and chain ID stored by default. If other properties from other
structures are desired, input the proper IDs. An alignment for the given sequence to the structure must be present in the sequence_alignments list.
Parameters:
  • seq_resnum (int) – Residue number in the sequence
  • seq_id (str) – ID of the sequence
  • struct_id (str) – ID of the structure
  • struct_chain_id (str) – ID of the structure’s chain
  • use_representatives (bool) – If the representative sequence/structure/chain IDs should be used
Returns:

All available letter_annotations for this residue number

Return type:

dict

get_sequence_properties(representative_only=True)[source]

Run Biopython ProteinAnalysis and EMBOSS pepstats to summarize basic statistics of the protein sequences. Annotations are stored in the protein’s respective seqprop objects at: .seq_record.annotations

Parameters:representative_only (bool) – If analysis should only be run on the representative sequence
load_itasser_folder(ident, itasser_folder, organize=False, outdir=None, organize_name=None, set_as_representative=False, representative_chain='X', create_dfs=False, force_rerun=False)[source]

Load the results folder from an I-TASSER run (local, not from the server), copy structure files over, and create summary dataframes.

Parameters:
  • ident – I-TASSER ID
  • itasser_folder – Path to results folder
  • organize (bool) – If select files from modeling should be copied to the Protein directory
  • outdir (str) – Path to directory where files will be copied and organized to
  • organize_name (str) – Basename of files to rename results to. If not provided, will use id attribute.
  • set_as_representative – If this structure should be set as the representative structure
  • representative_chain (str) – If set_as_representative is True, provide the representative chain ID
  • create_dfs – If summary dataframes should be created
  • force_rerun (bool) – If the PDB should be reloaded if it is already in the list of structures
Returns:

The object that is now contained in the structures attribute

Return type:

ITASSERProp

load_kegg(kegg_id, kegg_organism_code=None, kegg_seq_file=None, kegg_metadata_file=None, set_as_representative=False, download=False, outdir=None, force_rerun=False)[source]

Load a KEGG ID, sequence, and metadata files into the sequences attribute.

Parameters:
  • kegg_id (str) – KEGG ID
  • kegg_organism_code (str) – KEGG organism code to prepend to the kegg_id if not part of it already. Example: eco:b1244, eco is the organism code
  • kegg_seq_file (str) – Path to KEGG FASTA file
  • kegg_metadata_file (str) – Path to KEGG metadata file (raw KEGG format)
  • set_as_representative (bool) – If this KEGG ID should be set as the representative sequence
  • download (bool) – If the KEGG sequence and metadata files should be downloaded if not provided
  • outdir (str) – Where the sequence and metadata files should be downloaded to
  • force_rerun (bool) – If ID should be reloaded and files redownloaded
Returns:

object contained in the sequences attribute

Return type:

KEGGProp

load_manual_sequence(seq, ident=None, write_fasta_file=False, outname=None, outdir=None, set_as_representative=False, force_rewrite=False)[source]

Load a manual sequence given as a string and optionally set it as the representative sequence. Also store it in the sequences attribute.

Parameters:
  • seq (str, Seq, SeqRecord) – Sequence string, Biopython Seq or SeqRecord object
  • ident (str) – Optional identifier for the sequence, required if seq is a string. Also will override existing IDs in Seq or SeqRecord objects if set.
  • write_fasta_file
  • outname
  • outdir
  • set_as_representative
  • force_rewrite

Returns:

load_manual_sequence_file(ident, seq_file, copy_file=False, outdir=None, set_as_representative=False)[source]

Load a manual sequence given as a FASTA file and optionally set it as the representative sequence. Also store it in the sequences attribute.

Parameters:
  • ident
  • seq_file
  • copy_file
  • outdir
  • set_as_representative
load_pdb(pdb_id, mapped_chains=None, pdb_file=None, file_type=None, is_experimental=True, set_as_representative=False, representative_chain=None, force_rerun=False)[source]

Load a structure ID and optional structure file into the structures attribute.

Parameters:
  • pdb_id (str) – PDB ID
  • mapped_chains (str, list) – Chain ID or list of IDs which you are interested in
  • pdb_file (str) – Path to PDB file
  • file_type (str) – Type of PDB file
  • is_experimental (bool) – If this structure file is experimental
  • set_as_representative (bool) – If this structure should be set as the representative structure
  • representative_chain (str) – If set_as_representative is True, provide the representative chain ID
  • force_rerun (bool) – If the PDB should be reloaded if it is already in the list of structures
Returns:

The object that is now contained in the structures attribute

Return type:

PDBProp

load_uniprot(uniprot_id, uniprot_seq_file=None, uniprot_xml_file=None, download=False, outdir=None, set_as_representative=False, force_rerun=False)[source]

Load a UniProt ID and associated sequence/metadata files into the sequences attribute.

Sequence and metadata files can be provided, or alternatively downloaded with the download flag set to True.
Metadata files will be downloaded as XML files.
Parameters:
  • uniprot_id (str) – UniProt ID/ACC
  • uniprot_seq_file (str) – Path to FASTA file
  • uniprot_xml_file (str) – Path to UniProt XML file
  • download (bool) – If sequence and metadata files should be downloaded
  • outdir (str) – Output directory for sequence and metadata files
  • set_as_representative (bool) – If this sequence should be set as the representative one
  • force_rerun (bool) – If files should be redownloaded and metadata reloaded
Returns:

sequence that was loaded into the sequences attribute

Return type:

UniProtProp

map_seqprop_resnums_to_structprop_chain_index(resnums, seqprop=None, structprop=None, chain_id=None, use_representatives=False)[source]

Map a residue number in the seqprop to the mapping index in the structprop/chain_id

Use this to get the indices of the chain to then get structure residue number.

Parameters:
  • resnums (int, list) – Residue numbers in the sequence
  • seqprop (SeqProp) – SeqProp object
  • structprop (StructProp) – StructProp object
  • chain_id (str) – Chain ID to map to index
  • use_representatives (bool) – If representative sequence/structure/chain should be used in mapping
Returns:

Mapping of resnums to indices

Return type:

dict

map_seqprop_resnums_to_structprop_resnums(resnums, seqprop=None, structprop=None, chain_id=None, use_representatives=False)[source]

Map a residue number in the seqprop to the structure’s residue number for a specified chain.

Parameters:
  • resnums (int, list) – Residue numbers in the sequence
  • seqprop (SeqProp) – SeqProp object
  • structprop (StructProp) – StructProp object
  • chain_id (str) – Chain ID to map to
  • use_representatives (bool) – If the representative sequence and structure should be used. If True, seqprop, structprop, and chain_id do not need to be defined.
Returns:

Mapping of resnums to structure residue IDs

Return type:

dict

map_uniprot_to_pdb(seq_ident_cutoff=0.0, outdir=None, force_rerun=False)[source]

Map the representative sequence’s UniProt ID to PDB IDs using the PDBe “Best Structures” API.

Parameters:
  • seq_ident_cutoff (float) – Sequence identity cutoff in decimal form
  • outdir (str) – Output directory to cache JSON results of search
  • force_rerun (bool) – Force re-downloading of JSON results if they already exist
Returns:

A rank-ordered list of PDB IDs that map to the UniProt ID

Return type:

list

num_sequences

int – Return the total number of sequences

num_structures

int – Return the total number of structures

num_structures_experimental

int – Return the total number of experimental structures

num_structures_homology

int – Return the total number of homology models

pairwise_align_sequences_to_representative(gapopen=10, gapextend=0.5, outdir=None, engine='needle', parse=True, force_rerun=False)[source]

Align all sequences in the sequences attribute to the representative sequence. Stores the alignments the sequence_alignments DictList

Parameters:
  • gapopen
  • gapextend
  • outdir
  • engine
  • parse (bool) – Store locations of mutations, insertions, and deletions in the alignment object (as an annotation)
  • force_rerun

Returns:

pdb_downloader_and_metadata(outdir=None, pdb_file_type=None, force_rerun=False)[source]

Download experimental structures

prep_itasser_modeling(itasser_installation, itlib_folder, runtype, create_in_dir=None, execute_from_dir=None, print_exec=False, **kwargs)[source]

Prepare to run I-TASSER homology modeling for a sequence.

Parameters:
  • itasser_installation (str) – Path to I-TASSER folder, i.e. ~/software/I-TASSER4.4
  • itlib_folder (str) – Path to ITLIB folder, i.e. ~/software/ITLIB
  • runtype – How you will be running I-TASSER - local, slurm, or torque
  • create_in_dir (str) – Local directory where folders will be created
  • execute_from_dir (str) – Optional path to execution directory - use this if you are copying the homology models to another location such as a supercomputer for running
  • all_genes (bool) – If all genes should be prepped, or only those without any mapped structures
  • print_exec (bool) – If the execution statement should be printed to run modelling
  • **kwargs

Todo

  • kwargs - extra options for SLURM or Torque execution
protein_dir

str – Protein folder

protein_statistics

Get a dictionary of basic statistics describing this protein

root_dir

str – Directory where Protein project folder is located

sequence_dir

str – Directory where sequence related files are stored

sequence_mutation_summary(sequence_ids=None, alignment_type=None)[source]

Summarize all mutations found in the sequence_alignments attribute.

Returns 2 dictionaries, single_counter and fingerprint_counter.

single_counter:

Dictionary of {point mutation: list of genes/strains} Example: {(‘A’, 24, ‘V’): [‘Strain1’, ‘Strain2’, ‘Strain4’],

(‘R’, 33, ‘T’): [‘Strain2’]}

Here, we report which genes/strains have the single point mutation.

fingerprint_counter:

Dictionary of {mutation group: list of genes/strains} Example: {((‘A’, 24, ‘V’), (‘R’, 33, ‘T’)): [‘Strain2’],

((‘A’, 24, ‘V’)): [‘Strain1’, ‘Strain4’]}

Here, we report which genes/strains have the specific combinations (or “fingerprints”) of point mutations

Parameters:
  • sequence_ids (str, list) – Specified alignment ID or IDs to use
  • alignment_type (str) – Specified alignment type contained in the annotation field of an alignment object
Returns:

single_counter, fingerprint_counter

Return type:

dict, dict

set_representative_sequence(force_rerun=False)[source]

Consolidate sequences that were loaded and set a single representative sequence.

set_representative_structure(seq_outdir=None, struct_outdir=None, pdb_file_type=None, engine='needle', always_use_homology=False, rez_cutoff=0.0, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True, clean=True, keep_chemicals=None, force_rerun=False)[source]

Set a representative structure from a structure in self.structures

Parameters:
  • seq_outdir (str) – Path to output directory of sequence alignment files
  • struct_outdir (str) – Path to output directory of structure files
  • pdb_file_type (str) – pdb, pdb.gz, mmcif, cif, cif.gz, xml.gz, mmtf, mmtf.gz - PDB structure file type that should be downloaded
  • engine (str) – “needle” or “biopython” - which pairwise sequence alignment engine should be used needle is the standard EMBOSS tool to run pairwise alignments biopython is Biopython’s implementation of needle. Results can differ!
  • always_use_homology (bool) – If homology models should always be set as the representative structure
  • rez_cutoff (float) – Resolution cutoff, in Angstroms (only if experimental structure)
  • seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
  • allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
  • allow_mutants (bool) – If mutations should be allowed or checked for
  • allow_deletions (bool) – If deletions should be allowed or checked for
  • allow_insertions (bool) – If insertions should be allowed or checked for
  • allow_unresolved (bool) – If unresolved residues should be allowed or checked for
  • clean (bool) – If structure should be cleaned
  • keep_chemicals (str, list) – Keep specified chemical names if structure is to be cleaned
  • force_rerun (bool) – If sequence to structure alignment should be rerun
Returns:

Representative structure from the list of structures.

This is a not a map to the original structure, it is copied from its reference.

Return type:

StructProp

structure_dir

str – Directory where structure related files are stored

view_all_mutations(grouped=False, color='red', unique_colors=True, structure_opacity=0.5, opacity_range=(0.8, 1), scale_range=(1, 5), gui=False)[source]

Map all sequence alignment mutations to the structure.

Parameters:
  • grouped (bool) – If groups of mutations should be colored and sized together
  • color (str) – Color of the mutations (overridden if unique_colors=True)
  • unique_colors (bool) – If each mutation/mutation group should be colored uniquely
  • structure_opacity (float) – Opacity of the protein structure cartoon representation
  • opacity_range (tuple) – Min/max opacity values (mutations that show up more will be opaque)
  • scale_range (tuple) – Min/max size values (mutations that show up more will be bigger)
  • gui (bool) – If the NGLview GUI should show up
Returns:

NGLviewer object

ssbio.core.protein.log = <logging.Logger object>[source]

protein.py

Todo

  • Implement structural alignment objects
class ssbio.protein.sequence.seqprop.SeqProp(ident, description='<unknown description>', seq=None, sequence_path=None, metadata_path=None, feature_path=None, sync_description=True, sync_annotations=True, sync_letter_annotations=True, sync_features=True, write_fasta_file=False, outfile=None, force_rewrite=False)[source]

Generic class to store information on a protein sequence.

The main utilities of this class are to:

  1. Provide database identifier mappings as top level attributes
  2. Manipulate a sequence as a Biopython SeqRecord
  3. Calculate, store, and access sequence features and annotations
  4. File I/O (sequence and feature files)
bigg

str – BiGG ID for this protein

kegg

str – KEGG ID for this protein

refseq

str – RefSeq ID for this protein

uniprot

str – UniProt ID for this protein

gene_name

str – Gene name encoding this protein

pdbs

list – List of PDB IDs mapped to this protein

go

list – List of GO terms

pfam

list – List of PFAMs

ec_number

EC numbers for this protein

seq_record

SeqRecord – Biopython SeqRecord representation of sequence

sequence_file

str – Path to FASTA file

metadata_file

str – Path to generic metadata file

annotations

dict – Freeform dictionary of annotations, copied from any SeqRecord annotations

letter_annotations

dict – Per-residue annotations, copied from any SeqRecord letter_annotations

features

list – Sequence features, copied from any SeqRecord features

blast_pdb(seq_ident_cutoff=0, evalue=0.0001, display_link=False, outdir=None, force_rerun=False)[source]

BLAST this sequence to the PDB

equal_to(seq_prop)[source]

Test if the sequence is equal to another SeqProp’s sequence

Parameters:seq_prop – SeqProp object
Returns:If the sequences are the same
Return type:bool
equal_to_fasta(seq_file)[source]

Test if this sequence is equal to another sequence file.

Parameters:seq_file – Path to another sequence file
Returns:If the sequences are the same
Return type:bool
get_biopython_pepstats()[source]

Run Biopython’s built in ProteinAnalysis module.

Stores statistics in the annotations attribute.

get_dict(only_keys=None, exclude_attributes=None, df_format=False)[source]

Get a copy of all attributes as a dictionary, including object properties

Parameters:
  • only_keys
  • exclude_attributes
  • df_format

Returns:

get_emboss_pepstats()[source]

Run the EMBOSS pepstats program on the protein sequence.

Stores statistics in the annotations attribute. Saves a .pepstats file of the results where the sequence file is located.

load_feature_path(gff_path)[source]

Load a GFF file with information on a single sequence and store features in the features attribute

Parameters:gff_path – Path to GFF file.
load_from_seq_record_attributes(sr)[source]

Load specific attributes from a SeqRecord object into a SeqProp.

load_metadata_path(metadata_path)[source]

Provide pointers to the paths of the metadata file

Parameters:metadata_path – Path to metadata file
load_sequence_path(sequence_path)[source]

Provide pointers to the paths of the sequence file

Parameters:sequence_path – Path to sequence file
load_to_seq_record_attributes(sr)[source]

Load specific attributes from the SeqProp to a SeqRecord object

num_pdbs

int – Report the number of PDB IDs stored in the pdbs attribute

seq_len

int – Get the sequence length

seq_record

SeqRecord – Dynamically loaded SeqRecord object from the sequence or metadata file

seq_str

str – Get the sequence formatted as a string

write_fasta_file(outfile, force_rerun=False)[source]

Write a FASTA file for the protein sequence

If sequence was initialized in the object, it is cleared and seq_record will instead load from this file.

Parameters:
  • outfile (str) – Path to new FASTA file to be written to
  • force_rerun (bool) – If an existing file should be overwritten
ssbio.protein.sequence.seqprop.log = <logging.Logger object>[source]

seqprop.py

Todo

  • Include methods to read and write GFF files so features don’t need to be stored in memory
  • load_json for SeqProp objects needs to load letter_annotations as a Bio.SeqRecord._RestrictedDict object,

otherwise newly stored annotations can be of any length

class ssbio.protein.structure.structprop.StructProp(ident, description=None, chains=None, mapped_chains=None, is_experimental=False, structure_path=None, file_type=None)[source]

Class for protein structural properties.

add_chain_ids(chains)[source]

Add chains by ID into the chains attribute

Parameters:chains (str, list) – Chain ID or list of IDs
add_mapped_chain_ids(mapped_chains)[source]

Add chains by ID into the mapped_chains attribute

Parameters:mapped_chains (str, list) – Chain ID or list of IDs
clean_structure(out_suffix='_clean', outdir=None, force_rerun=False, remove_atom_alt=True, keep_atom_alt_id='A', remove_atom_hydrogen=True, add_atom_occ=True, remove_res_hetero=True, keep_chemicals=None, keep_res_only=None, add_chain_id_if_empty='X', keep_chains=None)[source]

Clean the structure file associated with this structure, and save it as a new file. Returns the file path.

Parameters:
  • out_suffix (str) – Suffix to append to original filename
  • outdir (str) – Path to output directory
  • force_rerun (bool) – If structure should be re-cleaned if a clean file exists already
  • remove_atom_alt (bool) – Remove alternate positions
  • keep_atom_alt_id (str) – If removing alternate positions, which alternate ID to keep
  • remove_atom_hydrogen (bool) – Remove hydrogen atoms
  • add_atom_occ (bool) – Add atom occupancy fields if not present
  • remove_res_hetero (bool) – Remove all HETATMs
  • keep_chemicals (str, list) – If removing HETATMs, keep specified chemical names
  • keep_res_only (str, list) – Keep ONLY specified resnames, deletes everything else!
  • add_chain_id_if_empty (str) – Add a chain ID if not present
  • keep_chains (str, list) – Keep only these chains
Returns:

Path to cleaned PDB file

Return type:

str

get_dict_with_chain(chain, only_keys=None, chain_keys=None, exclude_attributes=None, df_format=False)[source]
get_dict method which incorporates attributes found in a specific chain. Does not overwrite any attributes
in the original StructProp.
Parameters:
  • chain
  • only_keys
  • chain_keys
  • exclude_attributes
  • df_format
Returns:

attributes of StructProp + the chain specified

Return type:

dict

get_disulfide_bridges(threshold=3.0)[source]

Run Biopython’s search_ss_bonds to find potential disulfide bridges for each chain and store in ChainProp.

get_dssp_annotations(outdir, force_rerun=False)[source]

Run DSSP on this structure and store the DSSP annotations in the corresponding ChainProp SeqRecords

Parameters:
  • outdir (str) – Path to where DSSP dataframe will be stored.
  • force_rerun (bool) – If DSSP results should be recalculated
get_freesasa_annotations(outdir, include_hetatms=False, force_rerun=False)[source]

Run freesasa on this structure and store the calculated properties in the corresponding ChainProp SeqRecords

get_residue_depths(outdir, force_rerun=False)[source]

Run MSMS on this structure and store the residue depths/ca depths in the corresponding ChainProp SeqRecords

get_structure_seqs(model)[source]

Store chain sequences in the corresponding ChainProp objects in the chains attribute.

load_structure_path(structure_path, file_type)[source]

Load a structure file and provide pointers to its location

Parameters:
  • structure_path – Path to structure file
  • file_type – Type of structure file
parse_structure()[source]

Read the 3D coordinates of a structure file and return it as a Biopython Structure object

Also create ChainProp objects in the chains attribute

Returns:Biopython Structure object
Return type:Structure
view_structure(opacity=1.0, recolor=True, gui=False)[source]

Use NGLviewer to display a structure in a Jupyter notebook

Parameters:
  • opacity (float) – Opacity of the structure
  • gui (bool) – If the NGLview GUI should show up
Returns:

NGLviewer object

view_structure_and_highlight_residues(structure_resnums, chain=None, color='red', structure_opacity=0.5, gui=False)[source]

Input a residue number or numbers to view on the structure.

Parameters:
  • structure_resnums (int, list) – Residue number(s) to highlight, structure numbering
  • chain (str, list) – Chain ID or IDs of which residues are a part of. If not provided, all chains in the mapped_chains attribute will be used. IMPORTANT: if that is also empty, all residues in all chains matching the residue numbers will be shown, which may not always be correct.
  • color (str) – Color to highlight with
  • structure_opacity (float) – Opacity of the protein structure cartoon representation
  • gui (bool) – If the NGLview GUI should show up
Returns:

NGLviewer object

view_structure_and_highlight_residues_scaled(structure_resnums, chain=None, color='red', unique_colors=False, structure_opacity=0.5, opacity_range=(0.5, 1), scale_range=(0.7, 10), gui=False)[source]
Input a list of residue numbers to view on the structure. Or input a dictionary of residue numbers to counts
to scale residues by counts (useful to view mutations).
Parameters:
  • structure_resnums (int, list, dict) – Residue number(s) to highlight, or a dictionary of residue number to frequency count
  • chain (str, list) – Chain ID or IDs of which residues are a part of. If not provided, all chains in the mapped_chains attribute will be used. PLEASE NOTE: if that is also empty, all residues in all chains matching the residue numbers will be shown.
  • color (str) – Color to highlight with
  • unique_colors (bool) – If each mutation should be colored uniquely (will override color argument)
  • structure_opacity (float) – Opacity of the protein structure cartoon representation
  • opacity_range (tuple) – Min/max opacity values (residues that have higher frequency counts will be opaque)
  • scale_range (tuple) – Min/max size values (residues that have higher frequency counts will be bigger)
  • gui (bool) – If the NGLview GUI should show up
Returns:

NGLviewer object