The SeqProp Class

The SeqProp Class

Introduction

This section will give an overview of the methods that can be executed for a single protein sequence.

Available functions

Sequence-based predictions

Function Description Internal Python class
used and functions provided
External software
to install
Web server Alternate external
software to install
Secondary structure
and
solvent accessibilities
Predictions of secondary structure and
relative solvent accessibilities per residue
scratch module SCRATCH    
Thermostability Free energy of unfolding (ΔG), adapted from
Oobatake (Oobatake & Ooi 1993) and Dill (Dill et al. 2011)
thermostability module      
Transmembrane domains Prediction of transmembrane domains from sequence tmhmm module TMHMM    
Aggregation propensity Consensus method to predict the aggregation
propensity of proteins, specifically the number
of aggregation-prone segments on an unfolded
protein sequence
aggregation_propensity module   AMYLPRED2  

Sequence-based calculations

Function Description Internal Python class
used and functions provided
External software
to install
Web server Alternate external
software to install
Various sequence
properties
Basic properties of the sequence, such as
percent of polar, non-polar, hydrophobic
or hydrophilic residues.
    EMBOSS pepstats
Sequence alignment Basic functions to run pairwise or multiple
sequence alignments
    EMBOSS needle

API

SeqProp

class ssbio.protein.sequence.seqprop.SeqProp(seq, id, name='<unknown name>', description='<unknown description>', sequence_path=None, metadata_path=None, feature_path=None)[source]

Generic class to represent information for a protein sequence.

Extends the Biopython SeqRecord class. The main functionality added is the ability to set and load directly from sequence, metadata, and feature files. Additionally, methods are provided to calculate and store sequence properties in the annotations and letter_annotations field of a SeqProp. These can then be accessed for a range of residue numbers.

id

str – Unique identifier for this protein sequence

seq

Seq – Protein sequence as a Biopython Seq object

name

str – Optional name for this sequence

description

str – Optional description for this sequence

bigg

str, list – BiGG IDs mapped to this sequence

kegg

str, list – KEGG IDs mapped to this sequence

refseq

str, list – RefSeq IDs mapped to this sequence

uniprot

str, list – UniProt IDs mapped to this sequence

gene_name

str, list – Gene names mapped to this sequence

pdbs

list – PDB IDs mapped to this sequence

go

str, list – GO terms mapped to this sequence

pfam

str, list – PFAMs mapped to this sequence

ec_number

str, list – EC numbers mapped to this sequence

sequence_file

str – FASTA file for this sequence

metadata_file

str – Metadata file (any format) for this sequence

feature_file

str – GFF file for this sequence

features

list – List of protein sequence features, which define regions of the protein

annotations

dict – Annotations of this protein sequence, which summarize global properties

letter_annotations

RestrictedDict – Residue-level annotations, which describe single residue properties

Todo

  • Properly inherit methods from the Object class…
add_point_feature(resnum, feat_type=None, feat_id=None)[source]

Add a feature to the features list describing a single residue.

Parameters:
  • resnum (int) – Protein sequence residue number
  • feat_type (str, optional) – Optional description of the feature type (ie. ‘catalytic residue’)
  • feat_id (str, optional) – Optional ID of the feature type (ie. ‘TM1’)
add_region_feature(start_resnum, end_resnum, feat_type=None, feat_id=None)[source]

Add a feature to the features list describing a region of the protein sequence.

Parameters:
  • start_resnum (int) – Start residue number of the protein sequence feature
  • end_resnum (int) – End residue number of the protein sequence feature
  • feat_type (str, optional) – Optional description of the feature type (ie. ‘binding domain’)
  • feat_id (str, optional) – Optional ID of the feature type (ie. ‘TM1’)
blast_pdb(seq_ident_cutoff=0, evalue=0.0001, display_link=False, outdir=None, force_rerun=False)[source]

BLAST this sequence to the PDB

equal_to(seq_prop)[source]

Test if the sequence is equal to another SeqProp’s sequence

Parameters:seq_prop – SeqProp object
Returns:If the sequences are the same
Return type:bool
feature_path_unset()[source]

Copy features to memory and remove the association of the feature file.

features

list – Get the features stored in memory or in the GFF file

get_aggregation_propensity(email, password, cutoff_v=5, cutoff_n=5, run_amylmuts=False, outdir=None)[source]

Run the AMYLPRED2 web server to calculate the aggregation propensity of this protein sequence, which is the number of aggregation-prone segments on the unfolded protein sequence.

Stores statistics in the annotations attribute, under the key aggprop-amylpred.

See ssbio.protein.sequence.properties.aggregation_propensity for instructions and details.

get_biopython_pepstats()[source]

Run Biopython’s built in ProteinAnalysis module and store statistics in the annotations attribute.

get_dict(only_attributes=None, exclude_attributes=None, df_format=False)[source]

Get a dictionary of this object’s attributes. Optional format for storage in a Pandas DataFrame.

Parameters:
  • only_attributes (str, list) – Attributes that should be returned. If not provided, all are returned.
  • exclude_attributes (str, list) – Attributes that should be excluded.
  • df_format (bool) – If dictionary values should be formatted for a dataframe (everything possible is transformed into strings, int, or float - if something can’t be transformed it is excluded)
Returns:

Dictionary of attributes

Return type:

dict

get_emboss_pepstats()[source]

Run the EMBOSS pepstats program on the protein sequence.

Stores statistics in the annotations attribute. Saves a .pepstats file of the results where the sequence file is located.

get_kinetic_folding_rate(secstruct, at_temp=None)[source]

Run the FOLD-RATE web server to calculate the kinetic folding rate given an amino acid sequence and its structural classficiation (alpha/beta/mixed)

Stores statistics in the annotations attribute, under the key kinetic_folding_rate_<TEMP>-foldrate.

See ssbio.protein.sequence.properties.kinetic_folding_rate.get_foldrate() for instructions and details.

get_residue_annotations(start_resnum, end_resnum=None)[source]

Retrieve letter annotations for a residue or a range of residues

Parameters:
  • start_resnum (int) – Residue number
  • end_resnum (int) – Optional residue number, specify if a range is desired
Returns:

Letter annotations for this residue or residues

Return type:

dict

get_thermostability(at_temp)[source]

Run the thermostability calculator using either the Dill or Oobatake methods.

Stores calculated (dG, Keq) tuple in the annotations attribute, under the key thermostability_<TEMP>-<METHOD_USED>.

See ssbio.protein.sequence.properties.thermostability.get_dG_at_T() for instructions and details.

num_pdbs

int – Report the number of PDB IDs stored in the pdbs attribute

seq

Seq – Dynamically loaded Seq object from the sequence file

seq_len

int – Get the sequence length

seq_str

str – Get the sequence formatted as a string

write_fasta_file(outfile, force_rerun=False)[source]

Write a FASTA file for the protein sequence, seq will now load directly from this file.

Parameters:
  • outfile (str) – Path to new FASTA file to be written to
  • force_rerun (bool) – If an existing file should be overwritten
write_gff_file(outfile, force_rerun=False)[source]

Write a GFF file for the protein features, features will now load directly from this file.

Parameters:
  • outfile (str) – Path to new FASTA file to be written to
  • force_rerun (bool) – If an existing file should be overwritten