Protein - Structure Mapping, Alignments, and Visualization

This notebook gives an example of how to map a single protein sequence to its structure, along with conducting sequence alignments and visualizing the mutations.

Input: Protein ID + amino acid sequence + mutated sequence(s)
Output: Representative protein structure, sequence alignments, and visualization of mutations

Imports

In [1]:
import sys
import logging
In [2]:
# Import the Protein class
from ssbio.core.protein import Protein
In [3]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Logging

Set the logging level in logger.setLevel(logging.<LEVEL_HERE>) to specify how verbose you want the pipeline to be. Debug is most verbose.

  • CRITICAL
    • Only really important messages shown
  • ERROR
    • Major errors
  • WARNING
    • Warnings that don’t affect running of the pipeline
  • INFO (default)
    • Info such as the number of structures mapped per gene
  • DEBUG
    • Really detailed information that will print out a lot of stuff

Warning: DEBUG mode prints out a large amount of information, especially if you have a lot of genes. This may stall your notebook!

In [4]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # SET YOUR LOGGING LEVEL HERE #
In [5]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]

Initialization of the project

Set these three things:

  • ROOT_DIR
    • The directory where a folder named after your PROTEIN_ID will be created
  • PROTEIN_ID
    • Your protein ID
  • PROTEIN_SEQ
    • Your protein sequence

A directory will be created in ROOT_DIR with your PROTEIN_ID name. The folders are organized like so:

ROOT_DIR
└── PROTEIN_ID
    ├── sequences  # Protein sequence files, alignments, etc.
    └── structures  # Protein structure files, calculations, etc.
In [6]:
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()

PROTEIN_ID = 'SRR1753782_00918'
PROTEIN_SEQ = 'MSKQQIGVVGMAVMGRNLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'
In [7]:
# Create the Protein object
my_protein = Protein(ident=PROTEIN_ID, root_dir=ROOT_DIR)
In [8]:
# Load the protein sequence
# This sets the loaded sequence as the representative one
my_protein.load_manual_sequence(seq=PROTEIN_SEQ, ident='WT', write_fasta_file=True, set_as_representative=True)
Out[8]:
<SeqProp WT at 0x7f89f05bd0f0>

Mapping sequence –> structure

Since the sequence has been provided, we just need to BLAST it to the PDB.

Note: These methods do not download any 3D structure files.

Methods

Protein.blast_representative_sequence_to_pdb(seq_ident_cutoff=0, evalue=0.0001, display_link=False, outdir=None, force_rerun=False)[source]

BLAST repseq to PDB and return the list of new structures added, also saves df_pdb_blast

In [9]:
# Mapping using BLAST
my_protein.blast_representative_sequence_to_pdb(seq_ident_cutoff=0.9, evalue=0.00001)
my_protein.df_pdb_blast.head()
Out[9]:
['2zyd', '2zya', '3fwn', '2zyg']
Out[9]:
pdb_chain_id hit_score hit_evalue hit_percent_similar hit_percent_ident hit_num_ident hit_num_similar
pdb_id
2zya A 2319.0 0.0 0.987179 0.963675 451 462
2zya B 2319.0 0.0 0.987179 0.963675 451 462
2zyd A 2319.0 0.0 0.987179 0.963675 451 462
2zyd B 2319.0 0.0 0.987179 0.963675 451 462
2zyg A 2284.0 0.0 0.982906 0.950855 445 460

Downloading and ranking structures

Methods

Protein.pdb_downloader_and_metadata(outdir=None, pdb_file_type=None, force_rerun=False)[source]

Download experimental structures

Warning: Downloading all PDBs takes a while, since they are also parsed for metadata. You can skip this step and just set representative structures below if you want to minimize the number of PDBs downloaded.
In [10]:
# Download all mapped PDBs and gather the metadata
my_protein.pdb_downloader_and_metadata()
my_protein.df_pdb_metadata.head(2)
Out[10]:
['2zyd', '2zya', '3fwn', '2zyg']
Out[10]:
pdb_title description experimental_method mapped_chains resolution chemicals taxonomy_name structure_file
pdb_id
2zya Dimeric 6-phosphogluconate dehydrogenase compl... 6-phosphogluconate dehydrogenase, decarboxylat... X-RAY DIFFRACTION A;B 1.6 6PG Escherichia coli 2zya.cif
2zyd Dimeric 6-phosphogluconate dehydrogenase compl... 6-phosphogluconate dehydrogenase, decarboxylat... X-RAY DIFFRACTION A;B 1.5 GLO Escherichia coli 2zyd.cif
Protein.set_representative_structure(seq_outdir=None, struct_outdir=None, pdb_file_type=None, engine='needle', always_use_homology=False, rez_cutoff=0.0, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True, clean=True, keep_chemicals=None, force_rerun=False)[source]

Set a representative structure from a structure in self.structures

Parameters:
  • seq_outdir (str) – Path to output directory of sequence alignment files
  • struct_outdir (str) – Path to output directory of structure files
  • pdb_file_type (str) – pdb, pdb.gz, mmcif, cif, cif.gz, xml.gz, mmtf, mmtf.gz - PDB structure file type that should be downloaded
  • engine (str) – “needle” or “biopython” - which pairwise sequence alignment engine should be used needle is the standard EMBOSS tool to run pairwise alignments biopython is Biopython’s implementation of needle. Results can differ!
  • always_use_homology (bool) – If homology models should always be set as the representative structure
  • rez_cutoff (float) – Resolution cutoff, in Angstroms (only if experimental structure)
  • seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
  • allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
  • allow_mutants (bool) – If mutations should be allowed or checked for
  • allow_deletions (bool) – If deletions should be allowed or checked for
  • allow_insertions (bool) – If insertions should be allowed or checked for
  • allow_unresolved (bool) – If unresolved residues should be allowed or checked for
  • clean (bool) – If structure should be cleaned
  • keep_chemicals (str, list) – Keep specified chemical names if structure is to be cleaned
  • force_rerun (bool) – If sequence to structure alignment should be rerun
Returns:

Representative structure from the list of structures.

This is a not a map to the original structure, it is copied from its reference.

Return type:

StructProp

In [11]:
# Set representative structures
my_protein.set_representative_structure(engine='biopython', force_rerun=True)
[2017-09-04 17:50] [ssbio.protein.sequence.utils.alignment] WARNING: Gap penalties not implemented in Biopython yet
[2017-09-04 17:50] [ssbio.protein.sequence.utils.alignment] WARNING: Gap penalties not implemented in Biopython yet
[2017-09-04 17:50] [ssbio.protein.sequence.utils.alignment] WARNING: Gap penalties not implemented in Biopython yet
[2017-09-04 17:50] [ssbio.protein.sequence.utils.alignment] WARNING: Gap penalties not implemented in Biopython yet
[2017-09-04 17:50] [ssbio.protein.sequence.utils.alignment] WARNING: Gap penalties not implemented in Biopython yet
[2017-09-04 17:50] [ssbio.protein.sequence.utils.alignment] WARNING: Gap penalties not implemented in Biopython yet
[2017-09-04 17:50] [ssbio.protein.sequence.utils.alignment] WARNING: Gap penalties not implemented in Biopython yet
[2017-09-04 17:50] [ssbio.protein.sequence.utils.alignment] WARNING: Gap penalties not implemented in Biopython yet
[2017-09-04 17:50] [ssbio.core.protein] WARNING: SRR1753782_00918: no structures meet quality checks

Loading and aligning new sequences

You can load additional sequences into this protein object and align them to the representative sequence.

Methods

Protein.load_manual_sequence(seq, ident=None, write_fasta_file=False, outname=None, outdir=None, set_as_representative=False, force_rewrite=False)[source]

Load a manual sequence given as a string and optionally set it as the representative sequence. Also store it in the sequences attribute.

Parameters:
  • seq (str, Seq, SeqRecord) – Sequence string, Biopython Seq or SeqRecord object
  • ident (str) – Optional identifier for the sequence, required if seq is a string. Also will override existing IDs in Seq or SeqRecord objects if set.
  • write_fasta_file
  • outname
  • outdir
  • set_as_representative
  • force_rewrite

Returns:

In [12]:
# Input your mutated sequence and load it
mutated_protein1_id = 'N17P_SNP'
mutated_protein1_seq = 'MSKQQIGVVGMAVMGRPLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'

my_protein.load_manual_sequence(ident=mutated_protein1_id, seq=mutated_protein1_seq)
Out[12]:
<SeqProp N17P_SNP at 0x7f8e304effd0>
In [13]:
# Input another mutated sequence and load it
mutated_protein2_id = 'Q4S_N17P_SNP'
mutated_protein2_seq = 'MSKSQIGVVGMAVMGRPLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'

my_protein.load_manual_sequence(ident=mutated_protein2_id, seq=mutated_protein2_seq)
Out[13]:
<SeqProp Q4S_N17P_SNP at 0x7f8e3033d5c0>
Protein.pairwise_align_sequences_to_representative(gapopen=10, gapextend=0.5, outdir=None, engine='needle', parse=True, force_rerun=False)[source]

Align all sequences in the sequences attribute to the representative sequence. Stores the alignments the sequence_alignments DictList

Parameters:
  • gapopen
  • gapextend
  • outdir
  • engine
  • parse (bool) – Store locations of mutations, insertions, and deletions in the alignment object (as an annotation)
  • force_rerun

Returns:

In [14]:
# Conduct pairwise sequence alignments
my_protein.pairwise_align_sequences_to_representative()
In [15]:
# View the stored information for one of the alignments
my_protein.representative_sequence.sequence_alignments
my_protein.representative_sequence.sequence_alignments[0].annotations

str(my_protein.representative_sequence.sequence_alignments[0][0].seq)
str(my_protein.representative_sequence.sequence_alignments[0][1].seq)
Out[15]:
[<<class 'Bio.Align.MultipleSeqAlignment'> instance (2 records of length 468, SingleLetterAlphabet()) at 7f8e3033d518>,
 <<class 'Bio.Align.MultipleSeqAlignment'> instance (2 records of length 468, SingleLetterAlphabet()) at 7f8e3033d9b0>]
Out[15]:
{'a_seq': 'WT',
 'b_seq': 'N17P_SNP',
 'deletions': [],
 'insertions': [],
 'mutations': [('N', 17, 'P')],
 'percent_gaps': 0.0,
 'percent_identity': 99.8,
 'percent_similarity': 99.8,
 'score': 2381.0}
Out[15]:
'MSKQQIGVVGMAVMGRNLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'
Out[15]:
'MSKQQIGVVGMAVMGRPLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'
In [16]:
# Summarize all the mutations in all alignments
s,f = my_protein.representative_sequence.sequence_mutation_summary()
print('Single mutations:')
s
print('---------------------')
print('Mutation fingerprints')
f
Single mutations:
Out[16]:
{('N', 17, 'P'): ['N17P_SNP', 'Q4S_N17P_SNP'], ('Q', 4, 'S'): ['Q4S_N17P_SNP']}
---------------------
Mutation fingerprints
Out[16]:
{(('N', 17, 'P'),): ['N17P_SNP'],
 (('Q', 4, 'S'), ('N', 17, 'P')): ['Q4S_N17P_SNP']}

Some additional methods

Getting binding site/other information from UniProt

In [17]:
import ssbio.databases.uniprot
In [18]:
this_examples_uniprot = 'A0A0N2BFZ3'
sites_df = ssbio.databases.uniprot.uniprot_sites(this_examples_uniprot)
sites_df
Out[18]:
type seq_start seq_end notes
0 Domain 179 467 Note=6PGD;Ontology_term=ECO:0000259;evidence=E...
1 Nucleotide binding 10 15 Note=NADP;Ontology_term=ECO:0000256;evidence=E...
2 Nucleotide binding 33 35 Note=NADP;Ontology_term=ECO:0000256;evidence=E...
3 Nucleotide binding 74 76 Note=NADP;Ontology_term=ECO:0000256;evidence=E...
4 Region 128 130 Note=Substrate binding;Ontology_term=ECO:00002...
5 Region 186 187 Note=Substrate binding;Ontology_term=ECO:00002...
6 Active site 183 183 Note=Proton acceptor;Ontology_term=ECO:0000256...
7 Active site 190 190 Note=Proton donor;Ontology_term=ECO:0000256;ev...
8 Binding site 102 102 Note=NADP;Ontology_term=ECO:0000256;evidence=E...
9 Binding site 102 102 Note=Substrate;Ontology_term=ECO:0000256;evide...
10 Binding site 191 191 Note=Substrate;Ontology_term=ECO:0000256;evide...
11 Binding site 260 260 Note=Substrate%3B via amide nitrogen;Ontology_...
12 Binding site 287 287 Note=Substrate;Ontology_term=ECO:0000256;evide...
13 Binding site 445 445 Note=Substrate%3B shared with dimeric partner;...
14 Binding site 451 451 Note=Substrate%3B shared with dimeric partner;...
In [19]:
# Saving a list of the nucleotide binding site residues
nucleotide_binding_sites = []
for i, r in sites_df[sites_df.type=='Nucleotide binding'][['seq_start', 'seq_end']].iterrows():
    start = r.seq_start
    end = r.seq_end
    for x in range(start, end+1):
        nucleotide_binding_sites.append(x)
nucleotide_binding_sites
Out[19]:
[10, 11, 12, 13, 14, 15, 33, 34, 35, 74, 75, 76]

Mapping sequence residue numbers to structure residue numbers

Methods

In [20]:
# Returns a dictionary mapping sequence residue numbers to structure residue identifiers
structure_sites = my_protein.representative_structure.map_repseq_resnums_to_structure_resnums(nucleotide_binding_sites)
structure_sites
Out[20]:
{10: (' ', 10, ' '),
 11: (' ', 11, ' '),
 12: (' ', 12, ' '),
 13: (' ', 13, ' '),
 14: (' ', 14, ' '),
 15: (' ', 15, ' '),
 33: (' ', 33, ' '),
 34: (' ', 34, ' '),
 35: (' ', 35, ' '),
 74: (' ', 74, ' '),
 75: (' ', 75, ' '),
 76: (' ', 76, ' ')}
In [21]:
# For viewing below, we can remap binding site residues to the structure (luckily they are the same here)
nucleotide_binding_site_remapped_to_structure = [x[1] for x in structure_sites.values()]
nucleotide_binding_site_remapped_to_structure
Out[21]:
[33, 34, 35, 76, 74, 11, 12, 13, 14, 15, 75, 10]
In [22]:
# Will warn you if residues are not present in the structure
my_protein.representative_structure.map_repseq_resnums_to_structure_resnums([1,2,3])
[2017-03-09 15:29] [ssbio.structure.structprop] WARNING: 2zyd-A, 1: structure file does not contain coordinates for this residue
[2017-03-09 15:29] [ssbio.structure.structprop] WARNING: 2zyd-A, 2: structure file does not contain coordinates for this residue
Out[22]:
{3: (' ', 3, ' ')}

Viewing structures

The awesome package nglview is utilized as a backend for viewing structures within a Jupyter notebook. There are many more options which can be set if you run:

import nglview
view = nglview.show_structure_file(my_protein.representative_structure.structure_path)
view

ssbio provides some wrapper functions to easily view structures and also map sequence residue numbers to structure residue numbers:

Methods

StructProp.view_structure(opacity=1.0, recolor=True, gui=False)[source]

Use NGLviewer to display a structure in a Jupyter notebook

Parameters:
  • opacity (float) – Opacity of the structure
  • gui (bool) – If the NGLview GUI should show up
Returns:

NGLviewer object

In [23]:
# View just the structure
my_protein.representative_structure.view_structure()
Protein.view_all_mutations(grouped=False, color='red', unique_colors=True, structure_opacity=0.5, opacity_range=(0.8, 1), scale_range=(1, 5), gui=False)[source]

Map all sequence alignment mutations to the structure.

Parameters:
  • grouped (bool) – If groups of mutations should be colored and sized together
  • color (str) – Color of the mutations (overridden if unique_colors=True)
  • unique_colors (bool) – If each mutation/mutation group should be colored uniquely
  • structure_opacity (float) – Opacity of the protein structure cartoon representation
  • opacity_range (tuple) – Min/max opacity values (mutations that show up more will be opaque)
  • scale_range (tuple) – Min/max size values (mutations that show up more will be bigger)
  • gui (bool) – If the NGLview GUI should show up
Returns:

NGLviewer object

In [24]:
# Map the mutations on the visualization (scale increased)
my_protein.view_all_mutations(scale_range=(4,7))
[2017-03-09 15:29] [ssbio.structure.structprop] INFO: Selection: ( :A ) and not hydrogen and 17
[2017-03-09 15:29] [ssbio.structure.structprop] INFO: Selection: ( :A ) and not hydrogen and 4
StructProp.view_structure_and_highlight_residues(structure_resnums, chain=None, color='red', structure_opacity=0.5, gui=False)[source]

Input a residue number or numbers to view on the structure.

Parameters:
  • structure_resnums (int, list) – Residue number(s) to highlight, structure numbering
  • chain (str, list) – Chain ID or IDs of which residues are a part of. If not provided, all chains in the mapped_chains attribute will be used. IMPORTANT: if that is also empty, all residues in all chains matching the residue numbers will be shown, which may not always be correct.
  • color (str) – Color to highlight with
  • structure_opacity (float) – Opacity of the protein structure cartoon representation
  • gui (bool) – If the NGLview GUI should show up
Returns:

NGLviewer object

In [25]:
# View just the structure with selected residues
my_protein.representative_structure.view_structure_and_highlight_residues(structure_resnums=[1,2,3])
[2017-03-09 15:29] [ssbio.structure.structprop] INFO: Selection: ( :A ) and not hydrogen and ( 1 or 2 or 3 )
In [26]:
# View the previously saved binding site
my_protein.representative_structure.view_structure_and_highlight_residues(structure_resnums=nucleotide_binding_site_remapped_to_structure)
[2017-03-09 15:29] [ssbio.structure.structprop] INFO: Selection: ( :A ) and not hydrogen and ( 33 or 34 or 35 or 74 or 11 or 76 or 12 or 13 or 14 or 15 or 75 or 10 )

Saving

Protein.save_json(outfile, compression=True)

Save the object as a JSON file using json_tricks

In [27]:
import os.path as op
my_protein.save_json(op.join(my_protein.protein_dir, '{}.json'.format(my_protein.id)), compression=False)
[2017-03-09 15:29] [ssbio.core.io] INFO: Saved <class 'ssbio.core.protein.Protein'> (id: SRR1753782_00918) to /tmp/SRR1753782_00918/SRR1753782_00918.json