|
[ Previous | Up | Next ]
We'll continue to explore the powerful database sequence search program, BLAST as the course progresses. Other database searches are an important part
of the bioinformatician's arsenal. When we screen a new sequence against
a database of known sequences, we are trying to answer the following
questions:
- Is there any protein of known structure that has sufficient
similarity to the sequence of the unknown protein to suggest
a familial relationship?
- If not, which sequence of any known proteins is most similar to
the sequence of the unknown protein?
If we can identify a relationship to a protein of known structure,
it is possible to infer that the new protein shares a common structure
with its relative and to assign its general fold. However, what if
the homologue has no known structure? If its function has been identified
then we might expect our unknown protein to have a similar or related
function. However, exceptions do exist. A classic example is lysozyme,
which shares around 50% sequence identity and 70% sequence similarity
with alpha-lactalbumin. The two proteins also share similar folds,
but their functions are entirely different: the two key catalytic residues
of lysozyme are not conserved in alpha-lactalbumin, and the acidic
calcium binding motif important to the function of alpha-lactalbumin
is not present in most lysoszymes. It is essential that you confirm
any computer based predictions with benchwork.
What can you do if sequence similarity alone does not satisfactorily
identify a relative? We will show you a few more applications within
EMBOSS that can help you predict the function of your sequence.
In a number of cases, the active site of a protein can be recognized
by a specific "fingerprint" or "template", a fairly
small set of residues that are unique to a family of proteins. An example
is the sequence GXGXXG (where G=glycine and X=any amino acid) which
defines a GTP binding site. Searching for a (rather loose) predefined
string of characters in a sequence is called Pattern Matching - this
should be familiar to you from a previous class.
The EMBOSS program patmatmotifs looks for sequence motifs by searching
with a pattern search algorithm through the given protein sequence
for the patterns defined in the PROSITE database. PROSITE is a database
of protein families and domains, based on the observation that, while
there are a huge number of different proteins, most of them can be
grouped, on the basis of similarities in their sequences, into a limited
number of families. Proteins or protein domains belonging to a particular
family generally share functional attributes and are derived from a
common ancestor.
PRINTS is a database that defines functional protein families, identifying
each domain by a number of short, particularly well conserved sequences.
A full match to one of these "fingerprints" will match all
the relevant short sequences in the correct order. A partial match
is recorded if some are missing or if they occur in an incorrect order.
The PRINTS database can be searched using the pscan program which is
available within EMBOSS.
Exercise 9 - patmatmotifs and pscan
A) Use patmatmotifs and pscan try to uncover functoinal domains. Frankly, the globin protein is of limited interest here. So let's use instead the membrane protein, platelet derived growth factor receptor. Search for known motifs in that protein.
What is/are the motif(s) found and location(s)?
B) Sometimes the patterns sought in proteins are stochastic - there's a probability of one or more residue being associated with another to result in a particular function. Use pscan to scan our PDGF-R. Look at the documentation linked from the pscan page. How would you interpret the signatures that are found?
[ Previous | Up | Next ]
Page last modified
September 29, 2008
|