|
|
Research
Summary
Bioinformatics
Predicting Secondary Structure of Proteins using
Genetic Algorithms
Genetic algorithms are artificial intelligence procedures
that utilize the technique of Darwinian selection. They
"learn" solutions to complex problems - just as organisms
have done over billions of years - by the evolution of
increasingly effective rules.
Protein Secondary Structure
A particularly complex problem that concerns many in the
scientific community is to predict the secondary structures
of proteins using only amino acid sequence data. Secondary
structures are regular features found in proteins that
involve associations between the main chain NH and C=O
groups. The most important secondary elements are the alpha
helix and beta sheet (composed of beta strands). The
elements that connect alpha helices and beta strands are
called loops, essentially assemblages of amino acids,
usually located on the outside of proteins, that form an
irregular shape. Classifying a given protein's amino acids
into one of these three categories is the goal of the
research in my laboratory.
Our Implementation of the Genetic Algorithm
The rules of genetic algorithms are most often encoded as
binary strings. We therefore assigned one or two five digit
codes to each amino acid. As shown below, amino acids with
similar properties have similar code words.
|
Encoding Number
|
Amino Acid
|
|
00000 and 00001
|
Tryptophan
|
|
00010
|
Tyrosine
|
|
00011
|
Phenylalanine
|
|
00100
|
Leucine
|
|
00101
|
Methionine
|
|
00110
|
Isoleucine
|
|
00111
|
Valine
|
|
01000 and 01001
|
Cysteine
|
|
01010 and 01011
|
Alanine
|
|
01100 and 01101
|
Proline
|
|
01110 and 01111
|
Glycine
|
|
10000 and 10001
|
Threonine
|
|
10010 and 10011
|
Serine
|
|
10100 and 10101
|
Asparagine
|
|
10110 and 10111
|
Glutamine
|
|
11000 and 11001
|
Aspartic acid
|
|
11010 and 11011
|
Glutamic acid
|
|
11100 and 11101
|
Histidine
|
|
11110
|
Lysine
|
|
11111
|
Arginine
|
We then obtained amino acid sequence data from the DSSP -
http://www.sander.embl-heidelberg.de/dssp/
- a total of 118,854 amino acids-worth of data representing
474 proteins whose sequences were less than 25% similar.
These proteins had all been analyzed by X-ray
crystallography or NMR and, consequently, their secondary
structures were known.
Our overall strategy was
to begin with a set of rules like the one illustrated below
(the spaces are introduced for clarity):
10010 01001 00001 00100 => helix (2)
Translated into English using the coding scheme outline
above, this rule says that if a protein contains the
sequence serine (10010), cysteine (01001), tryptophan
(00001), and leucine (00100) then the second amino acid in
this sequence (cysteine) will be part of an alpha helix.
Each rule also has a value called "strength" associated with
it. A rule's strength is an indication of how well the rule
does in its prediction.
The program compares
this rule against all the amino acid sequences in the
database. When there is a match; that is, when the left-hand
portion of this rule matches some sequence found in the
database, the program looks to see if the second amino acid
(cysteine) is actually part of an alpha helix. If it is,
then the rule is rewarded and has its strength increased; if
it is not, then it is penalized decreasing its strength.
There is a further
complication. To make the rules more general, a third
character - # - is introduced in addition to the 0's and
1's. '#' is a "don't care" symbol and means either 1 or 0.
Thus a rule is more likely to look like the one below:
100#0 #1001 0#000 ##### => helix (3)
This rule can be translated as follows: If a protein
contains the sequence threonine (10000) or serine (10010) in
the first position, cysteine (01001) or aspartic acid
(11001) in the second, tryptophan (00000) or cysteine
(01000) in the third, and any amino acid at all in the
fourth, then the third amino acid is part of an alpha helix.
The reward and penalty structure is the same as outlined
above. Similar rules are created for predicting strands and
loop regions.
In ordinary genetic
algorithms, rules obtained by "training" in this way compete
with one another. In addition, good rules are recombined (in
a process that closely resembles genetic recombination) to
produce better ones. In the end, a set of "best" rules
evolve that are effective in solving the problem.
Our program works
somewhat differently. We begin by generating a rule with
'0's, '1's and '#'s in random positions. This rule is
created to predict either a helix, sheet or loop randomly.
For each rule, nine more are generated by changing one
position from a 1 to a 0, a 0 to a 1, a 1 to a #, etc. These
ten similar rules comprise a species. Each rule in each
species is then compared to the entire database and the
number of times it has predicted correctly and incorrectly
is computed.
Within a species, one or
more rules will usually be clearly better than others. These
are recombined with each other to produce new rules that
contain elements of both parents. Recombination consists of
taking a sequence from one rule and replacing it with a
sequence from the other, and the reciprocal. Such
recombination mimics natural sexual processes that occur
within a species. In addition, at infrequent intervals
mutations are introduced that change a 0 to a # or a 1 to a
# at random positions in the rule (in our system mutations
thus always increase the generality of a rule). Each time a
new rule is created or mutated, it is compared against the
entire database and its success rate is computed. While it
may not be intuitively obvious, the process of recombination
and selection soon results in all ten rules within a given
species becoming the same (as long as the mutation rate is
much lower than the recombination rate). When all ten rules
become identical, the computer determines whether their
strength surpass a certain threshold. If it does, the rule
is maintained. If not, it is discarded, and a new random
rule is created to undergo the same process. Over time, the
algorithm generates large numbers of relatively successful
rules.
Results
The algorithm has been trained on 100,000 amino acids
utilizing the data set mentioned above. The remaining 18,854
were held out; i.e., these amino acid sequences were never
exposed to the training and testing program. Four sets of
rules have been generated. Each set has a different number
of amino acids per rule. Four, five, six or eight amino
acids were used. Hundreds of thousands of rules have been
produced to date in each set.
A second program has
been developed to use these rules to predict the secondary
structure of the 18,854 amino acids to which the program had
not been previously exposed. This second program
progressively tries to match each amino acid sequence (four,
five, six or eight amino acids at a time) with the rule
collection. In some cases, no rules matched a particular
sequence. In other cases, more than one rule matched, and
these predicted different structures. Often one or more
rules matched a sequence, and all predicted the same
structure. By easing the stringency of the criteria used to
make a prediction, we can increase the number of predictions
made, usually at the expense of accuracy.
As expected, four length
rules were the least effective in predicting secondary
structure. Even with 64,000 rules, the best that the program
could do was to predict secondary structure with 60%
accuracy for 80% of the amino sampled. Increasing the number
of four length rules to 85,000 did not increase accuracy or
generality. Five and six length rules did better, achieving
about 64% accuracy with about 100,000 rules. We do not know
if an increase in the number of five and six length rules
will increase accuracy. We are currently generating eight
amino acid length rules and are investigating their
potential for prediction. It should be borne in mind that
the very best prediction programs, using procedures other
than genetic algorithms, achieve about 70% accuracy in
predicting secondary structure.
Publications
1. Freidman R, Hotaling E, Borack L, Sofer W. 1996.
Interactions between the regulatory regions of two Adh
alleles. Genetica Jan;97(1):1-14
2. Sofer W, Tompkins L. 1994.
Drosophila genetics in the classroom. Genetics
Jan;136(1):417-422
3. Shen NL, Hotaling EC, Subrahmanyam G, Martin PF, Sofer W.
1991.
Analysis of sequences regulating larval expression of the
Adh gene of Drosophila melanogaster. Genetics
Nov;129(3):763-771
4. Rothberg I, Hotaling E, Sofer W. 1991.
A Drosophila Adh gene can be activated in trans by an
enhancer. Nucleic Acids Res Oct 25;19(20):5713-5717
5. Shen NL, Subrahmanyam G, Clark W, Martin P, Sofer W.
1989.
Analysis of Adh gene regulation in Drosophila: studies using
somatic transformation. Dev Genet;10(3):210-219
6. Place AR, Benyajati C, Sofer W. 1987.
Molecular consequences of two formaldehyde-induced mutations
in the alcohol dehydrogenase gene of Drosophila
melanogaster. Biochem Genet Oct;25(9-10):621-638.
7. Sofer W, Martin P. 1987.
Analysis of densitometric data obtained from electrophoretic
analysis. Comput Appl Biosci Jun;3(2):129.
8. Fargnoli J, Hyde J, Sofer W. 1987.
Chemical selection for beta-galactosidase activity in
Drosophila melanogaster. Biochem Genet 1987
Jun;25(5-6):327-333.
9. Fargnoli J, Sofer W. 1987 May 1.
In situ detection of enzymes in yeast. Anal
Biochem;162(2):384-388.
10. Sofer W, Martin PF. 1987.
Analysis of alcohol dehydrogenase gene expression in
Drosophila. Annu Rev Genet;21:203-225.
11. Martin P, Martin A, Osmani A, Sofer W. 1986 Oct.
A transient expression assay for tissue-specific gene
expression of alcohol dehydrogenase in Drosophila. Dev
Biol;117(2):574-580.
12. Martin PF, Place AR, Pentz E, Sofer W. 1985 Jul 20.
UGA nonsense mutation in the alcohol dehydrogenase gene of
Drosophila melanogaster. J Mol Biol;184(2):221-229.
13. Benyajati C, Place AR, Sofer W. 1983 Sep.
Formaldehyde mutagenesis in Drosophila. Molecular analysis
of ADH-negative mutants. Mutat Res;111(1):1-7.
14. Kubli E, Schmidt T, Martin PF, Sofer W. 1982 Nov 25.
In vitro suppression of a nonsense mutant of Drosophila
melanogaster. Nucleic Acids Res;10(22):7145-7152.
15. Benyajati C, Place AR, Wang N, Pentz E, Sofer W. 1982
Nov 25.
Deletions at intervening sequence splice sites in the
alcohol dehydrogenase gene of Drosophila. Nucleic Acids
Res;10(22):7261-7272.
16. Pelliccia JG, Sofer W. 1982 Apr.
Synthesis and degradation of alcohol dehydrogenase in
wild-type and Adh-null activity mutants of Drosophila
melanogaster. Biochem Genet;20(3-4):297-313.
17. Reddy AR, Sofer W. 1981 Dec 15.
A rapid procedure to detect in situ DNA/RNA hybrids.
Biochem Biophys Res Commun;103(3):959-967.
18. Benyajati C, Place AR, Powers DA, Sofer W. 1981 May.
Alcohol dehydrogenase gene of Drosophila melanogaster:
relationship of intervening sequences to functional domains
in the protein. Proc Natl Acad Sci U S
A;78(5):2717-2721.
19. Benyajati C, Wang N, Reddy A, Weinberg E, Sofer W. 1980
Dec 11.
Alcohol dehydrogenase in Drosophila: isolation and
characterization of messenger RNA and cDNA clone.
Nucleic Acids Res;8(23):5649-5667.
20. Fishbein JC, Place AR, Ropson IJ, Powers DA, Sofer W.
1980 Oct.
Thin-layer peptide mapping: quantitative analysis and
sequencing at the nanomole level. Anal
Biochem;108(1):193-201.
21. Reddy AR, Pelliccia JG, Sofer W. 1980 Apr.
Adh-negative mutants: detection of an altered tryptic
peptide in a mutant enzyme of Drosophila. Biochem
Genet;18(3-4):339-351.
22. Papel I, Henderson M, Van Herrewege J, David J, Sofer W.
1979 Jun.
Drosophila alcohol dehydrogenase activity in vitro and in
vivo: effects of acetone feeding. Biochem
Genet;17(5-6):553-563.
23. Schwartz M, O'Donnell J, Sofer W. 1979 May.
Origin of the multiple forms of alcohol dehydrogenase from
Drosophila melanogaster. Arch Biochem
Biophys;194(2):365-378.
24. O'Donnell J, Mandel HC, Krauss M, Sofer W. 1977 Jul.
Genetic and cytogenetic analysis of the Adh region in
Drosophila melanogaster. Genetics;86(3):553-566.
25. Schwartz M, Sofer W. 1976 Sep 9.
Diet-induced alterations in distribution of multiple forms
of alcohol dehydrogenase in Drosophila.
Nature;263(5573):129-131.
26. Schwartz M, Sofer W. 1976 May.
Alcohol dehydrogenase-negative mutants in Drosophila:
defects at the structural locus?
Genetics;83(1):125-136.
27. Vigue C, Sofer W. 1976 Feb.
Chemical selection of mutants that affect ADH activity in
Drosophila. III. Effects of ethanol. Biochem
Genet;14(1-2):127-135.
28. O'Donnell J, Gerace L, Leister F, Sofer W. 1975 Jan.
Chemical selection of mutants that affect alcohol
dehydrogenase in Drosophila. II. Use of 1-pentyne-3-ol.
Genetics;79(1):73-83.
29. Vigue C, Sofer W. 1974 May.
Adh-n5: a temperature-sensitive mutant at the Adh locus in
Drosophila. Biochem Genet;11(5):387-396.
30. Borack LI, Sofer W. 1971 Sep 10.
Drosophila beta-L-hydroxy acid dehydrogenase. Purification
and properties. J Biol Chem;246(17):5345-5350.
31. Sofer W, Ursprung H. 1968 Jun 10.
Drosophila alcohol dehydrogenase. Purification and partial
characterization. J Biol Chem;243(11):3110-3115.
Lab
Support
Hui Zhang, Graduate Student
Glen
Charydczak, Research Associate
|