Research Summary

Bioinformatics

Predicting Secondary Structure of Proteins using Genetic Algorithms
Genetic algorithms are artificial intelligence procedures that utilize the technique of Darwinian selection. They "learn" solutions to complex problems - just as organisms have done over billions of years - by the evolution of increasingly effective rules.


Protein Secondary Structure
A particularly complex problem that concerns many in the scientific community is to predict the secondary structures of proteins using only amino acid sequence data. Secondary structures are regular features found in proteins that involve associations between the main chain NH and C=O groups. The most important secondary elements are the alpha helix and beta sheet (composed of beta strands). The elements that connect alpha helices and beta strands are called loops, essentially assemblages of amino acids, usually located on the outside of proteins, that form an irregular shape. Classifying a given protein's amino acids into one of these three categories is the goal of the research in my laboratory.

Our Implementation of the Genetic Algorithm
The rules of genetic algorithms are most often encoded as binary strings. We therefore assigned one or two five digit codes to each amino acid. As shown below, amino acids with similar properties have similar code words.




Encoding Number

Amino Acid

00000 and 00001

Tryptophan

00010

Tyrosine

00011

Phenylalanine

00100

Leucine

00101

Methionine

00110

Isoleucine

00111

Valine

01000 and 01001

Cysteine

01010 and 01011

Alanine

01100 and 01101

Proline

01110 and 01111

Glycine

10000 and 10001

Threonine

10010 and 10011

Serine

10100 and 10101

Asparagine

10110 and 10111

Glutamine

11000 and 11001

Aspartic acid

11010 and 11011

Glutamic acid

11100 and 11101

Histidine

11110

Lysine

11111

Arginine

We then obtained amino acid sequence data from the DSSP - http://www.sander.embl-heidelberg.de/dssp/ - a total of 118,854 amino acids-worth of data representing 474 proteins whose sequences were less than 25% similar. These proteins had all been analyzed by X-ray crystallography or NMR and, consequently, their secondary structures were known.
      Our overall strategy was to begin with a set of rules like the one illustrated below (the spaces are introduced for clarity):

10010 01001 00001 00100 => helix (2)


Translated into English using the coding scheme outline above, this rule says that if a protein contains the sequence serine (10010), cysteine (01001), tryptophan (00001), and leucine (00100) then the second amino acid in this sequence (cysteine) will be part of an alpha helix. Each rule also has a value called "strength" associated with it. A rule's strength is an indication of how well the rule does in its prediction.
      The program compares this rule against all the amino acid sequences in the database. When there is a match; that is, when the left-hand portion of this rule matches some sequence found in the database, the program looks to see if the second amino acid (cysteine) is actually part of an alpha helix. If it is, then the rule is rewarded and has its strength increased; if it is not, then it is penalized decreasing its strength.
      There is a further complication. To make the rules more general, a third character - # - is introduced in addition to the 0's and 1's. '#' is a "don't care" symbol and means either 1 or 0. Thus a rule is more likely to look like the one below:

100#0 #1001 0#000 ##### => helix (3)


This rule can be translated as follows: If a protein contains the sequence threonine (10000) or serine (10010) in the first position, cysteine (01001) or aspartic acid (11001) in the second, tryptophan (00000) or cysteine (01000) in the third, and any amino acid at all in the fourth, then the third amino acid is part of an alpha helix. The reward and penalty structure is the same as outlined above. Similar rules are created for predicting strands and loop regions.
      In ordinary genetic algorithms, rules obtained by "training" in this way compete with one another. In addition, good rules are recombined (in a process that closely resembles genetic recombination) to produce better ones. In the end, a set of "best" rules evolve that are effective in solving the problem.
      Our program works somewhat differently. We begin by generating a rule with '0's, '1's and '#'s in random positions. This rule is created to predict either a helix, sheet or loop randomly. For each rule, nine more are generated by changing one position from a 1 to a 0, a 0 to a 1, a 1 to a #, etc. These ten similar rules comprise a species. Each rule in each species is then compared to the entire database and the number of times it has predicted correctly and incorrectly is computed.
      Within a species, one or more rules will usually be clearly better than others. These are recombined with each other to produce new rules that contain elements of both parents. Recombination consists of taking a sequence from one rule and replacing it with a sequence from the other, and the reciprocal. Such recombination mimics natural sexual processes that occur within a species. In addition, at infrequent intervals mutations are introduced that change a 0 to a # or a 1 to a # at random positions in the rule (in our system mutations thus always increase the generality of a rule). Each time a new rule is created or mutated, it is compared against the entire database and its success rate is computed. While it may not be intuitively obvious, the process of recombination and selection soon results in all ten rules within a given species becoming the same (as long as the mutation rate is much lower than the recombination rate). When all ten rules become identical, the computer determines whether their strength surpass a certain threshold. If it does, the rule is maintained. If not, it is discarded, and a new random rule is created to undergo the same process. Over time, the algorithm generates large numbers of relatively successful rules.

Results
The algorithm has been trained on 100,000 amino acids utilizing the data set mentioned above. The remaining 18,854 were held out; i.e., these amino acid sequences were never exposed to the training and testing program. Four sets of rules have been generated. Each set has a different number of amino acids per rule. Four, five, six or eight amino acids were used. Hundreds of thousands of rules have been produced to date in each set.
      A second program has been developed to use these rules to predict the secondary structure of the 18,854 amino acids to which the program had not been previously exposed. This second program progressively tries to match each amino acid sequence (four, five, six or eight amino acids at a time) with the rule collection. In some cases, no rules matched a particular sequence. In other cases, more than one rule matched, and these predicted different structures. Often one or more rules matched a sequence, and all predicted the same structure. By easing the stringency of the criteria used to make a prediction, we can increase the number of predictions made, usually at the expense of accuracy.
      As expected, four length rules were the least effective in predicting secondary structure. Even with 64,000 rules, the best that the program could do was to predict secondary structure with 60% accuracy for 80% of the amino sampled. Increasing the number of four length rules to 85,000 did not increase accuracy or generality. Five and six length rules did better, achieving about 64% accuracy with about 100,000 rules. We do not know if an increase in the number of five and six length rules will increase accuracy. We are currently generating eight amino acid length rules and are investigating their potential for prediction. It should be borne in mind that the very best prediction programs, using procedures other than genetic algorithms, achieve about 70% accuracy in predicting secondary structure.

Publications

1. Freidman R, Hotaling E, Borack L, Sofer W. 1996. Interactions between the regulatory regions of two Adh alleles. Genetica Jan;97(1):1-14

2. Sofer W, Tompkins L. 1994. Drosophila genetics in the classroom. Genetics Jan;136(1):417-422

3. Shen NL, Hotaling EC, Subrahmanyam G, Martin PF, Sofer W. 1991. Analysis of sequences regulating larval expression of the Adh gene of Drosophila melanogaster. Genetics Nov;129(3):763-771

4. Rothberg I, Hotaling E, Sofer W. 1991. A Drosophila Adh gene can be activated in trans by an enhancer. Nucleic Acids Res Oct 25;19(20):5713-5717

5. Shen NL, Subrahmanyam G, Clark W, Martin P, Sofer W. 1989. Analysis of Adh gene regulation in Drosophila: studies using somatic transformation. Dev Genet;10(3):210-219

6. Place AR, Benyajati C, Sofer W. 1987. Molecular consequences of two formaldehyde-induced mutations in the alcohol dehydrogenase gene of Drosophila melanogaster. Biochem Genet Oct;25(9-10):621-638.

7. Sofer W, Martin P. 1987. Analysis of densitometric data obtained from electrophoretic analysis. Comput Appl Biosci Jun;3(2):129.

8. Fargnoli J, Hyde J, Sofer W. 1987. Chemical selection for beta-galactosidase activity in Drosophila melanogaster. Biochem Genet 1987 Jun;25(5-6):327-333.

9. Fargnoli J, Sofer W. 1987 May 1. In situ detection of enzymes in yeast. Anal Biochem;162(2):384-388.

10. Sofer W, Martin PF. 1987. Analysis of alcohol dehydrogenase gene expression in Drosophila. Annu Rev Genet;21:203-225.

11. Martin P, Martin A, Osmani A, Sofer W. 1986 Oct. A transient expression assay for tissue-specific gene expression of alcohol dehydrogenase in Drosophila. Dev Biol;117(2):574-580.

12. Martin PF, Place AR, Pentz E, Sofer W. 1985 Jul 20. UGA nonsense mutation in the alcohol dehydrogenase gene of Drosophila melanogaster. J Mol Biol;184(2):221-229.

13. Benyajati C, Place AR, Sofer W. 1983 Sep. Formaldehyde mutagenesis in Drosophila. Molecular analysis of ADH-negative mutants. Mutat Res;111(1):1-7.

14. Kubli E, Schmidt T, Martin PF, Sofer W. 1982 Nov 25. In vitro suppression of a nonsense mutant of Drosophila melanogaster. Nucleic Acids Res;10(22):7145-7152.

15. Benyajati C, Place AR, Wang N, Pentz E, Sofer W. 1982 Nov 25. Deletions at intervening sequence splice sites in the alcohol dehydrogenase gene of Drosophila. Nucleic Acids Res;10(22):7261-7272.

16. Pelliccia JG, Sofer W. 1982 Apr. Synthesis and degradation of alcohol dehydrogenase in wild-type and Adh-null activity mutants of Drosophila melanogaster. Biochem Genet;20(3-4):297-313.

17. Reddy AR, Sofer W. 1981 Dec 15. A rapid procedure to detect in situ DNA/RNA hybrids. Biochem Biophys Res Commun;103(3):959-967.

18. Benyajati C, Place AR, Powers DA, Sofer W. 1981 May. Alcohol dehydrogenase gene of Drosophila melanogaster: relationship of intervening sequences to functional domains in the protein. Proc Natl Acad Sci U S A;78(5):2717-2721.

19. Benyajati C, Wang N, Reddy A, Weinberg E, Sofer W. 1980 Dec 11. Alcohol dehydrogenase in Drosophila: isolation and characterization of messenger RNA and cDNA clone. Nucleic Acids Res;8(23):5649-5667.

20. Fishbein JC, Place AR, Ropson IJ, Powers DA, Sofer W. 1980 Oct. Thin-layer peptide mapping: quantitative analysis and sequencing at the nanomole level. Anal Biochem;108(1):193-201.

21. Reddy AR, Pelliccia JG, Sofer W. 1980 Apr. Adh-negative mutants: detection of an altered tryptic peptide in a mutant enzyme of Drosophila. Biochem Genet;18(3-4):339-351.

22. Papel I, Henderson M, Van Herrewege J, David J, Sofer W. 1979 Jun. Drosophila alcohol dehydrogenase activity in vitro and in vivo: effects of acetone feeding. Biochem Genet;17(5-6):553-563.

23. Schwartz M, O'Donnell J, Sofer W. 1979 May. Origin of the multiple forms of alcohol dehydrogenase from Drosophila melanogaster. Arch Biochem Biophys;194(2):365-378.

24. O'Donnell J, Mandel HC, Krauss M, Sofer W. 1977 Jul. Genetic and cytogenetic analysis of the Adh region in Drosophila melanogaster. Genetics;86(3):553-566.

25. Schwartz M, Sofer W. 1976 Sep 9. Diet-induced alterations in distribution of multiple forms of alcohol dehydrogenase in Drosophila. Nature;263(5573):129-131.

26. Schwartz M, Sofer W. 1976 May. Alcohol dehydrogenase-negative mutants in Drosophila: defects at the structural locus? Genetics;83(1):125-136.

27. Vigue C, Sofer W. 1976 Feb. Chemical selection of mutants that affect ADH activity in Drosophila. III. Effects of ethanol. Biochem Genet;14(1-2):127-135.

28. O'Donnell J, Gerace L, Leister F, Sofer W. 1975 Jan. Chemical selection of mutants that affect alcohol dehydrogenase in Drosophila. II. Use of 1-pentyne-3-ol. Genetics;79(1):73-83.

29. Vigue C, Sofer W. 1974 May. Adh-n5: a temperature-sensitive mutant at the Adh locus in Drosophila. Biochem Genet;11(5):387-396.

30. Borack LI, Sofer W. 1971 Sep 10. Drosophila beta-L-hydroxy acid dehydrogenase. Purification and properties. J Biol Chem;246(17):5345-5350.

31. Sofer W, Ursprung H. 1968 Jun 10. Drosophila alcohol dehydrogenase. Purification and partial characterization. J Biol Chem;243(11):3110-3115.


Lab Support

Hui Zhang, Graduate Student
Glen Charydczak, Research Associate