Repetition as the Essence of Life on this Earth: Music and Genes
S. Ohno     Hämatol. Bluttransf. Vol. 31

Beckman Research Institute of the City or Hope Duarte, California 91010, USA

A. Introduction

While it is believed that life on this earth started as long ago as a few billion or more years ago, a number of true innovations in evolution appears to have been rather dismally small. Most of the successful adaptive radiation of living organisms have apparently been accomplished by extensive plagiarization of those preciously few innovations via the mechanism of gene duplication [1]. Furthermore, it appears that most of these true innovations have occurred at the very beginning, before the division of prokaryotes from eukaryotes. For example, nearly all the sugar-metabolizing enzymes appear to have achieved their inviolable functional competence at the above-noted early date. Natural selection has since been spinning wheels in the air.

B. The Story of Glyceraldehyde 3-Phosphate Dehydrogenase

It would be noted in Fig. 1 that the 332-residue-Iong glyceraldehyde 3-phosphate dehydrogenase of the pig differs from the lobster enzyme only at 86 positions. Inasmuch as vertebrates, or rather chordates diverged from crustaceans roughly 500 million years ago, one can conclude from the above and similar data on additional species that this enzyme has been undergoing 1% amino acid sequence divergence every 20 million years, thus accumulating 26% amino acid sequence difference in 500 million years. If such a rate calculation can be extended indefinitely, however, even at this snail's pace one still expects this enzyme to have undergone 100% amino acid sequence divergence in 2 billion years. Now 2 billion years ago would have been about the time prokaryotes diverged from eukaryotes. Yet the bacterial amino acid sequence from Bacillus' stearothermophilis , also shown in Fig. 1, still maintains 177 out of the 332 sites (53%) homology with the pig enzyme, and similar 180 out of 332 sites homology with the lobster enzyme. In fact, there are 19 segments (tripeptidic or longer), comprised of 92 residues in total, that remain invariant in all three species. The longest conserved segment, tridecapeptidic in its length, occupying 144th to 156th position, represents the most critical of the substrate binding sites, 149th Cys forming the thiol linkage with substrate intermediates [2]. 1ndeed, after achieving the appropriate degree of functional competence 2 billion or more years ago, glyceraldehyde 3-phosphate dehydrogenase has not changed in its essence; evolutionary compatible amino acid substitutions that accompanied successive diversification and speciation merely symbolizing futile spinning of the wheel. Such a futility is also evident in Fig. 1, for at the 14 positions, a eukaryote (the pig) and a prokaryote (Bacillus stearothermophilis,) share the identical residues, while the other eukaryote (the lobster) is left out as an oddball; e.g., the third position of the pig and the bacillus is Val, while that of the lobster is Ile. At these and many other positions, the game of musical chairs

Fig.l. The amino acid sequences of glyceraldehyde 3-phosphate dehydrogenases from three divergent species are compared- Bacillus refers to Bacillus stearothermophilis. Discordant and identical residues are shown slightly displaced from each other; discordant ones are placed little above identical ones. Amino acid residues of tripeptidic or longer conserved segments are shown in large capital letter and segments are boxed in. Deleted residues are identified as black boxes

have apparently been in play among a limited number of functionally compatible amino acids. Analogous situations have been found with regard to other sugar metabolizing enzylles, e.g., phosphoglycerate kinase, triose isomerase etc. Furthermore, all these sugarmetabolizing enzymes are constructed of the same mould. The amino terminal half and the carboxyl terminal half forming two distinct domains, a cleft between the two accommodating the substrate and the coenzyme. The amino terminal half is for the coenzyme binding and the carboxyl terminal half is for the substrate binding. Furthermore, Rossman [3], among others, has pointed out that in the case of kinases, the mononucleotide (e.g., A TP) binding site of the amino terminal half is comprised of three p-sheet-forming segments and two alfa-helixforming segments in the following order from the amino terminus; ß alfa ß alfa ß. The dinucleotide (NAD or NADP) binding site of dehydrogenases, on the other hand, evolved from the above by duplication; thus, it can be expressed as 2 x ß alfa ß alfa ß. Inasmuch as the most critical portion of the substrate binding site evolved within the last segment of the duplicate ( e.g., 144th to 156th tridecapeptide of Fig. 1), this intrusion of the substrate binding active site into the dinucleotide binding domain froze the dinucleotide binding domain of each enzyme as uniquely its own. Thus, there is no more than 20% amino acid sequence homology between dinucleotide binding sites of different enzymes in spite of the fact that all are made of the same 2 x ß alfa ß alfa ß mould. It would be recalled that within the same enzyme, conservation of greater than 50% homology is the rule for the whole enzyme, therefore, the dinucleotide binding amino terminal half. At any rate, two notable facts emerge from the above. First, coding sequences for sugar-metabolizing enzymes and probably for many other enzymes (e.g., proteases) have already achieved the appropriate degree of functional competence before the division of prokaryotes from eukaryotes. Second, repetitions were the rule of the game from the very onset of life on this earth; the dinucleotide binding site evolving from the mononucleotide binding site by duplication, and that the mononucleotide binding site it self likely to have evolved by 2.5 times duplication of the one ß alfaor alfa ß unit.

C. Ingeniousness Embodied in the First Set of Coding Sequences that Were Repeats of Base Oligomers

Orgel's group [4] has shown that in the presence of Zn ion, nonenzymatic synthesis of nucleic acids occurs in the proper 3'- to 5' linkage, provided that there is a template. Thus, it would appear that what was in short supply in the prebiotic world, before the emergence of life on this earth was long templates from which copies can be made. Put it more succinctly, the first primordial question is: "How did oligonucleotides manage to extend themselves to become worthy coding sequences?" There is one simple answer: One tandem duplication of the preexisted oligomer assures indefinite extension of that template, as illustrated at the top of Fig. 2. What if the heptalleric template CAGCCTG duplicated to become tetradecaller? After completion of its complementary strand, the two might pair in the manner shown; second copy pairing with the first copy of the complementary strand. The paired portion would now serve as the primer for the next round of nucleic acid synthesis. At the completion of the second round, the 14-ller template now becomes 21-ller. In this way, the indefinite extension of the primer is assured a priori, a paired segment always serving as a primer for the next round of nucleic acid synthesis. The above then is the first reason for believing that the first set of coding sequences, or rather all nucleic acids in the prebiotic world that presaged the emergence of life, on this earth were all repeats of various base oligomers. How accurate was a copying function of the nonenzymatic nucleic acid replication? Of various nucleic acid polymerases known, the most error prone appear to be reverse transcriptase of retroviruses, for their error rate has been estimated as of the order of 10-3/base pair/year [5]. This is one million times higher error rate compared to DNA polymerases of vertebrates, and at this rate, there would be 100% base sequence change everyone thousand years. The inherent error rate of prebiotic, therefore,

Fig,2. Replication or nucleic acids is based upon the inherent complementarity that exists between two purine-pyrimidine pairs; A pairs with Tor U , while G pairs with C. Accordingly, provided that there is a template (the heptamer CAGCCTG shown at the top), mononucleotides would readily assemble themselvcs in the 3', 5' linkagc to form a complementary strand in the presence of Zn [4] as shown at the top. What was in short supply in the prebiotic world then were templates of substantial lengths. What if the above noted haptamer repeated itself in tandem or some of the base oligomers were by chance tandem repeats (two copies of the shorter oligomer) to begin with. It and its complemcntary strand can pair unequally in the manner depicted at the middle. As a paired segment now functions as a primer for the next round of nuclcic acid synthesis, infinite extension of tcmplates is now assured. All it takes to start this process is the one tandem duplication. of long oligomeric repeats thus formed, those that evolved to be the first set of coding sequences likely started from oligomeric units whosc numbers of bases werc not multiples or three. There were two distinct advantages: (1) They gave longer periodicities to polypeptide chains; e.g., repeats of the base octamer would have given octapeptidic periodicity while repeat or the base nonamer would have only the tripeptidic periodicity. (2) They would have encoded polypeptide chains of identical periodicity in all three reading frames. Within the periodic unit such repeats could have given both alfa-helical segment and fisheet forming segment as shown at the bottom Such alternating alfa / ß structures gave rise to the mononucleotide binding site (3) which would have been utilized immediately as parts or the primitive nucleic acid polymcrase. Later they gave rise to A TP and NAD, NADP binding sites or many enzymes as discussed in the text

nonenzymatic nucleic acid replication is expected to be higher than the above-noted 10-3; as error prone as they are, reverse transcriptases are, after all, the enzyme of a sort. Prebiotic coding sequences had to contend with this very high replication error rate and should still have been able to encode polypeptide chains of potential function. Provided that the number of bases in the oligomeric unit was not a multiple of three, repeats of the base oligomer would have been very stable under this mostly trying circumstance of constant base substitutions, deletions, and insertions. This is also illustrated at the bottom of Fig. 2. Since the monodecamer CGAAGCTGCTG cannot be divided by 3, three consecutive copies of it translated in three different reading frames gives the monodecapeptidic periodicity to a polypeptide chain. Contrast the above to repeats of the base dodecamer, which can give only the tetrapeptidic periodicity to the polypeptide chain. Furthermore, since within a given reading frame three consecutive copies of the monodecamer are to be translated in all three reading frames, such repeats encode polypeptide chains of the identical periodicity in all three reading frames. This openness of all three reading frames give them a great deal of imperviousness to base substitutions, deletions, and insertions. Repeats of the monodecamer shown at the bottom of Fig. 2 encode both potentially IX-helix-forming segment and potentially {3-sheet-forming segment within one monodecapeptidic unit. In fact, sugarmetabolizing enzymes in general and phosphoglycerate kinase in particular might have originally been encoded by repeats of such a monodecamer, for AAGCTGCTG portion of the monodecameric unit recur in many variations in the modern coding sequence (e.g., of man) for phosphoglycerate kinase as already noted in our previous paper [6].

D. Repetition as the Essence of Coding Sequences and Musical Compositions

Earth on which life has evolved has always been governed by the hierarchy of periodicities. First, earth rotates on its own axis to create days, while the moon's revolution around the earth gives months, with neap tides and spring tides to be topped by years, reflecting the earth's travel around the sun. It is small wonder if life itself was born out of periodicities embodied in repetition of unit base oligomers. Just as man eventually devised seconds, minutes, and hours as arbitrary units of time measurement, one of the periodicities embodied in polypeptide chains encoded by the first set of codeing sequences that were oligomeric repeats must soon have been chosen as the arbitrary time-measuring unit by the ancestral biological clock. It now appears that this arbitrarily chosen unit was the simplest dipeptidic periodicity. The polypeptide chain encoded by per locus of Drosophila merlanogaster, fundamentally involved in the expression of biological rhythms such as cicardian behaviors and 55s rhythm of courtship song, is largely comprised of the Gly- Thr dipeptidic repeats interspersed with short stretches of its deviant Gly-Ser dipeptidic repeats, and that the homologous gene encoding the polypeptide chain of the above-noted dipeptidic periodicity is conserved in the mouse as well [7]. Observing the per locus coding sequence, one notices that there have been numerous neutral base substitutions, e.g., free base substitutions at the redundant 3rd base position of glycine codons. Thus, it would appear that the time-keeping was done from the beginning at the polypeptide level rather than at the level of coding sequences, although the initial periodicity of that polypeptide chain had to be the consequence of its coding sequence being repeats of unit base oligomers. Now we come to the origin shrouded in mist, of the prehistory of musical compositions. Inasmuch as songs of canaries and skylarks are as pleasing to our ears as they must be to their mates as well as to themselves, it is clear that melodies as such are no human invention. Furthermore, the vocal cord and other sound-making apparatuses of our immediate relatives (e.g., Homo neanderthalensis ) appear to have been rather underdeveloped. Accordingly, I wonder if early Homo sapiens were capable even of imitating beautiful bird songs noted above even if they wanted to. I would rather believe that music as such were invented by primitive man as purely rhythmic timekeeping device. For example, a hunting party intent on bringing down a mammoth or two would have to coordinate activities of several cohorts spread over a wide arc surrounding the herd of mammothes. This, I suspect, was done by rhythmic beatings of hollowed tree trunks for example; fast repetitions of a given rhythm conveying an urgent need to close in whereas slow repetitions of the same rhythm meaning cautious approach. It would thus appear that music, too were initally born out of repetitious rendition. Even today of wonderous melodies, music is still used as a time keeping device, as in dancing and military parades. Rhythm of the latter, marching music are essentially that of our heart beat. Our heart beats slow in slumber and contemplation, while it beats uncontrollably fast in fright. Rhythm of marching music should be somewhere in between to indicate willingness either to go forth against formidable adversaries or to defend against adversaries until death. Because of this homage to the periodicity inherent both in coding sequence construction and musical composition, the way was sought to interconvert the two. The solution

Fig.3. An initial part of the treble-clef musical score of Prelude No.1 from well-tempered clavichord by I. S. Bach, accompanied by the base sequence and the amino acid sequence transcribable from that base sequence

that we arrived at is to assign a space and a line on the octave scale to each base in the ascending order of A, G, T, C in such away so that the classical middle-C position would be occupied by C on the line, A in the space occupying the position immediately above [6]. In Fig.3, the treble-clef musical score of Prelude No.1 from well-tempered clavichord by J. S. Bach, the great master of the early Baroque, is accompanied by the base sequence transcribed from it according to the rule stated above. It would be noted that with regard to every 4/4th or 8/8th time signature unit, the second half is the exact repeat of the first half. Furthermore, until the 3rd line, each half is repeats of four notes, the four-note subunit consisting of one 3/ 16th note and three 1/16th notes followed by one 1/4th note and four 1/16th notes. Translated to base sequence, the first time signature unit is comprised of four exact copies of the AGCA tetramer followed by four copies of a single-base substituted deviant of the above-noted tetramer A TCA. The AGCA recurrs again 8 times. Since 4 is not a multiple of three, these tetrameric repeats are capable of giving the tetrapeptidic periodicity to a polypeptide chain, but alas. chain terminators T AA and TAG come in pairs at the extreme right of 2nd line. From the 4th line onward, one 3/16th note and a quarter note are relegated to the base clef; therefore, the treble-clef score becomes trimeric repeats. When translated, this portion yields polyserine interspersed with teterailsoleucyne and tetraarginine. In general, I found musical compositions of the early Baroque period to be repeats of short base oligomers, these oligomers being single-base substituted variants of each other. Indeed, their resemblance to what I conceived as the first set of coding sequences at the very beginning of life on this earth is uncanny (see Fig.2). Most of the coding sequences possessed by modern organisms

Fig.4. The heart of the coding segment for tyrosine kinase domain of the human insulin receptor p-chain (8). Amino acid residues of the two active site segments are shown in large capiraller rers. This musical transformation for violin of the coding scgment is in E minor, 4/4th or 8/8th time signature

have endured for hundreds of millions of years. In the case of those for sugar-metabolizing enzymes, 2 billion years or more as already noted. Thus, their original periodicities are obvious only for discerning eyes. Not surprisingly, musical compositions of the late Romantic period resemble these coding sequences. We have previously shown that Frederic Chopin's nocturne Opus 55, No.1, resembled the last exon for the largest subunit of RNA polymerase II [6]. In Fig.4, the musical transformation for violin of the most functionally critical part of the tyrosine kinase domain of the human insulin receptor p-chain [8] is shown. This segment includes two active site segments most critical for the assigned function of tyrosine kinase. Amino acid residues of these two active site oligopeptides are shown in large capital letters. It would be noted that nearly all of the second active site is encoded by tandem repeats of the dodecamer GTGGTCCTTTGG, thickly underlined by solid bars (2nd from the last line of Fig. 4). Its two truncated derivatives at the top line of Fig.4 are also underlined by solid bars. Other, more musically pertinent repeats are also underlined by open bars and shaded bars; e.g., the hexamer TCCCTG in 3rd and 4th lines of Fig. 4.

E. Summary

In prebiotic nucleic acid replication, templates appear to have been in short supply. A single rOl1nd of tandem duplication of existing oligomers assured progressive extension of templates to the length adequate for encoding of polypeptide chains. Thus, the first set of coding sequences had to be repeats of base oligomers encoding polypeptide chains of various periodicities. On one hand, the readiness of these periodical polypeptide chains to assume alfa helical and / ß sheet secondary structures contributed to the extremely rapid initial functional diversification of these polypeptide chains. It would be recalled that most, if not all, of the sugar-metabolizing enzymes had already achieved the inviolable functional competence before the division of prokaryotes from eukaryotes. On the other hand, a certain ( dipeptidic?) of the peptidic periodicities was apparently chosen as the timekeeping unit by the biological clock. Musical compositions too apparently evolved originally as a timekeeping device. Accordingly, repetitiousness is evident in all musical compositions. Evolution of musical compositions from the early Baroque to the late Romantic parallels that of coding sequences from rather exact repeats of base oligomers to more complex modern coding sequences in which repetitious elements are less conspicuous and more varied. Inasmuch as the earth is governed by the hierarchy ofperiodicities (days, months and years), such reliance on periodicities is rather expected.


1. Ohno S (1970) Evolution by gene duplication. Springer- Verlag, Berlin Heidelberg New York 518
2. Dayhoff MO (ed) (1972) Atlas of protein sequences and structure. National biomedical research foundation, Silver Springs, Maryland
3. Rossman MG (1981) Evolution of glycolytic enzymes. Philos Trans R Soc Lond [BioI] B293.191-203
4. Bridson PK, Orgel LE (1980) Catalysis of ac curate poly (C)-directed synthesis of 3'-5'linked oligoguanylates by Zn + 2. J Mol BioI 144.567-577
5. Gojobori T, Yokoyama S (1985) Rates of evolution of the retroviral oncogene of Moloney murine sarcoma virus and of its cellular homologues. Proc Natl Acad Sci USA 82:4198-4201
6. Ohno S, Ohno M (1985) The all-pervasive principle of repetitious recurrence governs not only coding sequence construction but also human endeavor in musical composition. J Immunogenet24:71-78
7. Shin H-S, Bargiello TA, Clark BT, Jackson FR, Young MW (1985) An unusual coding sequence from a Drosophila clock gene is conserved in vertebrates. Nature 317:445-451
8. Ulrich A, Bell J R, Chen EY, Herrera R, Petruzzelli LM, Dull TJ, Gray A, Coussens L, Kiao Y -C, Tsubokawa M, Mason A, Seeburg PH, Gunfeld C, Rosen OM, Ramachandran J (1985) Human insulin receptor and its relationship to the tyrosine kinase family of oncogenes. Nature 313.756-761