OF WORDS, GENES AND MUSIC

Susumu Ohno NATO ASI Series, Vol H23 Springer-Verlag Berlin Heidelberg

Beckman Research Institute of The City of Hope 1450 East Duarte Road Duarte, CA 91010

I pay homage to Umberto Eco who attended this meeting by quoting the very first sentence of PROlOGUE to his widely acclaimed book "THE NAME OF THE ROSE" in English translation: "In the beginning was the Word and the Word was with God and the Word was God". It - would be noted that in this 17-word-long sentence, the word recurrs thrice and God twice; all together, recurring words occupying half of the sentence. Indeed, the essence of any good writing appears to depend upon the recurrence at more or less regular intervals of the same or similar sounding words which give a lyrical quality to it. Julius Caesar's announcement to the Roman senate of his victory at Zela (47 B.C.) survives to this day, only because all three word-combinations uttered in succession began with v and ended in i; "ve'ni, vi'di, vi'ci". This then is the extreme of recurrence with no interval at all. In immunology, the antigen-specific cooperation between helper T-cells and antibody-producing B cells appears again to depend upon recurrence of the same word (the same signal ). As schema tically illustrated in Figure 1, a macrophage phagocytizes antigen "A" and digests it to a number of small peptide fragments. Of those digested fragments, an amphipathic alfa-helical fragment is preferentially chosen and presented to the outside world by the antigen-presenting macrophage in conjunction with the self class II MHC antigen (extreme left of Figure 1). A clone of T-cells which happen to possess the membrane-bound receptor that fits this antigen "A" alfa-helical fragment + self class II MHC antigen complex, now becomes antigen-specific helper T-cells (middle of Figure 1). But how this clone of T-cells can selectively recognize and help not a single clone but a number of clones of B-cells equipped with membrane-bound antiantigen "A" anti-bodies of IgM and IgD types? For it would be recalled that antibodies are constructed to recognize antigens per Be not complexed with self class II MHC antigen. Further more. each polypeptide antigen usually present not one but a number of antigenic determinants. Accordingly. there ought to be and are several different clones of B-cells; their antibodies being directed against different antigenic determinants of antigen "A". Nevertheless. having been endowed with membrane-bound anti-antigen "A" antibodies. all these different clones of Bcells would manage to concentrate antigen "A" on their cell surface. These antigen-antibody complexes shall be lysostripped off the plasma membrane and shall be digested inside B-cells. As the digestive enzyme involved is of the same sort as that present in the macrophage. antigen "A" is digested to the same variety of peptide fragments and the same amphipathic alfa-helical fragment is chosen and presented by B-cells to the outside world in conjunction with self class II MHC antigen. It is this complex which the receptor of helper T-cells recognizes. thus. resulting in antigen-specific help to expand all clones of antiantigen "A" B-cells (extreme right of Figure I).

Figure 1 How helper T cells are able to provide the antigen specific help to multiple clones of B cells is schematically illustrated. This help is based upon antigen-presenting macrophages and antigenspecific clones of B cells uttering the same word which is perceived as such by membrane-bound receptor of helper T cells. At the extreme left. the specific antigen (antigen "A") is depicted as a polypeptide chain comprised of alternating alfa-helical and ß-sheet forming segments; of those. one amphipathic alfa-helical segment is shown as a black barrel. A macrophage. at the left. phagocitizes antigen "A". and in a specific intracellular locality but not in a lysosome. antigen "A" is digested by a particular protease to several peptidic fragments. One amphipathic alfa-helical fragment then is preferentially chosen and presented to the outside world. complexed with self class II MHC antigen. A clone of T cells equipped with the receptor that fits this complex presented by the antigen-presenting macrophage now becomes anti-antigen "A" specific helper T cells as shown in the middle. On the other hand. still membrane-bound antibodies of dormant anti-antigen "A" B cells shown at the right recognize any of the antigenic determinants present on antigen "A" but never a complex formed between self class II MHC and antigen "A" aphipathic ahelical fragment. Yet they can be recipients of the help from anti-antigen "A" specific T cells. because a complex formed between antigen "A" and specific membrane-bound antibody is taken inside B cells by pynocytosis and subsequently antigen "A" is digested by the same protease as present in macrophages. Of digested fragments. the same amphipathic alfa-helical fragment is preferrentially chosen and presented by B cells to the outside world complexed with self class II MHC antigen. This enables anti-antigen "A" specific helper T cells to see the same word on the plasma membrane of anti-antigen "A" specific B cells as it had seen on the antigen-presenting macrophage plasma membrane; hence antigen-specific help to cause clonal expansion of the antibody secretion by anti-antigen "A" specific B cells.

All in all. it would thus appear that the antigen-specific cooperation between T-cells and B-cells is based upon one principle; that when confronted with the same sentence (antigen "A"). both antigen-presenting macrophage and anti-antigen "A" B-cells chose the same word (a particular amphipathic alfa-helical segment) out of that sentence and crowns it with the same adjective (self class II MHC antigen).

COMPLEMENTARY RECOGNITION IS BUT A FORM OF THE HOMOLOGOUS RECOGNITION

In immunology. one often speaks of specific antigen-antibody interactions as examples of recognition based upon the complementarity between two components. Nevertheless, the fact is that all components of the adaptive immune system are composed of strings of repeating units ultimately derived from the common ancestral unit. This unit commonly referred to as the ß2-rnicroglobulin-like domain is made of 90 to 100 mostly hydrophobic amino acid residues, the relative abundance of hydrophilic Ser and/or Thr also being a conspicuous feature. These residues are folded into three to five loops of anti-parallel ß-sheet forming segments. Contacts between neighboring ß-sheet forming segments are maintained through hydrogen bonds mostly formed between Thr-Thr, Thr-Ser or Ser-Ser, and the whole structure is compacted by the presence of one intradomain disulfide bridge. It now appears that the immediate ancestor of genes for the adaptive immune system was CAM (cell adhesive molecule) gene engaged in organogenesis of early embryos. In the extracellular portion of N-CAM specific for neuronal organization, four successive ß2-rnicroglobulin-like domains were found (Hemperly et al. ,1986). Through these domains, N-CAM engages in homologous recognition, thus, aggregating similar neuronal cells; the first step in neuronal organization. It is fitting that all components of the adaptive immune system evolved from CAM; the original mediator of cell-cell interaction. The point to be made here is that the ß 2-rnicroglobulin-like domain originally evolved to engage in homologous, not complementary, recognition. Accordingly, recognition of class I and Class II MHC antigens by T-cell receptors and 8-cell antibodies, as well as that of idiotypes by another T-cell receptors and B-cell antibodies are homologous recognition sensu stricto; the notion of complementary recognition being more of an illusion than reality

SEARCH FOR THE ULTIMATE ANCESTOR

Implicit in the above stated notion that the immediate ancestor of various components of the adaptive immune system was one of the cell adhesion molecules (CAM) involved in the initial stage of embryonic organogenesis is the assumption that four ß2-rnicroglobulin-like domains of N-CAM arose in situ. Were they borrow ed from other molecules (even from immunoglobulins themselves) by the so-called domain exchange, the whole notion of CAM being the immediate ancestor of various components of the adaptive immune system becomes ridiculous. Fortunately, it looks as though these ß2-microglobulin-like domains of N-CAM indeed evolved in situ, for there is a noticeable similarity in construction of these ß2-microglobuin-like domains and other parts of N-CAM. Each of these ß2-microglobulin-like domains contains three absolutely invariant residues; 1) Cys in 12th position, 2) Trp in 24th position, and 3) Cys in 62nd position. These three invariant residues tend to be included in Thr-X,Thr-X dipeptidic repeats. This is illustrated at the top of Figure 2 on 3rd of the four success. ive ß2-microglobulin-like domains, for 3rd is the only complete domain, the other three sustaining deletions of three to six res. idues (Hemperly et al. ,1986). Four successive ß2-microglobulin like domains comprise but 40% of N-CAM polypeptide chain. The 362-residue long carboxyl terminal domain remaining within the cell is constructed of a simpler mode, thus, suggesting that this segment remained close to the original design of the entire CAM polypeptide chain. As also shown at the top of Figure 2, 699th to 728th residues of N-CAM is esentially made of Thr-X, Thr-X dipeptide repeats. Thus, it is conceivable that the entire coding sequence for the ancestral CAM was simple repeats of something like ACT C C A A, ß2-microglobulin-like domains too evolving from parts of it. Three consecutive copies of sucl a heptamer, 21 bases in the total length would have given the heptapeptidic periodicity to the original peptide chain as showl below:

Two base substitutions affecting the above noted periodicity uni' would have produced three consecutive Thr-X dipeptides as show below; two substituted bases are underlined:

At the top of Figure 2, ACT C portion of this hypothetical heptameric unit and its single base substituted deviants are solidly underlined. Is there any validity to the above noted proposal as to the ulti mate origin of CAM coding sequences. The first CAM must have come into being when the first multicellular eukaryote evolved from unicellular eukaryotes. Slime molds of the genus DictyOs telium indeed occupies a unique position of being an intermediate between unicellular and multicellular eukaryotes, for these organisms in nutrient rich environments live as unicellular

Figure 2.The indication of propinquity of descents between the Chicken N-CAM at the top (Hemperly et al. ,1986), csACAM of the slime mold (Dyctiostelium discoideum) in the middle (Noegel et al. ,1986) and the mouse transcript of primordial T A T C, T G T C repeats (Ohno and Epplen,1983). At the top, internal homology within N-CAM between ß2-rnicroglobulin-like domains and the apparently more ancient intracellular domain is indicated. Within each ß2-rnicroglobulin-like domain, three most invariant residues are a pair of cysteine for the intradomain disulfide bridge (12th and 62nd positions) and TRP at 24th position. As exemplified in 3rd ß2-rnicroglobulin-like domain of the chicken N-CAM, THR-X,THR-X dipeptidic repeats invariably occur in vicinities of these three most invariant residues. THR-X,THR-X dipeptidic repeats are even more prominent feature of the intracellular domain. The principle tetramer A C T C and its single base substituted deviants are solidly underlined. One T A T C primordial tetramer is identified by a shadded bar. Although not identified, both T G T G and T G A C tetramers recurr twice each in six short coding segments of the chicken N-CAM shown at the top. Both T G T G and T G A Care single base substituted deviants of T G T C. In the middle, four coding segments of the slime mold csA-CAM which are essentially encoding THR-X,THR-X dipeptidic repeats are shown. A C T C and its single base substituted deviants are again identified by solid bars. The 30 base-long tandem repeats are noteworthy. 441st to 450th codon differs from 451st to 460th codon by a single base. At the bottom, a portion of the mouse primordial transcript which is mixed repeats of T A T C and its single base substituted deviant T G T C is shown as the ultimate ancestor of CAM coding sequences.

amoeboid creatures. When surroundings become unfavorable, how ever they begin to aggregate with each other to form the stalk and fruiting body, much in the manner of fungal species that include various mushrooms. This aggregation is induced by cyclicAMP and mediated through csA CAM, and the 494-residue-long amino acid sequence of Dictyostelium discoideum csA CAM has recently been deduced from cDNA base sequence (Noegel et al. ,1986). Indeed, it appears as though this primordial CAM has evolved from Thr-X,Thr-X dipeptidic repeats as shown in the middle of Figure 2. Particularly noteworthy is the coding segment encoding 431st to 460th residues, for it is made of three consecutive copies of the 30-base-long unit. It would be noted that 2nd and 3rd copies differ from each other only by a single base substitution, while 11 base substitutions separate Ist from 2nd. The already noted tetrameric unit A C T C and its single base substituted deviants are again very prominent in csA CAM coding sequence. However, it appears that this is a derived oligomeric unit and not the original repeating unit. The A T/ G C ratio of A C T C tetramer is 50/50. But csA coding sequence is quite unusual in that 62.6% of the sequence is A and T. The original repeating unit of the ultimate ancestor of CAM coding sequences had to contain considerably more A and T than G and C. Thus, we come to the tetrameric repeat coding sequence of the mouse which we previously reported as one of the few ultimate ancestors of all coding sequences (Ohno and Epplen,1983). This primordial coding sequence is mixed repeats of two tetrameric units; TAT C and its single base substituted deviant T G T C as shown at the bottom of Figure 2. The even representation of two tetramers give to the primordial coding sequence A T/G C ratio of 62.5/37.5. It would be recalled that this is the exact ratio found in csA CAM of the slime mold. Indeed, overall, TAT C, T G T C and their single base substituted deviants are as prominent as ACT C and its single base substituted deviants in csA CAM as well as N-CAM coding sequences. However, the latter gains prominance in segments dominated by Thr-X,Thr-X dipeptidic repeats preferentially shown at the top and middle of Figure 2. However, it would be noted that a pair of invariant CYS of each ß2-microglobulin-like domain of N-CAM is invariably encoded by apart of T G T G tetremer which is a single base substituted deviant of T G T C as shown at the top of Figure 2. This applies to invariant Cys in components of adaptive immune system as well. Thus, we have deduced the ultimate ancestor of various CAM genes engaged in cell-cell recognition of early embryonic organogenesis as well as genes for various components of the adaptive immune system to mixed repeats of two base tetramers TAT C and T G T C.

THE PRINCIPLE OF RECURRING UNITS IN CONSTRUCTION OF CODING SEQUENCES,
LANGUAGES AND MUSICAL COMPOSITIONS

In our galaxy and others, stars have been formed and are still being formed by gravitational condensation of molecular clouds that contain large quantities of molecular hydrogen, water, ammmonia, carbon monoxide, methyl alcohol, hydrocyanic acid and others. When the earth was formed some 4.5 billion years ago, the primeval atmosphere surrounding it must have also contained these chemically reducing compounds noted above (Holye,1979; Dyson, 1985). In the classical experiment of Miller in 1953, electric sparks passed through a mixture of methane, ammonia, molecular hydrogen and water yielded large fractions of amino acids; notable being alanine of a 2% yield. Oro in 1960, on the other hand, prepared a concentrated solution of ammonium cyanidein water. After a period, he found spontaneous converison of ammonium cyanide to adenine with 0.5% yield, (Miller and Orgel, 1974). Thus, it might be said that the yielding of various building materials of life was and is inherent in the composition of molecular clouds. What is life but a form that reproduces near exact replicas of itself. Thus, we owe our lives to the inherent complementarity that exists between the two purinepyrimidine pairs of bases. Adenine pairs with uracil or thymine, while guanine forms hydrogen bonds with cytosine. Accordingly, when two complementary strand of double-stranded nucleic acids fall asunder, each can form its complementary strands. By this way, nucleic acids are inherently designed to perpetuate their base sequences. Inasmuch as the copying of the template, that is to say building of a new single stranded RNA complementary to the preexisted single stranded RNA is based upon the above noted inherent complementarity between A and U as well as G and C, this could have taken place in the prebiotic world, for if provided with a template as long as 60 to 100-base-long, AT P, G T P, UT P and C T P would align themselves in the proper 3'-5' linkage to form a complementary strand in the presence of Zn++ metal ion alone (Bridson and Orgel,1980). The major obstacle in the prebiotic world against spontaneous generation of the first cell on this earth, thus, was the formation of long enough templates directly from AT P, G T P, UT P and C T P, for even in the presence of imidazol and Zn++, autopolymelization of nucleotide triphosphates yields only base hexamers to decamers. It follows then that unless these base oligomers were endowed with the inherent property for self elongation, long enough templates would not have come into being to start life on this earth. What if a given base octamer was repeats of the base tetramer such as TAT C already noted This octamer and its complementary strand formed after the first round of copying may have reannealed unequally first copy to the second copy after falling asunder as illustrated below:

T A T C T A T C

A T A G A T A G

The hydrogen bonded paired portion would have served as a primer for the next round of copying (replication), and after this round, the octameric template would have elongated itself to the dodecameric template. Indeed, self elongation is inherent in repeats of base oligomers (prebiotic nucleic acids were RNA rather than DNA, thus, two T's of TAT C should have been substituted by U's, but for the sake of continuity, U AU C is shown as TAT C). This, then, is one of the many reasons for believing that the first set of coding sequences emerged at the very beginning of life on this earth were all repeats of base oligomers (Ohno and Epplen,1983). Indeed, we have already seen that mixed repeats of TAT C and its single base substituted deviant T G T C appear to have served as the ultimate ancestor of one superfamily of genes; first various CAM's for general cell-cell recognition during the initial stage of organogenesis of all multicellular eukaryotes and through them, various components of the adaptive immune system unique to vertebrates. It would be noted that such tetrameric repeats resemble Julius Caesar's remark already cited in construction. vi'di in the middle can be considered as TAT C, then ve'ni preceeding it becomes its two base substituted copy such as T G A C, while vi'ci following it becomes its single base substituted copy such as T G C C. Such tetrameric repeats also resemble musical compositions of the Baroque period. As an example, the treble clef musical score of Prelude No.1 for well-tempered clavichord by Johann Sebastian Bach (16856-1750) is shown in Figure 3. It would be noted that the initial part of this treble clef score in C major (the top 2 and 2/3rd lines of Figure 3) is essentially four note repeats; the second half of each 8/8th time signature segment being the exact copy of the first half. Each half of the time signature segment is comprised of two sets of the identical four notes; 4th note of the 1st set overlapping with 1st note of 2nd set. From the last one-third of the 3rd line of Figure 3 and downward, the theme now changes to three note repeats. This is because 1st note of each previous four note unit is now relegated to the base clef score. Such striking resemblance between Baroque musical compositions and primordial coding sequences that are repeats of base oligomers tempted us to devise one invariant rule by which treble clef scores of musical compositions and coding base sequences become interchangeable. After considering their respective molecular weights and complementarity, we have decided to assign a space and a line above it of the treble clef staff to each of the four bases in the ascending order of A G T C; Con the line of the previous scale occupying the classical middle C position (Ohno and Ohno, 1986). This assignment of bases to the treble clef staff afforded a needed freedom in transmutating coding base sequences to treble clef musical scores. This freedom is analogous to that accorded to coding sequences by the redundancy of

Figure 3 The treble clef score of an initial portion of J.S. Bach´s Prelude No.1 from well-tempered clavichord is shown accompanied by a base sequence transcribed from it according to the previously devised invariant rule (Ohno and Ohno,1986). Initial tetrameric repeat portion should have encoded a polypeptide chain of tetrapeptidic periodicities, except for an unfortunate concentration of chain terminaters T A A's and T A G's at the extreme right of 2nd line. Subsequently, the treble clef score becomes trimeric repeats monotonously encoding homoserines occasionally interspersed by stretches of homoisoleucines and homoarginines.

Figure 4, Part I

Figure 4, Part II

Figure 4, Part III

base tetramer A G C A; last A of the first unit overlapping with 1st A of 2nd unit, and the same with regard to 3rd and 4th units. This (A G C A) X 4 recurrs as 4th segment. The 2nd segment, on the other hand, appears as four repeats of A T C A; A T C A being a single base substituted deviant of the previous A G C A. However, it would be noted that all four notes of the first segment unit changed a step each in the second segment unit; in musical notation from e g b d to d h c f. In our devised rule, however, only two of the four possible single step changes can be detected as base substitutions; from a position on the line to a space above as well as from a position in the space to a line below. Whereas two other single step changes, from a position on the line to a space below as well as from a position in the space to a line above, are perceived as synonymous. In compliance with this rule, a portion of the primordial T A T C, T G T C repeats corresponding to 91st to 144th cod on in its longest open reading frame (Ohno and Epplen,1983) has been transmutated to the musical score in A minor and 8/8th time signature as shown as part I of Figure 4. This is to be regarded as prelude, for as part II of Figure 4, the transmutation in C major and again in 8/8th time signature of 431st and 464th codons of the slime mold csA CAM coding sequence is shown. As shown in the middle of Figure 2, this portion of csA CAM coding sequence is comprised of three copies of the 3O-base-long unit. The unit itself, however, apparently arose as repeats of shorter oligomers. One such tetramer, A C T C and its single base substituted deviants are identified by solid bars. This evolutionary trilogy ends in Part III of Figure 4 which celebrates the birth of original ß2-microglobulin-like domains in N-CAM -like cell adhesion proteins. The initial one-third of the coding sequence for 3rd ß2-microglobulin-like domain of the chicken N-CAM (Hemperly et al.,1986) has been transmutated to the treble clef musical score of Part III. Accordingly, Part III contains the first CYS for the invariably present intradomain disulfide bridge (in the middle of 2nd line of Figure 4, Part III) as well as the equally invariant TRP seen at the extreme right of 3rd line of Figure 4, Part III. Both CYS and TRP noted above are parts of THR-X, THR-X dipeptidic units. Even though the coding segment depicted in Part III is comprised of only 108 bases, there are still base oligomers recurring within. Tandem repeats of the pentamer G A T C A is seen at the extreme right and extreme left of 1st line, and that the base octamer C T T C CAT C encoding 210th to 213th PRO-SER-ILE (in the middle of 3rd line of Part III) is a single base deviant of C T T C C A c C encoding 217th to 219th THR-SERTHR seen straddling 3rd and 4th line of Part III. These recurring base oligomers still provide a melodious quality to aged coding sequences a billion or more years removed from their ultimately ancestral oligomeric repeats. SUMMARY Common denominators in all our cognitive processes are recurring elements. For example, the first step in deciphering ancient writings left on excavated tablets of a long lost civilization would be to identify the most frequently recurring set of symbols, for such a set likely represents the main subject with which those writing were concerned; be it a king of a particular dynasty or a taxable unit of lands. Similarly, our vision perceives patterns as a pattern only if a pattern is repeated, and a melody becomes a melody only when it is repeated. The same applies to all components of the adaptive immune system. All together they form a cognitive pattern because they were all derived from the ancestral ß2-microglobulin-like unit which probably arose in cell adhesion molecules (CAM); those plasma membrane proteins through homologous recognition contributed and are still contributing to the initial stage in organogenesis of all multicellular eukaryotes. This reliance on repetitions of our biological system appeared to have started at the very beginning of of coding sequences were likely to have been repeats of base oligomers. I have composed a musical trilogy to celebrate the birth of original ß2-rnicroglobulin -like domains in CAM-like molecules. Part I represents the ultimately ancestral mixed repeats of T A T C and its single base substituted deviant T G T C, Part II de picts a portion of csA CAM coding sequence of the slime mold (a link between unicellular and multicellular eukaryotes) which encodes THR-X, THR-X dipeptictic repeats. Finally, Part III represents 3rct ß2-rnicroglobulin-like domain of the chicken N-CAM. On one hand, this symbolizes the immediate ancestor of all the components of the adaptive immune system. On the other hand, it is linked to the past through recurring THR-X,THR-X dipeptictic repeats.

REFERENCES

1. Bridson, P.K, and Orgel, L.E. (1980) Catalysis of accurate poly (C) directed synthesis of 3'-5' linked oligoguanytes by Zn+2. J. Mol. Biol. 144:567-577.
2. Dyson, F. (1985) Origins of life. Cambridge Univ. Press, Cambridge, London.
3. Hemperly, J.J. , Murray, B.A. , Ectelman, G.M. and Cunningham, B.A. (1986) Sequence of a cDNA clone encoding the polysialic acid-rich and cytoplasmic domains of the neural cell adhesion molecule N-CAM. Proc. Natl. Acad. Sci .USA 83:3037-3041.
4. Hoyle, F. (1979) Ten faces of the universe. Freeman Press, London.
5. Miller, S.L. and Orgel, L.E. (1974) The origin of life on the earth. Prentice-Hall, New York.
6. Noegel, A. , Gerisch, G. , Stactler, J. and Westphal, M. (1986) Complete sequence and transcript regulation of a cell adhesion protein from aggregating Dictyostelium cells. Ernbo J. 5:14731476.
7. Ohno, S. and Epplen, J. (1983) The primitive code and repeats of base oligomers as the primordial protein-encocting sequence. Proc. Natl. Acad. Sci. USA 80:3391-3395.
8. Ohno, S. and Ohno, M. (1986) The all pervasive principle of repetitious recurrence governs not only coding sequence construction but also human endeavor in musical composition. Immunogenetics 24:71-78.