From Context-Dependence of Mutations to Molecular Mechanisms of Mutagenesis I.B. Rogozin, B.A. Malyarchuk, Y.I. Pavlov, and L. Milanesi Pacific Symposium on Biocomputing 10:409-420(2005) FROM CONTEXT-DEPENDENCE OF MUTATIONS TO MOLECULAR MECHANISMS OF MUTAGENESIS IGOR B. ROGOZIN National Center for Biotechnology Information NLM, National Institutes of Health, Bethesda, MD 20894, USA; rogozin@ncbi.nlm.nih.gov BORIS A. MALYARCHUK Institute of Biological Problems of the North, Far-East Branch of the Russian Academy of Sciences, Magadan 685000, Russia; malyar@ibpn.kolyma.ru YOURI I. PAVLOV University of Nebraska Medical Center, Omaha, NE 68198, USA; ypavlov@unmc.edu LUCIANO MILANESI Institute of Biomedical Technologies CNR, Milano 20090, Italy; luciano.milanesi@itb.cnr.it Mutation frequencies vary significantly along nucleotide sequences such that mutations often concentrate at certain positions called hotspots. Mutation hotspots in DNA reflect intrinsic properties of the mutation process, such as sequence specificity, that manifests itself at the level of interaction between mutagens, DNA, and the action of the repair and replication machineries. The nucleotide sequence context of mutational hotspots is a fingerprint of interactions between DNA and repair/replication/modification enzymes, and the analysis of hotspot context provides evidence of such interactions. The hotspots might also reflect structural and functional features of the respective DNA sequences and provide information about natural selection. We discuss analysis of 8-oxoguanineinduced mutations in pro- and eukaryotic genes, polymorphic positions in the human mitochondrial DNA and mutations in the HIV-1 retrovirus. Comparative analysis of 8oxoguanine-induced mutations and spontaneous mutation spectra suggested that a substantial fraction of spontaneous A·TC·T mutations is caused by 8-oxoGTP in nucleotide pools. In the case of human mitochondrial DNA, significant differences between molecular mechanisms of mutations in hypervariable segments and coding part of DNA were detected. Analysis of mutations in the HIV-1 retrovirus suggested a complex interplay between molecular mechanisms of mutagenesis and natural selection. 1. Mutation spectra and mutation hotspots Genetic variation is a necessary prerequisite of evolution. Genomes are replicated at a level of fidelity that "determined" by deep evolutionary forces, by Adjunct research scientist at the Institute of Cytology and Genetics RAS, Novosibirsk, Russia. the life history it has adopted, and by accidents of its evolutionary history [1]. The mechanisms of spontaneous and induced mutagenesis are complex, and much research is devoted to understanding of these mechanisms and the factors that alter mutation rate. Mutation spectra (distributions of mutations along nucleotide sequences of a target gene) are frequently used for such studies (Figure 1) [2,3]. Mutations in target sequences are usually revealed by either phenotypic selection in experimental test systems or, in case of disease-causing genes in humans, by clinical studies in which certain genes are sequenced in groups of patients and in control groups. Both the experimental test systems and the clinical studies rely on detectable (mutable) positions, which are sites where DNA sequence changes cause phenotypic changes [2,3]. A standard representation of a mutation spectrum is a nucleotide sequence of a target gene with all changes detected put above this sequence. The base substitution mutation spectrum [4,5] in Figure 1 includes two principal elements: (i) the target sequence (Fig. 1; lower line of continuous DNA sequence) and (ii) the mutations in the target sequence (Fig. 1). Mutations in DNA/RNA molecules are classified as point mutations, deletions/insertions, duplications, inversions, and chromosomal rearrangements. Point mutations are further classified as base pair substitutions, including transitions (purine [R = A/G] mutates to R or pyrimidine [Y = C/T] mutates to Y) and transversions (R mutates to Y or Y mutates to R), and +1 and ­1 frameshifts (insertions and deletions of a single base pair). Complex mutations include combinations of several point mutations and are relatively rare. T AAAA TT T TTTT AT G A CA A T T A AT G G T G A T AT AG GC A C AT T T AG C AG GT AT T TT A A G CT G CG AT AT T T G GAT GGG AG AC T G CG TCT CC AT AT T CT T T G T TTGAT CAT T AT A AG TTC GG ATGT TCAT A ACC TCTC AC A TAT A AG GGG GTTAG C CATGACTTT TT A A T CT T GT TT CT GCG TG C GATATCAGCTGATATCCAGCTGGATATCACAGCTGAGATATCAACAGCTGAAGATATCACACAGCTGACAGATATCACCACAGCTGACCAGATATCAGTT 100 EA PA EB PB EC PC ED PD EE PE EF PF EG Figure 1. Somatic hypermutation spectrum in an artificially synthesized EPS sequence [4,5]. Potential hotspot positions within AGCT and TA mutable sequences are underlined. The AGCT and TA mutable sequences match well-known RGYW/WRCY and WA/TW mutable motifs [6,7]. Mutability varies significantly along nucleotide sequences: mutations, whether induced or spontaneous, occur at higher frequencies at certain positions of a nucleotide sequence (mutation "hotspots") [8]. Some mutation hotspots are thought to depend on the nucleotide sequence and the mechanism of mutagenesis per se; these hotspots are called intrinsic mutation hotspots. In contrast, some hotspots may be due to preferential expansion of mutants with high fitness, for example hotspots in the p53 gene might reflect both intrinsic mutability and selection during tumor progression [3]. Thus, study of mutation hotspots can help reveal mutagenic mechanisms, or can reveal information about the functional domains of a target protein [2,3,9]. 2. Nucleotide context of mutation hotspots 2.1. Local context Intrinsic mutation hotspots are frequently caused by mutable motifs (hotspot motifs) (reviewed in [2,3,10]). One well-known example is CpG dinucleotides which are correlated with mutation hotspots in mammalian genomes [11]. The mutational mechanism for this effect is likely to involve deamination of 5methyl cytosine, which is frequently found at CpG dinucleotides. Thus, C·GT·A mutations occur at CG mutable motifs (hotspot bases are underlined) due to deamination of 5-methyl cytosine followed by replication of the resulting T/G mispair. Another well-known example of obvious mutational hotspots is hotspots of somatic mutations in immunoglobulin V genes [12]. In this case mutation hotspots are associated with RGYW/WRCY and WA/TW mutable motifs (potential hotspot sites are underlined, W = A/T) [6,7]. Usually mutation hotspots emerged at a specific position of a mutable motif, for example, only G·C bases are mutation-prone within RGYW/WRCY motifs (potential hotspot sites are underlined). Many other nucleotide sequence context effects on mutation rate have been studied and characterized, some examples are shown in the Table 1. Table 1. Examples of mutable motifs. Spectrum/test system/mutagen Mutable motif CG Spontaneous G·CA·T mutations in mammalian genomes Somatic mutations in immunoglobulin V genes Hotspots of error produced by cytidine deaminase APOBEC3G 8-Oxoguanine induced hotspots RGYW WA GG AA Comments May result from the spontaneous deamination of 5-methylcytosine [11] AGYW is more mutable compared to GGYW TA is more mutable compared to AA [7] in vivo experiment This motif was found to be mutable in proand eukaryotic genes TTAAAA LINEs and SINEs [17] Target signal of retroposable elements Hotspot positions are underlined. R = A or G; Y = T or C; S = G or C; W = A or T; K = G or T; M = A or C; B = T, C or G; H = A, T or C; V = A, C or G; D = A, T or G. Alternatively, repetitive sequences such as homonucleotide runs, direct and inverted repeats and microsatellite repeats are involved in specific types of high frequency mutational events (reviewed in [13]). For these mutations, the exact DNA sequence is not critical but only the fact that a sequence motif is repeated. The theoretical basis of these observations was suggested by Streisinger and coworkers [14]: it was proposed that short deletions and insertions within homonucleotide or homopolymeric tracts arise by misalignment of DNA strands during replication. This misalignment can lead to heterogeneity in the length of homopolymeric tracts; similar arguments apply to the more complex tandemly repeated structures of microsatellites (reviewed in [2,13]). Dislocation mutagenesis is similar to misalignment mutagenesis, but involves transient strand slippage in a monotonous run of nucleotides in the primer or template strand which is followed by incorporation of the next correct nucleotide (Figure 2) [15]. This mechanism was proposed based on studies of the in vitro mutation spectra of DNA polymerase [15]. Dislocation mutagenesis may also play an important role in vivo generating base substitution hotspots in the control region of human mitochondrial DNA [16]. G /\ G-G A-T 5' :::: 5'-G-A-T-C-C-T-A- 3' 3' 3' -T-G-G-G-A-T 5' : :::: 5'-G-A-T-C-C-T-A- 3' Primer Template Primer Template 3'-G-G-A-T 5' :::: 5'-G-A-T-C-C-T-A- 3' 3'-G-G-G-A-T 5' :::: 5'-G-A-T-C-C-T-A- 3' 5'-TCC-3' 5'-CCC-3' Figure 2. Dislocation mutagenesis. The primer strand dislocation, a three-nucleotide subsequence of the template strand is shown below a schematic representation of dislocation model. There is strong evidence that short direct repeats mediate deletions and duplications in DNA [13]. Two possible mechanisms for these events are: 1) recombination between short homologous repeats or 2) DNA polymerase slippage between short repeated sequences [13]. In addition, if heteroduplexes form between imperfect direct repeats, repair of the mismatches could cause base substitutions and frameshift mutations in a concerted manner [18]. This mechanism has been suggested for some classes of spontaneous mutations in bacterial and eukaryotic genes [19] and somatic mutations in immunoglobulin genes [20]. Long inverted repeats (40-150 bases) are also particularly unstable in bacterial cells [21]. This instability is likely due to formation of hairpin structures in single-stranded DNA and/or DNA polymerase "jumps" [13]. Correction of a quasipalindrome to a perfect inverted repeat may occur by either inter- or intramolecular strand switch [18]. Many mutations of this type have been observed in bacteria, yeast and human cells [18]. 2.2. Global factors Mutable motifs alone are not enough for emergence of hotspots, this is illustrated by the distribution of somatic mutations across AGCT sites in an artificially synthesized EPS sequence inserted into immunoglobulin gene (Figure 1). The EPS sequence contains AGCT sequences matching well-known RGYW/WRCY mutable motifs [6] repeated six times, respectively (PA-PF monomeric units in Figure 1) [4,5]. The number of mutations at G:C bases within AGCT motifs varied from 4 (the PF monomer) to 21 mutations (the PD monomer). Two significantly different classes of AGCT motifs was revealed by the CLUSTERM program (www.itb.cnr.it/webmutation/) [22], the hotspot class includes PA, PB, PC, and PD sequences (Figure 1), while another class consists from PE and PF sequences which have significantly lower frequency of mutations. This result shows that a significant heterogeneity of the mutation rate exists even in monotonously repeated AGCT motifs. Notably, the frequency of mutations in AGCTs significantly dropped at the end of the EPS sequence. This illustrates that mutation hotspots are not equivalent to mutable motifs. Emergence of hotspots is a complex process depending on high-order structures which are hard to detect. Many factors may influence mutation frequency in a particular nucleotide sequence. However, in most cases, only local nucleotide sequence context was studied. It is likely that other higher-level features of gene or chromatin structure also have significant influence on mutation frequency of a mutable motif at a specific site. An important factor could be the rate of DNA repair. DNA repair rates vary for transcribed and non-transcribed strands of the same gene and for more and less highly expressed genes [23]. Inherent asymmetry between the two DNA strands at the replication fork could also influence mutation frequency and specificity [24]. Other potential factors include asymmetric base composition or higher order chromatin structure (reviewed by Boulikas [25]). In general, the impact of mutation rate heterogeneity is not clear, and there are some contradictions about neutral mutation rate variation across genomes. It was suggested that mutation rates differ substantially among regions of mammalian genomes [26]. However, analysis of genomic alignments of human, chimpanzee, and baboon suggested that since the time of the human-chimpanzee ancestor, there has been little or no regional variation in mutation [27]. The controversy about mutation rate variation can be crucial for estimates of a fraction of human non-coding DNA which is under purifying selection. In general, the problem of mutation rate variation is important for understanding of fundamental problems of molecular biology and evolution. In this paper we will discuss three examples of mutation spectra analysis. 3. 8-Oxoguanine-induced mutations Chemical agents, ionizing radiation and oxidative stress cause DNA oxidation [28]. 8-Oxoguanine (8-oxoG) is one of the most prominent base oxidation products and has been implicated in mutagenesis, carcinogenesis and aging [29]. It has been shown to cause G·CT·A and A·TC·G mutations in vivo and in vitro, depending whether guanine is oxidized in DNA or in the DNA precursor pools, respectively [30]. A spontaneous mutation spectrum in the mutT deficient E.coli strain (186 mutations; lacI-d test system) is composed almost exclusively, of A·TC·G transversions which is in general consistent with mutagenic properties of 8-oxoGTP [31]. Hotspot context analysis of these transversions [31] using the CLUSTERM program [22] and regression trees [32] revealed AA mutable sequence (the hotspot position is underlined) (Figure 3). Comparison of the mutT- spectrum and A·TC·T transversions in a spectrum of spontaneous mutations in the lacI gene (lacI-d test system) [33] did not reveal significant differences between them (probability that these two spectra are different [34] P(2) = 0.69). Furthermore, a highly significant positive correlation was found (Kendell's correlation coefficient [35] = 0.65, P < 0.01). This result suggested that a substantial fraction of spontaneous A·TC·T mutations in E.coli is caused by 8-oxoGTP in nucleotide pools. Table 2. Comparison of A·TC·G transversion in lacI gene from mutT- and wild-type strains of E.coli [31, 33]. Position 41 81 72 64 87 168 79 189 192 195 167 83 117 96 128 177 77 141 105 54 4 10 5 2 4 9 7 A·TC·G mutations in mutT- strain 23 18 10 37 5 15 4 0 7 20 3 4 5 0 2 1 8 3 A·TC·G spontaneous mutations 23221 2 2 8 1 6 10 0 10 Positions of AA mutable motifs are underlined. Reconstructed spontaneous mutations in human pseudogenes [36] (ftp.bionet.nsc.ru/pub/biology/dbms/PSEUDO.ZIP) were also analyzed, and the frequencies of nucleotides surrounding A·TC·T transversions are shown in Table 3. Notably, AA and TT are the most frequent dinucleotide combinations. Such excess is statistically significant (P(2) < 0.01) as compared to dinucleotide frequencies in reconstructed ancestral sequences (Table 3). A substantially higher frequency of A in the position +1 was observed for A/C SNPs (A >> T C >>G) [37] which is consistent with the AA mutable motif. Table 3. Frequencies of bases in position +1 and -1 in a set of spontaneous A·T C·G transversions found in human pseudogenes [36]. Substitution Position ­1 Position +1 A T G C A T G C 0.35 0.24 0.17 0.24 AC 0.25 0.32 0.21 0.22 TG Expected 0.25 0.22 0.22 0.31 0.26 0.23 0.29 0.22 Expected values (frequencies of AN and NT dinucleotides, N=A/T/G/C) were calculated in ancestral sequences used for reconstruction of spontaneous mutations [36]. These results suggested that mutagenesis due to 8-oxoG is significantly influenced by nearest neighboring bases and the context is quite evolutionarily stable. The revealed context properties could be fingerprints of interactions between DNA and repair/replication/modification enzymes. There is, at present, no data on evolutionary conservation of context specificity of such interactions between pro- and eukaryotes [3]. Thus, the alternative view that the revealed context properties reflect intrinsic properties of interactions between 8-oxoG and DNA might be the current model of choice. 4. Human mitochondrial DNA Most of mitochondrial DNA (mtDNA) variability studies have been based on sequence variation of the fast-evolving major non-coding (or control) region, which spans 1122 bases between the tRNA genes for proline (tRNAPro) and phenylalanine (tRNAPhe) [38]. The majority of mutations are concentrated in two hypervariable segments, HVS I (positions 16024-16365) and HVS II (positions 73-340) [39]. Our analysis of phylogenetically reconstructed mutation spectra of the mtDNA HVS I and II regions has suggested that the dislocation mutagenesis (Figure 2) plays an important role for generating base substitutions in these regions [16,40]. However, an impact of the dislocation mutagenesis on the remaining part of mtDNA remains unclear. To study spontaneous base substitutions in regions of human mtDNA other than HVS I and II, we reconstructed mutation spectra of the mtDNA region containing ND3, tRNAArg, and ND4L genes (positions 10171-10659) using published data on polymorphisms in various human populations. We have analyzed different phylogenetic haplogroups of mtDNA revealed by means of median network analysis [39] (http://fluxus-engineering.com, the Network 3.1 program). We used only the published population data comprising mtDNA sequences with known phylogenetic status as described by Malyarchuk and coworkers [16,40]. The reconstructed mutation spectrum contained 93 mutations in 489 bases. The dislocation mutagenesis model was analyzed for the reconstructed mutation spectrum using a Monte-Carlo procedure [40]. No statistically significant support for this model was found (P(WWrandom) = 0.68). This result suggested that the dislocation mutagenesis does not play an important role for generating substitutions in the coding regions of mtDNA. A higher rate of molecular evolution in HVS regions than in the remaining part of mtDNA can be explained by differences in either mutation or selection pressure [39]. The observed differences in dislocation mutagenesis suggested that a higher rate of mutations in HVS regions is caused by intrinsic properties of mutations. HVS regions are associated with initiation/termination of mtDNA replication and RNA/DNA transition, and one of these processes may be error-prone for the DNA strand dislocation mutagenesis. 5. Hypermutation in HIV-1 Genomic heterogeneity is a hallmark of retroviruses, especially of HIV, which helps virus to escape the host immune system. Hypermutability is linked to pathogenicity and it was generally attributed to relatively low fidelity of reverse transcriptase [41]. A high rate of mutagenesis in retroviruses, in addition to being a way to elude the immune system, may lead to their low viability. It was hypothesized that even a relatively small increase of the mutation rate in a retrovirus will lead to the accumulation of many deleterious mutations and virus extinction. Indeed, the treatment of HIV-infected human cells by mutagenic nucleoside analogs resulted in the loss of viral replicative potential [42]. Recent discoveries suggest that nature already exploited this mechanism for protection from retroviruses. The unique cellular gene CEM15 was found that conferred resistance to HIV. Its antiviral action could be overcome by the presence of virion infectivity factor (Vif), encoded by the viral genome [43]. CEM15 appeared to be identical to the cytidine deaminase APOBEC3G. It is already known that APOBEC3G is a strong mutator when expressed in E. coli, suggesting that it could deaminate cytosines in DNA [44]. The viruses without Vif experienced hypermutation and all these mutations were transitions that could be explained by deamination of a (-) DNA strand of the virus [45]. The current model for APOBEC3G antiviral action proposes that the deaminase is packaged into Vif ­ HIV virions and induces massive deamination in the viral (-) strand [45,46]. This deamination can lead to hypermutagenesis or the destruction of the viral genome during repair of uracil [46]. We analyzed the published spectra of APOBEC3G-induced mutations (434 mutations) in the GFP gene [47] (Fig. 3A). Analysis of nucleotide context of mutation hotspots using regression trees [32] suggested that almost all APOBEC3G hotspots are located in the GG motif (Fig. 3A). Not all GG motifs appeared to be hotspots for mutations. This can be explained by the GFP selection system (mutations in some GG sites cannot be detected by this system), however some unknown global context properties of GG hotspot sites might modulate mutability in some GG sites. We analyzed a correlation between this motif and mutations of HIV-1 DNA (30 mutations) in the absence of the Vif protein [45] (Fig. 3B) using the CONSEN program [6,7]. A low probability PW