# Balancing natural selection

x

### Abstract

Our understanding of balancing selection is currently becoming greatly clarified by new sequence data being gathered from genes in which polymorphisms are known to be maintained by selection. The data can be interpreted in conjunction with results from population genetics models that include recombination between selected sites and nearby neutral marker variants. This understanding is making possible tests for balancing selection using molecular evolutionary approaches. Such tests do not necessarily require knowledge of the functional types of the different alleles at a locus, but such information, as well as information about the geographic distribution of alleles and markers near the genes, can potentially help towards understanding what form of balancing selection is acting, and how long alleles have been maintained.

Citation: Charlesworth D (2006) Balancing Selection and Its Effects on Sequences in Nearby Genome Regions. PLoS Genet 2(4): e64. https://doi.org/10.1371/journal.pgen.0020064

Published: April 28, 2006

Copyright: © 2006 Deborah Charlesworth. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: Funding from the National Environment Research Council (NERC) is gratefully acknowledged.

Competing interests: The author has declared that no competing interests exist.

Abbreviations: CMS, cytoplasmic male sterility; LD, linkage disequilibrium; MHC, histocompatibility

### Introduction

The concept of balancing selection is well-established, appearing in every genetics and evolution textbook. Classic examples are known in humans and other organisms, and two different forms of balancing selection are very familiar—heterozygote advantage at a locus (often called overdominance), and frequency-dependent selection with a rare-allele advantage (although overdominance is often incorrectly used as synonymous with balancing selection). It is well-known that balancing selection maintains different alleles at the selected loci. A familiar example of frequency dependence (though not always viewed in this way) is the selection on sex ratio that maintains males and females in a population with an X/Y sex chromosome system, which behaves like a single gene, since large parts of the sex chromosomes, including the regions containing the sex-determining region, do not undergo genetic crossing-over.

More complex models of selection maintaining diversity include temporally or spatially heterogeneous selection, which can sometimes maintain different alleles [1,2], and systems with interactions between species (or other genetically interacting systems). An example is male sterility in plants. Maternally transmitted selfish cytoplasmic male-sterility (CMS) factors that increase seed production can invade hermaphrodite populations, but generally cannot spread throughout a population, because, as females become common, hermaphrodites become the only source of pollen to fertilise females' seeds; relative fitness of non male-sterile individuals then increases. Thus, as with the male/female sex polymorphism, a balanced polymorphism is established, with females and hermaphrodites in the population. Restorer alleles, suppressing the sterility of plants with the sterility cytoplasmic genotype, may then often spread. Although fixation is possible, polymorphism may again sometimes remain within populations for both cytoplasmic and restorer genes, with complex frequency dependence leading to stable equilibrium frequencies of the genetic factors, or to stable or slowly decaying oscillations in the frequencies [3].

Host–pathogen systems can behave similarly, with a pathogen invading a host population, thus creating selection for resistant genotypes. Mutant resistance alleles can either become fixed (with, potentially, a succession of such fixations as new pathogens appear, i.e., an “evolutionary arms race”) or establish polymorphisms, with different resistance alleles present within populations or in different populations of a species [4,5].

The timescales of the different kinds of balancing selection determine how the selection affects sequences in nearby genome regions (Table 1). Sex chromosomes are maintained for long evolutionary times, while situations involving pathogens might often be ephemeral, since hosts might evolve resistance alleles that become fixed in the species. Many of the classic cases have now been restudied using DNA sequence data, and clear “footprints” of selection are sometimes found, which can be used to detect selection and to distinguish balancing from other forms of selection. These tests depend on the interaction of balancing selection at amino acids or other sites in genes with recombination, affecting non-selected (neutral) variation in nearby genome regions. This review does not deal with technical aspects of testing for balancing selection, but focuses on an understanding of how such selection affects diversity at neutral sites near sites under balancing selection, and how balancing selection over different timescales leads to different footprints.

### Long-Term Balancing Selection: High Sequence Diversity in Genes where Polymorphisms Are Maintained for Long Evolutionary Times

The familiar textbook view of balancing selection stresses the most dramatic cases, with alleles maintained for very long evolutionary times. Balancing selection is often portrayed as “diversifying,” meaning that there is an advantage to new alleles, as with plant self-incompatibility (S) alleles, where the frequency-dependent selective advantage of rare pollen and pistil types is well understood to maintain many alleles [6,7], or fungal incompatibility alleles [8,9], whose selective maintenance remains unclear, despite evident similarity to plant SI.

When the same alleles persist for long times, balancing selection may be detectable from its effects at nearby neutral sites. The population genetics of balancing selection shows that, as well as maintaining diversity at the selected sites themselves (generally maintaining different amino acids), it increases diversity at closely linked neutral sites [10–12]. Regions of genome close to a site under balancing selection, which rarely recombine with the selected site(s), will have common ancestors longer ago than other regions (longer coalescence times), because migration of variants between allelic classes depends on recombination. This high diversity is not due to diversifying selection, since systems with just two states, such as sex-determining genes, where selection on sex ratio gives the rarer sex an advantage, with no diversifying selection, can also be maintained in the long term (though sometimes a sex-determination system is replaced by a new one [13]). This example clearly illustrates the evolution of high diversity. The divergence of the X and Y chromosomes were once homologues. With the acquisition of sex-determining functions and loss of recombination, genes on these chromosomes now have, in several taxa, higher sequence divergence than between related species [14–16].

If different functional types of alleles at a locus persist long enough, each allele class can acquire its own unique set of neutral mutations, each associated with the class in which it arose, until eventually recombination causes “migration” into a different allele (reviewed in [17]). The region around alleles of functionally different types can thus differ at multiple non-selected sites, so that polymorphism will be higher than in unlinked genome regions, over a distance depending on the local recombination frequency, and variants in the region will show linkage disequilibrium (LD) due to associations between functionally different alleles [11,18].

High diversity can thus provide evidence for balancing selection. In plant species with CMS, large frequency differences of females in natural populations, and differences in the frequency of restoration of male fertility when females from one population are pollinated by males from elsewhere, indicate highly variable frequencies of the genetic factors involved. This might reflect regular turnover of the sterility and restorer factors, in an arms race [19], or perhaps frequency oscillations [3,20]. However, high diversity has been found in sequences of a mitochondrial gene within populations of Silene acaulis, a plant with CMS [21], which excludes turnover of cytoplasmic genotypes, or in prolonged periods of low frequency for any of these genotypes. In this species at least, the male-sterility polymorphisms must therefore have been maintained for long times.

The CMS case is extreme, because, like sex chromosomes, mitochondrial genomes rarely recombine, since heteroplasmy is rare. Even with recombination, however, considerable sequence diversity can exist several kilobases from a selected site, in systems with many different alleles (Figure 1). Long-term maintenance of honeybee sex-determining alleles may be one such case, with high amino acid and synonymous site diversity [22]. Nucleotide diversity is also extremely high throughout the sequences of multi-allelic pistil recognition genes of plants with gametophytic self-incompatibility, e.g. [23–25], and in the pistil and pollen S-loci of species with sporophytic incompatibility [26,27]. Recombination rates between the pollen and pistil S-loci are not known, but may be low, because selection against self-fertile recombinants is likely to be strong.

Figure 1. Sequence Diversity Expected at Neutral Sites at Different Distances from a Site under Balancing Selection

The figure shows the dependence of diversity at neutral sites in a gene on the number of different alleles maintained (n values) and the distance from the selected sites. A recombination rate of 1 cM/Mb is assumed, which is appropriate for humans, but much lower than the estimated rate for A. thaliana or maize. The example calculated is based on equations in the Appendix of [12], which are appropriate for selection at loci, such as MHC, where homozygous genotypes can be formed (e.g., a system with heterozygous advantage in which homozygotes are viable); note, however, that heterozygous advantage is unlikely to maintain very large allele numbers [35,82]. In the example shown, the turnover rate of alleles at the selected locus (or site) is assumed to be 10−7.

(A) Shows predicted nucleotide diversity (π) between and within haplotypes of allelic classes (defined as having different alleles at the selected site or sites) for the case when 50 different alleles are maintained.

(B) Shows the proportion of the overall diversity that is between allelic classes (analogous to FST in a subdivided population), showing differentiation between the haplotypes across several kb when there are many alleles, even when recombination occurs.

https://doi.org/10.1371/journal.pgen.0020064.g001

If host–pathogen co-evolution leads to long-term maintenance of variation, this should therefore be detectable from these “footprints” at nearby silent sites and marker loci, even if we are unable to classify the functional types of alleles and determine their number (though fewer alleles are expected than for incompatibility loci). Some loci known to be involved in defence processes indeed have high sequence polymorphism. One such locus in Arabidopsis thaliana, is estimated to have nucleotide diversity above 4% for synonymous sites, and even for non-synonymous ones [28], much above the average for this species (<1% for synonymous sites [29]). These genes are difficult to study, because they are often members of gene families, and it is essential in studying polymorphism to be sure that the sequences are from a single locus, and to exclude “migration” from paralogous genes, which might occur by gene conversion or other exchange processes.

If exchanges between alleles are frequent, or allele numbers are not large, even long-term balancing selection causes high differentiation between alleles only very close to the selected sites [12,30], while exchanges erode differences at synonymous and intron sites elsewhere in the gene (Figure 1). It may thus be difficult to distinguish between long-term balancing selection with recombination, and short-term maintenance of alleles (the likely situation for allozyme loci, discussed later). Recombination also implies that tests may fail to detect selected loci by searching for high diversity genes, e.g., [31]. Loci will be missed where selection has not acted for long enough, or exchanges are too frequent, to allow for diversity to build up between alleles.

Recombination clearly occurs at the histocompatibility (MHC) loci [32–34]. Although their diversity per nucleotide site is only a few percent [33], this is exceptionally high for human sequences (though much lower than diversity in plant or fungal incompatibility gene sequences). These genes' much-cited high allelic diversity largely results from recombination between differentiated haplotypes, and this differentiation clearly indicates long-term balancing selection. Arguments against MHC alleles being maintained by overdominance are based on the difficulty of maintaining large allele numbers [35], but although numbers of functionally different alleles are currently unknown they must be lower than haplotype numbers.

### Trans-Specific Polymorphism

Another effect of long-term balancing selection, also relying on the evolution of highly differentiated alleles, is trans-specific polymorphism. When the same alleles persist for long times, and are not regularly replaced by new alleles (“turnover”), allele ages can exceed the ages of related species [36]. If a species with such a balanced polymorphism splits into two, multiple different haplotypes will often pass to the daughter species. Figure 2 gives a hypothetical example for sites in and near a gene. Initially, the associations of variants will be the same as in the ancestor, but, over evolutionary time, this signal will become indistinct, as the sequences of each daughter species' copies of each allele type recombine with other haplotypes of the locus, acquire new mutations, or are lost or evolve into new, functionally different alleles (allele turnover). After enough time, the sequences will cluster by species, rather than by functional types.

Figure 2. Lineages at a Locus under Long-Term Balancing Selection

Two haplotypes with different alleles, Ax and Ay, which diverged before the common ancestor of two species (1 and 2), are denoted by black and grey lines and boxes, respectively (denoting genes). Variants in the regions in and around the selected locus will remain associated with the haplotype in which they arose until recombination occurs with a different haplotype, even after the species become isolated. Species–specific differences (shown as thin horizontal lines in the tree and vertical lines in the haplotypes) will also accumulate. Recombination between different haplotypes (indicated by mixed black–grey haplotypes) will mean that sites close to the selected sites will be most differentiated between alleles (see Figure 1).

https://doi.org/10.1371/journal.pgen.0020064.g002

Trans-specific polymorphism is highly unlikely under neutrality, except between species that are so closely related that they are likely to share variants present in their common ancestors, so it should provide a test for long-term balancing selection [18]. For plant and fungal incompatibility systems, the same types are sometimes detectable in different species [8,9]. In Brassica oleracea and Brassica rapa, S-alleles with similar sequences of the pistil receptor gene reject each others' pollen [37]. Even in incompatibility systems, however, typing is laborious, and most analyses infer trans-specific polymorphism from gene trees using sequences from multiple species (e.g., [23,38–40]). When sequences do not cluster by species, long-term maintenance of alleles is likely. However, reconstructing trees is inappropriate, since sequences may recombine within the study species or their ancestors [41], and, in the absence of functional information, trans-specific or trans-generic alleles are determined arbitrarily. The signal of fixed differences between species also quickly overwhelms that of shared variants unless sequences of functionally different alleles differ hugely, as in MHC systems [42]. Ideally, individual variants in sequences should be examined to test rigorously for unexpected numbers of trans-specific polymorphisms [43].

In most cases, we do not know the nature of the selection maintaining polymorphisms with multiple alleles, except in very general terms (e.g., MHC polymorphisms may be connected with resistance to diseases), and therefore cannot generally classify alleles into functional “types” and recognise when the same functional alleles are shared between different species. It is extremely complex to determine the effects of amino acid changes in the peptide binding regions of MHC proteins on the strength and specificity of their binding to peptides. However, shared amino acid sequence motifs determining similar binding properties between primate species are sometimes recognisable [44]. Even this does not necessarily imply long-term balancing selection. Despite apparently similar ABO blood groups in different primate species and high sequence diversity [45], the sequences of A, B, and O alleles have few trans-specific variants, so recombination may occur between alleles, and convergent evolution between species has been suggested [46].

Tests for trans-specific polymorphism at silent sites in sequences should nevertheless help us to detect long-term balancing selection, even without being able to classify alleles functionally. Searching human and chimpanzee gene sequences for trans-specific polymorphism, we uncovered little evidence for long-term balancing selection, except for MHC sequences [43,47]. For MHC genes, frequently observed high diversity [48] and trans-specific polymorphism [42] rule out a high turnover rate and, thus, arms-race scenarios, though this does not necessarily suggest overdominant selection.

### Short-Term Balancing Selection

Long-term balancing selection is, however, probably unusual. The evolutionary lifespans of alleles (or, inversely, their turnover rates) are likely to be very important in understanding pathogen systems, in which frequency-dependent selection can sometimes maintain allelic diversity, but directional selection for resistance involving arms races may sometimes occur. Estimating diversity at known disease-resistance loci, without knowing the alleles' functional types, or even the relevant pathogens in nature, suggests that some of them maintain long-term polymorphisms [49,50].

Even with directional selection due to pathogens (or to human disturbance, e.g., a pesticide), polymorphisms may establish, because heterozygote advantage can arise simply from a disadvantage of a new allele when homozygous. When a resistance mutation arises, if heterozygotes are resistant, and have no other strong disadvantage, the allele will increase in frequency. In an outcrossing population, homozygotes for the mutation are initially rare. Consequently, even a strong survival or fertility disadvantage of the mutation in homozygotes cannot prevent its increase to an intermediate frequency, but may lead to a balanced polymorphism.

When such alleles arose recently, the sequences at the locus can show a characteristic pattern in which the new alleles are uniform throughout a large region surrounding the gene. A mutation with a strong selective advantage, which increases in frequency rapidly, has too little time to recombine with variants in the surrounding region of genome, or to incorporate variants by mutation, especially in the case of partially dominant advantageous alleles, which quickly increase in frequency [51]. Low diversity due to such “selective sweeps” (Figure 3) is the basis of one type of test for recent spread of advantageous allele, using silent site variants in the sequence of the locus itself and its introns, e.g., [52–54], or markers such as microsatellite alleles at closely linked loci [55].

Figure 3. Haplotypes in a Genome Region after Spread of an Advantageous Mutation That Establishes a Balanced Polymorphism

An advantageous mutation (denoted by the star) arises and quickly spreads to a high frequency. Variants (black dots) in the region of genome around the selected site will be carried to high frequency in the haplotypes with the mutation, and recombination subsequently introduces variants from the rest of the population, especially at sites distant from the selected site. Mutations may also occur. Note that the hitchhiking does not contribute to differentiation between haplotypes, since the variants were present before the selective event.

https://doi.org/10.1371/journal.pgen.0020064.g003

When selection opposing fixation has led to a recently established balanced situation, the initial sweep (or hitchhiking) event is potentially detectable from “homozygosity” of variants in and around the locus, since it creates a high frequency of one haplotype [53,56]. Such footprints of recently increased frequency of a uniform haplotype are evident near the β-globin locus in African populations with the classic balancing selection example, sickle-cell allele [57], and across large regions of the chromosome carrying the rat warfarin-resistance gene [58]. The regions affected by such selective sweeps are generally much larger than the region of LD around a locus under long-term balancing selection, because recombination has not yet eroded differences between the selected allele and others, but when the advantageous allele is maintained by balancing selection and does not become fixed in the population, diversity is severely reduced only in haplotypes carrying that allele (Figure 3). Other previously known cases of balancing selection showing such patterns include the human polymorphisms PTC taster/non-taster [59], glucose-6-phosphate dehydrogenase alleles [60], and haemoglobin E, another globin variant involved in resistance to malaria [61]. Allozyme polymorphisms may also be due to recent selective events, and the classic case of balancing selection, Drosophila inversion polymorphisms [62], may also often not be maintained for very long times [see 63–65].

It might seem to be straightforward to discover new examples of selection by using these signs of selection in genome scans. However, rapidly increased frequency of one allele at a locus occasionally happens by chance, as genetic drift occurs in a population, and will be indistinguishable from selection events. Thus, tests must also show that loci identified lie outside the bounds of what might occur if the variants were neutral. This is difficult, because real populations are often subdivided, and the sub-populations (and thus the species as a whole) have unknown complicated histories involving size changes and migration, which cannot be taken into account [66]. This is proving to be a problem for inferring selection in human populations, despite very large studies [67,68], and it surely applies to other populations. It may be helpful to compare the extent of uniform haplotypes of different alleles [53] or many different loci [54], which should often share similar histories [69]. Certainly, given the problems for established tests for selection including McDonald–Kreitman, Hudson, Kreitman–Aguadé, and Tajima's tests [reviewed in 70] that can be caused by unknown population subdivision and history [71–73], these tests are now often supplemented by other evidence.

The difficulties are greatest for weak selection, or selection events that occurred long ago. Weak balancing selection may often occur, and could be the basis of much quantitative variability, including variation in fitness. In finite populations, fixation can occur for alleles under weak balancing selection, which can complicate tests for selection [74].

There is much evidence from whole organism studies, such as reciprocal transplant experiments, for selective differences between populations [75], which could create balancing selection at the metapopulation scale. The detailed genetic basis of such local adaptations is interesting, as is the duration of differences. Estimates of genetic subdivision may be helpful, particularly FST, which estimates the proportion of diversity between populations. FST-based tests for selection are already in use [76,77]. This approach should help with an understanding of selection of host–pathogen systems in natural populations, which may often involve local adaptation [e.g., 78]. If sequences of loci involved in pathogen defences suggest unusually high subdivision compared with other loci, this might suggest selection for locally adaptive alleles, differing from one population to another. In contrast, loci where balancing selection maintains alleles within populations, such as CMS haplotypes, or incompatibility loci, should show less evidence for subdivision than the average locus [73,79]. This approach may be helpful in understanding selection on MHC genes [80]. Due to recombination, different populations will generally have different sets of alleles, but sequences should reveal whether populations share variants or differ significantly more than other loci (suggesting local selection differences). In an MHC locus studied in deer mouse populations, low FST was found, suggesting similar selection across populations [81].

### Conclusions

Analyses of DNA sequences have the promise to advance understanding of the different forms of balancing selection. Sequences can uncover highly polymorphic loci (even in the presence of recombination), define regions of LD and detect trans-specific polymorphism of neutral variants. It will be particularly interesting to combine such results with FST estimates from sampling multiple natural populations, to test which cases involve maintenance of diversity within populations, and which do not. Even if they include false positives, sequence-based tests can provide interesting sets of genes that may be under selection, of interest for detailed studies.

### References

1. 1. Levene H (1953) Genetic equilibrium when more than one niche is available. Am Nat 87: 331–333.
2. 2. Gillespie JH (1978) A general model to account for enzyme variation in natural populations. V. The SAS/CFF model. Theor Popul Biol 14: 1–45.
3. 3. Gouyon PH, Vichot F, Damme Jv (1991) Nuclear-cytoplasmic male sterility: Single point equilibria versus limit cycles. Am Nat 137: 498–514.
4. 4. Seger J (1988) Dynamics of some simple host–parasite models with more than two genotypes in each species. Philos Trans R Soc Lond B Biol Sci 319: 541–555.
5. 5. Frank SA (1993) Coevolutionary genetics of plants and pathogens. Evol Ecol 7: 45–75.
6. 6. Takahata N (1990) A simple genealogical structure of strongly balanced allelic lines and trans-species polymorphism. Proc Natl Acad Sci U S A 87: 2419–2423.
7. 7. Vekemans X, Slatkin M (1994) Gene and allelic genealogies at a gametophytic self-incompatibility locus. Genetics 137: 1157–1165.
8. 8. Wu J, Saupe SJ, Glass NL (1998) Evidence for balancing selection operating at the het-c heterokaryon incompatibility locus in filamentous fungi. Proc Natl Acad Sci U S A 95: 12398–12402.
9. 9. May G, Shaw F, Badrane H, Vekemans X (1999) The signature of balancing selection: Fungal mating compatibility gene evolution. Proc Natl Acad Sci U S A 96: 9172–9177.
10. 10. Hudson RR, Kaplan NL (1988) The coalescent process in models with selection and recombination. Genetics 120: 831–840.
11. 11. Charlesworth B, Nordborg M, Charlesworth D (1997) The effects of local selection, balanced polymorphism and background selection on equilibrium patterns of genetic diversity in subdivided inbreeding and outcrossing populations. Genet Res 70: 155–174.
12. 12. Takahata N, Satta Y (1998) Footprints of intragenic recombination at HLA loci. Immunogenetics 47: 430–441.
13. 13. Bull JJ (1983) Evolution of sex determining mechanisms. Menlo Park (California): Benjamin/Cummings. 316 p.
14. 14. Lahn BT, Page DC (1999) Four evolutionary strata on the human X chromosome. Science 286: 964–967.
15. 15. Lawson-Handley LJ, Ceplitis H, Ellegren H (2004) Evolutionary strata on the chicken Z chromosome: Implications for sex chromosome evolution. Genetics 167: 367–376.
16. 16. Nicolas M, Marais G, Hykelova V, Janousek B, Laporte V, et al. (2005) A gradual process of recombination restriction in the evolutionary history of the sex chromosomes in dioecious plants. PLoS Biology 3(1). e4. DOI: https://doi.org/10.1371/journal.pbio.0030004.
17. 17. Charlesworth B, Charlesworth D, Barton NH (2003) The effects of genetic and geographic structure on neutral variation. Annu Rev Ecol Evol S 34: 99–125.
18. 18. Wiuf C, Zhao K, Innan H, Nordborg M (2004) The probability and chromosomal extent of trans-specific polymorphism. Genetics 168: 2363–2372.
19. 19. Frank SA (1989) The evolutionary dynamics of cytoplasmic male sterility. Am Nat 133: 345–576.
20. 20. Charlesworth D (1981) A further study of the problem of the maintenance of females in gynodioecious species. Heredity 46: 27–39.
21. 21. Städler T, Delph LF (2002) Ancient mitochondrial haplotypes and evidence for intragenic recombination in a gynodioecious plant. Proc Natl Acad Sci U S A 99: 11730–11735.
22. 22. Hasselmann M, Beye M (2004) Signatures of selection among sex-determining alleles of the honey bee. Proc Natl Acad Sci U S A 101: 4888–4893.
23. 23. Richman AD, Uyenoyama MK, Kohn JR (1996) Allelic diversity and gene genealogy at the self-incompatibility locus in the Solanaceae. Science 273: 1212–1216.
24. 24. Lu Y (2002) Molecular evolution at the self-incompatibility locus of Physalis longifolia (Solanaceae). J Mol Evol 54: 784–793.
25. 25. Lu Y (2001) Roles of lineage sorting and phylogenetic relationship in the genetic diversity at the self-incompatibility locus of Solanaceae. Heredity 86: 195–205.
26. 26. Sato T, Nishio T, Kimura R, Kusaba M, Suzuki G, et al. (2002) Coevolution of the S-locus genes SRK, SLG and SP11/SCR in Brassica oleracea and B. rapa. Genetics 162: 931–940.
27. 27. Charlesworth D, Bartolomé C, Schierup MH, Mable BK (2003) Haplotype structure of the stigmatic self-incompatibility gene in natural populations of Arabidopsis lyrata. Mol Biol Evol 20: 1741–1753.
28. 28. Rose LE, Bittner-Eddy PD, Langley CH, Holub EB, Michelmore RW, et al. (2004) The maintenance of extreme amino acid diversity at the disease resistance gene, RPP13, in Arabidopsis thaliana. Genetics 166: 1517–1527.
29. 29. Nordborg M, Hu TT, Ishino Y, Jhaveri Y, Toomajian C, et al. (2005) The pattern of polymorphism in Arabidopsis thaliana. PLoS Biol 3(7): e196..
30. 30. Navarro A, Barton NH (2002) The effects of multilocus balancing selection on neutral variability. Genetics 161: 849–863.
31. 31. Cork JM, Purugganan MD (2005) High-diversity genes in the Arabidopsis genome. Genetics 170: 1897–1911.
32. 32. Carrington M (1999) Recombination within the human MHC. Immunol Rev 167: 245–256.
33. 33. Raymond CK, Kas A, Qiu R, Zhou Y, Subramanian Sa, et al. (2005) Ancient haplotypes of the HLA Class II region. Genome Res 15: 1250–1257.
34. 34. Traherne JA, Horton R, Roberts AN, Miretti MM, Hurles ME, et al. (2006) Genetic analysis of completely sequenced disease-associated MHC haplotypes identifies shuffling of segments in recent human history. PLoS Genetics 2(1): 81–92. DOI: https://doi.org/10.1371/journal.pgen.0020009.
35. 35. Boer RJD, Borghans JAM, Boven MV, Kesmir C, Weissing FJ (2004) Heterozygote advantage fails to explain the high degree of polymorphism of the MHC. Immunogenetics 55: 725–731.
36. 36. Muirhead CA, Glass NL, Slatkin M (2002) Multilocus self-recognition systems in fungi as a cause of trans-species polymorphism. Genetics 161: 633–641.
37. 37. Sato T, Fujimoto R, Toriyama K, Nishio T (2003) Commonality of self-recognition specificity of S haplotypes between Brassica oleracea and Brassica rapa. Plant Mol Biol 52: 617–626.
38. 38. Adams EJ, Cooper S, Thomson G, Parham P (2000) Common chimpanzees have greater diversity than humans at two of the three highly polymorphic MHC class I genes. Immunogenetics 51: 410–424.
39. 39. Klein J, Takahata N, O'hUigin C (1993) Trans-specific Mhc polymorphism and the origin of species in primates. J Med Primatol 22: 57–64.
40. 40. Ioerger TR, Clark AG, Kao T-H (1990) Polymorphism at the self-incompatibility locus in Solanaceae predates speciation. Proc Natl Acad Sci U S A 87: 9732–9735.
41. 41. Schierup MH, Mikkelsen AM, Hein J (2001) Recombination, balancing selection and phylogenies in MHC and self-incompatibility genes. Genetics 159: 1833–1844.
42. 42. Garrigan D, Hedrick PW (2003) Perspective: Detecting adaptive molecular polymorphism: Lessons from the MHC. Evolution 57: 1707–1722.
43. 43. Clark AG (1997) Neutral behavior of shared polymorphism. Proc Natl Acad Sci U S A 94: 7730–7734.
44. 44. Geluk A, Elferink DG, Slierendregt BL, Meijgaarden KEv, Vries RRd, et al. (1993) Evolutionary conservation of major histocompatibility complex-DR/peptide/T cell interactions in primates. J Exp Med 177: 979–987.
45. 45. Stajich JE, Hahn MW (2005) Disentangling the effects of demography and selection in human history. Mol Biol Evol 22: 63–73.
46. 46. O'hUigin C, Sato A, Klein J (1997) Evidence for convergent evolution of A and B blood group antigens in primates. Hum Genetics 101: 141–148.
47. 47. Asthana S, Schmidt S, Sunyaev S (2005) A limited role for balancing selection. Trends Genet 21: 30–32.
48. 48. Hughes A, Nei M (1988) Pattern of nucleotide substitution at MHC class I loci reveals overdominant selection. Nature 335: 167–170.
49. 49. Bergelson J, Kreitman M, Stahl EA, Tian D (2001) Evolutionary dynamics of plant R-genes. Science 292: 2281–2285.
50. 50. Shen J, Araki H, Chen L, Chen Q, Tian D (2006) Unique evolutionary mechanism in R-Genes under the presence/absence polymorphism in Arabidopsis thaliana. Genetics 172: 1243–1250.
51. 51. Teshima KM, Przeworski M (2006) Directional positive selection on an allele of arbitrary dominance. Genetics 172: 713–718.
52. 52. Kim Y, Stephan W (2003) Selective sweeps in the presence of interference among partially linked loci. Genetics 164: 389–398.
53. 53. Wang ET, Kodama G, Baldi P, Moyzis RK (2006) Global landscape of recent inferred Darwinian selection for Homo sapiens. Proc Natl Acad Sci U S A 103: 135–140.
54. 54. Nielsen R, Williamson S, Kim Y, Hubisz MJ, Clark AG, et al. (2005) Genomic scans for selective sweeps using SNP data. Genome Res 15: 1566–1575.
55. 55. Nair S, Williams JT, Brockman A, Paiphun L, Mayxay M, et al. (2003) A selective sweep driven by pyrimethamine treatment in Southeast Asian malaria parasites. Mol Biol Evol 20: 1526–1536.
56. 56. Sabeti PC, Reich DE, Higgins JM, Levine HZ, Richter DJ, et al. (2002) Detecting recent positive selection in the human genome from haplotype structure. Nature 419: 832–837.
57. 57. Currat M, Trabuchet G, Rees D, Perrin P, Harding RM, et al. (2002) Molecular analysis of the beta-globin gene cluster in the Niokholo Mandenka population reveals a recent origin of the beta(S) Senegal mutation. Am J Hum Genet 70: 207–223.
58. 58. Kohn MH, Pelz H-J, Wayne RK (2000) Natural selection mapping of the warfarin-resistance gene. Proc Natl Acad Sci U S A 97: 7911–7915.
59. 59. Wooding S, Kim UK, Bamshad MJ, Larsen J, Jorde LB, et al. (2004) Natural selection and molecular evolution in PTC, a bitter-taste receptor gene. Am J Hum Genet 74: 637–646.
60. 60. Verrelli BC, McDonald JH, Argyropoulos G, Destro-Bisol G, Froment A, et al. (2002) Evidence for balancing selection from nucleotide sequence analyses of human. Am J Hum Genet 71: 455–462.
61. 61. Ohashi J, Naka I, Patarapotikul J, Hananantachai H, Brittenham G, et al. (2004) Extended linkage disequilibrium surrounding the hemoglobin e variant due to malarial selection. Am J Hum Genet 74: 1198–1208.
62. 62. Dobzhansky T (1943) Genetics of natural populations IX. Temporal changes in the composition of populations of Drosophila pseudoobscura. Genetics 28: 162–186.
63. 63. Eanes WF (1999) Analysis of selection on enzyme polymorphisms. Annu Rev Ecol Syst 30: 301–326.
64. 64. Sezgin E, Duvernell DD, Matzkin LM, Duan Y, Zhu C-T, et al. (2004) Single-locus latitudinal clines and their relationship to temperate adaptation in metabolic genes and derived alleles in Drosophila melanogaster. Genetics 168: 923–931.
65. 65. Andolfatto P, Depaulis F, Navarro A (2001) Inversion polymorphisms and nucleotide variability in Drosophila. Genet Res 77: 1–8.
66. 66. Li H, Stephan W (2005) Maximum-likelihood methods for detecting recent positive selection and localizing the selected site in the genome. Genetics 171: 377–378.
67. 67. Schaffner SF, Foo C, Gabriel S, Reich D, Daly MJ, et al. (2005) Calibrating a coalescent simulation of human genome sequence variation. Genome Res 15: 1576–1583.
68. 68. Novembre JA, Galvani AP, Slatkin M (2005) The geographic spread of the CCR5 Δ32 HIV-resistance allele. PLoS Biology 3: e339.. DOI: https://doi.org/10.1371/journal.pbio.0030339.
69. 69. Galtier N, Depaulis F, Barton NH (2000) Detecting bottlenecks and selective sweeps from DNA sequence polymorphism. Genetics 155: 981–987.
70. 70. Kreitman M, Akashi H (1995) Molecular evidence for natural selection. Annu Rev Ecol Syst 26: 403–422.
71. 71. Eyre-Walker A (2003) Changing effective population size and the McDonald–Kreitman test. Genetics 162: 2017–2024.
72. 72. Ingvarsson PK (2004) Population subdivision and the Hudson–Kreitman–Aguade test: Testing for deviations from the neutral model in organelle genomes. Genet Res 83: 31–39.
73. 73. Schierup MH, Vekemans X, Charlesworth D (2000) The effect of subdivision on variation at multi-allelic loci under balancing selection. Genet Res 76: 51–62.
74. 74. Williamson S, Fledel-Alon A, Bustamante CD (2004) Population genetics of polymorphism and divergence for diploid selection models with arbitrary dominance. Genetics 168: 463–475.
75. 75. Linhart YB, Grant MC (1996) Evolutionary significance of local genetic differentiation in plants. Ann Rev Ecol Syst 27: 237–277.
76. 76. Akey JM, Zhang G, Zhang K, Jin L, Shriver MD (2002) Interrogating a high-density SNP map for signatures of natural selection. Genome Res 12: 1805–1814.
77. 77. Beaumont MA, Balding DJ (2004) Identifying adaptive genetic divergence among populations from genome scans. Mol Ecol 13: 969–980.
78. 78. Kaltz O, Gandon S, Michalakis Y, Shykoff JA (1999) Local maladaptation in the anther-smut fungus Microbotryum violaceum to its host plant Silene latifolia: Evidence from a cross-inoculation experiment. Evolution 53: 39–407.
79. 79. Schierup MH, Vekemans X, Charlesworth D (2000) The effect of hitch-hiking on genes linked to a balanced polymorphism in a subdivided population. Genet Res 76: 63–73.
80. 80. Muirhead CA (2001) Consequences of population structure on genes under balancing selection. Evolution 55: 1532–1541.
81. 81. Richman AD, Herrera LG, Nash D, Schierup MH (2003) Relative roles of mutation and recombination in generating allelic polymorphism at an MHC class II locus in Peromyscus maniculatus. Genet Res 82: 89–99.
82. 82. Lewontin RC, Ginzburg LR, Tuljapurkar SD (1978) Heterosis as an explanation for large amounts of genic polymorphism. Genetics 88: 149–169.

• Alleles
• Natural selection
• Genetic loci
• Haplotypes
• Population genetics
• Evolutionary genetics
• Species diversity
• Genomics
Sours: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0020064

## Detecting Long-Term Balancing Selection Using Allele Frequency Correlation

Katherine M. Siewert1 and Benjamin F. Voight2,3,4

### Katherine M. Siewert

1Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA

Find articles by Katherine M. Siewert

### Benjamin F. Voight

2Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA

3Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA

4Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA

Find articles by Benjamin F. Voight

1Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA

2Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA

3Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA

4Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA

*Corresponding author: E-mail: [email protected]

Associate editor: Rasmus Nielsen

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

### Abstract

Balancing selection occurs when multiple alleles are maintained in a population, which can result in their preservation over long evolutionary time periods. A characteristic signature of this long-term balancing selection is an excess number of intermediate frequency polymorphisms near the balanced variant. However, the expected distribution of allele frequencies at these loci has not been extensively detailed, and therefore existing summary statistic methods do not explicitly take it into account. Using simulations, we show that new mutations which arise in close proximity to a site targeted by balancing selection accumulate at frequencies nearly identical to that of the balanced allele. In order to scan the genome for balancing selection, we propose a new summary statistic, β, which detects these clusters of alleles at similar frequencies. Simulation studies show that compared with existing summary statistics, our measure has improved power to detect balancing selection, and is reasonably powered in non-equilibrium demographic models and under a range of recombination and mutation rates. We compute β on 1000 Genomes Project data to identify loci potentially subjected to long-term balancing selection in humans. We report two balanced haplotypes—localized to the genes WFS1 and CADM2—that are strongly linked to association signals for complex traits. Our approach is computationally efficient and applicable to species that lack appropriate outgroup sequences, allowing for well-powered analysis of selection in the wide variety of species for which population data are rapidly being generated.

Keywords: balancing selection, human evolution, selection scans

### Introduction

The availability of high-quality, population-level genomic data from a wide variety of species has spurred recent efforts to detect genomic regions subjected to natural selection (Singh etal. 2012; Vitti etal. 2013; Xu etal. 2015). One type of pressure, balancing selection, occurs when more than one allele is maintained at a locus. This selection can arise from overdominance (in which the fitness of heterozygotes at a locus is higher than either type of homozygote) or from frequency, temporally, or spatially dependent selection (Charlesworth 2006). A classic case of overdominance occurs at the hemoglobin-β locus in populations located in malaria-endemic regions. Homozygotes for one allele have sickle-cell anemia, and homozygotes for the other allele have an increased risk of malaria. In contrast, heterozygotes are protected from malaria, and at most have a mild case of sickle-cell anemia (Aidoo etal. 2002; Luzzatto 2012).

The discovery of novel targets of balancing selection could help us better understand the role this selection has played in evolution, uncover traits that have been preserved for long evolutionary time periods, and aid in interpreting regions previously associated with phenotypes of interest. In addition, theory predicts that signatures of long-term balancing selection will be confined to regions of at most a few kilobases in human (Gao etal. 2015). This feature of balancing selections' targets leads to fewer possible causal variants than other types of selection, potentially aiding in understanding the underlying biology and associated mechanism.

Patterns of genetic variation around a locus targeted by balancing selection are distorted relative to a neutral locus. Because both alleles at a balanced locus are maintained in the population, the time to the most recent common ancestor (TMRCA) will be substantially increased if selection is maintained long enough (Charlesworth 2006). This elevates the levels of polymorphism around the balanced locus and leads to a corresponding reduction in substitutions (i.e., fixed differences relative to an outgroup species) (Charlesworth 2006).

This deviation in the site frequency spectrum has been harnessed to identify signals of balancing selection in population data, genome-wide. These methods include Tajima’s D (Tajima 1989), which detects an excess number of intermediate frequency alleles. Another commonly used method, the HKA test (Hudson etal. 1987), uses the signal of high diversity and/or a deficit of substitutions. Although these methods are easily implemented and widely applicable, their power under certain demographic scenarios or equilibrium frequencies is modest (DeGiorgio etal. 2014).

If the selection began prior to the divergence of two species, then both species can share the balanced haplotype or variant (Charlesworth 2006). These shared variants are unlikely to occur under neutrality (Gao etal. 2015). Several recent studies have utilized primate outgroups to identify these trans-species polymorphisms (Andrés etal. 2009; Leffler etal. 2013; Teixeira etal. 2015). While specific, this approach fails to identify selection if the balanced variant was lost in at least one of the species under consideration.

More powerful methods to detect balancing selection have been developed, though challenges have limited their broader application. DeGiorgio etal. (2014) proposed two model-based summaries, T1 and T2, which generate a composite likelihood of a site being under balancing selection. However, the most powerful measure (T2) requires the existence of a closely related outgroup sequence and knowledge of the underlying demographic history from which an extensive grid of simulations must first be generated. New advances in estimating population-scale coalescent trees have also been harnessed to detect regions of the genome showing an unusually old TMRCA, but genome-wide application may be computationally prohibitive (Rasmussen etal. 2014).

Despite these methodological advances, the exact frequencies of the excess intermediate frequency alleles seen under balancing selection have not been precisely quantified. The key insight motivating our work was the observation that the frequencies of these excess variants closely match the balanced allele’s frequency. We confirm this signature using simulations.

Motivated by this observation, and inspired by the structure of summary-spectrum based statistics (Tajima 1989; Fay and Wu 2000), we developed a new summary statistic that detects these clusters of variants at highly correlated allele frequencies. This statistic is computationally efficient and does not require knowledge of the ancestral state or an outgroup sequence. Using simulations, we show that our approach has equivalent or higher power to identify balancing selection than similar approaches, and retains power over a range of population genetic models and assumptions (i.e., demography, mutation, or recombination).

We report a genome-wide scan applying our statistic to humans using 1000 Genomes Project data (The 1000 Genomes Consortium 2015), focusing on regions of high sequence quality. We highlight signals of balancing selection at two loci (WFS1 and CAMD2) with functional evidence supporting these as the target genes, as well as signals at several previously known loci.

### Allelic Class Build-up

We begin with an idealized model generating the expected distribution of allele frequencies around a balanced variant. Consider a new neutral mutation that arises within an outcrossing, diploid population. In a genomic region not experiencing selection, this mutation is expected to eventually either drift out of the population, or become fixed (i.e., become a substitution). However, if the locus is under balancing selection, then the allele’s frequency can reach no higher than the frequency of the balanced allele it arose in linkage with, assuming no recombination (fig. 1). This is because the frequency is constrained by selection. Without a recombination event and given enough time, variants that are fixed within these allelic classes (defined by the selected variant) accumulate (Hey 1991; Hudson 1991; Charlesworth 2006). We used Wright–Fisher forward simulations to model neutral variants in a region closely linked to a variant under balancing selection (see Materials and Methods). Within a region not expected to have experienced recombination since the start of the selection, we observed an excess number of variants with frequencies identical to that of the balanced variant, as predicted by this model (fig. 2A).

Open in a separate window

Fig. 1.

Model of allelic class build-up. (1) A new SNP (red star) arises in the population and is subject to balancing selection. (2) It sweeps up to its equilibrium frequency. (3) New SNPs enter the population linked to one of the two balanced alleles and some drift up in frequency. However, unlike in the neutral case, their maximum frequency is that of the balanced allele they are linked to, so variants build-up at this frequency (e.g., blue diamond or brown circle). (4) Recombination decouples SNPs (e.g., purple pentagon) from the balanced site, allowing them to experience further genetic drift.

Open in a separate window

Fig. 2.

Simulations demonstrating build-up of alleles at frequencies similar to balanced alleles as compared with selectively neutral counterparts. The blue bars indicate the fraction of SNPs in simulation replicates at specific frequency differences away from a balanced core site. In contrast, the orange bars represent simulation replicates that lack a balanced variant. Instead, the core site is chosen to be a neutral variant within frequency 10% of the equilibrium frequency of variants introduced in the balanced simulations. (A) Folded frequency differences between the core SNP and each other SNP in a 400-bp window surrounding the core site. Recombination is not expected to have occurred in this region since the start of selection (Gao etal. 2015). (B) Frequency differences in 2,000-bp windows, where recombination is expected to have occurred since the start of selection.

Eventually, recombination decouples variants from the balanced allele, which allows them to drift to loss or fixation within the population. However, even after recombination, the frequency of the variants previously fixed in their allelic class will remain close to that of their previous class until enough time has passed for genetic drift to significantly change their frequencies (Hey 1991; Hudson 1991; Charlesworth 2006). In our simulations of balancing selection, a window expected to have experienced recombination since the onset of selection still has an excess number of variants at similar frequencies to the balanced variant. However, there is a diminished excess at identical frequencies relative to the more narrow window, demonstrating the effects of recombination (fig. 2B).

### A Measure for Allele Frequency Correlation

To capture this signature, we derive a measurement of frequency similarity between a core variant and a second variant of interest. Let n be the number of chromosomes sampled, f0 be frequency of the core SNP, fi be the frequency of the second SNP, i, and p be a scaling constant (see Supplementary Material online). Finally, g(f) returns the folded allele frequency and m is the maximum possible folded allele frequency difference between the core SNP and SNP i, We then measure the similarity in frequency, di, by:

(2)

(3)

Thus, g(f0)−g(fi) is the folded frequency difference between the core SNP and the SNP under consideration. We then subtract this value from m, the maximum folded frequency difference possible with the core SNP, and then divide by m. This gives the fraction of the maximum folded frequency difference of the SNP under consideration compared with the core SNP. We then raise it to the power p so that we can weight variants in a nonlinear fashion with respect to this fraction. Therefore, di can range from 0 if a SNP has the maximum frequency difference with the core SNP, to 1 if SNP i is at the same frequency as the core SNP. We give guidance on the choice of p in the Supplementary Material online. However, the power of β is fairly insensitive to its value (supplementary fig. S12, Supplementary Material online). We use the folded site frequency spectrum in calculating di, as the frequency difference between the core variant and the second variant is independent of whether the derived or ancestral allele of the nearby allele is in linkage with the derived or ancestral core allele.

In a region under long-term balancing selection, the average di between a core SNP and the surrounding variants is expected to be elevated. However, di alone is not optimally powered to detect balancing selection, as its value will be sensitive to changes in the mutation rate in the surrounding region, and it does not take into account the probability of observing each allele frequency under neutrality.

### Capturing Allelic Class Build-up

We propose a statistic, β, that uses our measure of allele frequency correlation, di, combined with a measure of the overall mutation rate, to detect balancing selection. Our approach is inspired by previous summary statistics based on the site frequency spectrum (Tajima 1989; Fay and Wu 2000). These methods compute the difference between two estimators of θ, the population mutation rate parameter, one of which is more sensitive to characteristics of the site frequency spectrum distorted in the presence of natural selection. We propose to calculate β at each SNP in a region of interest to identify loci in which there is an excess of variants near the core SNP’s allele frequency, as evidence of balancing selection.

It has been previously shown that the mutation rate in a region can be estimated as: , where Si is the total number of derived variants found i times in the window from a sample of n chromosomes in the population (Fu 1995). An estimator of θ can then be obtained by taking a weighted average of θi. In our method, we weight by the similarity in allele frequency to the core SNP, as measured by di. If there is an excess of variants at frequencies close to the core SNP allele frequency, then our new estimator, θβ, will be elevated. We propose:

(4)

(5)

(6)

θw is simply Watterson’s estimator (Watterson 1975). β is, in effect, a weighted average of SNP counts based on their frequency similarity to the core SNP. We exclude the core site from our estimation of θw and θβ.

To better understand the properties of β, we used simulations to examine its distribution with and without a balanced SNP (supplementary fig. S2, Supplementary Material online). As expected, under long-term balancing selection β tends to be greater than 0, and under neutrality it tends to be close to 0.

We note that the mean value of β in our neutral simulations generally increases slightly with higher equilibrium frequencies. This behavior is expected because higher frequency alleles will tend to have a longer TMRCA and therefore higher diversity. The exception to this trend is neutral SNPs of frequency 0.5, which we posit is due to the fact that this allele frequency requires the most time for mutations to drift up to the equilibrium frequency needed to fix in their allelic class.

In this version of β, knowledge of the ancestral state for each variant is required. To address this possible shortcoming, we developed a version of the statistic based on a folded site frequency spectrum. This formulation is available in the Supplementary Material online.

Although our statistic can be calculated on any window size, previous work has suggested that the effects of balancing selection localize to a narrow region surrounding the balanced site (Gao etal. 2015). Ultimately, the optimal window size depends on the recombination rate, as it breaks up allelic classes. In the Supplementary Material online, we present some mathematical formulations to suggest reasonable window sizes.

### Power Analysis

We used forward simulations (Haller and Messer 2017) to calculate the power of our approach to detect balancing selection relative to other commonly utilized statistics. Initially, we simulated a single, overdominant mutation for each simulation replicate in an equilibrium demographic model, varied over a range of balancing selection equilibrium frequencies and onset times (see Materials and Methods). We also simulated genomic regions in which all variants were selectively neutral. We then computed the power of β, Tajima’s D, HKA, and T1 to distinguish between simulation replicates with a balanced variant (i.e., our balanced simulations) or those with only neutral mutations (i.e., our neutral simulations). As a reference, we also measured the likelihood-based statistic, T2.

β, Tajima’s D and HKA use a sliding window approach, in units of base pairs, when scanning the genome, whereas T1 and T2 use the number of informative sites (polymorphisms plus substitutions). In order to make a fair comparison between these methods, we first determined the most powerful window size for each method using simulations (supplementary fig. S5, Supplementary Material online). For the summary statistics, a 1-kb window size did well across a range of selection timings and equilibrium frequencies. This 1-kb region matches the approximate size of the ancestral region, in which there have been no expected recombination events between allelic classes (see Supplementary Material online).

For T1 and T2, a number of informative sites of ∼20, or 10 on either side of the core site, achieved maximum power in simulations (supplementary fig. S6, Supplementary Material online). Furthermore, this roughly matches the expected number of informative sites in a 1-kb region under selection (see Supplementary Material online). Therefore, a window of 20 total informative sites is roughly equal to the expected ancestral region size, which is roughly equal to the window at which all these methods achieve optimal power. For this reason, we used a 1-kb window or 20 informative sites, as applicable, when calculating each statistic.

Compared with other summaries, β had the greatest performance across most parameter combinations (fig. 3 and supplementary figs. S4–S17, Supplementary Material online). As expected, β performs slightly worse than T2 under many conditions. However, unlike T2, our method does not require an outgroup sequence, or grids of simulations which are computationally expensive.

Open in a separate window

Fig. 3.

Power of methods to detect ancient balancing selection. Power was calculated based on simulation replicates containing only neutral variants (True Negatives) or containing a balanced variant that was introduced (True Positives). Columns correspond to simulations of balanced alleles at equilibrium frequencies 0.25, 0.50, and 0.75. Rows correspond to older and more recent selection, beginning 250,000 and 100,000 generations prior to sampling, respectively.

We next investigated the power of β under more complex demographic scenarios (see Materials and Methods) compatible with recent human history (DeGiorgio etal. 2014). We found that β performs well under bottleneck and expansion models. Under an expansion scenario, the performance of all methods decreased (supplementary fig. S7, Supplementary Material online), consistent with results from previous studies (DeGiorgio etal. 2014), possibly due to the larger population size increasing the expected time until an allele can fix in its allelic class. The effect of a population bottleneck on power was less drastic and led to a slight increase in power to detect more recent selection (supplementary fig. S8, Supplementary Material online).

Population substructure can confound scans for selection (Schierup etal. 2000; Ingvarsson 2004). To investigate the power of our method in these scenarios, we simulated two models of population substructure. First, we considered a model of two completely subdivided populations. We pooled together 50 individuals from each subpopulation with which to perform the statistical calculations. In this case, the power of all methods to detect balancing selection at equilibrium frequency 0.5 decreased considerably (supplementary fig. S9, Supplementary Material online). This matches expectation, as this situation is expected to drastically increase the number of variants at frequency 0.5.

Next, we considered a two-pulse model of ancient admixture. We selected this model because of its approximation of Neanderthal admixture into human (Vernot and Akey 2015), which may be thought to confound scans for selection in humans. Power with Neanderthal admixture stayed roughly the same as without (supplementary fig. S10, Supplementary Material online). This is as expected, as most haplotypes introduced through admixture are expected to be at very low frequency.

We next examined the power for all methods under models of variable mutation rates, recombination rates, and sample sizes. As expected, the power of all methods was positively correlated with mutation rate (supplementary figs. S13 and S15, Supplementary Material online), and negatively correlated with recombination rate (supplementary figs. S14 and S16, Supplementary Material online). A higher mutation rate provides more variants that can accumulate within an allelic class, whereas a lower recombination rate allows for longer haplotypes upon which mutations can accumulate.

β has reasonable power down to very small sample sizes, achieving near maximum power with as few as 20 sampled chromosomes (supplementary figs. S19 and S20, Supplementary Material online). In practice, the sample size used to calculate the frequency of each variant may differ between variants. We tested the power of β when the sample size of each variant is downsampled from the original size of 100 by a random amount from 0 to 25 individuals. We found that this decreases power very slightly, and that lower values of p perform better in this scenario (supplementary fig. S11, Supplementary Material online).

Finally, power remained high under frequency-dependent selection (supplementary fig. S18, Supplementary Material online), and when a lower selection coefficient was simulated (supplementary fig. S17, Supplementary Material online). This matches expectation, as frequency-dependent selection is expected to maintain haplotypes in the population for long time periods, causing allelic class build-up. A lower selective coefficient would be expected to lower the probability of maintenance of the balanced allele in the population, but conditioned on this maintenance, should not affect power, as we observed.

Simulations show that the power of the folded version of β is similar to the unfolded version at intermediate allele frequencies, but has reduced power at very high frequencies (supplementary fig. S4, Supplementary Material online). However, even at these frequencies, it still outperforms Tajima’s D, the only other statistic of those tested which does not require knowledge of the ancestral state or an outgroup.

### Genome-Wide Scan in Human Populations

We applied the unfolded version of β to population data obtained by the 1000 Genomes Project (Phase 3) to detect signatures of balancing selection (The 1000 Genomes Consortium 2015). We calculated the value of β in 1-kb windows around each SNP in all 26 populations, separately. We focused on regions that passed sequencing accessibility and repeat filters (see Materials and Methods).

In addition, we filtered out variants which did not have a folded frequency of at least 15% in a minimum of one population. The purpose of the frequency filter is to prevent false positives: we were unable to simulate balancing selection with a folded equilibrium frequency of <15%, due to the high frequency of one allele drifting out of the population. Although this phenomenon has not been described for population sizes near that of humans to our knowledge, it has been detailed for lower effective population sizes (Ewens and Thomson 1970). Therefore, it seems unlikely that a balanced variant with a folded equilibrium frequency <15% could successfully be maintained in a population.

We defined extreme β scores as those in the top 1% in the population under consideration (see Materials and Methods). We analyzed the autosomes and X-chromosome separately. Because our method is substantially better powered to detect older selection, we focus on signals of selection that predate the split of modern populations. For this reason, we further filtered for loci that were top-scoring in at least half of the populations tested (see Materials and Methods). We focus on results of our unfolded β scan, however, we also scanned using the folded β statistic to test for robustness of our top scoring sites.

We identified 8,702 autosomal, and 317 X-chromosomal, top-scoring variants that were shared among at least half ( ≥ 13) of the 1000 Genomes populations (see Supplementary Material online). Together, these variants comprise 2,453 distinct autosomal and 86 X-chromosomal loci, and these signatures overlapped 692 autosomal and 29 X-chromosomal genes.

### Characterization of Identified Signals

Trans-species haplotypes are defined as two or more variants are found in tight linkage and are shared between humans and a primate outgroup (in our case, chimpanzee). These haplotypes are highly unlikely to occur by chance, unlike trans-species SNPs, which are expected to be observed in the genome due to recurrent mutations (Gao etal. 2015). These haplotypes present a signature of balancing selection independent from the signature captured by β. If β captures true signatures of balancing selection, one would expect an enrichment of high β values at trans-species haplotypes. We found that β was indeed predictive of trans-species haplotype status from Leffler etal. (2013), even after including adjustments for the distance to the nearest gene (P < 2 × 10−16, see Materials and Methods).

Our scan identified several loci that have been previously implicated as putative targets of balancing selection (see Supplementary Material online). Several major signals occurred on chromosome 6 near the HLA, a region long presumed to be subjected to balancing selection (Hedrick 1998; Hughes and Nei 1988). In particular, we found a strong signal in the HLA at a locus influencing response to Hepatitis B infection, rs3077 (Thursz etal. 1997; DeGiorgio etal. 2014; Jiang etal. 2015). Several additional top sites in our scan matched those from DeGiorgio etal. (2014). These include sites that tag phenotypic associations (Welter etal. 2014), such as MYRIP, involved with sleep-related phenotypes (Gottlieb etal. 2007), and BICC1, associated with corneal astigmatism (Lopes etal. 2013). We focus on two of our top-scoring regions, located in the CADM2 and WFS1 genes. In addition to passing the 1000 Genomes strict filter and the RepeatMasker test, these haplotypes also passed Hardy–Weinberg filtering (see Materials and Methods).

### A Signature of Selection at the CADM2 Locus

One of our top-scoring regions fell within an intron of the cell adhesion molecule 2 gene, CADM2. This locus contains a haplotype with β scores falling in the top 0.25 percentile in 17 of the 1000 Genomes populations, and scoring in the top 0.75 percentile across all 26 populations (fig. 4). This site was also a top scoring SNP in the CEU population based on the T2 statistic (DeGiorgio etal. 2014). In our scan using the folded β statistic, this haplotype contained top-scoring variants in 20 populations, indicating the result was not due to ancestral allele miscalling. In the remaining six populations, the haplotype was at folded frequency 0.15 or lower, where the folded version of β has significantly reduced power.

Open in a separate window

Fig. 4.

Signal of balancing selection at CADM2. The signal of selection is located in an intron of CADM2. Top: rs17518584 is the lead GWAS SNP for several cognitive traits and is marked by the brown vertical dashed line. The purple dashed line marks two regulatory variants found on the balanced haplotype. β scores were calculated using a rolling average with windows of size 5 kb, including only SNPs at the same frequency as the core SNP in the average. In addition, we show the allele frequencies of the GWAS and a top-scoring β SNP in each representative population. Bottom: Approximate haplotype spans for each population.

To elucidate the potential mechanisms contributing to the signal in this region, we overlapped multiple genomic data sets to identify potential functional variants that were tightly linked with our haplotype signature. First, one variant that perfectly tags (EUR r2 = 1.0) our signature, rs17518584, has been genome-wide significantly associated with cognitive functions, including information processing speed (Davies etal. 2015; Ibrahim-Verbaas etal. 2016). Second, multiple variants in this region colocalized (EUR r2 between 0.9 and 1 with rs17518584) with eQTLs of CADM2 in numerous tissues (Lung, Adipose, Skeletal Muscle, Heart-Left Ventricle), though notably not in brain (The GTEx Consortium 2015). That said, several SNPs with regulatory potential (RegulomeDB scores of 3a or higher) are also strongly tagged by our high-scoring haplotype (EUR r2 between 0.9 and 1.0 with rs17518584), which include regions of open chromatin in Cerebellum and other cell types (Boyle etal. 2012). Several SNPs on this haplotype, particularly rs1449378 and rs1449379, fall in enhancers in several brain tissues, including the hippocampus (Boyle etal. 2012; Ernst and Kellis 2012). Taken collectively, these data suggest that our haplotype tags a region of regulatory potential that may influence the expression of CADM2, and potentially implicates cognitive or neuronal phenotypes in the selective pressure at this site.

### A Signature of Balancing Selection near the Diabetes Associated Locus, WFS1

We identified a novel region of interest within the intron of WFS1, a transmembrane glycoprotein localized primarily to the endoplasmic reticulum (ER). WFS1 functions in protein assembly (Takei etal. 2006) and is an important regulator of the unfolded protein and ER Stress Response pathways (Fonseca etal. 2005). A haplotype in this region (∼3.5 kb) contains ∼26 variants, 3 of which are in high-quality windows and are high-scoring β in all populations (fig. 5). The haplotype was also in the top 1 percentile in our folded β scan in 21 populations. In the remaining five populations, this haplotype was at frequency 0.82 or higher, where the folded version of β has significantly lower power than the unfolded version.

Open in a separate window

Fig. 5.

Signal of balancing selection at the WFS1 gene. Top: rs4458523 is the lead GWAS SNP for diabetes, and is marked by the brown vertical dashed line. The purple dashed line marks five regulatory variants found on the balanced haplotype. In addition, we show the allele frequencies of the GWAS and a top-scoring β SNP in each representative population. Bottom: Approximate haplotype spans for each population.

Our identified high-scoring haplotype tags several functional and phenotypic variant associations. First, one variant that perfectly tags our signature (EUR r2 = 1.0), rs4458523, has been previously associated with type 2 diabetes (Voight etal. 2010; Mahajan etal. 2014). Second, multiple variants in this region are associated with expression-level changes of WFS1 in numerous tissues (The GTEx Consortium 2015); these variants are strongly tagged by our high-scoring haplotype (EUR r2 between 0.85 and 0.9 with rs4458523). Finally, several SNPs with regulatory potential (RegulomeDB scores of 2 b or higher) are also strongly tagged by our high-scoring haplotype (EUR r2 between 0.9 and 1.0 with rs4458523). Taken collectively, these data suggest that our haplotype tags a region of strong regulatory potential that is likely to influence the expression of WFS1.

### Discussion

Informed by previous theory on allelic-class build-up (Hey 1991; Hudson 1991; Charlesworth 2006), we developed a novel summary statistic to detect the signature of balancing selection, and measured efficacy and robustness of our approach using simulations. Although our method does not require knowledge of ancestral states for each variant from outgroup sequences, this information can improve power at extreme equilibrium frequencies.

Although our method outperforms existing summary statistic methods, it is not as powerful as the computationally intensive approach of T2, which uses simulations to calculate likelihoods of observed data (DeGiorgio etal. 2014). To improve power, we considered utilizing information on rates of substitutions, but this did not substantially improve discriminatory power (see supplementary methods, Supplementary Material online). Alternative possibilities could include the following: 1) consideration of the region past the ancestral region surrounding the balanced variant, or 2) deviations in the frequency spectrum beyond just nearly identical frequencies to the balanced SNP. As expected from theory, we also note that models of population structure can also produce our haplotype signature, emphasizing the requirement to perform scans on individual populations.

Balancing selection can cause a similar signature in self-fertilizing species, though we focused on out-crossed species in this report. Previous work has shown that given the same selection coefficient, the signature of balancing selection can be wider in self-fertilizing species due to a lower effective recombination rate (Nordborg etal. 1996). However, lower recombination rate also means that background selection leaves a wider footprint on the genome in these species, which can reduce levels of polymorphism (Agrawal and Hartfield 2016). Furthermore, a decrease in the frequency of heterozygotes, owing to selfing, can reduce or eliminate the effects of heterozygote advantage. Instead, modes of balancing selection like frequency, temporally or spatially dependent selection may be more significant.

We have also assumed a single causal variant throughout. However, there may be more than one variant at a locus experiencing balancing selection. This situation is thought to occur throughout the HLA region (Hedrick 1998). Assuming the maintenance of multiple variants, this scenario would also increase the regional TMRCA, leading to allele class build-up, spanning perhaps a larger window than our single-variant models (Lenz etal. 2016). The dynamics of this type of situations could be the focus of future work.

Although it is impossible to know the true selective pressure underlying our highlighted loci, our results suggest that balancing selection could contribute to the genetic architecture of complex traits in human populations. At the CADM2 locus, functional genomics data suggests that our haplotype signature may connect to brain-related biology. Intriguingly, a recent report also noted a strong signature of selection at this locus in canine (Freedman etal. 2016), suggesting a possibility of convergent evolution. That said, the phenotypes that have resulted in a historical fitness trade-off at this locus are far from obvious.

Similarly, speculation on the potential phenotypes subject to balancing selection at WFS1 should also be interpreted cautiously. It is known that autosomal recessive, loss of function mutations in this gene cause Wolfram Syndrome. This gene is a component of the unfolded protein response (Fonseca etal. 2005) and is involved with ER maintenance in pancreatic β-cells. Furthermore, deficiency of WFS1 results in increased ER stress, impairment of cell cycle, and ultimately increased apoptosis of β-cells (Yamada etal. 2006). These data would suggest that reduced expression of WFS1 would be diabetes risk increasing; however, eQTLs that colocalized with the diabetes risk-increasing allele elevate expression, at least in nonpancreas tissue, suggesting perhaps a more complex functional mechanism. Furthermore, how the unfolded protein response could connect to historical balancing selection is also not immediately obvious. One possibility derives from recent work suggesting that these pathways respond not only to stimulus from nutrients or ER stress, but also to pathogens (Nakamura etal. 2010). This could suggest the possibility that expression of WFS1 is optimized in part to respond to pathogen exposure at a population level.

β is powered to detect balancing selection when outgroup sequences are not available and can do so quickly and easily. Given the increasing ease of collecting population genetic data from non-model organisms, our approach is in a unique position to characterize balancing selection in these populations.

An implementation of both the folded and unfolded versions of β is available for download at https://github.com/ksiewert/BetaScan.

### Simulations

Simulations were performed using the forward genetic simulation software SLiM 2.0 (Haller and Messer 2017). In our simulations, neutral mutations and recombination events occur at a predefined rate throughout the entire length of the simulation. A burn-in time of 100,000 generations was first simulated to achieve equilibrium levels of variation. Then, two populations representing humans and chimpanzees split from this original population, and were simulated for 250,000 additional generations. We then sampled 100 chromosomes from the human population, and 1 chromosome from the chimpanzee population. We first simulated these scenarios under parameters suitable for human populations, with mutation and recombination rates of πr = 2.5x10−8 and Ne = 10,000.

We generated two sets of simulations: one without a balanced variant (the set we refer to as our neutral simulations) and one with a balanced variant (balanced simulations). In the second set, a single balanced variant was introduced at the center of the simulated region in the human population, either at the time of speciation (250,000 generations prior to simulation ending), or 150,000 generations after speciation (100,000 generations prior to simulation ending). The simulations then continued as normal, conditional on maintenance of the balanced SNP in the population. If this balanced variant was lost, the simulation restarted at the generation in which the balanced variant was introduced. In the second (neutral) set, no balanced variant was introduced, so all variants are selectively neutral.

Each balanced SNP had an overdominance coefficient h and selection coefficient s. The fitness of the heterozygote is then 1 + hs, and the fitness of the ancestral and derived homozygotes are 1 and 1 + s, respectively. We simulated two different s values: 10−2 (our default) and 10−4. We simulated six different equilibrium frequencies: 0.17, 0.25, 0.5, 0.75, 0.83, which correspond to h = −0.25, −0.5, 100, 1.5, 1.25. Negative h values were paired with negative s values.

After simulation completion, the frequency of each variant in the sampled individuals was calculated. Substitutions were defined as any variant in which the allele from the chimpanzee chromosome was not found in the sampled human individuals. For each set of balanced simulations, we define the core SNP as the variant under balancing selection. For each set of balanced simulations, we then found a corresponding set of core SNPs in our neutral simulations which were within 10% of the equilibrium frequency of the balanced variants. We then calculated the score for each statistic on these core variants. In this way, we have statistic scores for the balanced variant from each balanced simulation replicate, and a score for a neutral variant matched for similar frequency. For more details on how each statistic was calculated, see supplementary methods in the Supplementary Material online.

To increase simulation speed, we rescaled our simulations by a factor of 10 for specified power analyses in the supporting information (Hoggart etal. 2007); results presented in the main text were not rescaled. A minimum of 1,500 simulation replicates were performed for each parameter set. We simulated 10-kb regions for each simulation replicate, with the exception of the analysis of optimal windows size, in which case a 100-kb region was simulated.

### Empirical Site Analysis

To apply our method to 1000 Genomes data, we first downloaded data for each of the 26 populations in phase 3 of the project (obtained May 2, 2013). We then calculated allele frequencies separately for each population, and calculated β in 1-kb-sized windows centered around each SNP for each population.

Because poorly sequenced regions can artificially inflate the number of SNPs in a region, we then filtered out regions that contained one or more base pairs that were ruled as poor quality in the 1000 Genomes phase 3 strict mask file. For further confirmation that the signal was not a result of poor mapping quality, we overlapped SNPs of interest with hg19 human RepeatMasker regions, downloaded from the UCSC Table Browser on February 9, 2017. We then removed all core SNPs from consideration that were found within a repeat, similar to Bubb etal. (2006). We further removed SNPs that were not of common frequency (at or above a folded frequency of 15%) in at least one population. After filtering, there were 1,803,299 SNPs that remained. We then found the top 1% of these high-quality SNPs in each population in our β scan.

Unknown paralogs or other technical artifacts could inflate the number of intermediate frequency alleles. Although the 1000 Genomes data provides strict quality filter masks, we wanted to further verify that our haplotypes of interest in WFS1 and CADM2 were not the result of obvious technical artifacts. In order to do this, we used the –hardy flag in vcftools (Danecek etal. 2011), and investigated both the one-tailed P value for an excess of heterozygotes, and the two-tailed P value, in our four representative populations (YRI, CEU, CDX, and PJL). All variants on these haplotypes had P values above 1 × 10−3.

The lowest autosomal significance cut-off of any population, ASW, corresponds to a β score of 47.49. This score is in the top 0.05 percentile of core SNPs in neutral simulations corresponding to an equilibrium frequency of 0.5 (supplementary fig. S3, Supplementary Material online).

To find top-scoring sites that are also GWAS hits, we obtained LD proxies in European populations for our top-scoring SNPs, using a cut-off of r2 of 0.9, a maximum distance of 50 kb and a minimum minor allele frequency of 5%. We then overlapped these LD proxies with GWAS hits obtained from the GWAS Catalog to get our final list of putatively balanced GWAS hits (Welter etal. 2014). Gene names and locations were downloaded from Ensembl BioMart on November 26, 2016.

For our trSNP comparison, we used the Human/Chimp shared haplotypes from Leffler etal. (2013). Using logistic regression, we then modeled the outcome of a SNP being part of a trHap as dependent on the β Score and distance to nearest gene.

### Supplementary Material

Supplementary data are available at Molecular Biology and Evolution online.

### Acknowledgments

We would like to acknowledge the members of the Voight Lab for their helpful suggestions and support, Philipp Messer for his comments on this manuscript, and comments from two anonymous reviewers. This work was supported through grants from the National Institutes of Health (NIDDK R01DK101478 to B.F.V. and T32HG000046-17 to K.M.S.) and the Alfred P. Sloan Foundation (BR2012-087 to B.F.V.).

### References

• Agrawal AF, Hartfield M.. 2016. Coalescence with background and balancing selection in systems with bi- and uniparental reproduction: contrasting partial asexuality and selfing. Genetics2021: 313–326. [PMC free article] [PubMed] [Google Scholar]
• Aidoo M, Terlouw DJ, Kolczak MS, McElroy PD, ter Kuile FO, Kariuki S, Nahlen BL, Lal AA, Udhayakumar V.. 2002. Protective effects of the sickle cell gene against malaria morbidity and mortality. The Lancet3599314: 1311–1312. [PubMed] [Google Scholar]
• Andrés AM, Hubisz MJ, Indap A, Torgerson DG, Degenhardt JD, Boyko AR, Gutenkunst RN, White TJ, Green ED, Bustamante CD, et al.2009. Targets of balancing selection in the human genome. Mol Biol Evol. 26:2755–2764. [PMC free article] [PubMed] [Google Scholar]
• Boyle AP, Hong EL, Hariharan M, Cheng Y, Schaub MA, Kasowski M, Karczewski KJ, Park J, Hitz BC, Weng S, et al.2012. Annotation of functional variation in personal genomes using RegulomeDB. Genome Res. 229: 1790–1797. [PMC free article] [PubMed] [Google Scholar]
• Bubb KL, Bovee D, Buckley D, Haugen E, Kibukawa M, Paddock M, Palmieri A, Subramanian S, Zhou Y, Kaul R, et al.2006. Scan of human genome reveals no new Loci under ancient balancing selection. Genetics1734: 2165–2177. [PMC free article] [PubMed] [Google Scholar]
• Charlesworth D. 2006. Balancing selection and its effects on sequences in nearby genome regions. PLoS Genet. 24: 379–384. [PMC free article] [PubMed] [Google Scholar]
• Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, et al.2011. The variant call format and VCFtools. Bioinformatics2715: 2156–2158. [PMC free article] [PubMed] [Google Scholar]
• Davies G, Armstrong N, Bis JC, Bressler J, Chouraki V, Giddaluru S, Hofer E, Ibrahim-Verbaas CA, Kirin M, Lahti J, et al.2015. Genetic contributions to variation in general cognitive function: a meta-analysis of genome-wide association studies in the CHARGE consortium (N = 53,949). Mol Psychiatry202: 183–192. [PMC free article] [PubMed] [Google Scholar]
• DeGiorgio M, Lohmueller KE, Nielsen R.. 2014. A model-based approach for identifying signatures of ancient balancing selection in genetic data. PLoS Genet. 108: e1004561. [PMC free article] [PubMed] [Google Scholar]
• Ernst J, Kellis M.. 2012. ChromHMM: automating chromatin-state discovery and characterization. Nat Methods93: 215–216. [PMC free article] [PubMed] [Google Scholar]
• Ewens WJ, Thomson G.. 1970. Heterozygote selective advantage. Ann Hum Genet. 334: 365–376. [Google Scholar]
• Fay JC, Wu CI.. 2000. Hitchhiking under positive Darwinian selection. Genetics1553: 1405–1413. [PMC free article] [PubMed] [Google Scholar]
• Fonseca SG, Fukuma M, Lipson KL, Nguyen LX, Allen JR, Oka Y, Urano F.. 2005. WFS1 is a novel component of the unfolded protein response and maintains homeostasis of the endoplasmic reticulum in pancreatic beta-cells. J Biol Chem. 28047: 39609–39615. [PubMed] [Google Scholar]
• Freedman AH, Schweizer RM, Ortega-Del Vecchyo D, Han E, Davis BW, Gronau I, Silva PM, Galaverni M, Fan Z, Marx P, et al.2016. Demographically-based evaluation of genomic regions under selection in domestic dogs. PLoS Genet. 123: e1005851.. [PMC free article] [PubMed] [Google Scholar]
• Fu Y. 1995. Statistical properties of segregating sites. Theor Popul Biol. 482: 172–197. [PubMed] [Google Scholar]
• Gao Z, Przeworski M, Sella G.. 2015. Footprints of ancient-balanced polymorphisms in genetic variation data from closely related species. Evolution692: 431–446. [PMC free article] [PubMed] [Google Scholar]
• Gottlieb DJ, O’Connor GT, Wilk JB.. 2007. Genome-wide association of sleep and circadian phenotypes. BMC Med Genet. 8(Suppl 1): S9. [PMC free article] [PubMed] [Google Scholar]
• Haller BC, Messer PW.. 2017. SLiM 2: flexible, interactive forward genetic simulations. Mol Biol Evol. 341: 230–240. [PubMed] [Google Scholar]
• Hedrick PW. 1998. Balancing selection and MHC. Genetica1043: 207–214. [PubMed] [Google Scholar]
• Hey J. 1991. A multi-dimensional coalescent process applied to multi-allelic selection models and migration models. Theor Popul Biol. 391: 30–48. [PubMed] [Google Scholar]
• Hoggart CJ, Chadeau-Hyam M, Clark TG, Lampariello R, Whittaker JC, De Iorio M, Balding DJ.. 2007. Sequence-level population simulations over large genomic regions. Genetics1773: [PMC free article] [PubMed] [Google Scholar]
• Hudson RR. 1991. Oxford surveys in evolutionary biology. 7th ed New York: Oxford University Press. [Google Scholar]
• Hudson RR, Kreitman M, Aguadé M.. 1987. A test of neutral molecular evolution based on nucleotide data. Genetics1161: 153–159. [PMC free article] [PubMed] [Google Scholar]
• Hughes AL, Nei M.. 1988. Pattern of nucleotide substitution at major histocompatibility complex class I loci reveals overdominant selection. Nature3356186: 167–170. [PubMed] [Google Scholar]
• Ibrahim-Verbaas CA, Bressler J, Debette S, Schuur M, Smith AV, Bis JC, Davies G, Trompet S, Smith JA, Wolf C, et al.2016. GWAS for executive function and processing speed suggests involvement of the CADM2 gene. Mol Psychiatry212: 189–197. [PMC free article] [PubMed] [Google Scholar]
• Ingvarsson PK. 2004. Population subdivision and the Hudson-Kreitman-Aguade test: testing for deviations from the neutral model in organelle genomes. Genet Res. 831: 31–39. [PubMed] [Google Scholar]
• Jiang D-K, Ma X-P, Yu H, Cao G, Ding D-L, Chen H, Huang H-X, Gao Y-Z, Wu X-P, Long X-D, et al.2015. Genetic variants in five novel loci including CFB and CD40 predispose to chronic hepatitis B. Hepatology621: 118–128. [PubMed] [Google Scholar]
• Leffler EM, Gao Z, Pfeifer S, Ségurel L, Auton A, Venn O, Bowden R, Bontrop R, Wall JD, Sella G, et al.2013. Multiple instances of ancient balancing selection shared between humans and chimpanzees. Science3396127: 1578–1582. [PMC free article] [PubMed] [Google Scholar]
• Lenz TL, Spirin V, Jordan DM, Sunyaev SR.. 2016. Excess of deleterious mutations around HLA genes reveals evolutionary cost of balancing selection. Mol Biol Evol. 3310: 2555–2564. [PMC free article] [PubMed] [Google Scholar]
• Lopes MC, Hysi PG, Verhoeven VJM, Macgregor S, Hewitt AW, Montgomery GW, Cumberland P, Vingerling JR, Young TL, van Duijn CM, et al.2013. Identification of a candidate gene for astigmatism. Invest Opthalmol Vis Sci. 542: 1260. [PMC free article] [PubMed] [Google Scholar]
• Luzzatto L. 2012. Sickle cell anaemia and malaria. Mediterr J Hematol Infect Dis. 41: e2012065.. [PMC free article] [PubMed] [Google Scholar]
• Mahajan A, Go MJ, Zhang W, Below JE, Gaulton KJ, Ferreira T, Horikoshi M, Johnson AD, Ng MCY, Prokopenko I, et al.2014. Genome-wide trans-ancestry meta-analysis provides insight into the genetic architecture of type 2 diabetes susceptibility. Nat Genet. 463: 234–244. [PMC free article] [PubMed] [Google Scholar]
• Nakamura T, Furuhashi M, Li P, Cao H, Tuncman G, Sonenberg N, Gorgun CZ, Hotamisligil GS.. 2010. Double-stranded RNA-dependent protein kinase links pathogen sensing with stress and metabolic homeostasis. Cell1403: 338–348. [PMC free article] [PubMed] [Google Scholar]
• Nordborg M, Charlesworth B, Charlesworth D.. 1996. Increased levels of polymorphism surrounding selectively maintained sites in highly selfing species. Biol Sci. 263123671373: 1033–1039. [Google Scholar]
• Rasmussen MD, Hubisz MJ, Gronau I, Siepel A.. 2014. Genome-wide inference of ancestral recombination graphs. PLoS Genet. 105: e1004342.. [PMC free article] [PubMed] [Google Scholar]
• Schierup MH, Vekemans X, Charlesworth D.. 2000. The effect of subdivision on variation at multi-allelic loci under balancing selection. Genet Res. 761: 51–62. [PubMed] [Google Scholar]
• Singh ND, Jensen JD, Clark AG, Aquadro CF.. 2012. Inferences of demography and selection in an African population of Drosophila melanogaster. Genetics1931: [PMC free article] [PubMed] [Google Scholar]
• Tajima F. 1989. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics1233: 585.. [PMC free article] [PubMed] [Google Scholar]
• Takei D, Ishihara H, Yamaguchi S, Yamada T, Tamura A, Katagiri H, Maruyama Y, Oka Y.. 2006. WFS1 protein modulates the free Ca2+ concentration in the endoplasmic reticulum. FEBS Lett. 58024: 5635–5640. [PubMed] [Google Scholar]
• Teixeira JC, de Filippo C, Weihmann A, Meneu JR, Racimo F, Dannemann M, Nickel B, Fischer A, Halbwax M, Andre C, et al.2015. Long-term balancing selection in LAD1 maintains a missense trans-species polymorphism in humans, chimpanzees, and bonobos. Mol Biol Evol. 325:1186–1196. [PubMed] [Google Scholar]
• The 1000 Genomes Consortium 2015. A global reference for human genetic variation. Nature5267571: 68–74. [PMC free article] [PubMed] [Google Scholar]
• The GTEx Consortium 2015. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science3486235: 648–660. [PMC free article] [PubMed] [Google Scholar]
• Thursz MR, Thomas HC, Greenwood BM, Hill AV.. 1997. Heterozygote advantage for HLA class-II type in hepatitis B virus infection. Nat Genet. 171: 11–12. [PubMed] [Google Scholar]
• Vernot B, Akey JM.. 2015. Complex history of admixture between modern humans and Neandertals. Am J Hum Genet. 963: 448–453. [PMC free article] [PubMed] [Google Scholar]
• Vitti JJ, Grossman SR, Sabeti PC.. 2013. Detecting natural selection in genomic data. Annu Rev Genet. 471: 97–120. [PubMed] [Google Scholar]
• Voight BF, Scott LJ, Steinthorsdottir V, Morris AP, Dina C, Welch RP, Zeggini E, Huth C, Aulchenko YS, Thorleifsson G, et al.2010. Twelve type 2 diabetes susceptibility loci identified through large-scale association analysis. Nat Genet. 427: 579–589. [PMC free article] [PubMed] [Google Scholar]
• Watterson G. 1975. On the number of segregating sites in genetical models without recombination. Theor Popul Biol. 72: 256–276. [PubMed] [Google Scholar]
• Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H, Klemm A, Flicek P, Manolio T, Hindorff L, Parkinson H.. 2014. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 42(Database issue): D1001–D1006. [PMC free article] [PubMed] [Google Scholar]
• Xu L, Bickhart DM, Cole JB, Schroeder SG, Song J, Tassell CPV, Sonstegard TS, Liu GE.. 2015. Genomic signatures reveal new evidences for selection of important traits in domestic cattle. Mol Biol Evol. 323: 711–725. [PMC free article] [PubMed] [Google Scholar]
• Yamada T, Ishihara H, Tamura A, Takahashi R, Yamaguchi S, Takei D, Tokita A, Satake C, Tashiro F, Katagiri H, et al.2006. WFS1-deficiency increases endoplasmic reticulum stress, impairs cell cycle progression and triggers the apoptotic pathway specifically in pancreatic cells. Hum Mol Genet. 1510: 1600–1609. [PubMed] [Google Scholar]

Articles from Molecular Biology and Evolution are provided here courtesy of Oxford University Press

Sours: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5850717/

Balancing selection refers to forms of natural selection which work to maintain genetic polymorphisms (or multiple alleles) within a population. Balancing selection is in contrast to directional selection which favor a single allele. A balanced polymorphism is a situation in which balancing selection within a population is able to maintain stable frequencies of two or more phenotypic forms. Note balancing selection will not always result in an observable phenotypic difference because the genotype may not be one-to-one with the phenotype.

There are several major mechanisms (which are not exclusive within any given population) by which natural selection preserves this variation and consequently may produce a balanced polymorphism. The two most well studied are heterozygote advantage (overdominance) and frequency dependent selection. A less well studied alternative is environmental heterogeneity.

In heterozygote advantage, an individual who is heterozygous at a particular genelocus has a greater fitness than a homozygous individual. A well-studied case of heterozygote advantage is that of sickle cell anemia. This can be seen in human populations with the locus for a certain protein present in hemoglobin (an important component in blood). Individuals who are homozygous for the recessive allele at this locus are inflicted with sickle-cell disease,a disease in which blood cells are grossly misshapen and which often results in a reduced lifespan.

An individual heterozygous at this locus will not suffer from sickle-cell disease but because of slightly irregularly shaped blood cells they are resistant to malaria. This resistance is favored by natural selection in tropical regions where malaria (a common and deadly sickness caused by the protozoanparasitePlasmodium falciparum) is present and so the heterozygote has an evolutionary edge. It is in this way that natural selection preserves stable frequencies of both the heterozygote and the homozygote dominant phenotypes.

The second important mechanism by which natural selection can preserve two or more phenotypic forms is known as frequency-dependent selection. Frequency-dependent selection is a form of selection in which the relative fitness of a specific phenotype declines if the frequency of that phenotype becomes too high. An example of this type of selection is between parasites and their hosts. An example follows: suppose that a certain parasite can recognize one of two receptors in its host, receptor or receptor , if many parasites with receptor exist then hosts with receptor will be selected for, and this will subsequently increase the selective pressure on parasites which use receptor and this relationship will continue rocking back and forth.

Frequency-dependent selection has been observed experimentally in the banding and colour polymorphism in European land snails, Cepaea nemoralis, by Sheppard and Kwok. Frequency-dependent selection often appears in the form of mate preference, a type of sexual selection.

In the case of environmental heterogeneity when the environment conditions fluctuate, it may give the normally selected-against organism some form of advantage. An example would be the Biston betulariapeppered moth, which has both dark and white polymorphic states. During snowfall, when the fields are covered with snow, it is more likely that the white forms are selectively favoured. The balance is tilted in the other direction when the snow disappears.

### References[]

• Campbell, Neil A. & Reece, Jane B. (2002). Biology (6th ed.). Benjamin Cummings. ISBN 0-8053-6624-5.

Sours: https://psychology.wikia.org/wiki/Balancing_selection
Natural selection: Directional, Disruptive, Balancing, and Stabilizing Selection

### Abstract

Genetic drift is expected to remove polymorphism from populations over long periods of time, with the rate of polymorphism loss being accelerated when species experience strong reductions in population size. Adaptive forces that maintain genetic variation in populations, or balancing selection, might counteract this process. To understand the extent to which natural selection can drive the retention of genetic diversity, we document genomic variability after two parallel species-wide bottlenecks in the genus Capsella. We find that ancestral variation preferentially persists at immunity related loci, and that the same collection of alleles has been maintained in different lineages that have been separated for several million years. By reconstructing the evolution of the disease-related locus MLO2b, we find that divergence between ancient haplotypes can be obscured by referenced based re-sequencing methods, and that trans-specific alleles can encode substantially diverged protein sequences. Our data point to long-term balancing selection as an important factor shaping the genetics of immune systems in plants and as the predominant driver of genomic variability after a population bottleneck.

https://doi.org/10.7554/eLife.43606.001

### eLife digest

Capsella rubella is a small plant that is found in southern and western Europe. This plant is young in evolutionary terms: it is thought to have emerged less than 200,000 years ago from a small group of plants belonging to an older species known as Capsella grandiflora.

Individuals of the same species may carry alternative versions of the same genes – known as alleles – and the total number of alleles present in a population is referred to as genetic diversity. When a few individuals form a new species, the gene pool and the genetic diversity in the new species is initially much lower than in the ancestral species, which may make the new species less robust to fluctuations in the environment. For example, alternative versions of a gene might be preferable in hot or cold climates, and loss of one of these versions would limit the species’ ability to survive in both climates.

A mechanism known as balancing selection can maintain various alleles in a species, even if the population is very small. However, it was not clear how common long-lasting balancing selection was after a species had split. To address this question, Koenig et al. assembled collections of wild C. rubella and C. grandiflora plants and sequenced their genomes in search of alleles that were shared between individuals of the two species.

The analysis found not just a few, but thousands of examples where the same genetic differences had been maintained in both C. rubella and C. grandiflora. Some of these allele pairs were also shared with individuals of a third species of Capsella that had split from C. rubella and C. grandiflora over a million years ago. The shared alleles did not occur randomly in the genome; genes involved in immune responses were far more likely to be targets of balancing selection than other types of genes.

These findings indicate that there is strong balancing selection to maintain different alleles of immunity genes in wild populations of plants, and that some of this diversity can be maintained over hundreds of thousands, if not millions of years. The strategy developed by Koenig et al. may help to identify new versions of immunity genes from wild relatives of crop plants that could be used to combat crop diseases.

https://doi.org/10.7554/eLife.43606.002

### Introduction

Balancing selection describes the suite of adaptive forces that maintain genetic variation for longer than expected by random chance. It can have many causes, including heterozygous advantage, negative frequency-dependent selection, and environmental heterogeneity in space and time. The unifying characteristic of these situations is that the turnover of alleles is slowed, resulting in increased diversity at linked sites (Charlesworth, 2006). In principle, it should be simple to detect the resulting footprints of increased coalescence times surrounding balanced sites (Tellier et al., 2014), and many candidates have been identified using diverse methodology (Fijarczyk and Babik, 2015). However, balanced alleles will be stochastically lost over long time spans, suggesting that most balanced polymorphism is short lived (Fijarczyk and Babik, 2015).

The strongest evidence for balancing selection comes from systems in which alleles are maintained in lineages that are reproductively isolated and that have separated millions of years ago, resulting in trans-specific alleles with diagnostic trans-specific single nucleotide polymorphisms (tsSNPs). A few, well known genes fit this paradigm: the self-incompatibility loci of plants (Vekemans and Slatkin, 1994), mating-type loci of fungi (Wu et al., 1998), and the major histocompatibility complex (MHC) and ABO blood group loci in vertebrates (McConnell et al., 1988; Mayer et al., 1988; Lawlor et al., 1988; Watkins et al., 1990; Ségurel et al., 2012). Additional candidates have been proposed by comparing genome sequences from populations of humans and chimpanzees, and from populations of multiple Arabidopsis species. These efforts have revealed six loci in primates (Leffler et al., 2013b; Teixeira et al., 2015) and up to 129 loci, that were identified by at least two shared SNPs each, in Arabidopsis (Novikova et al., 2016; Bechsgaard et al., 2017), as potential targets of long-term balancing selection and/or introgression. In both systems, genes involved in host–pathogen interactions were enriched, which in Arabidopsis is consistent with previous findings that several disease resistance loci appear to be under balancing selection in this species, based on the analysis of individual genes (Huard-Chauveau et al., 2013; Botella, 1998; Caicedo et al., 1999; Noel, 1999; Stahl et al., 1999; Tian et al., 2002; Bakker, 2006; Rose et al., 2004; Todesco et al., 2010). However, even with the ability to conduct whole-genome scans for balancing selection in A. thaliana, the total number of examples with robust evidence across species remains small (Cao et al., 2011; 1001 Genomes Consortium, 2016).

One explanation for this paucity of evidence for pervasive and stable balancing selection is that cases of long-term maintenance of alleles are rare. However, there are good reasons to believe that many studies lacked the power to detect the expected effects (Fijarczyk and Babik, 2015; DeGiorgio et al., 2014). If one requires that alleles have been maintained in species separated by millions of years, then only targets of outstandingly strong selective pressures that remain the same over many millennia can be identified. Furthermore, recombination between deeply coalescing alleles will typically reduce the size of the genomic footprint to very short sequence stretches, thus limiting the opportunity for distinguishing old alleles from recurrent mutations.

We hypothesised that self-fertilizing species provide increased sensitivity to detect balancing selection based on two observations (Wiuf et al., 2004; Wright et al., 2008). First, self fertilisation greatly reduces the effective rate of recombination, thus potentially expanding the footprint of balancing selection. In addition, the transition to self fertilisation is generally associated with dramatic genome-wide reductions in polymorphism, potentially making it easier to detect outlier loci that retain variation from the outcrossing, more polymorphic ancestor. In this study we sought to assess how strongly selection acts to maintain genetic diversity in the context of repeated transitions to self fertilisation in the flowering plant genus Capsella. Like many plant lineages, the ancestral state of Capsella is outcrossing (found in the extant diploid species C. grandiflora), but selfing has evolved independently in two diploid species, C. rubella and C. orientalis (Figure 1A)(Foxe et al., 2009; Guo et al., 2009; Bachmann et al., 2018). The genomes of both species exhibit the drastic loss of genetic diversity typical for many selfers (Figure 1B–C) (Guo et al., 2009; Foxe et al., 2009; St Onge et al., 2011; Slotte et al., 2013; Brandvain et al., 2013; Slotte et al., 2012). In the younger species, C. rubella, loss of genetic diversity was initially thought to have occurred uniformly throughout the entire genome (Foxe et al., 2009; Guo et al., 2009), but subsequent reports already hinted at some loci having increased diversity (Gos et al., 2012; Brandvain et al., 2013), motivating the present study.

### Polymorphism discovery in C. grandiflora and C. rubella

The species Capsella rubella is young, only 30,000 to 200,000 years old, and was apparently founded when a small number of C. grandiflora individuals became self-compatible (Foxe et al., 2009; Guo et al., 2009). Previous studies had hinted at unequal retention of C. grandiflora alleles across the C. rubella genome (Gos et al., 2012; Brandvain et al., 2013), leading us to analyse this phenomenon systematically by comparing the genomes of 50 C. rubella and 13 C. grandiflora accessions from throughout each species’ range (Figure 1—figure supplement 1 and Figure 1—source data 1). Because the calling of trans-specific SNPs (tsSNPs) is particularly sensitive to mismapping errors in repetitive sequences, we applied a set of stringent filters, resulting in 74% of the C. rubella reference genome remaining accessible to base calling in both species, with almost half (47%) of the masked sites in the repeat rich pericentromeric regions. After filtering, there were 5,784,607 SNPs and 883,837 indels. Unless otherwise stated, all subsequent analyses were performed using SNPs. Of these, only 27,852 were fixed between the two species, whereas 824,540 were found in both species (tsCgCrSNPs), consistent with the expected sharing of variation between the two species. In addition, 4,291,959 SNPs segregated only in C. grandiflora (species-specific SNPs; ssCgSNPs), and 640,256 only in C. rubella (ssCrSNPs). Sample rarefaction by subsampling our sequenced accessions indicated that common ssCrSNP and tsCgCrSNP discovery was near saturation in our experiment, though additional sampling will continue to uncover rare alleles (Figure 1D).

The consequences of selfing are easily seen as a dramatic reduction in genetic diversity in C. rubella (Figure 1—source data 2), consistent with the previously suggested genetic bottleneck (Foxe et al., 2009; Guo et al., 2009). As expected from a predominantly selfing species, SNPs segregating in C. rubella were much less likely to be heterozygous than those segregating in C. grandiflora, though evidence for occasional outcrossing in C. rubella is observed in the form of a variable number of heterozygous calls (Figure 1E). Selfing is also expected to reduce the effective rate of recombination between segregating polymorphisms. Linkage disequilibrium (LD) decayed, on average, to 0.1 within 5 kb in C. grandiflora, while it only reached this value at distances greater than 20 kb in C. rubella (Figure 1F). Though C. rubella is a relatively young species, it exhibits characteristics typical of a predominantly (but not exclusively) self-fertilising species: reduced genetic diversity, reduced observed heterozygosity, and reduced effective recombination rate. This last effect could potentially increase the visibility of signals for balancing selection from linked sites (Wiuf et al., 2004).

### Capsella rubella demography

The degree of trans-specific allele sharing depends upon the level of gene flow between species, the age of the speciation event, and the demographic history of each resultant species. We first sought to understand how these neutral processes have affected extant polymorphism in C. grandiflora and C. rubella. We searched for evidence of population structure in our dataset by fitting individual ancestries to different numbers of genetic clusters with ADMIXTURE (Alexander et al., 2009) (Figure 2A and Figure 2—figure supplement 1A-B; k-values from 1 to 6). The best fit as determined by the minimum cross-validation error was three clusters, with one including all C. grandiflora individuals, and C. rubella samples split into two clusters. Principal component (PC) analysis (Price et al., 2006) of genetic variation revealed a similar picture, with PC1 separating the two species and PC2 separating the C. rubella samples (Figure 2A).

C. rubella population structure was strongly associated with geography. Samples from western Europe and southeastern Greece were unambiguously assigned to separate groups, while samples from northern and western Greece, near the presumed site of speciation in the current range of C. grandiflora (Hurka and Neuffer, 1997), showed mixed ancestry (or intermediate assignment to these groups, Figure 2A–B). A single C. rubella sample from western Europe showed some mixed ancestry. This sample was collected near Gargano National Park on the eastern coast of Italy. The source of its mixed ancestry is unclear, but its proximity to Greece suggests that it may result from ongoing migration across the Adriatic Sea. The general pattern of population structure is consistent with the centre of diversity for C. rubella being in northern Greece and a more recent rapid expansion into Western Europe, and agrees with predictions made based on previous, smaller datasets (Brandvain et al., 2013). The observed structure is principally organised by a major geographic barrier, the Adriatic Sea. We therefore separated our samples into into two distinct groups to the west (W) and east (E) of the Adriatic Sea for subsequent analyses.

Because their current ranges overlap, ongoing gene flow between sympatric C. rubella and C. grandiflora could be a potentially important source of allele sharing between the two species. While a previous study had not found any evidence for such a scenario (Brandvain et al., 2013), one of our C. grandiflora samples was assigned partial ancestry to the otherwise C. rubella-specific clusters, and resided at an intermediate position along PC1 (Figure 2A). Furthermore, eastern C. rubella individuals, many of which grew in sympatry with C. grandiflora, were less differentiated from C. grandiflora compared to western C. rubella samples along PC1 (Figure 2A and Figure 2—figure supplement 1C-D). Gene flow between eastern C. rubella and C. grandiflora was supported by significant genome-wide D-statistics for C. rubella samples from the C. grandiflora range (ABBA-BABA test; comparing each E individual with the W population) (Green et al., 2010; Durand et al., 2011), with D decreasing as a function of distance from the centre of C. grandiflora’s range (Figure 2—figure supplement 1 and Figure 2—source data 1). Because D statistics can be sensitive to ancient population structure (Durand et al., 2011), we further relied on identity-by-descent (IBD) segments as detected by BEAGLE (Browning and Browning, 2013) to identify genomic regions of more recent co-ancestry across these species. The proportion of the genome shared in IBD segments between C. rubella and C. grandiflora also decreased as a function of distance between samples, and the strongest evidence for recent ancestry was found between C. grandiflora individuals and sympatric northern Greek C. rubella lines (Figure 2C–D). These results indicate that gene flow is ongoing between the species, consistent with interspecific crosses often producing fertile offspring, specifically with C. rubella as the paternal parent (Sicard et al., 2011; Rebernig et al., 2015).

To estimate the magnitude and direction of gene flow and other demographic events that have shaped genetic variation in the two species we used fastsimcoal2 (Excoffier et al., 2013) to compare the likelihood of a large number of demographic models given the observed joint site frequency spectrum (Figure 2E, Figure 2—figure supplement 2 and Figure 2—source data 2). The best fitting model estimated the split between C. rubella and C. grandiflora to have occurred 170,000 generations ago, associated with a strong reduction in C. rubella population size (to only 2–14 effective chromosomes, or 1–7 individuals). Bidirectional gene flow at a relatively low rate apparently occurred until just over 10,000 generations ago, when C. rubella split into the W and E populations, after which gene flow continued only from E C. rubella to C. grandiflora (Figure 2E).

The close timing of the end of gene flow into C. rubella and the split into two populations suggests that westward expansion of the C. rubella range reduced the opportunity for gene flow from C. grandiflora, with potential genetic reinforcement by the development of hybrid incompatibilities (Sicard et al., 2015). If we assume an average of 1.3 years per generation as found in the close relative, A. thaliana (Falahati-Anbaran et al., 2014), which has similar life history and ecology, the population split and the end of introgression from C. grandiflora occurred around 13,500 years ago. This date is similar to the spread of agriculture and the end of the last glaciation in Europe (Walker et al., 2009), suggesting that C. rubella’s success might have been facilitated by one or both of these events.

### Non-random polymorphism sharing after a genetic bottleneck

Our analyses provide dates for the bottleneck and rapid colonisation events that have led to dramatically reduced genetic variation in C. rubella. Yet, over half of the segregating variants in C. rubella were also found in C. grandiflora (Figure 1D). Such tsCgCrSNPs could originate from independent mutation in each species (identity by state, IBS). Alternatively, they could be the result of introgression after speciation or they could reflect retention of the same alleles since the species split (identity by descent, IBD). Older retained alleles are expected to be found at elevated frequencies relative to the genome-wide average, while younger, recurrent mutations are expected to be rare. We therefore identified ancestral and derived alleles by comparison with the related genus Arabidopsis, and then compared the derived allele frequency spectra of tsCgCrSNPs and ssSNPs in Capsella as a proxy for allele age. We found that tsCgCrSNPs are strongly enriched among high-frequency alleles in both Capsella species (Figure 3A, p-value << 0.0001 in C. grandiflora and C. rubella, Mann-Whitney U-test). At allele frequencies greater than 0.25 in C. rubella, tsCgCrSNPs accounted for more than 80% of all variation. These results indicate that tsCgCrSNPs are predominantly older alleles that were already present in the common ancestral population of C. rubella and C. grandiflora or that were introgressed from C. grandiflora to C. rubella prior to its expansion into western Europe.

The distribution of tsCgCrSNPs was uneven across the genome. When compared to ssCrSNPs drawn from the same allele frequency distribution, tsCgCrSNPs were less likely to result in nonsynonymous changes (Figure 3B, p-value < 0.001, from 1000 jackknife resamples from the same allele frequency distribution), but they were more likely to be in genes (Figure 3C). As expected for transpecific haplotype sharing, eighty-three percent of all tsCgCrSNPs were in complete LD with at least one other tsCgCrSNP in C. rubella, and the density of tsSNPs along the genome was highly variable (Figure 3D–G). tsCgCrSNP density was positively correlated with local genetic diversity in C. rubella (and less strongly so with genetic diversity in C. grandiflora; Figure 3F–I and Figure 3—figure supplements 2–5), and negatively correlated with differentiation between the species as measured by Fst (Figure 3J and Figure 3—figure supplements 2–5). The uneven pattern of diversity was similar in each C. rubella subpopulation (Figure 3—figure supplements 6–9), indicating that most of the retained polymorphism already segregated prior to colonisation. Thus, most common genetic variation in C. rubella is also retained in its outcrossing ancestor, and the rate of retention varies dramatically between genomic regions.

### High density of tsSNPs around immunity-related loci

The observed heterogeneity in shared diversity across the C. rubella genome could be a simple consequence of a bottleneck during the transition to selfing. In the simplest scenario, C. rubella was founded by a small number of closely related individuals, and stochastic processes during subsequent inbreeding caused random losses of population heterozygosity. A study of genetic variation in bottlenecked populations of the Catalina fox found this exact pattern (Robinson et al., 2016). Alternatively, there may be selective maintenance of diversity in specific regions of the genome due to balanced polymorphisms, with contrasting activities of the different alleles. To explore this latter possibility, we tested whether the likelihood of allele sharing was dependent on annotated function of the affected genes. We found that tsCgCrSNPs were strongly biased towards genes involved in plant biotic interactions, including defense and immune responses, and also toward pollen-pistil interactions, though less strongly (Supplementary file 1, Figure 4A). Amongst the top ten enriched Gene Ontology (GO) categories for biological processes were apoptotic process, defense response, innate immune response, programmed cell death, and defensive secondary metabolite production (specifically associated with terpenoids). Of genes annotated with apoptotic process, 87% were homologs of A. thaliana NLR genes, a class of genes best known for its involvement in perception and response to pathogen attack (Jones and Dangl, 2006). An even higher enrichment for tsCgCrSNPs was found when testing this class of genes specifically, with tsCgCrSNPs falling in NLR genes being more likely than those in other types of genes to result in nonsynonymous changes (Figure 4A–B). These results indicate that despite a severe global loss of genetic diversity, genes involved in plant-pathogen interactions have maintained high levels of genetic variation in C. rubella.

While the high density of tsCgCrSNPs near immunity genes was intriguing, NLR genes frequently occur in complex clusters, which could elevate error rates during SNP calling and thus potentially influence our analyses. Of particular concern is that sequencing reads derived from paralogs not found in the reference, but present in some accessions, could be mismapped against the reference, leading to false positive tsCgCrSNPs calls. We therefore examined whether tsCgCrSNPs showed more evidence of such errors than other SNPs. Mismapping should increase coverage and reduce concordance (the fraction of reads supporting a particular call) at a site. That the distributions of these two metrics were nearly identical at tsCgCrSNPs and ssSNP sites indicates, however, that mismapping is unlikely to have affected our SNP calls (Figure 4—figure supplement 1). Mismapping is also expected to cause pseudo-heterozygous calls, due to reads from different positions in the focal genome being mapped to the same target in the reference genome. However, tsCgCrSNPs were not more likely to be found in the heterozygous state as compared to ssSNPs (Figure 4—figure supplement 1). In addition, we asked whether the signal of increased tsCgCrSNPs density extended into sequences adjacent to NLRs and is detectable even when masking the NLR clusters themselves. For this purpose, we collapsed NLR genes within 10 kb of one another into a single region, and calculated tsCgCrSNPs rates and genetic diversity as a function of distance from these collapsed regions, ignoring SNPs within the focal cluster. We found that elevated tsCgCrSNPs sharing and genetic diversity extended over 100 kb from NLR genes. Thus, increased sharing is not an artefact of the internal structure of NLR clusters (Figure 4B–C).

Increased retention of genetic diversity near immunity loci suggests that these genes might be the targets of balancing selection in either C. rubella, C. grandiflora, or both species. However, neutral processes including random introgression and stochastic allele fixation can give rise to uneven distributions of genetic variation across the genome after genetic bottlenecks (Robinson et al., 2016). We sought to identify regions that showed a pattern of allele sharing that was unlikely to have occured neutrally, as indicated by low values of the fixation index Fst, which quantifies genetic differentiation between populations. We compared the observed values of Fst between C. rubella and C. grandiflora to a distribution calculated from simulated sequences under our previously inferred neutral demographic model, which included gene flow between C. rubella and C. grandiflora. We simulated one million 20 kb DNA segments, or just over 7,000 C. rubella genome equivalents, under the neutral model and calculated the expected distribution of Fst values. Using this distribution, we assigned the probability of observing the Fst value for each non-overlapping 20 kb window throughout the genome. After Bonferroni correction and joining of adjacent significant segments, we identified 21 genomic regions that we designated as candidate targets of balancing selection (Bal, Figure 4D and Figure 4—source data 1). Bal regions showed several classical indications of balancing selection including substantially higher Tajima’s D and within-C. Rubella genetic diversity relative to the remainder of the genome (Figure 4E–F; p<<0.001 Mann-Whitney U-test for both statistics). tsCgCrSNPs in Bal regions were also less likely to have been lost during colonisation of Western Europe than ssCrSNPs or tsCgCrSNPs in other parts of the genome, and allele frequencies in Bal regions showed elevated correlation across populations (Figure 4—source data 2). Like tsSNPs in general, Bal regions did not show evidence for increased heterozygosity that might indicate increased error rates in SNP calling (Median Observed - Expected Heterozygosity in 20 kb windows was −0.021 inside of Bal regions, and −0.020 outside of these regions).

Estimates of Fst were reduced in large segments surrounding NLR and other immune gene candidate clusters (Figure 4G), consistent with allele sharing being the result of linkage to a nearby balanced polymorphism. Of the 21 candidate regions, nine overlapped with clusters of NLR genes, and five with clusters of RLK/RLP or CRK genes, two classes of genes with broad roles in innate immunity (Yeh et al., 2015; Zipfel, 2008). Many of the specific regions we identified in Capsella have been directly demonstrated to function in disease resistance in A. thaliana (McDowell et al., 2000; McDowell, 1998; Goritschnig et al., 2012; Holub, 1994; Gassmann et al., 1999; Zhang et al., 2014; Zhang et al., 2013; Xu, 2006; Yeh et al., 2015). RPP1 and RPP8 have been previously suggested as candidate targets of balancing selection, and trans-specific polymorphism has been reported at the RPP8 locus in the genus Arabidopsis (Bergelson et al., 2001; Wang et al., 2011). It should be noted, however, that these genes are often members of larger linked NLR gene superclusters, with some of the regions our approach identified being sizeable and thus making it difficult to pinpoint a single focal gene. Indeed the strong signal found in these regions could result from multiple linked balanced sites. Furthermore, the strongest signals of balancing selection are mostly derived from linked sites, rather than the clusters themselves, because confident SNP calling is very difficult, if not impossible, with short reads in the most complex genomic regions (Figure 4G).

It is formally possible that the unusual pattern of diversity that we observe near Bal loci could result from historical balancing selection in the outcrossing ancestor C. grandiflora rather than ongoing selection in the selfing C. rubella. Population genetic indices such as Fst, nucleotide diversity pi, Tajima’s D, and allele sharing are not fully independent, and elevated diversity in the C. rubella founding population, driven by historical balancing selection, could also generate the observed patterns. Genetic diversity was only modestly elevated in these regions in C. grandiflora (p<<0.001 Mann-Whitney U-test, Figure 4E), and Tajima’s D was not significantly different from other windows (Figure 4F), suggesting that this is not very likely. If balancing selection is acting at these loci in the outcrosser, it is clear that its genomic footprint is small, perhaps due to the rapid decay of LD in this species relative to the selfing C. rubella. Still, it is possible that even a small elevation of genetic diversity in Bal regions in the founding populations might have considerable impact on subsequent C. rubella diversity. We approximated this situation using our simulated genetic data. We subsetted simulations by the level of genetic diversity in C. grandiflora, choosing the top 1% of simulated values. Even in the case of elevated founder diversity in these data, the observed Fst values in Bal regions remain exceptionally unlikely (p<0.0001). These observations point to ongoing balancing selection within C. rubella maintaining diversity in Bal regions.

Adaptive retention of C. grandiflora diversity in Bal regions could be explained by two non-exclusive models. Allelic variation might have been present in the C. rubella founding population and maintained by balancing selection until the present. Alternatively, beneficial alleles may have been introgressed from C. grandiflora after the evolution of selfing, and retained by balancing selection. We searched for evidence of recent ancestry between the two species in Bal regions. A larger fraction of Bal region sequence was found to be IBD when compared to the genome-wide average (Figure 4—figure supplement 2), consistent with elevated retention of introgressed alleles in these regions. Shared segments in Bal regions were on average shorter than those found in other parts of the genome, suggesting that they are older and have been subjected to longer periods of recombination since the introgression event (median within 3,503 bp, median outside 6,661 bp, Wilcoxon-rank sum test, p=1e-54), although we cannot exclude the influence of differing patterns of recombination in these regions as a contributing factor to this observation.

Elevated IBD rates in Bal regions might result from gene flow between the species in either direction, and our previous results suggested that most modern gene flow occurs through introgression of C. rubella alleles into C. grandiflora. We explored the geographic pattern of IBD between C. rubella and C. grandiflora in Bal regions to determine whether it differs from that of the genome-wide pattern. Within the East population, IBD decayed as a function of distance from the C. grandiflora range in a manner comparable to the observed genome-wide pattern, albeit with a more shallow slope (Figure 4—figure supplement 2). In contrast to the genome-wide pattern, high levels of IBD were observed between C. grandiflora and West population accessions. Thus, we find evidence for neutral gene flow throughout the genome, perhaps dominated by C. rubella to C. grandiflora introgression, as indicated by our demographic simulations. However, allele sharing appears to be older in Bal regions and introgressed alleles have been retained for longer periods even after colonisation of Western Europe. This latter observation is consistent with the hypothesis that alleles were introgressed prior to the most recent range expansions in C. rubella, and that variation was subsequently maintained by selection in Bal regions.

### Balancing selection over millions of years

Although evidence for balancing selection at immunity-related loci in C. rubella is much stronger than in C. grandiflora, it is difficult to completely exclude the effect of founder diversity at these loci on the observed patterns. We therefore sought to validate our findings in a related species that has been separated from C. grandiflora and C. rubella for a long time. The genus Capsella offers a unique opportunity to test the longevity of balancing selection, because selfing has evolved independently in C. orientalis, which diverged from C. grandiflora and C. rubella more than one million years ago and whose modern range no longer overlaps with the two other species, preventing ongoing introgression (Hurka et al., 2012; Douglas et al., 2015). We expected the evolution of selfing to have generated a similar bottleneck as in C. rubella (Douglas et al., 2015; Bachmann et al., 2018), and we therefore resequenced 16 C. orientalis genomes, to test whether there is evidence of balancing selection at similar types of loci.

After alignment, SNP calling, and filtering, we identified a mere 71,454 segregating SNPs in C. orientalis. This is a surprisingly small amount of variation, corresponding to an almost 50-fold reduction in diversity relative to the outcrossing C. grandiflora (Figure 5—source data 1). Using our divergence and diversity measures, we estimated that C. orientalis diverged from C. grandiflora over 1.8 million generations ago (calculated as in ref. Brandvain et al., 2013). The combination of long divergence times and low variability in C. orientalis makes it unlikely that alleles will have been maintained by random chance. Using estimates of Ne from nucleotide diversity at four-fold degenerate sites (C. orientalis [14,643] and C. grandiflora [694,643]), the divergence time above, and the genome assembly size of 134.8 Mb, the probability of finding a single tsSNP is <4×10−19 using the methodology of Leffler and colleagues and Wiuf and colleagues (Leffler et al., 2013a; Wiuf et al., 2004), which assumes constant population size. It was therefore surprising that 8,408 C. orientalis variants were shared with either C. rubella or C. grandiflora (ts2-waySNPs), and 3992 with both (ts3-waySNPs, Figure 5A–B). In each of the three species, ts3-waySNPs were enriched at higher derived allele frequencies relative to ssSNPs and ts2-waySNPs, suggesting that they are on average the oldest SNPs (Figure 5C).

Because this large amount of trans-specific polymorphism was unexpected, we wanted to ensure that this was not due to more error-prone read mapping to a distant reference. We therefore also used an additional set of more stringent filters to identify high confidence ts3-waySNPs (ts3-wayhqSNPs; see Materials and methods). Importantly, we required ts3-wayhqSNPs to be in LD with at least one other ts3-wayhqSNP in all three species (r2 >0.2 in the same phase), to provide evidence that they represented the same ancestral haplotype. The aim was to improve the likelihood that such SNPs were true examples of identity by descent. Furthermore, we generated a draft assembly of the C. orientalis genome using Pacific Biosciences SMRT cell technology, and re-called ts3-waySNP sites. We identified 812 high quality transpecific SNPs segregating in all three species (ts3-wayhqSNPs). The distributions of coverage and concordance values in this dataset were similar between ts3-waySNP sites and other C. orientalis sites, further supporting their authenticity (Figure 5—figure supplement 1).

As discussed earlier, the presence of trans-specific polymorphism in diverged species could be driven by stable balancing selection or it could result from gene flow between the species. While C. grandiflora and C. rubella occur around the Mediterranean, C. orientalis is restricted to Central Asia (Hurka et al., 2012) and its current distribution is far from that of C. grandiflora and C. rubella. Modern gene flow between the C. orientalis and C. rubella/C. grandiflora lineages is therefore unlikely, but it is possible that the ranges of these species overlapped in the past. If alleles have been maintained since the split between the lineages, then the divergence between maintained alleles should meet or exceed the divergence between the species. On the other hand, if ts3-wayhqSNPs are the result of recent gene flow between the lineages, then divergence between species near these SNPs should be reduced compared to the genome-wide average divergence. We examined diversity and divergence at neutral (four-fold degenerate) sites surrounding ts3-wayhqSNPs (Figure 5D). In all three species, diversity was high directly adjacent to ts3-wayhqSNPs, close to average levels for genome-wide divergence between the two Capsella lineages. This footprint of elevated diversity is much more discernible in the two selfing species than in C. grandiflora. No obvious reduction in divergence was observed near ts3-wayhqSNPs (Figure 5D). We conclude that ts3-wayhqSNPs correspond predominantly to long-term maintained alleles that diverged on ancient time scales and that they are not the result of recent introgression.

The finding of tsSNPs shared between two independent lineages, C. grandiflora/C. rubella and C. orientalis, for over a million generations in spite of strong geographic barriers suggests that they are targets of stable long-term balancing selection. If this selection pressure remains constant across species, ancient alleles are expected to evolve towards similar equilibrium intermediate frequencies. In comparison to ts2-waySNPs, the minor alleles at ts3-wayhqSNP sites are closer to intermediate frequencies in all three species (Figure 5—figure supplement 2). Furthermore, ts3-wayhqSNPs segregate at more similar allele frequencies in C. rubella and C. grandiflora than other two-way tsSNPs, as measured by Fst median values: 0.03 for ts3-wayhqSNPs and 0.16 for ts2-waySNPs, p<<0.001 Mann-Whitney test) and correlation of derived allele frequencies (Figure 4—source data 2). These results suggest a conserved equilibrium maintained since the isolation of C. rubella and C. grandiflora over 10,000 generations ago. Derived allele frequencies for ts3-wayhqSNPs are not correlated between the two ancient Capsella lineages (Spearman’s rho −0.08 C. orientalis to C. grandiflora and −0.04 to C. rubella). It is possible that demographic reduction or habitat shift in C. orientalis has disturbed this equilibrium.

Like tsCgCrSNPs, ts3-waySNPs are strongly enriched in GO categories associated with immunity (Supplementary file 2). Our previously identified balanced regions strongly predicted the genomic distribution of ts3-waySNPs; 50% of ts3-wayhqSNPs fell into these regions, even though they encompass fewer than 10% of tsCgCrSNPs and fewer than 3% of all SNPs, resulting in an even more skewed and uneven distribution of genetic diversity along the genome (Figure 5F–G). At least one ts3-wayhqSNP was found in each of 10 of the 21 original candidate regions under balanced selection. Six of these corresponded to NLR clusters, two to RLK/RLP clusters, and one to a TIR-X cluster. Only one region did not contain a clear immunity candidate, with the caveat that this conclusion is based on the single annotated C. rubella reference genome (Slotte et al., 2013). Thus, even in a situation where a recent genetic bottleneck has wiped out almost all genetic diversity, there is very strong selection to maintain allelic diversity at specific immunity-related loci, consistent with these alleles having persisted already for very long evolutionary times.

### Insights into balancing selection from de novo assembly of MLO2

The balanced regions we identified contained very old tsSNPs, yet as mentioned, the immunity genes themselves are often not accessible to variant discovery based on mapping short reads to a single reference genome. Furthermore, it is possible, or even likely, that the strongest evidence for balancing selection comes from loci that include several linked targets of balancing selection. This combination of factors makes it difficult to pinpoint potential functional changes maintained by balancing selection in these regions. To discover functional changes, we therefore focused on ts3-wayhqSNPs that did not fall in our large balanced regions but were clustered in regions of the genome that were likely less complex. We selected genes that were well covered by reads in all three species (>80% sites), contained at least six high quality tsSNPs, at least one non-synonymous ts3-wayhqSNP, were at least 100 kb from any of our candidate balanced regions, and had been functionally characterised in A. thaliana. These filters singled out a homolog of the A. thaliana MLO2 gene as a particularly good candidate for more detailed analysis (Supplementary file 3 and Figure 6).

MLO2 encodes a seven-transmembrane domain protein with a conserved role in plant disease susceptibility (Figure 6A) (Consonni et al., 2006). The C. rubella MLO2 locus has experienced a tandem duplication, resulting in two genes, MLO2a and MLO2b. Although both homologs are sufficiently diverged to be accessible to unambiguous read mapping, all six ts3-wayhqSNPs were in MLO2b (Figure 6B–C). In C. rubella and C. orientalis, the ts3-wayhqSNPs were arranged in five different haplotypes, which we collapsed into three related haplogroups, A, B and C (Figure 6B). The reference haplogroup A was most frequent in both species.

Because several known targets of balancing selection in A. thaliana are the result of structural variation, or lesions larger than 1 kb (Mauricio et al., 2003; Stahl et al., 1999), we examined coverage patterns around the MLO2 locus to identify potential linked indels. We found that haplogroup B in both C. rubella and C. orientalis exhibited similar patterns of low read coverage at the 5’ end of MLO2b, suggesting a possible indel (Figure 6C). To examine the exact sequence of each allele, we took advantage of the homozygous nature of sequence data from these two selfing species and performed local de novo assembly of the MLO2 locus from read pairs mapping to this region. We were able to reconstruct the locus for 15 C. orientalis samples (13 haplogroup A and two haplogroup B) and 43 C. rubella samples (34 A, 2 B, and 7 C). Surprisingly, a comparison of the different haplotypes revealed that the pattern of low coverage observed for haplogroup B was not due to structural variation, but instead to extremely high divergence from the reference haplogroup A (Figure 6—figure supplement 1). Divergence between alleles within species was greater than 0.15 differences per bp, over three times higher than the genome-wide divergence between the species (Figure 6—figure supplement 1 and Figure 5—source data 1). This highly diverged region had therefore been originally inaccessible to reference-based read mapping in haplogroup B samples. De novo assembly allowed us to identify a total of 204 additional tsSNPs, nearly all of which mapped to the 5’ end of MLO2b (Figure 6—figure supplement 1). Neighbour-joining trees revealed the expected clustering of samples by species in regions adjacent to MLO2b, but clear clustering by haplogroup within the 5’ region, a pattern that is reproduced in phylogenetic analysis of the entire CDS (Figure 6C and Figure 6—figure supplement 2). Importantly, divergence within haplogroup across species was greater than, or similar to genome-wide averages for both A and B, demonstrating that recent introgression did not give rise to allele sharing (Figure 6—figure supplement 1).

The high nucleotide divergence between haplogroups A and B translates into numerous amino acid differences in the N terminal half of the encoded proteins. In a 157 amino acid stretch, 31 amino acid differences are found in both species (Figure 6A and Figure 6—figure supplement 3), with an indel polymorphism accounting for another seven amino acid differences. The large number of differences between the two haplogroups makes it difficult to point to any specific change as the target of balancing selection, but it seems likely that the two alleles differ functionally, perhaps reinforced by additional differences in the promoter. In summary, the nucleotide divergence in this region suggests that the MLO2b haplogroups are much older than the split between the two species.

### Discussion

While balancing selection has long been recognised as an important evolutionary force, its relevance as a major factor shaping genomic variation has remained unclear (Charlesworth, 2006; Wiuf et al., 2004; Asthana et al., 2005). We have taken advantage of unique demographic situations in two Capsella lineages to demonstrate not only that there is pervasive balancing selection at immunity-related loci in this genus, but also that the same alleles are maintained in species that are likely experiencing quite different pathogen pressures. We expect that balancing selection plays a similar role in other taxa, but that its effects are masked by a background of higher neutral genetic diversity and more frequent recombination between balanced sites and linked variants (Wiuf et al., 2004; Charlesworth, 2006). In addition, the detection of long-term balancing selection is further compounded by very old alleles being less accessible to short read re-sequencing, the dominant mode of variant discovery today. In the two selfing Capsella species, the footprints of balancing selection extend for tens of kilobases, greatly impacting diversity of many other genes. While this makes it more difficult to pinpoint the actual selected variants, it greatly improves statistical power to identify regions under balancing selection. This is reminiscent of genome-wide association studies, where extended LD improves statistical power to detect causal regions of the genome but reduces the ability to identify the specific causal variants (Atwell et al., 2010).

The nature of balancing selection acting on the regions we have identified remains to be clarified. Stable balancing selection in self fertilising species is unlikely to derive from heterozygous advantage, pointing to negative frequency-dependent selection or fluctuating selection from variable pathogen pressures as possible factors. While the mode of selection cannot be determined from these static data, the strong signal that we observe in highly selfing lineages points to environmental heterogeneity or negative frequency dependent selection over heterozygote advantage. Based on the enrichment of immunity-related genes, it appears that biotic factors are the dominant drivers of long-term maintenance of polymorphism. This observation is consistent with a large body of work on intraspecific variation in A. thaliana. The signal of balancing selection has been observed for specific pairs of disease resistance alleles in A. thaliana (Stahl et al., 1999; Tian et al., 2002; Tian et al., 2003; Mauricio et al., 2003; Bakker, 2006), and in the case of the resistance gene RPS5, alternative alleles have been shown to affect fitness in the field (Karasov et al., 2014). It is possible, or perhaps even likely, that the signal of balancing selection is amplified by the fact that immunity-related loci occur in clusters (Meyers, 2003) and that our strongest signal is the result of simultaneous selection on several genes in these regions in a situation analogous to the MHC in animals (Hedrick, 1998). Thus, biotic factors might not be quite as important as our analyses make them appear. On the other hand, it is also possible that the clustering of disease resistance genes itself is a product of selection, if selection was more effective when acting on groups of genes (Charlesworth and Charlesworth, 1975), or if evolution under a balanced regime was deleterious at other types of loci. Even if we accept that biotic factors predominate, the nature of the potential trade-offs that prevent individual alleles from becoming fixed is still a mystery, but it might involve conflicts between growth and defense (Coley et al., 1985; Walling, 2009; Herms and Mattson, 1992), beneficial and harmful microbe interactions (Walters and Heil, 2007), or defense against different types of pathogens (Kliebenstein and Rowe, 2008). What is clear is that the trade-offs must be stable over very long periods of evolution.

Our findings suggest a model in which the success of self fertilising populations may be buoyed by gene flow from outcrossing relatives in a situation analogous to evolutionary rescue strategies in conservation biology (Whiteley et al., 2015). This model is a variation on the theme of adaptive introgressions, which have recently emerged as a major evolutionary force in a wide range of taxa (Whitney et al., 2006; Castric et al., 2008; Pease et al., 2016; Dasmahapatra et al., 2012; Henning and Meyer, 2014; Hedrick, 2013; Huerta-Sánchez et al., 2014; Racimo et al., 2015; Castric et al., 2008; Pease et al., 2016; Dasmahapatra et al., 2012; Whitney et al., 2006; Huerta-Sánchez et al., 2014; Hedrick, 2013). The unique feature of self-fertilisation in comparison to these examples is that the amplified effects of linked selection and genetic drift lead to a steady loss of genetic variation over time. Constant replenishment via adaptive introgression from an outcrossing relative counters the loss of diversity at immunity-related loci, thereby preventing decreased fitness in competition with pathogens. Whether this model generally applies will require independent study of other lineages of related self-fertilising and outcrossing populations at various stages of speciation.

Finally, we note that maintenance of ancient variants is most easily detectable in a background of low variation. Therefore, it could potentially be used to rapidly identify loci with meaningful functional variation. Typically, agricultural breeding panels seek to maximise surveyed diversity, but our results indicate that identification of useful immunity-related polymorphism with genomic data might be facilitated in otherwise homogeneous wild populations.

### Plant material and DNA extraction

Request a detailed protocol

Seeds were stratified for two weeks at 4°C and germinated in controlled environment chambers. Four to six rosette leaves were collected from each accession and frozen in liquid nitrogen for gDNA extraction. The methods available for extraction and sequencing varied as the project progressed, and 24 of the C. rubella and the 13 C. grandiflora samples were analysed independently in previous studies (Agren et al., 2014; Williamson et al., 2014). See Figure 1—source data 1 for a listing of DNA preparation, library construction, and sequencing technology by sample. In brief, DNA was extracted following an abbreviated nuclei enrichment protocol (Becker et al., 2011) or using the Qiagen Plant DNeasy Extraction kit. The recovered DNA was sheared to the desired length using a Covaris S220 instrument, and Illumina sequencing libraries were prepared using the NEBNext DNA Sample Prep Reagent Set 1 (New England Biolabs) or the Illumina TruSeq DNA Library Preparation Kit and sequenced on the instrument as listed in Figure 1—source data 1. We aimed for a minimum genome coverage of 40x. We mapped reads to the C. rubella reference genome (Slotte et al., 2013) resulting in realised coverages of 30 – 126x.

### Sequence handling and variant calling

Request a detailed protocol

Initial sequence read processing, alignment, and variant calling were carried out using the SHORE (v0.8) software package (Ossowski et al., 2008). Read filtering, de-multiplexing, and trimming were accomplished using the import command discarding reads that had low complexity, contained more than 10% ambiguous bases, or were shorter than 75 bp after trimming. Reads were mapped to the C. rubella reference genome (Phytozome v.1.0) using the GenomeMapper aligner (Schneeberger et al., 2009) with a maximum edit distance (gaps or mismatches) of 10%. Alignments from each sample were then processed to generate raw whole genome reference and variant calls with qualities computed using an empirical scoring matrix approach (Cao et al., 2011) allowing heterozygous positions. Of the initial 53 C. rubella samples, two were removed because of low or uneven coverage, and one was removed as a misidentified C. bursa-pastoris sample (C. rubella and its polyploid relative C. bursa-pastoris are not easily identified phenotypically, but they can be distinguished by the extreme number of pseudo-heterozygous calls in the latter).

The per-sample raw consensus calls produced by SHORE were used to construct a whole genome matrix of finalised genotype calls for each species. Positions were considered only if covered by at least four reads and if overlapping reads mapped uniquely (GenomeMapper applies a ‘best match’ approach, so unique means that only one best match exists) (Schneeberger et al., 2009). We simultaneously considered information from all samples within a species to make base pair calls. If no variant was called in any sample then the site was treated as reference. Individual sample calls were made if four reads supported the reference base, the computed quality was above 24, and at least 80% of reads supported a reference call. A site was excluded if more than 30% of the samples from that species did not meet these criteria.

If at least one sample reported a difference from the reference in the raw consensus, then variant (indel or SNP) or reference calls were considered. The SNP calling parameters were slightly different for the two selfing species as compared to the outcrossing C. grandiflora because variants should only rarely be found in the heterozygous state in the former (and the frequency of heterozygous calls in a selfing species is a powerful filter to detect problems with mismapped reads). The general approach was to require at least one high quality variant call at a site and then to call genotypes in other samples with slightly reduced stringency. If no variant call met the more stringent threshold, then the site was reconsidered using the above reference criteria. Finally, the calls from each of the three species were combined into a master matrix. If a position was not called biallelic or invariant across the compared species, then it was not considered. To facilitate further analyses in PLINK (v1.9) (Chang et al., 2015) and vcftools (v0.1.12a) (Danecek et al., 2011), the genome matrix at biallelic SNP sites was also converted into a minimal vcf format.

### Defining pericentromeric regions

Request a detailed protocol

Regions of high repeat density near the centromeres of all chromosomes as well as two large, repeat-rich regions in chromosomes 1 and 7 were removed from genome scans. Coordinates for these regions are listed in Supplementary file 4.

### Site annotations

Request a detailed protocol

We used the SnpEff (v.3.2a) (Cingolani et al., 2012) software package to annotate variant and invariant sites for the whole genome. The annotation database was built using the C. rubella v1.0 Phytozome gff file. Sites were annotated using the table input function that includes annotation of fold degeneracy for each site in coding regions. Invariant sites were annotated using a table with dummy SNPs at each position. The SnpEff program outputs several annotations for some sites, and a primary annotation was selected by ranking the strength of effect of each annotation and reporting the annotation with the strongest effect (the rankings are listed in Supplementary file 5).

### Ancestral state assignment

Request a detailed protocol

To calculate derived allele frequency spectra we assigned ancestral state to each polymorphic site using three-way whole genome alignments between C. rubella, A. thaliana, and A. lyrata (Slotte et al., 2013). Only biallelic sites identical between A. lyrata and A. thaliana (indels were ignored) were considered. For the two species analysis, only sites also fixed for the ancestral allele in C. orientalis were considered.

### Trans-specific SNP annotation comparisons

Request a detailed protocol

To compare tsSNP and ssSNP annotations from similar allele frequency spectra, we binned 20,000 tsSNPs randomly drawn from throughout the genome by derived allele frequency (10 bins). We then drew an equivalent number of ssSNPs from each allele frequency bin and calculated the fraction of CDS SNPs that caused nonsynonymous changes and the fraction that fell in genes. This process was repeated 1000 times for both species to generate the plots shown in Figure 3B.

### Analysis of population structure and demographic modeling

Request a detailed protocol

Genotypes at four-fold degenerate SNP sites called in C. grandiflora and C. rubella were pruned in PLINK (50 kb windows, 5 kb step, and 0.2 r2 LD threshold) and used as input for ADMIXTURE (v.1.23) (Alexander et al., 2009) and EIGENSTRAT (v6.0 beta) (Price et al., 2006). For demographic modelling in Fastsimcoal (v2.5.2.11) (Excoffier et al., 2013), joint minor allele frequency spectra were generated at four-fold degenerate sites with complete information and ignoring heterozygous calls in selfing lineages (counting only one allele from each individual). Demographic parameters for each tested model were then inferred in 50 runs of Fastsimcoal (parameters: -l40 -L40 -n100000 -N100000 -M0.001 -C5). The global maximum likelihood model was selected after correcting for number of estimated parameters using Akaike Information Criterion. Confidence intervals were set for estimated parameters using 100 bootstraps of identical inference runs on simulated data under the most likely model. To reduce computational times, global maximum likelihoods were calculated for bootstraps after 13 runs rather than 50. The mutation rate assumed for this and other analyses was 7 × 10−9 mutations/generation/ bp based on mutation rate measurements in Arabidopsis thaliana (Ossowski et al., 2010).

### Segments of recent ancestry and interspecific introgression

Request a detailed protocol

Segments of IBD were identified using the phasing and segment identification in Beagle (r1339) (Browning and Browning, 2013). For the analysis presented here, we considered only the first haplotype from each C. rubella sample and both haplotypes from each C. grandiflora sample. Segments were required to be larger than 1 kb to be considered in the analyses. D statistics were calculated as in Green et al. (2010); Patterson et al. (2012); Dasmahapatra et al. (2012) comparing each individual genotype from the eastern C. rubella population to allele frequencies from western C. rubella and C. grandiflora. The outgroup species for these analyses was C. orientalis.

### Sliding window analysis of genetic diversity

Request a detailed protocol

Population genetic diversity statistics for genome scans were calculated for each species by transforming variant calls from the genome matrix into FASTA files and inputting these files into the compute function from the libsequence analysis package (Thornton, 2003). Heterozygous bases were randomly assigned as reference or variant to generate a single haplotype for each sample. Weir and Cockerham’s Fst was calculated using vcftools (v.0.1.12a) on biallelic SNP sites.

### Identification of balanced regions

Request a detailed protocol

To identify regions of the genome with unusually low Fst after speciation, we generated a null distribution of Fst values by simulating one million 20 kb segments under our inferred best demographic model using Fastsimcoal2. The output of each simulation was transformed to vcf format and Fst between C. grandiflora and each C. rubella subpopulation was calculated using vcftools. The probability of a particular Fst value in the observed data was then assigned based on its rank in these simulations (independently for the two subpopulations; one sided test). Multiple testing was accounted for using Bonferroni correction. Significant outlier windows (adjusted p-value<0.05) identified for each subpopulation were collapsed into regions using a two state hidden markov-model as implemented in the Rhmm package. The HMM approach has the advantage of joining windows of high coverage separated by a low coverage window. Only regions significant in both subpopulations were considered for further analysis. Windows overlapping the pericentromeric regions were removed from the analysis.

Request a detailed protocol

LD was calculated in 30 kb windows in C. grandiflora and C. rubella using PLINK (v.1.9). The decay of LD is the mean value at each position up to 30 kb from a focal SNP.

### Gene ontology (GO) enrichment

Request a detailed protocol

Because the C. rubella annotation is sparse, we used annotations from nucleotide blast best hit matches (e < 1e-10) to CDS sequences from its close relative, A. thaliana, for our GO analysis. Enrichment tests were performed with the SNP2GO R library (Szkiba et al., 2014) using tsCgCrSNPs as the test set and all SNP sites called in either C. rubella or C. grandiflora as the background set. We chose this approach because it is less sensitive to gene length (which should similarly affect tsSNP and non-tsSNP distributions across genes). A corresponding analysis was performed in the three-way comparisons using a background set of all SNP sites called in all three species. Significant enrichments were considered at a q-value threshold of q < 0.01 after false discovery correction. A gene was considered as belonging to the NLR family in C. rubella if its best blast hit in A. thaliana was annotated as such (Supplementary file 6).

### Identification of high quality three-way tsSNPs

Request a detailed protocol

To generate a list of high quality ts3-waySNPs, we applied a series of empirical filters. First, all ts3-waySNPs were required to have an r2 >0.2 with another ts3-wayhqSNP in the same phase in all three species. We excluded SNPs overlapping pericentromeric or annotated repeat sequences (Slotte et al., 2013). We also required that the coverage of SNPs was no more than two standard deviations above the mean coverage of all SNPs for that species, to have an average concordance greater than 0.98, and to be identified in more than one individual. These criteria were selected to increase our confidence in identified tsSNPs; it is likely that our inferences are conservative.

To validate our trans-specific SNPs we aligned the C. orientalis samples against the draft C. orientalis assembly using the bwa (v.0.7.12) mem command with default parameters. The output bam format file was sorted using samtools (v.1.6) and multisample variant calls were made with freebayes (v.1.1.0) using the parameter settings -z. 1–0 w. The resulting vcf file was filtered using vcftools (v.0.1.13) using the settings --remove-indels --minQ 50 --max-missing 0.8 --max-alleles two and further filtered to remove sites that were called as heterozygous in more than 5% of the samples. The sites overlapping with the original call set were extracted from this vcf and used for validation.

Coordinate transforms between the two genomes were necessary to validate tsSNPs. The draft assembly of C. orientalis and the C. rubella reference genome were aligned using the LAST (v.923) aligner. The C. rubella reference database was built with the lastdb command with the parameter settings -uMAM8 -cR11, and then the two genomes were aligned with the lastal command with the settings -m50 -E0.05. Equivalent sites were considered if they were present in alignments at least 500 bp long and contained only one C. orientalis and one C. rubella sequence.

### Local de novo assembly and analysis of MLO2

Request a detailed protocol

To reconstruct alleles from the MLO2 locus, we used an iterative assembly approach. Reads were first mapped to the entire reference genome using bwa (v.0.7.8) (Li and Durbin, 2009) using the bwa-mem alignment algorithm for each sample. Reads that mapped to the MLO2 locus were then extracted and assembled de novo using SPAdes (v.3.5.0) (Nurk et al., 2013). Assemblies were filtered to be longer than 2,000 bp with a coverage greater than 5, and then used to create an index for a second round of read mapping. Reads that mapped to the assembly without mismatches were collected together with their mates (regardless of the mate’s mapping quality), and were again de novo assembled. This process was iterated six times until scaffolds covering both coding regions were achieved. Format conversions and file handling made use of the software samtools (v.0.1.19) (Li et al., 2009) and bamutil (v.1.0.13).

Assemblies were filtered for appropriate length, and aligned using MAFFT (Katoh and Standley, 2013). Alignments were visualised using AliView (Larsson, 2014), and manually edited where appropriate. The protein encoded by MLO2b annotated in the C. rubella reference was truncated relative to A. thaliana MLO2. We aligned the genomic and coding regions from both species and found that the premature stop in MLO2b is likely due to a mis-annotated splice junction. The A. thaliana junction is conserved in C. rubella and alternative annotations on phytozome identify the A. thaliana-like splice variant. We therefore used the full-length version derived from manual alignments for our analysis. The phylogeny of Capsella MLO2 CDS sequences was produced using the optim.pml command from the R package phangorn using Jukes-Cantor distances. 1000 bootstrap iterations were run to estimate support for nodes in the tree. To determine where amino acid substitutions had occurred, we aligned the proteins encoded by each allele against the barley mlo protein and annotated domains (UniProtKB P3766).

### Draft assembly of the C. orientalis genome

Request a detailed protocol

The draft genome from the C. orientalis accession 2007–03 (Figure 1—source data 1) was assembled from long reads generated by PacBio single-molecule real-time sequencing. Long reads were assembled with Falcon (Chin et al., 2016) (version 0.5.4, max_diff = 150, max_cov = 150, min_cov = 2). The resulting primary contig set was iteratively polished with Quiver again using long reads (Chin et al., 2013) (version 2.0.0) and with Pilon (Walker et al., 2014) (version 1.16) using short reads from a single Illumina TruSeq DNA PCR-free library. The draft genome of C. orientalis comprises 135 Mb distributed over 423 gap-free contigs and covers 60% of the C. rubella reference with non-ambiguous 1-to-1 whole genome alignments. Its completeness is comparable to that of the C. rubella reference.

### Data availability

All raw sequencing data are depsoited under the accession codes PRJEB6689.

The following data sets were generated

The following previously published data sets were used

### References

Sours: https://elifesciences.org/articles/43606

## Negative Frequency-Dependent Selection Is Frequently Confounding

### Introduction

Natural diversity—the “endless forms most beautiful and most wonderful” (Darwin, 2012)—Is an enduring focus of both evolutionary biologists and nature lovers. The evolutionary processes that have generated or are maintaining many examples of diversity in nature, however, remain obscure and often controversial (Chesson, 2000). The processes that result in persistent polymorphisms within populations demand a special explanation as both directional natural selection and genetic drift should eliminate alleles and thus erode genetic diversity (Lewontin, 1974; Charlesworth and Hughes, 2000; Nielsen, 2005). Nevertheless, many examples of persistent polymorphisms occur in nature (Hedrick, 1986; Mallet and Joron, 1999; Richman, 2000; Carius et al., 2001; Hedrick et al., 2002; Delph and Kelly, 2014). Models of balancing selection—including negative frequency-dependent selection, spatial or temporal habitat heterogeneity, and heterozygote advantage—provide theoretical frameworks describing the processes that can account for persistent polymorphisms within populations. A core tenet of each balancing selection model is that the selective value of an allele—whether it is beneficial or detrimental—is dependent on the environmental context (Dobzhansky, 1982; Clarke et al., 1988). That is, alleles are advantageous and deleterious in different circumstances.

Negative frequency-dependent selection has been called the most powerful selective force maintaining balanced polymorphisms (Ayala and Campbell, 1974; Turelli and Barton, 2004; Fitzpatrick et al., 2007; Kazancioǧlu and Arnqvist, 2014), with some proposing that a large proportion of natural genetic polymorphisms are maintained by selection favoring rare alleles (Kojima and Yarbrough, 1967). Negative frequency-dependent selection occurs when the selective value of a variant (relative to other variants) is a function of its abundance in the population (relative to other variants) such that its relative fitness increases as the relative abundance, or frequency, of the variant decreases (Wright, 1939) (please see Clarke, 1979; Levin, 1988 for foundational mathematical descriptions and assumptions of this process). That is, rare variants have a selective advantage specifically because of their rarity while common variants are disadvantaged because of their commonness. Negative frequency-dependent selection has the potential to maintain polymorphisms within populations because relatively rare variants have a selective advantage over more common variants and thus tend to increase in frequency and avoid local extinction. Negative frequency-dependent selection models are a narrow subset of a broad field of models describing the impact of variant frequency on natural selection; the overwhelming majority of this broad field is beyond the scope of the concepts addressed here. Here, I focus on natural polymorphisms that can be explained by negative frequency-dependent selection, where genetic diversity is maintained when a variant becomes disadvantageous as it becomes more frequent, and polymorphisms that are more accurately explained by other process.

Numerous ecological interactions can result in a selective advantage for relatively rare alleles including sexual selection, parasite or predator preferences, and resource competition. In fact, each of these mechanisms has been shown to create a selective advantage for rare alleles that has resulted in persistent polymorphisms in multiple natural populations (Fisher, 1930; Wright, 1939; Harvey et al., 1975; Gigord et al., 2001; Delph and Kelly, 2014). While ecological context and natural history determine the proximate ecological mechanism affecting the differential survival or reproduction of variants in a population, changes in relative survival or reproduction must be negatively correlated with variant frequency for negative frequency-dependent selection to maintain natural polymorphisms. In a classic example, color polymorphisms are maintained in natural populations of Cepaea nemoralis snails by negative frequency-dependent selection because their predators, the song thrush (Turdus philomelos), form a search image for the most common morph resulting in much greater predation pressure on the common than the rare morph (Harvey et al., 1975; Allen, 1988). The rare morph can increase in frequency due to the relaxed predation pressure until it becomes common, resulting in a search image switch that now targets the new common morph, a process that maintains this polymorphism in C. nemoralis populations. Two luminaries in population genetics—R. Fisher and S. Wright—have also demonstrated the power of negative frequency-dependent selection to maintain diversity in natural systems. Wright famously demonstrated that self-incompatibility alleles, a genetic mechanism in plants to prevent inbreeding, are incredibly diverse because pollen containing a rare allele is more likely to find a receptive mate than pollen containing a common allele (Wright, 1964, 1969; Castric and Vekemans, 2004). Thus, plants with rare alleles have a selective advantage (Figure S1 and Appendix 2). Similarly, Fisher's principle demonstrates that human males and females are equally frequent because, if one sex were more frequent, parents producing the alternate sex would enjoy an advantage resulting in more grandchildren (Fisher, 1930; Edwards, 1998).

The many incontrovertible demonstrations of the power of negative frequency-dependent selection to maintain polymorphisms in nature have led some to suggest that it is a “pervasive” force maintaining natural diversity (Clarke, 1979). The pervasiveness of negative frequency-dependent selection has been further supported by the perception that “nearly every [selective agent] works in a way liable to produce frequency-dependent selection of the kind that favors rare phenotypes and hinders common ones” (Clarke, 1979). Although negative frequency-dependent selection may be a “powerful, perhaps a dominant, factor maintaining genetic diversity” within populations (Clarke, 1979), many natural polymorphisms are maintained by other evolutionary processes (Allison, 1954; Barton, 1986; Smith, 1990; Weatherall, 1997; Schmidt et al., 2000; Brisson and Dykhuizen, 2004). Nevertheless, many natural polymorphisms have been assumed to result from negative frequency-dependent selection even when empirical data from the system are inconsistent with the theoretical framework in which selection favors relatively rare variants. In this essay, I describe several patterns of allele dynamics that are commonly described in the literature as resulting from negative frequency-dependent selection despite data demonstrating that other causative processes. These processes include allelic diversity resulting from directional selection within a changing ecological context, density-dependent population regulation, other models of balancing selection, and aspects of community ecology. I will discuss concepts and experiments that can aid in identifying the processes underlying patterns of allele dynamics and suggest that accurately identifying the evolutionary process underlying natural patterns facilitates the development of hypotheses and future experiments to determine the ecological interactions or molecular mechanisms at the root of the process.

### Directional Selection Described as Negative Frequency-Dependent Selection

Conceptually, negative frequency-dependent selection may be the “most intuitively obvious explanation” of polymorphisms in nature (Trotter and Spencer, 2007). However, the original concept becomes ambiguous, complex, and even controversial as a result of differing definitions and applications in both theoretical and empirical work (Heino et al., 1998). Even some of the greatest thinkers in evolutionary biology have used negative frequency-dependent selection to explain scenarios in which the selective values of alleles are independent of their relative abundance. A prominent example comes from an influential essay by JBS Haldane outlining mechanisms by which infectious diseases drive natural selection in metazoans (Haldane, 1949). Although most of these ideas have been “followed profitably” (very profitably indeed), the negative frequency-dependent selection framework described in this essay appears to be one of the few unsound lines of thought. In this framework, Haldane suggested that a host with a rare defensive phenotype has a selective advantage in the face of highly-adapted pathogens, “For just because of its rarity it will be resistant to diseases which attack the majority of its fellows.” That is, the adapted pathogen has evolved mechanisms to overcome the common defensive phenotypes in host populations but cannot overcome the rare defensive phenotypes. Thus, hosts expressing rare but effective defensive phenotypes, or escape variants, enjoy a selective advantage over hosts expressing common but exploitable defenses.

The scenario described by Haldane, however, confounds natural selection favoring a specific (effective) phenotype in the current environment with a selective advantage resulting from rarity. Haldane's escape variants have a selective advantage because they cannot be subverted by the pathogen, not because they are rare. Although both rarity and novelty can result in a selective advantage, the novel defensive phenotype maintains its efficacy against the pathogen not because it is rare, but because it is novel. This point can be illustrated by extending this line of thought to allow migration of many individuals expressing a novel and effective defensive phenotype. These migrants would enjoy the same selective advantage over the previously common resident phenotype, regardless of frequency of the novel phenotype in the population immediately following the mass-migration event. The evolutionary dynamics occurring in this framework do not occur because of rare advantage and, in most cases, will not result in a balanced polymorphism. These evolutionary dynamics are more likely the result of directional selection in a continuously changing environment (Levins, 1968; Lande and Shannon, 1996; Orr, 2005; Collins et al., 2007; Bell, 2010). These two processes—negative frequency-dependent selection and selection in a changing environment—can potentially be distinguished by artificially manipulating variant frequencies or by introducing a previously common but now extinct variant into a controlled population.

The genetic diversity of haemagglutinin (HA) glycoproteins in the influenza virus is another conspicuous example of selection in a changing environment that is often confounded with negative frequency-dependent selection. The dynamics of HA alleles change over time such that rare alleles enter the population, rise to high population sizes, and subsequently decline toward extinction (Earn et al., 2002; Andreasen, 2003; Lin et al., 2003). The strains expressing a numerically common allele have relatively low fitness and decline in frequency because there are few hosts still susceptible to this strain, as hosts acquire immunity to strains with which they have been previously infected (Pease, 1987; Stegeman et al., 2004; Virseda et al., 2010). By contrast, strains expressing numerically rare alleles have many susceptible hosts available and enjoy high rates of secondary infections per infected host causing a numerical increase (Stegeman et al., 2004; Virseda et al., 2010). While there is undoubtedly strong selection at the HA locus, the selective advantage is derived not from relative rarity but from antigenic novelty (Plotkin and Dushoff, 2003; Nelson and Holmes, 2007; Cherry et al., 2009; Virseda et al., 2010), similar to Haldane's example. The presence or frequency of alternative HA alleles does not affect the fitness (growth rate) or temporal dynamics of the alleles. That is, the population dynamics of a numerically rare allele is the same if the host population is already plagued by other numerically common strains (0.0001% when one novel allele enters a population of 106 infected hosts) and if it enters a host population in which no other influenza strain is circulating (100% when one novel allele enters a previous uninfected host population) (Figure 1). As the selective value of the allele is conditioned on the absolute abundance—but not the relative abundance—of the allele, it is unlikely that negative frequency-dependent selection is the evolutionary process underlying the polymorphism commonly observed at the HA locus. More likely, the common variant is changing its own environment such that there are few susceptible hosts in which new infections can establish, but it is not affecting the environment of alternative variants.

Figure 1. Influenza virus carrying rare HA or NA alleles do not have a selective advantage because they are relatively rare—a necessary condition of negative frequency-dependent selection—but because they are numerically rare compared to the number of susceptible hosts. (A) The population dynamics of two influenza strains (dark and light lines). Both strains increase numerically when they are numerically rare, but not relatively rare, and decrease after they become numerically common. Here, the maximal rate of increase of the first strain occurs prior to the second strain entering the population, despite remaining at the maximum relative abundance (100%). (B) The relative frequencies of the two influenza strains through time. If negative frequency-dependent selection were affecting the relative abundances of these strains, the common strain at time = 0 (dark line) should have lower fitness than the rare strain (light line). However, the numerical growth rate of the common strain remains high until it reduces the number of susceptible hosts, regardless of its frequency. Arrows indicate expected effects of negative frequency-dependent selection on the relative fitness of each strain given its relative abundance. Red arrows indicate the time periods when the expectations of negative frequency-dependent selection are not satisfied; black arrows indicate time periods when negative frequency-dependent selection expectations are satisfied. (C) The numerical growth rate and population dynamics of each strain have the same temporal patterns in the absence of alternative strains. Strain 1 remains at 100% frequencies throughout the time period, suggesting that relative abundance does not underlie changes in relative fitness.

### Density-dependent Fitness Dynamics Described as Negative Frequency-dependent Selection

A preeminent evolutionary biologist, Lewontin suggested that negative frequency-dependent selection should be pervasive because, whenever “a genotype is its own worst enemy, its fitness will decrease as it becomes more common” (Lewontin, 1974). As similar variants occupy similar ecological niches and are commonly their own worst enemy, this logic suggests that negative frequency-dependent selection should indeed be pervasive. However, “common” in this case refers not to relative abundance but absolute abundance. For example, the fitness (growth rate) of individuals within a monomorphic population, one in which the frequency of a genotype is always at 100%, decreases as it “becomes more common” in absolute abundance as it approaches a carrying capacity. Further, relatively rare variants suffer negative fitness effects in proportion to the absolute abundance of their numerically common competitors such that relative rarity may not provide a selective advantage.

There is an extensive literature describing fitness (growth rate) as a function of the absolute abundance of each variant in a population (Birch, 1955; MacArthur, 1962; MacArthur et al., 1967; Roughgarden, 1971; Emlen, 1985). The above scenario can be characterized using classical Logistic growth models that include competition among variants such that “a genotype is its own worst enemy” (Lotka-Volterra models) (Equation 1). The growth rates of the variants in these models are a function of the absolute abundance of each variant—discounted by their competitive abilities (αij)—with respect to the carrying capacity (K), but are not explicitly conditioned on the abundance of the variants relative to each other. An interesting body of literature uses this modeling framework to describe the generation and maintenance of polymorphisms not through negative frequency-dependent selection mechanisms but through disruptive selection conditioned on the strength of competitive interactions and the abundance of each variant (ex Kisdi, 1999).

It is often challenging to distinguish the effect of numerical rarity from relatively rarity on the selective value of an allele through observations of patterns of allelic diversity. Experimental manipulations of the carrying capacity (K), potentially through resource supplementation, can assuage the reductions in relative fitness experienced by common variants that result from high densities without altering relative frequencies. In these experiments, the relative fitness of common variants should increase if the effects are associated with density while the relative fitness of the common and rare variants should not be altered if the allelic diversity is maintained by negative frequency-dependent selection.

### Multiple Niche Polymorphisms Described as Negative Frequency-dependent Selection

In the multiple niche selection model of balancing selection, the selective value of a trait is conditioned on its ability to exploit different environmental features in a heterogeneous habitat (Levene, 1953; Ravigne et al., 2004). Multi-niche selection maintains multiple variants in a population if each variant has a selective advantage in some available habitats while other variants are superior in other habitats. This idea—that environmentally variable selection can result in balanced polymorphisms—has a long history in the literature in which the foundational idea is stated by Dobzhansky (1982). Although incontrovertible examples of multi-niche selection maintaining polymorphism in natural populations are relatively rare, correct inference of the process resulting in balancing selection is necessary to generate hypotheses and design experiments to determine the ecological interactions or molecular mechanisms underlying the process.

The study of pattern, in isolation from the evolutionary processes that generated it, is not likely to advance general theories nor an understanding of specific systems (Cale et al., 1989). However, determining the processes responsible for balanced polymorphism patterns observed in nature is a difficult task (Barrett, 1988; Chaboudez and Burdon, 1995; Laine et al., 2011; Kazancioǧlu and Arnqvist, 2014). The balanced polymorphism at the outer surface protein C (ospC) locus in populations of Borrelia burgdorferi, the cause of human Lyme disease, provides a fitting example. Although the function of OspC remains unclear (Pal et al., 2004; Tilly et al., 2006, 2013; Xu et al., 2007; Onder et al., 2012; Carrasco et al., 2015), the within-population diversity at this locus bears all the hallmarks of balancing selection—large numbers of alleles in all local populations; allele frequencies that are more even than expected at neutrally evolving loci; and genetic evidence of an ancient polymorphism (Charlesworth et al., 1997; Qiu et al., 1997, 2002; May et al., 1999; Wang et al., 1999; Brisson and Dykhuizen, 2004).

Negative frequency-dependent selection and multi-niche selection have both been proposed as processes maintaining ospC polymorphisms, and both frameworks have empirical support (Qiu et al., 1997; Wang et al., 1999; Haven et al., 2011; Brisson et al., 2012; Seifert et al., 2015). The negative frequency-dependent selection model suggests that the polymorphism can be maintained if previously infected hosts are immune to subsequent infections by the same OspC variant but susceptible to novel variants, a molecular mechanism that has been demonstrated in laboratory animals (Gilmore et al., 1996; Probert et al., 1997; but see, Devevey et al., 2015). However, in this scenario the frequency or even presence of alternative OspC variants does not affect the number of susceptible hosts for the common strain, similar to the influenza example, arguing against negative frequency-dependent selection as an evolutionary process maintaining ospC polymorphisms. Further, negative frequency-dependent selection is most effective when few hosts remain susceptible to the common ospC variants, a pattern that is not observed in natural data sets (Brisson and Dykhuizen, 2004; Hanincova et al., 2006; Ogden et al., 2008; States et al., 2014; Vuong et al., 2014). Studies investigating allelic diversity at ospC from natural hosts consistently demonstrate that most natural reservoir hosts, those that are regularly infected with B. burgdorferi, are rarely infected with all of the common ospC variants (Brisson and Dykhuizen, 2004; Hanincova et al., 2006; Vuong et al., 2014; Mechai et al., 2016). Most hosts are, however, infected with a subset of the ospC variants, as expected if each host species represented a different ecological niche (Brisson and Dykhuizen, 2004; Hanincova et al., 2006; Vuong et al., 2014; Mechai et al., 2016). Further, host individuals of the same species, including humans, are often infected by the same subset of ospC variants across both time and geography (Seinost et al., 1999; Brisson and Dykhuizen, 2004; Hanincova et al., 2006; Dykhuizen et al., 2008; Wormser et al., 2008; Vuong et al., 2014; Mechai et al., 2016). The collective evidence suggests that the balanced ospC polymorphisms are more likely maintained by multi-niche selection—with each host species representing multiple niches (Brisson et al., 2011), one for each ospC variant by which it can be infected—than by negative frequency-dependent selection. These results suggest that the mechanisms causing the balanced polymorphism are more likely to involve ospC variant-by-host species interactions than to involve a memory immune response mechanism that is conserved across vertebrate species.

It has been argued that “Selection in multiple niches is not an alternative to [negative] frequency-dependent selection…but a way of generating it” (Clarke, 1979). However, scenarios in which balanced polymorphisms can be maintained without a selective advantage favoring relatively rare variants are not uncommon, suggesting that these are two distinct evolutionary processes in at least some cases. To illustrate this point, image two variants occupying a heterogeneous habitat where each variant has a selective advantage in one niche but is disadvantaged in another, a classical multi-niche selection scenario (Levene, 1953; Ravigne et al., 2004). Here we assume that the carrying capacity in niche A is much lower than the carrying capacity in niche B (KA = 10; KB = 105). In this scenario, variant B—which has a competitive advantage in niche B—can retain a fitness advantage (a greater per capita growth rate) even when it is more common than variant A—which has a competitive advantage in niche A. For example, in a population with 90 variant B individuals and 10 variant A individuals, variant B has a rapid per capita rate of increase while variant A does not increase (Figure 2). Here, the relatively common variant B has a “selective advantage” over the relatively rare variant A due to multi-niche selection, which is independent of negative frequency-dependent selection. Depending on the parameter values in this model, a balanced polymorphism can be maintained in the absence of rare advantage.

Figure 2. Multi-niche selection, an alternative model of balancing selection, does not require the core assumption of negative frequency-dependent selection models that relative fitness is a function of relative frequency in the population. Shown is a simulation where variant A has a selective advantage in niche A while variant B has a selective advantage in niche B (Appendix 1, Supplementary Material). Here, the carrying capacity in niche A is much lower than in niche B (KA = 10, KB= 10,000). At the start of the simulation, there are 10 variant A individuals (10% of the population) and 90 variant B individuals (90% of the population), yet the fitness (growth rate) of variant A individuals is much lower than for variant B individuals. This contradicts the expectations of negative frequency-dependent selection, where the frequency of variant A should increase as it is currently less frequent than variant B. Although the conditions necessary for negative frequency-dependent selection to maintain a stable polymorphism are not satisfied, both variants can be maintained in the population due to the selective advantage each enjoys in their preferred niche. Parameters used in the simulation: growth rate = 0.35, death rate in preferred niche = 0.05, death rate in non-preferred habitat = 0.25, migration among niches = 0.01.

### Community Diversity Described as Negative Frequency-dependent Selection

Prominent population geneticists including Williams and Maynard Smith, among many others, have demonstrated that the efficacy of natural selection decreases at increasing levels of biological organization such that selection among individuals within populations is much more efficient than selection among species within communities (Maynard Smith, 1964, 1976; Williams, 1966). Additionally, selection at higher levels of organization (i.e., among species within communities) “tends to be undermined by natural selection at lower levels” (i.e., among individuals with populations) (Wilson and Wilson, 2007). Nevertheless, several studies have suggested that negative frequency-dependent selection maintains species diversity within ecological communities. There is a rich empirical and theoretical history describing the causes and consequences of species diversity within ecological communities (Connell and Orias, 1964; Schoener, 1974; Lubchenco, 1978; Ricklefs and Schluter, 1993; Chesson, 2000; Wright, 2002). Mechanisms of coexistence function in two major ways: equalizing mechanisms minimize the average fitness differences between species while stabilizing mechanisms increase negative intraspecific interactions relative to negative interspecific interactions (Chesson, 2000). Stabilizing mechanisms promote species coexistence and include mechanisms such as resource partitioning and frequency-dependent predation, as well as mechanisms that depend on spatial or temporal fluctuations in population densities or environmental factors. Equalizing mechanisms contribute to stable coexistence when they reduce large average fitness inequalities which might negate the effects of stabilizing mechanisms (Chesson, 2000). While some natural forces that affect the maintenance of community diversity have frequency-dependent mechanisms, this should not be mistaken for negative frequency-dependent selection which maintains polymorphisms within populations. Applying models of natural selection to levels of biological organization above the population level should be exercised only with the greatest of caution (Williams, 1966).

The Killing the Winner (KtW) hypothesis is a recent endeavor to understand patterns of diversity within communities using a negative frequency-dependent selection framework (Thingstad and Lignell, 1997; Thingstad, 2000). The KtW hypothesis suggests that predators target species that maximize reproductive effort over those that invest heavily in predator defense. Recent extensions of the KtW hypothesis suggest that this predator functional response promotes community diversity through negative frequency dependent selection. However, the functional response in this hypothesis is often not conditioned on the frequency of the prey species but on the presence or absence of character traits in the prey species (Thingstad and Lignell, 1997; Suttle, 2007; Winter et al., 2010; Koskella and Meaden, 2013). The “winner” in this hypothesis refers to species that invest resource into reproduction at the expense of investing in predator defenses, which may or may not correspond to the most frequent species (Winter et al., 2010). In these cases, neither the relative nor the absolute abundance of the prey species affects the functional responses of the predator.

### Concluding Remarks and Future Perspectives

Understanding the processes that produce or maintain diversity in natural populations is a central challenge in evolutionary biology. Negative frequency-dependent selection maintains many noted and striking polymorphisms in nature (Kojima and Tobari, 1969; Gigord et al., 2001; Charlesworth, 2006; Loisel et al., 2006; Fitzpatrick et al., 2007; Mitchell-Olds et al., 2007; Mokkonen et al., 2011), and many polymorphisms exist in the absence of a selective advantage favoring rare variants (Allison, 1954; Barton, 1986; Smith, 1990; Weatherall, 1997; Schmidt et al., 2000; Brisson and Dykhuizen, 2004). Ideally, one could unequivocally determine the causative process through observations of the patterns of variation in nature. Unfortunately, many processes result in identical patterns, especially when those patterns are observed over short time scales. In some cases, long-term observations of allelic dynamics can distinguish polymorphisms caused by mutation-selection balance or selection in a changing environment from a stable polymorphism resulting from balancing selection (Roy, 1998; Schmidt et al., 2000; Schmidt, 2001; Siemens and Roy, 2005; Olendorf et al., 2006; Koskella and Lively, 2009). Evidence suggesting negative frequency-dependent selection—such as allelic cycles where each allele gains a selective advantage as it becomes more rare—may also be observed from long-term observational studies (Gigord et al., 2001; Thrall et al., 2012). More directly, the patterns resulting from specific evolutionary processes can be tested through controlled and natural experiments such as manipulating allele frequencies in sub-populations (Roy, 1998; Schmidt et al., 2000; Olendorf et al., 2006; Koskella and Lively, 2009).

Ecological and molecular mechanisms are rarely deducible from patterns (Kershaw, 1963), but accurate identification of the evolutionary processes causing the pattern can generate hypotheses about these mechanisms. For example, the northern acorn barnacle, Semibalanus balanoides, shows clear evidence of a balanced polymorphism at the mannose-6-phosphate isomerase (mpi) locus (Hoffmann, 1981; McDonald, 1991). The pattern of mpi genotype frequencies among intertidal microhabitats, where one allele is common in high intertidal zones but rare in low intertidal zones, suggests that multi-niche selection maintains this polymorphism (Schmidt and Rand, 1999). Experimental manipulations of genotypes among microhabitats confirmed that multi-niche selection is the process responsible for the allelic variation (Schmidt et al., 2000; Schmidt and Rand, 2001). The molecular mechanism linking mannose utilization with survivorship in high intertidal zones, where temperature and desiccation stress is high, was subsequently elucidated through controlled laboratory experiments (Schmidt, 2001). As this and many other examples demonstrate, the ecological interaction or molecular mechanism underlying an evolutionary process can best be understood when the evolutionary process is accurately determined.

### Author Contributions

The author confirms being the sole contributor of this work and approved it for publication.

### Funding

Burroughs Wellcome Fund (1012376), the National Institutes of Health (NIH AI076342 and AI097137), and National Science Foundation (DEB 1354184).

### Conflict of Interest Statement

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

### Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fevo.2018.00010/full#supplementary-material

### References

Ayala, F. J., and Campbell, C. A. (1974). Frequency-dependent selection. Annu. Rev. Ecol. Syst. 5, 115–138. doi: 10.1146/annurev.es.05.110174.000555

CrossRef Full Text | Google Scholar

Barrett, J. A. (1988). Frequency-dependent selection in plant fungal interactions. Philos. Trans. R. Soc. B 319, 473–483. doi: 10.1098/rstb.1988.0060

CrossRef Full Text | Google Scholar

Barton, N. (1986). The maintenance of polygenic variation through a balance between mutation and stabilizing selection. Genet. Res. 47, 209–216. doi: 10.1017/S0016672300023156

PubMed Abstract | CrossRef Full Text | Google Scholar

Birch, L. C. (1955). Selection in Drosophila pseudoobscura in relation to crowding. Evolution 9, 389–399. doi: 10.1111/j.1558-5646.1955.tb01549.x

CrossRef Full Text | Google Scholar

Brisson, D., Baxamusa, N., Schwartz, I., and Wormser, G. P. (2011). Biodiversity of Borrelia burgdorferi strains in tissues of Lyme disease patients. PLoS ONE 6:e22926. doi: 10.1371/journal.pone.0022926

PubMed Abstract | CrossRef Full Text | Google Scholar

Brisson, D., Drecktrah, D., Eggers, C. H., and Samuels, D. S. (2012). Genetics of Borrelia burgdorferi. Annu. Rev. Genet. 46, 515–536. doi: 10.1146/annurev-genet-011112-112140

PubMed Abstract | CrossRef Full Text | Google Scholar

Brisson, D., and Dykhuizen, D. E. (2004). ospC diversity in Borrelia burgdorferi: different hosts are different niches. Genetics 168, 713–722. doi: 10.1534/genetics.104.028738

PubMed Abstract | CrossRef Full Text | Google Scholar

Cale, W., Henebry, G., and Yeakley, J. (1989). Inferring process from pattern in natural communities. Bioscience 39, 600–605. doi: 10.2307/1311089

CrossRef Full Text | Google Scholar

Carius, H. J., Little, T. J., and Ebert, D. (2001). Genetic variation in a host-parasite association: potential for coevolution and frequency-dependent selection. Evolution 55, 1136–1145. doi: 10.1111/j.0014-3820.2001.tb00633.x

PubMed Abstract | CrossRef Full Text | Google Scholar

Carrasco, S. E., Troxell, B., Yang, Y., Brandt, S. L., Li, H., Sandusky, G. E., et al. (2015). Outer surface protein OspC is an antiphagocytic factor that protects Borrelia burgdorferi from phagocytosis by macrophages. Infect. Immun. 83, 4848–4860. doi: 10.1128/IAI.01215-15

PubMed Abstract | CrossRef Full Text | Google Scholar

Castric, V., and Vekemans, X. (2004). Plant self-incompatibility in natural populations: a critical assessment of recent theoretical and empirical advances. Mol. Ecol. 13, 2873–2889. doi: 10.1111/j.1365-294X.2004.02267.x

PubMed Abstract | CrossRef Full Text | Google Scholar

Charlesworth, B., and Hughes, K. A. (2000). “The maintenance of genetic variation in life-history traits,” in Evolutionary Genetics: from Molecules to Morphology, eds R. S. Singh and C. B. Krimbas (Cambride, UK: Cambridge University Press), 369–392.

Charlesworth, B., Nordborg, M., and Charlesworth, D. (1997). The effects of local selection, balanced polymorphism and background selection on equilibrium patterns of genetic diversity in subdivided populations. Genet. Res. 70, 155–174. doi: 10.1017/S0016672397002954

PubMed Abstract | CrossRef Full Text | Google Scholar

Cherry, J. L., Lipman, D. J., Nikolskaya, A., and Wolf, Y. (2009). Evolutionary dynamics of N-glycosylation sites of influenza virus hemagglutinin. PLoS Curr. 1:RRN1001. doi: 10.1371/currents.RRN1001

PubMed Abstract | CrossRef Full Text | Google Scholar

Chesson, P. (2000). Mechanisms of maintenance of species diversity. Annu. Rev. Ecol. Syst. 31, 343–366. doi: 10.1146/annurev.ecolsys.31.1.343

CrossRef Full Text | Google Scholar

Clarke, B. C., Shelton, P. R., and Mani, G. S. (1988). Frequency-dependent selection, metrical characters and molecular evolution. Philos. Trans. R. Soc. B 319, 631–640. doi: 10.1098/rstb.1988.0070

PubMed Abstract | CrossRef Full Text | Google Scholar

Connell, J. H., and Orias, E. (1964). The ecological regulation of species diversity. Am. Nat. 98, 399–414. doi: 10.1086/282335

CrossRef Full Text | Google Scholar

Darwin, C. (2012). The Origin of Species, Dover Thrift Edition. New York, NY: Courier Corporation.

Devevey, G., Dang, T., Graves, C. J., Murray, S., and Brisson, D. (2015). First arrived takes all: inhibitory priority effects dominate competition between co-infecting Borrelia burgdorferi strains. BMC Microbiol. 15:61. doi: 10.1186/s12866-015-0381-0

PubMed Abstract | CrossRef Full Text | Google Scholar

Dobzhansky, T. (1982). Genetics and the origin of Species. New York, NY: Columbia University Press.

Dykhuizen, D. E. D. E., Brisson, D., Sandigursky, S., Wormser, G. P. G. P., Nowakowski, J., Nadelman, R. B., et al. (2008). The propensity of different Borrelia burgdorferi sensu stricto genotypes to cause disseminated infections in humans. Am. J. Trop. Med. Hyg. 78, 806–810. doi: 10.4269/ajtmh.2008.78.806

PubMed Abstract | CrossRef Full Text | Google Scholar

Earn, D. J. D., Dushoff, J., and Levin, S. A. (2002). Ecology and evolution of the flu. Trends Ecol. Evol. 17, 334–340. doi: 10.1016/S0169-5347(02)02502-8

CrossRef Full Text | Google Scholar

Edwards, A. W. F. (1998). Natural selection and the sex ratio: Fisher's sources. Am. Nat. 151, 564–569.

Emlen, J. M. (1985). The assessment of frequency- and density-dependent influences on fitness in natural populations. Am. Nat. 125, 507–520. doi: 10.1086/284359

CrossRef Full Text | Google Scholar

Fisher, R. A. (1930). The Genetical Theory of Natural Selection. New York, NY: Oxford University Press.

Fitzpatrick, M. J., Feder, E., Rowe, L., and Sokolowski, M. B. (2007). Maintaining a behaviour polymorphism by frequency-dependent selection on a single gene. Nature 447, 210–212. doi: 10.1038/nature05764

PubMed Abstract | CrossRef Full Text | Google Scholar

Gigord, L. D., MacNair, M. R., and Smithson, A. (2001). Negative frequency-dependent selection maintains a dramatic flower color polymorphism in the rewardless orchid Dactylorhiza sambucina (L.). Soo. Proc. Natl. Acad. Sci. U.S.A. 98, 6253–6255. doi: 10.1073/pnas.111162598

PubMed Abstract | CrossRef Full Text | Google Scholar

Gilmore, R. D., Kappel, K. J., Dolan, M. C., Burkot, T. R., and Johnson, B. J. B. (1996). Outer surface protein C (OspC), but not P39, is a protective immunogen against a tick-transmitted Borrelia burgdorferi challenge: evidence for a conformational protective epitope in OspC. Infect. Immun. 64, 2234–2239.

Haldane, J. B. S. (1949). Disease and evolution. La Ric. Sci. 19(Suppl.), 1–11.

Hanincova, K., Kurtenbach, K., Diuk-Wasser, M., Brei, B., and Fish, D. (2006). Epidemic spread of Lyme borreliosis, northeastern United States. Emerg. Infect. Dis. 12, 604–611. doi: 10.3201/eid1204.051016

PubMed Abstract | CrossRef Full Text | Google Scholar

Harvey, P. H., Birley, N., and Blackstock, T. H. (1975). The effect of experience on the selective behaviour of song thrushes feeding on artificial populations of Cepaea. Genetica 45, 211–216. doi: 10.1007/BF01517197

CrossRef Full Text | Google Scholar

Haven, J., Vargas, L. C., Mongodin, E. F., Xue, V., Hernandez, Y., Pagan, P., et al. (2011). Pervasive recombination and sympatric genome diversification driven by frequency-dependent selection in Borrelia burgdorferi, the Lyme disease bacterium. Genetics 189, 951–966. doi: 10.1534/genetics.111.130773

PubMed Abstract | CrossRef Full Text | Google Scholar

Hedrick, P. W. (1986). Genetic-polymorphism in heterogeneous environments - a decade later. Annu. Rev. Ecol. Syst. 17, 535–566. doi: 10.1146/annurev.es.17.110186.002535

CrossRef Full Text | Google Scholar

Hedrick, P. W., Lee, R. N., and Garrigan, D. (2002). Major histocompatibility complex variation in red wolves: evidence for common ancestry with coyotes and balancing selection. Mol. Ecol. 11, 1905–1913. doi: 10.1046/j.1365-294X.2002.01579.x

PubMed Abstract | CrossRef Full Text | Google Scholar

Hoffmann, R. J. (1981). Evolutionary genetics of Metridium senile. I. Kinetic differences in phosphoglucose isomerase allozymes. Biochem. Genet. 19, 129–144. doi: 10.1007/BF00486143

PubMed Abstract | CrossRef Full Text | Google Scholar

Kazancioǧlu, E., and Arnqvist, G. (2014). The maintenance of mitochondrial genetic variation by negative frequency-dependent selection. Ecol. Lett. 17, 22–27. doi: 10.1111/ele.12195

PubMed Abstract | CrossRef Full Text | Google Scholar

Kojima, K., and Tobari, Y. (1969). Selective modes associated with karyotypes in Drosophila ananassae. II. Heterosis and frequency-dependent selection. Genetics 63, 639–651.

Kojima, K., and Yarbrough, K. M. (1967). Frequency-dependent selection at the esterase-6-locus in Drosophila melanogaster. Proc. Natl. Acad. Sci. U.S.A. 57, 645–649. doi: 10.1073/pnas.57.3.645

PubMed Abstract | CrossRef Full Text | Google Scholar

Koskella, B., and Lively, C. M. (2009). Evidence for negative frequency-dependent selection during experimental coevolution of a freshwater snail and a sterilizing trematode. Evolution 63, 2213–2221. doi: 10.1111/j.1558-5646.2009.00711.x

PubMed Abstract | CrossRef Full Text | Google Scholar

Laine, A. L., Burdon, J. J., Dodds, P. N., and Thrall, P. H. (2011). Spatial variation in disease resistance: from molecules to metapopulations. J. Ecol. 99, 96–112. doi: 10.1111/j.1365-2745.2010.01738.x

PubMed Abstract | CrossRef Full Text | Google Scholar

Lande, R., and Shannon, S. (1996). The role of genetic variation in adaptation and population persistence in a changing environment. Evolution 56, 429.

Levene, H. (1953). Genetic equilibrium when more than one ecological niche is available. Am. Nat. 87, 331–333. doi: 10.1086/281792

CrossRef Full Text | Google Scholar

Levins, R. (1968). Evolution in Changing Environments. Princeton, NJ: Princeton University Press.

Lewontin, R. C. (1974). The Genetic Basis of Evolutionary Change. New York, NY: Columbia University Press.

Lin, J., Andreasen, V., Casagrandi, R., and Levin, S. A. (2003). Traveling waves in a model of influenza A drift. J. Theor. Biol. 222, 437–445. doi: 10.1016/S0022-5193(03)00056-0

PubMed Abstract | CrossRef Full Text | Google Scholar

Loisel, D. A., Rockman, M. V., Wray, G. A., Altmann, J., and Alberts, S. C. (2006). Ancient polymorphism and functional variation in the primate MHC-DQA1 5' cis-regulatory region. Proc. Natl. Acad. Sci. U.S.A. 103, 16331–16336. doi: 10.1073/pnas.0607662103

PubMed Abstract | CrossRef Full Text | Google Scholar

Lubchenco, J. (1978). Plant species diversity in a marine intertidal community; importance of herbivore food preference and algal competetive. Am. Nat. 112, 23–39. doi: 10.1086/283250

CrossRef Full Text | Google Scholar

MacArthur, R. H., Wilson, E. O., and MacArthur, W. (1967). The Theory of Island Biogeography. Princeton, NJ: Princeton University Press.

Mallet, J., and Joron, M. (1999). Evolution of diversity in warning color and mimicry: polymorphisms, shifting balance, and speciation. Annu. Rev. Ecol. Syst. 30, 201–233. doi: 10.1146/annurev.ecolsys.30.1.201

CrossRef Full Text | Google Scholar

May, G., Shaw, F., Badrane, H., and Vekemans, X. (1999). The signature of balancing selection: fungal mating compatibility gene evolution. Proc. Natl. Acad. Sci. U.S.A. 96, 9172–9177. doi: 10.1073/pnas.96.16.9172

PubMed Abstract | CrossRef Full Text | Google Scholar

Maynard Smith, J. (1976). Commentary - group selection. Q. Rev. Biol. 51, 277–283.

McDonald, J. H. (1991). Contrasting amounts of geographical variation as evidence for direct selection: the mpi and pgm loci in eight crustacean species. Heredity 67, 215–219. doi: 10.1038/hdy.1991.82

CrossRef Full Text | Google Scholar

Mechai, S., Margos, G., Feil, E. J., Barairo, N., Lindsay, L. R., Michel, P., et al. (2016). Evidence for host-genotype associations of Borrelia burgdorferi sensu stricto. PLoS ONE 11:e0149345. doi: 10.1371/journal.pone.0149345

PubMed Abstract | CrossRef Full Text | Google Scholar

Mitchell-Olds, T., Willis, J. H., and Goldstein, D. B. D. B. (2007). Which evolutionary processes influence natural genetic variation for phenotypic traits? Nat. Rev. Genet. 8, 845–856. doi: 10.1038/nrg2207

PubMed Abstract | CrossRef Full Text | Google Scholar

Mokkonen, M., Kokko, H., Koskela, E., Lehtonen, J., Mappes, T., Martiskainen, H., et al. (2011). Negative frequency-dependent selection of sexually antagonistic alleles in Myodes glareolus. Science 334, 972–974. doi: 10.1126/science.1208708

PubMed Abstract | CrossRef Full Text | Google Scholar

Ogden, N. H., Lindsay, L. R., Hanincova, K., Barker, I. K., Bigras-Poulin, M., Charron, D. F., et al. (2008). Role of migratory birds in introduction and range expansion of Ixodes scapularis ticks and of Borrelia burgdorferi and Anaplasma phagocytophilum in Canada. Appl. Environ. Microbiol. 74, 1780–1790. doi: 10.1128/AEM.01982-07

PubMed Abstract | CrossRef Full Text | Google Scholar

Olendorf, R., Rodd, F. H., Punzalan, D., Houde, A. E., Hurt, C., Reznick, D. N., et al. (2006). Frequency-dependent survival in natural guppy populations. Nature 441, 633–636. doi: 10.1038/nature04646

PubMed Abstract | CrossRef Full Text | Google Scholar

Onder, O., Humphrey, P. T., McOmber, B., Korobova, F., Francella, N., Greenbaum, D. C., et al. (2012). OspC is potent plasminogen receptor on surface of Borrelia burgdorferi. J. Biol. Chem. 287, 16860–16868. doi: 10.1074/jbc.M111.290775

PubMed Abstract | CrossRef Full Text | Google Scholar

Pal, U., Yang, X., Chen, M., Bockenstedt, L. K., Anderson, J. F., Flavell, R. A., et al. (2004). OspC facilitates Borrelia burgdorferi invasion of Ixodes scapularis salivary glands. J. Clin. Invest. 113, 220–230. doi: 10.1172/JCI200419894

PubMed Abstract | CrossRef Full Text | Google Scholar

Plotkin, J. B., and Dushoff, J. (2003). Codon bias and frequency-dependent selection on the hemagglutinin epitopes of influenza A virus. Proc. Natl. Acad. Sci. U.S.A. 100, 7152–7157. doi: 10.1073/pnas.1132114100

PubMed Abstract | CrossRef Full Text | Google Scholar

Probert, W. S., Crawford, M., Cadiz, R. B., and LeFebvre, R. B. (1997). Immunization with outer surface protein (Osp) A, but not OspC, provides cross-protection of mice challenged with North American isolates of Borrelia burgdorferi. J. Infect. Dis. 175, 400–405. doi: 10.1093/infdis/175.2.400

CrossRef Full Text | Google Scholar

Qiu, W. G., Bosler, E. M., Campbell, J. R., Ugine, G. D., Wang, I. N., Luft, B. J., et al. (1997). A population genetic study of Borrelia burgdorferi sensu stricto from eastern Long Island, New York, suggested frequency-dependent selection, gene flow and host adaptation. Hereditas 127, 203–216. doi: 10.1111/j.1601-5223.1997.00203.x

PubMed Abstract | CrossRef Full Text | Google Scholar

Qiu, W. G., Dykhuizen, D. E., Acosta, M. S., and Luft, B. J. (2002). Geographic uniformity of the Lyme disease spirochete (Borrelia burgdorferi) and its shared history with tick vector (Ixodes scapularis) in the northeastern United States. Genetics 160, 833–849.

Ravigne, V., Olivieri, I., and Dieckmann, U. (2004). Implications of habitat choice for protected polymorphisms. Evol. Ecol. Res. 6, 125–145.

Ricklefs, R. E., and Schluter, D. (1993). Species Diversity in Ecological Communities: Historical and Geographical Perspectives. Chicago, IL: University of Chicago Press.

Roy, B. A. (1998). Differentiating the effects of origin and frequency in reciprocal transplant experiments used to test negative frequency-dependent selection hypotheses. Oecologia 115, 73–83. doi: 10.1007/s004420050493

PubMed Abstract | CrossRef Full Text | Google Scholar

Schmidt, P. S. (2001). The effects of diet and physiological stress on the evolutionary dynamics of an enzyme polymorphism. Proc. R. Soc. B 268, 9–14. doi: 10.1098/rspb.2000.1323

PubMed Abstract | CrossRef Full Text | Google Scholar

Schmidt, P. S., Bertness, M. D., and Rand, D. M. (2000). Environmental heterogeneity and balancing selection in the acorn barnacle, Semibalanus balanoides. Philos. Trans. R. Soc. B 267, 379–384. doi: 10.1098/rspb.2000.1012

PubMed Abstract | CrossRef Full Text | Google Scholar

Schmidt, P. S., and Rand, D. M. (1999). Intertidal microhabitat and selection at Mpi: interlocus contrasts in the northern acorn barnacle, Semibalanus glandula. Evolution 53, 135–146.

Schmidt, P. S., and Rand, D. M. (2001). Adaptive maintenance of genetic polymorphism in an intertidal barnacle: habitat- and life-stage-specific survivorship of Mpi genotypes. Evolution 55, 1336–1344. doi: 10.1111/j.0014-3820.2001.tb00656.x

PubMed Abstract | CrossRef Full Text | Google Scholar

Schoener, T. W. (1974). Resource partitioning in ecological communities research on how similar species divide resources helps reveal the natural regulation of species diversity. Science 185, 27–39. doi: 10.1126/science.185.4145.27

CrossRef Full Text | Google Scholar

Seifert, S. N., Khatchikian, C. E., Zhou, W., and Brisson, D. (2015). Evolution and population genomics of the Lyme borreliosis pathogen, Borrelia burgdorferi. Trends Genet. 31, 201–207. doi: 10.1016/j.tig.2015.02.006

PubMed Abstract | CrossRef Full Text | Google Scholar

Seinost, G., Dykhuizen, D. E., Dattwyler, R. J., Golde, W. T., Dunn, J. J., Wang, I. N., et al. (1999). Four clones of Borrelia burgdorferi sensu stricto cause invasive infection in humans. Infect. Immun. 67, 3518–3524.

Siemens, D. H., and Roy, B. A. (2005). Tests for parasite-mediated frequency-dependent selection in natural populations of an asexual plant species. Evol. Ecol. 19, 321–338. doi: 10.1007/s10682-005-6639-5

CrossRef Full Text | Google Scholar

Smith, T. B. (1990). Natural selection on bill characters in the two bill morphs of the African finch, Pyrenestes ostrinus. Evolution 44, 832–842. doi: 10.1111/j.1558-5646.1990.tb03808.x

PubMed Abstract | CrossRef Full Text | Google Scholar

States, S. L., Brinkerhoff, R. J., Carpi, G., Steeves, T. K., Folsom-O'Keefe, C., DeVeaux, M., et al. (2014). Lyme disease risk not amplified in a species-poor vertebrate community: Similar Borrelia burgdorferi tick infection prevalence and OspC genotype frequencies. Infect. Genet. Evol. 27, 566–575. doi: 10.1016/j.meegid.2014.04.014

CrossRef Full Text | Google Scholar

Stegeman, A., Bouma, A., Elbers, A. R. W., de Jong, M. C. M., Nodelijk, G., de Klerk, F., et al. (2004). Avian influenza A virus (H7N7) epidemic in the Netherlands in 2003: course of the epidemic and effectiveness of control measures. J. Infect. Dis. 190, 2088–2095. doi: 10.1086/425583

PubMed Abstract | CrossRef Full Text | Google Scholar

Thingstad, T. F. (2000). Elements of a theory for the mechanisms controlling abundance, diversity, and biogeochemical role of lytic bacterial viruses in aquatic systems. Limnol. Oceanogr. 45, 1320–1328. doi: 10.4319/lo.2000.45.6.1320

CrossRef Full Text | Google Scholar

Thingstad, T. F., and Lignell, R. (1997). Theoretical models for the control of bacterial growth rate, abundance, diversity and carbon demand. Aquat. Microb. Ecol. 13, 19–27. doi: 10.3354/ame013019

CrossRef Full Text | Google Scholar

Thrall, P. H., Laine, A. L., Ravensdale, M., Nemri, A., Dodds, P. N., Barrett, L. G., et al. (2012). Rapid genetic change underpins antagonistic coevolution in a natural host-pathogen metapopulation. Ecol. Lett. 15, 425–435. doi: 10.1111/j.1461-0248.2012.01749.x

PubMed Abstract | CrossRef Full Text | Google Scholar

Tilly, K., Bestor, A., and Rosa, P. A. (2013). Lipoprotein succession in Borrelia burgdorferi: similar but distinct roles for OspC and VlsE at different stages of mammalian infection. Mol. Microbiol. 89, 216–227. doi: 10.1111/mmi.12271

PubMed Abstract | CrossRef Full Text | Google Scholar

Tilly, K., Krum, J. G., Bestor, A., Jewett, M. W., Grimm, D., Bueschel, D., et al. (2006). Borrelia burgdorferi OspC protein required exclusively in a crucial early stage of mammalian infection. Infect. Immun. 74, 3554–3564. doi: 10.1128/IAI.01950-05

PubMed Abstract | CrossRef Full Text | Google Scholar

Trotter, M. V., and Spencer, H. G. (2007). Frequency-dependent selection and the maintenance of genetic variation: exploring the parameter space of the multiallelic pairwise interaction model. Genetics 176, 1729–1740. doi: 10.1534/genetics.107.073072

PubMed Abstract | CrossRef Full Text | Google Scholar

Turelli, M., and Barton, N. H. (2004). Polygenic variation maintained by balancing selection: pleiotropy, sex-dependent allelic effects and G × E interactions. Genetics 166, 1053–1079. doi: 10.1534/genetics.166.2.1053

PubMed Abstract | CrossRef Full Text | Google Scholar

Virseda, S., Restrepo, M. A., Arranz, E., Magan-Tapia, P., Fernandez-Ruiz, M., de la Camara, A. G., et al. (2010). Seasonal and pandemic A (H1N1) 2009 influenza vaccination coverage and attitudes among health-care workers in a spanish university hospital. Vaccine 28, 4751–4757. doi: 10.1016/j.vaccine.2010.04.101

PubMed Abstract | CrossRef Full Text | Google Scholar

Vuong, H. B., Canham, C. D., Fonseca, D. M., Brisson, D., Morin, P. J., Ostfeld, R. S., et al. (2014). Occurrence and transmission efficiencies of Borrelia burgdorferi ospC types in avian and mammalian wildlife. Infect. Genet. Evol. 27, 594–600. doi: 10.1016/j.meegid.2013.12.011

PubMed Abstract | CrossRef Full Text | Google Scholar

Wang, I. N., Dykhuizen, D. E., Qiu, W., Dunn, J. J., Bosler, E. M., and Luft, B. J. (1999). Genetic diversity of ospC in a local population of Borrelia burgdorferi sensu stricto. Genetics 151, 15–30.

Williams, G. C. (1966). Adaptation and Natural Selection. Princeton, NJ: Princeton University Press.

Winter, C., Bouvier, T., Weinbauer, M. G., and Thingstad, T. F. (2010). Trade-offs between competition and defense specialists among unicellular planktonic organisms: the “killing the winner” hypothesis revisited. Microbiol. Mol. Biol. Rev. 74, 42–57. doi: 10.1128/MMBR.00034-09

PubMed Abstract | CrossRef Full Text | Google Scholar

Wormser, G. P., Brisson, D., Liveris, D., Hanincova, K., Sandigursky, S., Nowakowski, J., et al. (2008). Borrelia burgdorferi genotype predicts the capacity for hematogenous dissemination during early Lyme disease. J. Infect. Dis. 198, 1358–1364. doi: 10.1086/592279

PubMed Abstract | CrossRef Full Text | Google Scholar

Wright, S. (1964). The distribution of self-incompatibility alleles in populations. Evolution 18, 609–619. doi: 10.1111/j.1558-5646.1964.tb01675.x

CrossRef Full Text | Google Scholar

Wright, S. (1969). Evolution and the Genetics of Populations, Vol. 2, The Theory of Gene Frequencies. Chicago, IL: University of Chicago Press.

Xu, Q., McShan, K., and Liang, F. T. (2007). Identification of an ospC operator critical for immune evasion of Borrelia burgdorferi. Mol. Microbiol. 64, 220–231. doi: 10.1111/j.1365-2958.2007.05636.x

PubMed Abstract | CrossRef Full Text | Google Scholar

Keywords: negative frequency dependent selection, balancing selection, killing the winner hypothesis, multiple niche polymorphism, density dependent selection

Citation: Brisson D (2018) Negative Frequency-Dependent Selection Is Frequently Confounding. Front. Ecol. Evol. 6:10. doi: 10.3389/fevo.2018.00010

Received: 01 December 2017; Accepted: 19 January 2018;
Published: 21 February 2018.

Copyright © 2018 Brisson. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Dustin Brisson, [email protected]

Sours: https://www.frontiersin.org/articles/337021
Frequency Dependent Selection

## Balancing selection via life-history trade-offs maintains an inversion polymorphism in a seaweed fly

### Abstract

How natural diversity is maintained is an evolutionary puzzle. Genetic variation can be eroded by drift and directional selection but some polymorphisms persist for long time periods, implicating a role for balancing selection. Here, we investigate the maintenance of a chromosomal inversion polymorphism in the seaweed fly Coelopa frigida. Using experimental evolution and quantifying fitness, we show that the inversion underlies a life-history trade-off, whereby each haplotype has opposing effects on larval survival and adult reproduction. Numerical simulations confirm that such antagonistic pleiotropy can maintain polymorphism. Our results also highlight the importance of sex-specific effects, dominance and environmental heterogeneity, whose interaction enhances the maintenance of polymorphism through antagonistic pleiotropy. Overall, our findings directly demonstrate how overdominance and sexual antagonism can emerge from a life-history trade-off, inviting reconsideration of antagonistic pleiotropy as a key part of multi-headed balancing selection processes that enable the persistence of genetic variation.

### Introduction

The selective mechanisms involved in the maintenance of long-term polymorphism in the face of genetic drift often remain unknown. Early assessments of heritable diversity, as well as recent empirical genomic and theoretical studies, all emphasise the importance of balancing selection in promoting within-species diversity and maintaining intermediate allele frequencies1,2,3,4. The best-documented forms of balancing selection are overdominance, where heterozygotes benefit from a higher survival compared to homozygotes, and negative frequency-dependent selection, where allelic fitness decreases with increasing frequency, resulting in the protection of rare alleles. Balancing selection may also arise from the combination of opposing selection pressures, each favouring different alleles at polymorphic loci5, for example via spatially or temporally varying selection, where the fitness of the different alleles varies among habitats or seasons6,7,8, or sexually antagonistic selection, where allelic fitness varies between sexes9,10,11.

However, the mechanisms underlying balanced polymorphisms are still largely unidentified, partly because the fitness associated with different genotypes is affected by several interacting life-history factors. Individual fitness results from complex trait combinations of survival, longevity, reproductive success and fecundity, which can be under opposed selective pressures. For example, one allele can increase survival but weaken reproductive success, while the alternative allele can confer high fertility at the cost of decreasing survival, creating a life-history trade-off12,13. Such antagonistic pleiotropy increases genetic variation via the maintenance of intra-locus polymorphism14,15. Antagonistic pleiotropy has long been considered as a minor contributor to balancing selection because theoretical studies predicted it enables persistent polymorphism only for a limited range of parameters16,17,18. Nevertheless, recent models suggest that the role of antagonistic pleiotropy has been underestimated19,20,21,22 by showing that several factors, realistic in natural populations, can promote polymorphism persistence. These factors include trait-specific dominance (i.e. when the level of dominance varies between fitness components22,23), sex-specific selection (i.e. selection strength on each fitness component differs between sexes19) and spatially and temporally varying selection20,21.

These recent theoretical developments urge for a re-examination of the mechanisms allowing polymorphism maintenance in natural systems and to disentangle their effects on the different components of fitness. For example, detailed work on phenotypic variation in horn size in Soay sheep (Ovis aries) shows that antagonistic pleiotropy due to a life-history trade-off between survival and reproductive success at a single locus maintains polymorphism by causing an overall net heterozygote advantage13. Horn size is also under sex-specific selection and involves trait-specific dominance, two factors predicted to contribute to the maintenance of genetic variation. Similarly, in the yellow monkeyflower (Mimulus guttatus) variability in flower size is related to antagonistic pleiotropy due to a trade-off between viability and fecundity, with the persistence of the polymorphism being further enhanced by spatial and temporal environmental heterogeneity20,24,25. However, direct empirical evidence of antagonistic pleiotropy enabling long-term polymorphism remains scarce, in part because the different components of fitness are rarely estimated separately.

Here, we focus on the seaweed fly Coelopa frigida, whose natural populations are all polymorphic at a large chromosomal inversion on chromosome I, representing about 10% of the genome, 25 MB and containing one thousand genes26,27,28. Recombination within large inversions is suppressed between the different rearrangements. Consequently, inversions behave as a single locus, with alleles corresponding to the different haplotypic rearrangements29,30,31. In C. frigida, the two alleles, referred to as α and β, differ by more than 2.5% in coding regions and are observed at stable and intermediate frequencies in both European and North American C. frigida populations, suggesting that the haplotypic rearrangements diverged a long time ago and have been maintained ever since by balancing selection26,27,28. The widespread excess of αβ heterozygotes and the higher egg-to-adult survival of heterozygotes compared to αα and ββ homozygotes implicates that this polymorphism is partly maintained by overdominance, possibly due to deleterious alleles captured by the inversion32,33. Moreover, inversion frequencies correlate with environmental factors, suggesting that spatially varying selection combined with migration also contributes to maintain this polymorphism26,27,34. Previous work has also shown that the inversion strongly affects adult size, which is linked to female fertility and male reproductive success35,36,37, as well as development time38 which could modulate egg-to-adult survival. While the phenotypic effect on females is moderate, αα males can be up to three times larger than ββ males, but αα males can also take up twice as long to develop than ββ males26,38. This pattern suggests a trade-off between adult size and egg-to-adult development, which may result in a trade-off between fertility and survival. These findings make the inversion polymorphism in C. frigida a relevant empirical case to investigate the role of antagonistic pleiotropy in promoting polymorphism and to specifically test interactions with other mechanisms favouring the maintenance of variation, such as sex-specific effect, overdominance and spatially varying selection.

We combine experimental evolution and simulations to investigate the mechanisms of balancing selection underlying the maintenance of the inversion polymorphism in C. frigida (Fig. 1a) and to decipher the role of antagonistic pleiotropy in this process by estimating the effect of inversion alleles on different fitness components. First, we use experimental evolution (Fig. 1b) to follow the inversion frequencies over five generations and to estimate the relative fitness associated with each genotype. We determine the inversion genotypes in the eggs to specifically dissect the different components of fitness, such as egg-to-adult survival and reproductive bias. Second, realistic numerical simulations based on these experimentally estimated parameters allow quantifying the contribution of antagonistic pleiotropy in the maintenance of this inversion polymorphism. Both the experiments and simulations support that the life-history trade-off mediates balancing selection maintaining the inversion polymorphism, in interaction with other factors, namely sex-specific selection, trait-specific dominance and environment. Finally, we expand our simulation model to characterize how the combination of different selective mechanisms favours the persistence of polymorphism via antagonistic pleiotropy.

### Inversion dynamics during in vivo experimental evolution

The frequency of the α allele, originally at 27–36% in natural populations, increased significantly between generations 1 and 5 from 28–56% to 58–75% (glm z = 10.6, p < 0.001, Fig. 1c, Supplementary Tables 1 and 2). This was observed in all 16 replicates, with no significant effect of the substrate (glm p = 0.27, Supplementary Table 2). Initial differences in frequency between the two origins were lost at generation 5 (Fig. 1c–f, Supplementary Table 2). The increase in α frequency stemmed from a sharp and significant increase of αα homozygotes from 4–24% to 25–57% (glm z = 8.8, p < 0.001, Fig. 1d) and a reduction in ββ homozygotes from 12–48% to 0–15% between generation 1 and 5 (glm z = −7.9, p < 0.001, Fig. 1f). Proportions of αβ heterozygotes remained stable around 50% [36–73%] between generations 1 and 5 (glm z = −1.4, p = 0.16, Fig. 1e).

To disentangle the fitness components for the three genotypes and explore dominance effects, we genotyped the adults and the eggs at each generation for a subset of four replicates (Supplementary Table 1) and monitored frequency deviations due to biased survival or non-random reproduction. In combination with follow-up experiments and simulations, we evaluated four fitness parameters in the three genotypes: egg-to-adult survival, development time, female fecundity, and male reproductive success.

### The β allele confers a viability advantage to larvae

Egg-to-adult relative survival was significantly affected by genotype (lmm df = 2, F = 4.7, p< 0.01, Fig. 2a, Table 1, Supplementary Table 4, Supplementary Fig. 2). Survival of αα homozygotes was on average 21% lower than αβ heterozygote and l0% lower than ββ homozygote survival. Therefore, the increase of αα frequency during the experiments is a paradox given the lower viability of this genotype. Here, ββ and αβ relative survival rate did not differ significantly (mean difference 2%, t-test p = 0.99), suggesting a dominance effect of the β allele on egg-to-adult survival. This contrasts with the general finding of overdominance in European populations of C. frigida, where αβ heterozygote larvae survive better than both homozygotes32,33,39. Yet, the high αβ relative survival is known to be enhanced by increased competitive conditions33. Therefore, the performance of homozygotes in our data may be explained by the low-density conditions maintained in our experiment. We also calculated relative sex-specific survival rate of the three genotypes (Table 1). Although these estimates should be interpreted cautiously given their large variance, our data were consistent with previous estimates on C. frigida populations raised at low density33: overdominance of αβ was observed for male survival while ββ females tended to outperform αβ females (Table 1, Supplementary Fig. 1, Supplementary Table 3), suggesting that the effects of the inversion on viability as well as the dominance relationship differs between sexes.

The inversion also showed a sex-specific effect on the duration of the larval stage, with no significant difference among female genotypes (on average, αα: 9.0 days, αβ: 8.7 days, ββ: 9.0 days; glmm p = 0.57) but highly significant differences among males (glmm p < 0.001, Fig. 2b Supplementary Table 5, Supplementary Fig. 3). In our experimental conditions, (i.e. 25 °C and low-density) the development time was on average 50% longer for αα males (12.8 days), and 25% longer for αβ males (10.3 days) compared to ββ males (8.8 days). Such ordering between genotypes is consistent with previous studies, although absolute values are larger at higher density38,40 or lower temperature (pers. obs.). Therefore, the α allele is expected to confer an additional mortality risk in the natural environment for males by delaying sexual maturity, although this effect is challenging to quantify in the laboratory where the substrate is not limiting.

### The α allele confers a reproductive advantage

Egg genotype proportions significantly deviated from proportions expected under random mating, i.e. based on the Hardy–Weinberg proportions of the previous generation (combined probabilities of Χ² tests, p = 0.003). The deviation from random mating expectation in the eggs was significantly different between genotypes (lmm df = 2, F = 24.9, p < 0.001; Fig. 2c, Supplementary Table 6, Supplementary Fig. 4). It consistently corresponded to an excess of αα eggs (on average +43%), a slight excess of αβ eggs (+11%) and a strong deficit of ββ eggs (−44%). Those deviations mean that the α allele provided an important advantage for reproduction potentially underlying the increase of α frequency in our experiment.

The excess of the α allele in the eggs was partly explained by a higher fecundity of the females bearing the α allele. In a follow-up experiment on 90 females, αα and αβ females laid significantly more eggs than ββ females in their first clutch (t-test p = 0.03), with 15% more eggs on average (αα: 70 eggs, αβ: 68 eggs, ββ: 61 eggs—Fig. 2d), possibly due to the larger size associated with α also observed in females35. The number of eggs laid by αβ females and αα females did not differ significantly (t-test p = 0.60, Fig. 2d), suggesting that the advantage in female fecundity associated with the α allele may be dominant.

The excess of α in the eggs may also result from non-random mating favouring α males because of their larger body size. Previous experiments in C. frigida with two males and one female demonstrated that the largest male sires a disproportionately higher number of the progeny37. The reasons are two-fold: smaller males lose male–male competition by being dislodged by larger males41; and, even in single-pair experiments, smaller males are more likely to be rejected by females during mating than larger males, with ββ males being 30% and 20% less attractive than αα and αβ males, respectively36,39,42. To estimate male reproductive success in our experiment, we built an individual-based model simulating the evolution of inversion frequencies in silico over five generations with all parameters drawn from the experiment except for the male relative reproductive success (Fig. 3a, Table 1, Supplementary Table 7). A scenario with equal male reproductive success across genotypes did not fit our experimental data (Fig. 3b), meaning that the difference in female fecundity is not sufficient to explain the observed rise in α frequency and the excess of αα in the eggs. The best model fit was achieved by a 10-fold higher mating success in αα compared to ββ males (Tββm = 0.1, Fig. 3b, c, Supplementary Table 8), with co-dominant, intermediate values for αβ. Models involving dominance were explored but generally exhibited a lower model fit (Supplementary Table 8). Models including a negative frequency-dependent effect on mating success were also explored to account for a possible higher male–male competition between large-size males when the proportions of αα increased. Although those models adequately fit the experimental data, the global fit was not significantly better compared to simpler models (Supplementary Fig. 5, Supplementary Table 8).

### Modelling the impact of the trade-off on inversion dynamics

To explore how a life-history trade-off modulates inversion frequencies dynamics in nature and to test whether they can contribute to the maintenance of the polymorphism over time, we expanded the individual-based model to 200 generations and used a larger population size (K = 10,000). Our experiment provided realistic values for the different life-history parameters. Yet, the increase of α frequency to 50–75% in the experiment departs sharply from the lower α frequencies generally observed in natural populations (30–50%26,34). This means that there were key differences between wild and experimental conditions that the model needed to take into account. We identified three main differences: (i) our experimental boxes were likely less densely crowded than natural wrackbeds and density is known to affect relative egg-to-adult survival33, (ii) during the experiment, male–male competition was likely stronger than in nature because of restricted space and synchronized reproduction, which may have favoured large αα males over smaller ββ males, (iii) the experimental substrate was unlimited while, in nature, access to the resource is frequently disrupted by tides or storm-induced waves, which may prevent slow-developing males from reaching adulthood43. We thus explored how genotype proportions are affected by the following three parameters: (i) relative egg-to-adult survival, (ii) male relative reproductive success, and (iii) duration of habitat availability. We then estimated the probability of maintaining polymorphism vs. fixing one allele within both a realistic and theoretical (less constrained) parameter space.

The effect of density on egg-to-adult survival (i) could not account for frequency differences between the wild and the experiment (Fig. 4a). Yet, higher densities shifted the equilibrium proportions towards an excess of heterozygotes as observed in some natural populations (Fig. 4a, d).

Simulations reducing the relative reproductive success of αα males (ii) to about 2-fold (instead of 10-fold in our experiment) recovered an equilibrium of genotypic proportions close to values observed in the wild (Fig. 4b, d), confirming that the increase of α frequency in our experiments could be linked to a large reproductive advantage of males carrying the α allele, possibly intensified by male–male competition due to restricted experimental space and synchronised reproduction. In the wild, ββ males may also have easier access to females by reaching adulthood earlier than αα and αβ males. However, equal male reproductive success between the three genotypes led to higher ββ proportions than what is generally observed in nature and removing the α-female fecundity advantage led to the fixation of β allele. This suggests that the reproductive advantage conferred by the α allele in both males and females may also contribute to the persistence of α/β polymorphism in the wild.

Finally, the duration of habitat availability (iii) appeared to be a prominent factor affecting genotypic proportions and explaining the difference in genotype frequencies between our experiment and wild populations (Fig. 4c). This was mediated by a different balance between survival and reproductive advantage. When the habitat availability was shorter, relative survival of αα males, and to a lesser extent of αβ males, was reduced in comparison to the faster-developing females (all three genotypes) and ββ males (Fig. 5b, Supplementary Fig. 6). It is noteworthy that variation in the duration of habitat availability tended to cover the natural variability in genotypic proportions (Fig. 4c, d), suggesting that spatio-temporal heterogeneity in habitat availability could explain the variable balance of genotypes observed in nature.

### Polymorphism maintenance through antagonistic pleiotropy

Taken together, the antagonistic pleiotropy among inversion-associated fitness components, as revealed by our experimental evolution approach, can explain the maintenance of the inversion polymorphism in nature, albeit considering a different tuning of the trade-off between survival and reproduction. Simulations exploring these different factors simultaneously showed that the polymorphism is maintained for a wide parameter space of male reproductive success and environmental conditions (Fig. 5a, Supplementary Fig. 7), even at low density conditions, where heterozygote survival advantage is very weak (Fig. 5b, Supplementary Fig. 6). Remarkably, when the duration of the habitat availability was more variable, the range of conditions maintaining polymorphism was considerably expanded (Fig. 5c). Also, the frequency of the inversion was buffered around intermediate values (Supplementary Fig. 8), which suggests that spatially varying selection resulting from environmental heterogeneity strongly favours the coexistence of alternative life-history strategies and thus the underlying genetic polymorphism. As expected, in models accounting for strong overdominance in survival (medium density conditions, Supplementary Fig. 9) or negative frequency-dependant effect on male success (Supplementary Fig. 10), most combinations of parameter values led to a protected polymorphism.

Estimating total fitness, defined as the combination of egg-to-adult survival and reproductive success for each genotype and sex, showed that antagonistic pleiotropy translates into two mechanisms of balancing selection: overdominance and sexual antagonism (Fig. 5a). Overdominance naturally emerges from antagonistic pleiotropy, even in the absence of dominance for any given trait, particularly when selection is strong (Fig. 5d). In the case of C. frigida, heterozygotes benefit from the combination of a survival advantage associated with β and increased reproductive success associated with α. Variation in the strength and direction of dominance between fitness components, as observed in our experiment, with for instance overdominance in αβ male survival and dominance of β allele in female larval survival (Fig. 2), considerably expanded the parameter space leading to overdominance and polymorphism maintenance (Fig. 5d). Sexual antagonism emerges from the sex-specific effects of each allele on survival and reproduction (Fig. 5e), and expands the range of conditions allowing polymorphism.

### Discussion

Results from the experimental evolution trial coupled with simulations show that the inversion dynamics in C. frigida can be largely explained by an antagonistic relationship between viability and fecundity. Individual-based simulations revealed that this antagonism is probably stronger in natural conditions and contributes to the widespread maintenance of inversion polymorphism in the wild. Both empirical results and simulations emphasize the importance of accounting for the reproductive phase and sexual selection. In fact, when considering viability alone, the persistence of the α haplotype, and its increase in our experiment, represents a paradox that is only resolved by considering female fecundity and the reproductive bias favouring larger males37,42. The antagonistic pleiotropy between growth and reproduction observed in C. frigida at the inversion also corroborates the evidence for trade-offs mediated by body size44,45,46. Such trade-offs frequently emerge because a bigger size either requires prolonged development time or a faster growth rate, both mechanisms associated with a lower likelihood to reach the reproductive stage. For instance, artificial selection for large body size in the yellow dung fly (Scathophaga stercoraria) increases juvenile mortality, particularly in stressful environments, as well as development time, leading to mortality before reproduction due to winter frost47. For a wide range of species, environmental factors impose such limits to development time. For example, in snow voles (Chionomys nivalis) the benefits of higher body size are counteracted by the need to reach adult size before the first snowfall48,49. Similarly, in many annual plants, delayed flowering allows investment in vegetative growth and a subsequently higher number of seeds, but is constrained by the duration of the reproductive window20,24,50. In C. frigida, this effect is relevant given the instability of the wrackbed habitat due to tide and wind induced waves (here modelled as limited habitat availability). By disproportionately increasing mortality of the genotypes with larger reproductive success, the effect of the environment is expected to exacerbate the trade-offs reported during the experimental trials.

When life-history trade-offs are affected by external factors, environmental heterogeneity in space and time strongly contributes to the maintenance of variation by locally favouring different life-history strategies. For instance, in the yellow monkeyflower Mimulus spp., spatial heterogeneity in wetness and seasonal variation alternatively favours alleles determining early-flowering/low fecundity or late-flowering/high fecundity20,24,25. In C. frigida, wrackbed stability varies between locations and across seasons depending on the exposure of the beach to tidal and storm-induced waves. Such heterogeneity is expected to generate fluctuating selection regimes between a slow-development/high fertility strategy (αα) or a fast-development/low fertility strategy (ββ), leading to variations in inversion frequency, and enhancing the maintenance of polymorphism. This hypothesis, originally proposed by Day, Butlin et al.27,34, is supported by field observations reporting geographic variation in genotypic proportions coinciding with tidal pattern in Scandinavia27 and temporal variation in genotypic proportions in Great Britain, with the α allele increasing in frequency in summer when the wrackbed was less frequently disturbed by storms34. Geographic variation in genotypic proportions in natural populations of C. frigida are also associated with environmental variations, such as air temperature, depth and temperature of the wrackbed and substrate composition26,27,34. Although those factors may correlate with the duration of the wrackbed stability, they are also known to modulate the genotype–phenotype relationships and therefore the associated fitness. Controlled experiments in C. frigida show that substrate composition, temperature and density affect the relationship between genotype and survival, development time, body weight and body size (G × E effect)33,38,51,52. In nature, this translates to geographic variation in male and female sizes, as well as variation in size difference among genotypes and between the sexes26,36. This is non-trivial because adult size, and size difference between sexes or between competitors may modify the reproductive advantage associated with the α allele. For instance, experiments showed that a larger size difference between two competitor males relates to a higher success of the larger male37. In other words, local environmental conditions are expected to affect genotype-associated fitness values both for viability and reproductive success. This may enhance the effect of spatial or temporal heterogeneity by changing the slope of the trade-off between survival and fertility in C. frigida. This feature is likely more general and many life-history trade-offs are probably modulated by environmental effects, since genetic correlations are known to vary in direction and strength between environments53. Spatio-temporal environmental heterogeneity thus appears to have a bivalent effect, by causing not only fluctuations in selection but also in the intensity of the trade-off. Both of these aspects interact and favour the maintenance of genetic variation.

Polymorphism generated by antagonistic pleiotropy is also predicted to be enhanced by different trade-off intensities for each sex19. Sex-specific selection on different fitness components frequently occurs in nature, since optimal age/size at maturity or the physiological state to achieve high fertility often differs between sexes9,54. C. frigida provides an empirical case of antagonistic pleiotropy whereby the strength of selection, but not the direction, differs between sexes for each fitness component. Differences between genotypes in size, development time, fertility and survival are stronger for males than for females33,35,36,52. Within a range of realistic parameters, sex-specific effects, even without antagonism for each fitness components, results in sexual antagonism for total fitness, therefore strongly protecting polymorphism19. Moreover, even without sexual antagonism in total fitness, our simulations suggest that sex-specific selection varying between components of fitness delays the fixation of alleles (Supplementary Fig. 11). While this mechanism cannot on its own protect long-term polymorphism, the slower rate of allelic fixation might allow additional factors, such as environmental heterogeneity, to prevent allele fixation22,55.

Inversions are frequently reported as polymorphic and maintained over long evolutionary timescales by balancing selection30,31,61,62. While one of the reasons could be the genetic load linked to the lack of recombination, we argue that antagonistic pleiotropy may also be a more important feature of the inversion systems than previously acknowledged. In fact, the particular architecture of inversions leads them to behave as a single-locus because of the stark reduction of recombination between rearrangements, but they are composed of dozens to hundreds of genes, sometimes interacting in a so-called “supergene” complex29,63,64, where combinations of alleles at different genes lead to highly differentiated phenotypes. Inversion haplotypic rearrangements can thus have large, pleiotropic effects on complex phenotypes, which will be under various selective pressures, possibly in opposing direction30. For example, in the longwing butterfly Heliconius numata an inversion polymorphism determining wing colour pattern is under opposing pleiotropic selection, with survival under positive frequency-dependent selection and reproduction under negative frequency-dependent selection65. In the ruff Calidris pugnax, a polymorphic inversion that determines male reproductive morphs carries inverted alleles with lethal effect on survival but positive effects on testis size, indicating a possibly higher reproductive success, which is also under frequency-dependent selection66. In the rainbow trout Oncorhynchus mykiss, alternate reproductive tactics are determined by a large polymorphic inversion, which involves a trade-off between reproductive output and longevity67. The accumulating evidence from the literature combined with our results on C. frigida thus indicate that contrasted effects of inversions on different fitness components may represent a non-negligible process involved in the maintenance of inversion polymorphism. Clearly, antagonistic pleiotropy and the effect of a wide spectrum of fitness traits on pleiotropic interactions must be considered more carefully when studying the evolutionary importance of structural genomic variants, which are currently increasingly documented to underlie complex phenotypes30,31,61.

### Experimental evolution in vivo

Wild C. frigida adults were collected at two locations along the Canadian Atlantic coast, in May 2017 at Cap Espoir, Québec (CE: 48.43087, −64.32778) and in June 2017 at Kamouraska, Québec (KA: 47.56294, −69.87375). The two populations (KA and CE) were kept separated and represented two distinct experimental lines. Wild-caught flies (KA: 303 females, 218 males, CE: 396 females, 570 males) constituted the generation 0 of the experimental evolution and were used to initiate 20 replicates of experimental evolution (2 populations × 2 substrates × 5 replicates).

To disentangle the fitness components, the generations were non-overlapping and we explicitly separated the growing phase from the reproductive phase. For reproduction, the pool of adults was left overnight at 25 °C on a layer of seaweeds, either Laminariaceae or Fucaceae. After 16–20 h, adults were preserved in ethanol. Eggs were collected by flooding the seaweed substrate with 3% salt water in each box. Egg density in the salt water solution was counted over a grid in three subsets of 3 mL each and a volume corresponding to an estimated 1000 eggs was filtered over a dark piece of linen. This subset of 1000 eggs was transferred to a controlled mass of seaweeds to start the growing phase of the next generation, and a subset of the remaining eggs was preserved in ethanol or RNAlater for subsequent genotyping. Upon emergence, adults were collected every day by aspiration and kept in conditions favouring high survival but without reproduction (5 °C, dark, with a solution of Mannitol 0.5%). When all adults emerged, we initiated the following generation with the same procedure, allowing free mating between all adults emerging from one replicate.

Each replicate was kept isolated and maintained for five generations under semi-natural conditions, controlling for density, temperature (25 °C), humidity (50%), light duration (12 h) and substrate. Raising boxes were identical plastic boxes measuring 12 × 8 × 18 cm closed with a hermetic lid perforated with four holes filled with sponge to allow air exchange but preventing adult flies to escape. Stones and sand were put in the bottom of the box to facilitate drainage. Half of the replicates were kept on 90% Laminariaceae/10% Fucaceae and the other half on 90% Fucaceae/10% Laminariaceae, the two main substrates on which C. frigida flies are naturally found in this region (Fig. 1b). Seaweeds were collected (either attached to the substrate or floating in the sea) in Métis sur Mer, Québec (48.66408, −68.07221) repeatedly over the course of the experiment. Seaweed were transported to Laval University and frozen for subsequent use as substrate in the experiment. Freezing allows a good conservation of the seaweed and destroys any eggs, parasites or larvae. The 1000 eggs started on 500 g unfrozen and cut seaweed substrate with subsequent addition of 200 g of seaweeds after ~6 days. This procedure ensured a controlled larval density between replicates.

Only 16 replicates (8 replicates per population) were kept until generation 5 due to stochastic crashes in population size experienced by some replicates because, to avoid founder effects, we allowed reproduction and the initiation of a subsequent generation only for those replicates in which more than 100 adults emerged per generation.

The frequency of the inversion and the proportions of the three inversion genotypes (αα, αβ, ββ) were estimated by genotyping adults (40–95 per replicate) and eggs (28–51 per replicate), using a diagnostic SNP assay26 after having extracted genomic DNA for each individual adult/egg with a lysis procedure. Variation in allele or genotype proportions between generations 1 and 5 was analysed with a generalized linear-mixed model for binomial data in the R package lme468 and lmerTest69 taking into account the identity of the replicate as a random factor and substrate or population as covariates.

### Estimates of fitness

Relative survival rate of each genotype (and sex) was calculated by comparing the genotypic proportions in adults to the genotypic proportions observed in the eggs. The ratio of these values is expected to be 1 if all genotypes survive equally. Survival rate was calculated in 32 replicates (16 replicates at generation 1, 4 replicates at generation 2, 3, 4 and 5). Similarly, we calculated the deviation from random mating in the eggs as the ratio of genotypic proportions observed in the eggs over the expected proportions under random mating, i.e. Hardy–Weinberg proportions of the previous generation. This ratio is expected to be around 1 if mating occurs randomly and if all genotypes reproduce equally. Deviation from random mating in the eggs was estimated in 18 replicates (2 populations at generation 1, and 4 replicates at generations 2, 3, 4 and 5). Differences between genotypes and between sexes in survival rate or in the deviation from random mating in the eggs were tested with a linear mixed model, taking into account the identity of the replicate as random factor, and a paired post-hoc t-test corrected following Benjamini and Hochberg70.

Development time was measured at generation 5 on 757 adults as the number of days between the day of oviposition and adult emergence. Differences in development time between genotypes and sex were tested with a generalized linear-mixed model, based on a Poisson distribution, taking into account the identity of the replicate as a random factor. Each factor and interaction was added sequentially and nested models were built with the R package lme468 and compared to each other with a χ²-test to determine whether the additional factor significantly improved the explicative power of the model.

Female fecundity was assessed in a follow-up experiment on a laboratory population established from eggs laid by wild-caught females from Kamouraska, Québec, raised under the same conditions as described for the previous experiment (25 °C, substrate of 90% Laminariaceae–10% Fucaceae, low density). After 7 days, pupae were collected every day and kept isolated in a 2 mL tube with cotton soaked in 0.05% Mannitol. Upon emergence, adults were kept for a few days at 5 °C in the dark. A total of 161 pairs of one male and one female were formed. On the day of the experiment, each pair was transferred into a small box (5 × 5 × 5 cm) with a mix of Laminariaceae and Fucaceae, and left overnight (for 16 h) at 25 °C (identical conditions as used in the previous experiment for mating and egg laying). After 16 h, if a clutch of eggs was observed, it was collected and preserved on a dark-linen in a 1.5 mL tube of RNAlater, and the parents were preserved in ethanol. If no eggs were observed, the pair was kept in the same experimental box and the eggs were checked again after an extra 8 h. We did this to maximize the number of females for which we could count eggs (eggs after 16 h: 69 females, eggs after 24 h: 120 females). We genotyped 90 females for the inversion using the method described above and counted their eggs under a binocular magnifier (Zeiss Stemi 2000C). Variation in the number of eggs per female in relation to female genotype (αα, αβ, ββ) and time of laying (16 h/24 h) was analysed with a linear model and post hoc pairwise t-test (adjusted following Benjamini and Hochberg70). Whether eggs were laid after 16 or 24 h did not significantly affect the number of eggs per female (F1,88 = 0.09, p = 0.76), or the interaction with genotype (F2,84 = 1.96, p = 0.15), and we therefore pooled the data for all subsequent analyses to derive female fitness parameters in the model.

All statistical analyses were performed using the software R 3.5.071.

### Simulating evolution in silico

To evaluate how fitness differences between genotypes based on a different investment in the trade-off between survival and reproduction modulates the evolution of the inversion frequency, we developed an individual-based model inspired by the results of the experimental evolution trials and by the biological characteristics of C. frigida (Fig. 3a).

The model is fully described in Supplementary Methods following the ODD protocol72 and the basic principles of the model are as follows: generations are non-overlapping, the time step is one generation and each generation proceeds in two phases, growth and reproduction. Growth phase: Each egg goes through a growing period, during which its survival is determined by chance following a binomial law with a survival probability determined by the product between global egg-to-adult survival rate (30%) and relative genotype-sex-specific survival (SXX-s) depending on its sex (s = f or m) and genotype (XX = αα or αβ or ββ). The surviving larva then matures into an adult if its development time is shorter than the duration of its habitat. Individual development time (Di), is calculated for each individual larva as the cubic root of three values randomly drawn from a uniform distribution whose mean and range depends on sex and genotype (means are the experimental estimates reported in the result section for each genotype and sex, and range is determined with a variation coefficient of 0.5). We chose this distribution as the one fitting best the measured experimental data of the development time. The duration of habitat availability is randomly drawn for each individual larva from a uniform distribution determined by two parameters: mean duration (Amean) and variability (Avar). Mean duration is shared by all individuals of a given simulation, and thus takes into account the non-independence in the availability of the habitat, i.e. the fact that when the wrackbed is removed by a high tide or a storm, all individuals are affected at the same time. Variability represents individual heterogeneity in the total duration before the wrackbed is removed, which depends for instance, at which date a given egg was laid or whether all patches of wrackbed are deposited or removed at the same time. Reproduction phase: All females reproduce once and lay a number of eggs determined by the product between the number of eggs laid by a female (70 eggs) and genotype-specific female fertility (TXX-f). For each female, a male partner is randomly drawn from the pool of reproductive males. This pool of reproductive males is generated at each reproductive step based on the distribution of adult males {Nαα-m; Nαβ-m; Nββ-m}, corrected by genotype-specific male reproductive success (TXX-m). Note that this procedure allows to account for the fact that each male can mate several times. Additional models taking into account a frequency-dependant effect are presented in Supplementary Information. Egg genotype is determined by Mendelian inheritance from parental genotypes, by randomly drawing one allele from the mother, and one allele from the father, and egg sex is determined by chance with no bias (sex-ratio = 0.5). A subset of K eggs initiates the next generation, mirroring the census made in the experiment or a limited carrying capacity in the wild. At each generation, we record the proportions of the three genotypes in the eggs and the adults. The model was implemented in Rust 1.32.0 (9fda7c223 2019-01-16).

The model outcomes were analysed at three levels.

1. (i)

First, we ran the model for five generations with parameters drawn from the experiment (Table 1, Supplementary Table 7) and no limited duration of habitat availability (Amean = 30 days) while varying the relative male reproductive success (30 replicates per set of parameters). The fit of each simulation to empirical data was quantified by computing the normalized root‐mean‐squared error (nRMSE) for each genotypic proportion from generation 1 to 5. The average nRMSE over the six variables (αα/αβ/ββ proportions in the eggs and the adults), and over the 30 replicates, was taken as an index of fit, with the best predicting scenarios having the smallest values. Difference of mean nRMSE between the best scenarios was tested with a t-test based on the 30 replicates, corrected following70.

2. (ii)

Second, to simulate evolution in a natural population, we then ran the model with larger K (10,000) for 200 generations, with 30 replicates per set of parameters, and included variation in habitat availability (Supplementary Table 7). The equilibrium in genotypic proportions was compared to the frequencies observed in nature and we explored the combination of parameters possibly influencing the frequency of the inversion. The proportion of the three genotypes under each scenario was visualized in ternary plots built with the R package ggtern73. We tested the role of density by exploring several sets of survival values based on previous laboratory studies33 (Supplementary Table 3). For relative male reproductive success, the full range of parameters was explored because this parameter could not be estimated empirically and possibly varies between populations with natural variation in adult size26,51. Habitat availability was set between 7 and 20 days, i.e. the range of development time found in our experimental conditions, to estimate the interplay between those two factors. Yet, in natural conditions, both the range of wrackbed availability and development time are expected to be wider. Development time varies with density and temperature, although the ordering of emergence between the three male genotypes remains unchanged. Wrackbeds are expected to be removed cyclically by spring tides, so the actual availability would be slightly less than 14 or 28 days, but they are sometimes observed to last shorter because of storms43. Finally, for the parameters that are more likely to vary in natural populations (male relative success, duration of habitat availability, density, variability in the duration of habitat availability), we explored which combinations of realistic parameters maintained polymorphism after 200 generations, what was the mean frequency of the inversion at equilibrium after 100 replicates, which portion of the parameter space lead to polymorphism and whether overdominance or sexual antagonism emerged for total fitness.

3. (iii)

Third, the backbone of the model was used to theoretically explore the range of conditions under which antagonistic pleiotropy could maintain polymorphism at evolutionary time-scales when combined with sex-specific effects and dominance, (K = 10,000, 500 generations, 100 replicates per set of parameters, initial proportions were set to Hardy–Weinberg proportions, with the frequency of α being 0.5).These simulations explored the whole theoretical parameter range for survival and reproduction (Table 1, Supplementary Table 7), with either various scenarios of dominance, coded by the parameters Hs/Ht, or sex-specific effects with independent values for sm/tm and sf/tf ranging between 0 and 1. We surveyed the proportions of the simulations that led to maintenance of polymorphism vs. the simulations in which one of the genotypes got fixed, as well as the emerging mechanism at the level of total fitness (overdominance in one/both sex, sexual antagonism).

Total fitness (WXX-s) was calculated as the product of relative survival rate and reproductive success, for each genotype or allele as

$$W_{xx - s} = S_{xx - s} \cdot T_{xx - s}$$

(1)

$$w_{\alpha - s} = w_{\alpha a - s} \, + \, \frac{1}{2} \, \cdot \, w_{\alpha \beta - s}$$

(2)

$$w_{\beta - s} = w_{\beta \beta - s} \, + \, \frac{1}{2} \, \cdot \, w_{\alpha \beta - s}$$

(3)

Overdominance for total fitness corresponded to cases in which

$$w_{a\beta - s} \, > \, w_{\alpha \alpha - s}\,\, \& \,\, w_{a\beta - s} \, > \, w_{\beta \beta - s}$$

(4)

Sexual antagonism for total fitness emerged in cases under which

$$w_{a - m} \, > \, w_{\beta - m}\,\, \& \,\, w_{a - f} \, < \, w_{\beta - f}$$

(5)

or

$$w_{a - m} \, < \, w_{\beta - m}\,\, \& \,\, w_{a - f} \, > \, w_{\beta - f}$$

(6)

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

### Data availability

The authors declare that all data supporting the findings of this study are available within the paper and its supplementary information files. The experimental and simulated data underlying all figures and all supplementary figures are provided as a Source Data file.

### Code availability

Simulation code and parameters files are available at https://github.com/enormandeau/coelopa_fastsim.

### References

1. 1.

Brian, Charlesworth & Hugues, K. A. The maintenance of genetic variation in life-history traits. In Evolutionary Genetics from Molecules to Morphology (eds Singh, R. S. & Krimbas, C. B.) 369–392 (Cambridge University Press, Cambridge, 2000).

2. 2.

Charlesworth, B. Causes of natural variation in fitness: evidence from studies of Drosophila populations. Proc. Natl Acad. Sci. USA112, 1662–1669 (2015).

3. 3.

Llaurens, V., Whibley, A. & Joron, M. Genetic architecture and balancing selection: the life and death of differentiated variants. Mol. Ecol.26, 2430–2448 (2017).

4. 4.

Gloss, A. D. & Whiteman, N. K. Balancing selection: walking a tightrope. Curr. Biol.26, R73–R76 (2016).

5. 5.

Prout, Timothy. How well does opposing selection maintain variation? In Evolutionary Genetics From Molecules to Morphology (eds Singh, R. S. & Krimbas, C. B.) 157–181 (Cambridge University Press, Cambridge, 2000).

6. 6.

Levene, H. Genetic equilibrium when more than one ecological niche is available. Am. Nat.87, 331–333 (1953).

7. 7.

Hedrick, P. W. Genetic polymorphism in heterogeneous environments: the age of genomics. Annu. Rev. Ecol. Evol. Syst.37, 67–93 (2006).

8. 8.

Wittmann, M. J., Bergland, A. O., Feldman, M. W., Schmidt, P. S. & Petrov, D. A. Seasonally fluctuating selection can maintain polymorphism at many loci via segregation lift. Proc. Natl Acad. Sci. USA114, E9932–E9941 (2017).

9. 9.

Barson, N. J. et al. Sex-dependent dominance at a single locus maintains variation in age at maturity in salmon. Nature528, 405 (2015).

10. 10.

Grieshop, K. & Arnqvist, G. Sex-specific dominance reversal of genetic variation for fitness. PLoS Biol.16, e2006810 (2018).

11. 11.

Pearse, D. E. et al. Sex-dependent dominance maintains migration supergene in rainbow trout. Nat. Ecol. Evol. 3, 1731–1742 (2019).

12. 12.

Stearns, S. C. Trade-offs in life-history evolution. Funct. Ecol.3, 259–268 (1989).

13. 13.

Johnston, S. E. et al. Life history trade-offs at a single locus maintain sexually selected genetic variation. Nature502, 93–95 (2013).

14. 14.

Rose, M. R. Antagonistic pleiotropy, dominance, and genetic variation. Heredity48, 63 (1982).

15. 15.

Rose, M. R. Life history evolution with antagonistic pleiotropy and overlapping generations. Theor. Popul. Biol.28, 342–358 (1985).

16. 16.

Hedrick, P. W. Antagonistic pleiotropy and genetic polymorphism: a perspective. Heredity82, 126–133 (1999).

17. 17.

Rose, M. R., Service, P. M. & Hutchinson, E. W. Three approaches to trade-offs in life-history evolution. In Genetic Constraints on Adaptive Evolution (ed Loeschcke, V.) 91–105 (1987).

18. 18.

Curtsinger, J. W., Service, P. M. & Prout, T. Antagonistic pleiotropy, reversal of dominance, and genetic polymorphism. Am. Nat.144, 210–228 (1994).

19. 19.

Zajitschek, F. & Connallon, T. Antagonistic pleiotropy in species with separate sexes, and the maintenance of genetic variation in life‐history traits and fitness. Evolution, 72, 1306–1316 (2018).

20. 20.

Brown, K. & Kelly, J. Antagonistic pleiotropy can maintain fitness variation in annual plants. J. Evol. Biol.31, 46–56 (2018).

21. 21.

Tellier, A., Villaréal, L. M. M. A. & Giraud, T. Antagonistic pleiotropy may help population-level selection in maintaining genetic polymorphism for transmission rate in a model phytopathogenic fungus. Heredity98, 45 (2006).

22. 22.

Connallon, T. & Chenoweth, S. F. Dominance reversals and the maintenance of genetic variation for fitness. PLoS Biol.17, e3000118 (2019).

23. 23.

Dooren, T. J. V. Protected polymorphism and evolutionary stability in pleiotropic models with trait‐specific dominance. Evolution60, 1991–2003 (2006).

24. 24.

Troth, A., Puzey, J. R., Kim, R. S., Willis, J. H. & Kelly, J. K. Selective trade-offs maintain alleles underpinning complex trait variation in plants. Science361, 475–478 (2018).

25. 25.

Mojica, J. P., Lee, Y. W., Willis, J. H. & Kelly, J. K. Spatially and temporally varying selection on intrapopulation quantitative trait loci for a life history trade‐off in Mimulus guttatus. Mol. Ecol.21, 3718–3728 (2012).

26. 26.

Mérot, C. et al. Intercontinental karyotype–environment parallelism supports a role for a chromosomal inversion in local adaptation in a seaweed fly. Proc. R. Soc. B Biol. Sci. 285, 20180519 (2018).

Sours: https://www.nature.com/articles/s41467-020-14479-7

## Balancing selection

Balancing selection refers to a number of selective processes by which multiple alleles (different versions of a gene) are actively maintained in the gene pool of a population at frequencies larger than expected from genetic drift alone. This can happen by various mechanisms, in particular, when the heterozygotes for the alleles under consideration have a higher fitness than the homozygote.[1] In this way genetic polymorphism is conserved.[2]

Evidence for balancing selection can be found in the number of alleles in a population which are maintained above mutation rate frequencies. All modern research has shown that this significant genetic variation is ubiquitous in panmictic populations.

There are several mechanisms (which are not exclusive within any given population) by which balancing selection works to maintain polymorphism. The two major and most studied are heterozygote advantage and frequency-dependent selection.

### Mechanisms

Sickle-shaped red blood cells. This non-lethal condition in heterozygotes is maintained by balancing selection in humans of Africa and India due to its resistance to the malarial parasite.

Malaria versus sickle-cell trait distributions

In heterozygote advantage, or heterotic balancing selection, an individual who is heterozygous at a particular gene locus has a greater fitness than a homozygous individual. Polymorphisms maintained by this mechanism are balanced polymorphisms.[3] Due to unexpected high frequencies of heterozygotes, and an elevated level of heterozygote fitness, heterozygotic advantage may also be called "overdominance" in some literature.

A well-studied case is that of sickle cell anemia in humans, a hereditary disease that damages red blood cells. Sickle cell anemia is caused by the inheritance of an allele (HgbS) of the hemoglobingene from both parents. In such individuals, the hemoglobin in red blood cells is extremely sensitive to oxygen deprivation, which results in shorter life expectancy. A person who inherits the sickle cell gene from one parent and a normal hemoglobin allele (HgbA) from the other, has a normal life expectancy. However, these heterozygote individuals, known as carriers of the sickle cell trait, may suffer problems from time to time.

The heterozygote is resistant to the malarial parasite which kills a large number of people each year. This is an example of balancing selection between the fierce selection against homozygous sickle-cell sufferers, and the selection against the standard HgbA homozygotes by malaria. The heterozygote has a permanent advantage (a higher fitness) wherever malaria exists.[4][5] Maintenance of the HgbS allele through positive selection is supported by significant evidence that heterozygotes have decreased fitness in regions where malaria is not prevalent. In Surinam, for example, the allele is maintained in the gene pools of descendants of African slaves, as the Surinam suffers from perennial malaria outbreaks. Curacao, however, which also has a significant population of individuals descending from African slaves, lacks the presence of widespread malaria, and therefore also lacks the selective pressure to maintain the HgbS allele. In Curacao, the HgbS allele has decreased in frequency over the past 300 years, and will eventually be lost from the gene pool due to heterozygote disadvantage.[6]

### Frequency-dependent selection

Main article: Frequency-dependent selection

Frequency-dependent selection occurs when the fitness of a phenotype is dependent on its frequency relative to other phenotypes in a given population. In positive frequency-dependent selection the fitness of a phenotype increases as it becomes more common. In negative frequency-dependent selection the fitness of a phenotype decreases as it becomes more common. For example, in prey switching, rare morphs of prey are actually fitter due to predators concentrating on the more frequent morphs. As predation drives the demographic frequencies of the common morph of prey down, the once rare morph of prey becomes the more common morph. Thus, the morph of advantage now is the morph of disadvantage. This may lead to boom and bust cycles of prey morphs. Host-parasite interactions may also drive negative frequency-dependent selection, in alignment with the Red Queen hypothesis. For example, parasitism of freshwater New Zealand snail (Potamopyrgus antipodarum) by the trematode Microphallus sp. results in decreasing frequencies of the most commonly hosted genotypes across several generations. The more common a genotype became in a generation, the more vulnerable to parasitism by Microphallus sp. it became.[7] Note that in these examples that no one phenotypic morph, nor one genotype is entirely extinguished from a population, nor is one phenotypic morph nor genotype selected for fixation. Thus, polymorphism is maintained by negative frequency-dependent selection.

### Fitness varies in time and space

The fitness of a genotype may vary greatly between larval and adult stages, or between parts of a habitat range.[8] Variation over time, unlike variation over space, is not in itself enough to maintain multiple types, because in general the type with the highest geometric mean fitness will take over, but there are a number of mechanisms that make stable coexistence possible.[9]

### More complex examples

Species in their natural habitat are often far more complex than the typical textbook examples.

### Grove snail

The grove snail, Cepaea nemoralis, is famous for the rich polymorphism of its shell. The system is controlled by a series of multiple alleles. Unbanded is the top dominant trait, and the forms of banding are controlled by modifier genes (see epistasis).

Grove snail, dark yellow shell with single band.

In England the snail is regularly preyed upon by the song thrushTurdus philomelos, which breaks them open on thrush anvils (large stones). Here fragments accumulate, permitting researchers to analyse the snails taken. The thrushes hunt by sight, and capture selectively those forms which match the habitat least well. Snail colonies are found in woodland, hedgerows and grassland, and the predation determines the proportion of phenotypes (morphs) found in each colony.

A second kind of selection also operates on the snail, whereby certain heterozygotes have a physiological advantage over the homozygotes. Thirdly, apostatic selection is likely, with the birds preferentially taking the most common morph. This is the 'search pattern' effect, where a predominantly visual predator persists in targeting the morph which gave a good result, even though other morphs are available.

The polymorphism survives in almost all habitats, though the proportions of morphs varies considerably. The alleles controlling the polymorphism form a supergene with linkage so close as to be nearly absolute. This control saves the population from a high proportion of undesirable recombinants.

In this species predation by birds appears to be the main (but not the only) selective force driving the polymorphism. The snails live on heterogeneous backgrounds, and thrush are adept at detecting poor matches. The inheritance of physiological and cryptic diversity is preserved also by heterozygous advantage in the supergene.[10][11][12][13][14] Recent work has included the effect of shell colour on thermoregulation,[15] and a wider selection of possible genetic influences is also considered.[16]

### Chromosome polymorphism in Drosophila

In the 1930s Theodosius Dobzhansky and his co-workers collected Drosophila pseudoobscura and D. persimilis from wild populations in California and neighbouring states. Using Painter's technique,[17] they studied the polytene chromosomes and discovered that all the wild populations were polymorphic for chromosomal inversions. All the flies look alike whatever inversions they carry, so this is an example of a cryptic polymorphism. Evidence accumulated to show that natural selection was responsible:

Drosophilapolytene chromosome
1. Values for heterozygote inversions of the third chromosome were often much higher than they should be under the null assumption: if no advantage for any form the number of heterozygotes should conform to Ns (number in sample) = p2+2pq+q2 where 2pq is the number of heterozygotes (see Hardy–Weinberg equilibrium).
2. Using a method invented by L'Heretier and Teissier, Dobzhansky bred populations in population cages, which enabled feeding, breeding and sampling whilst preventing escape. This had the benefit of eliminating migration as a possible explanation of the results. Stocks containing inversions at a known initial frequency can be maintained in controlled conditions. It was found that the various chromosome types do not fluctuate at random, as they would if selectively neutral, but adjust to certain frequencies at which they become stabilised.
3. Different proportions of chromosome morphs were found in different areas. There is, for example, a polymorph-ratio cline in D. robusta along an 18-mile (29 km) transect near Gatlinburg, TN passing from 1,000 feet (300 m) to 4,000 feet.[18] Also, the same areas sampled at different times of year yielded significant differences in the proportions of forms. This indicates a regular cycle of changes which adjust the population to the seasonal conditions. For these results selection is by far the most likely explanation.
4. Lastly, morphs cannot be maintained at the high levels found simply by mutation, nor is drift a possible explanation when population numbers are high.

By 1951 Dobzhansky was persuaded that the chromosome morphs were being maintained in the population by the selective advantage of the heterozygotes, as with most polymorphisms.[19][20][21]

### References

1. ^King, R.C.; Stansfield, W.D.; Mulligan, P.K. (2006). A dictionary of genetics (7th ed.). Oxford: Oxford University Press. p. 44.
2. ^Ford, E.B. (1940). "Polymorphism and taxonomy". In J. Huxley (ed.). The New Systematics. Oxford: Clarendon Press. pp. 493–513.
3. ^Heredity. 2009. Encyclopædia Britannica. Chicago.
4. ^Allison A.C. 1956. The sickle-cell and Haemoglobin C genes in some African populations. Ann. Human Genet.21, 67-89.
5. ^Sickle cell anemia. 2009. Encyclopædia Britannica. Chicago.
6. ^David Wool. 2006. The Driving Forces of Evolution: Genetic Processes in Populations. 80-82.
7. ^Koskella, B. and Lively, C. M. (2009), EVIDENCE FOR NEGATIVE FREQUENCY-DEPENDENT SELECTION DURING EXPERIMENTAL COEVOLUTION OF A FRESHWATER SNAIL AND A STERILIZING TREMATODE. Evolution, 63: 2213–2221. doi:10.1111/j.1558-5646.2009.00711.x
8. ^Ford E.B. 1965. Genetic polymorphism, p26, Heterozygous advantage. MIT Press 1965.
9. ^Bertram, Jason; Masel, Joanna (20 March 2019). "Different mechanisms drive the maintenance of polymorphism at loci subject to strong versus weak fluctuating selection". Evolution. 73 (5): 883–896. doi:10.1111/evo.13719. hdl:10150/632441. PMID 30883731.
10. ^Cain A.J. and Currey J.D. Area effects in Cepaea. Phil. Trans. R. Soc. B246: 1-81.
11. ^Cain A.J. and Currey J.D. 1968. Climate and selection of banding morphs in Cepaea from the climate optimum to the present day. Phil. Trans. R. Soc. B253: 483-98.
12. ^Cain A.J. and Sheppard P.M. 1950. Selection in the polymorphic land snail Cepaea nemoralis (L). Heredity4:275-94.
13. ^Cain A.J. and Sheppard P.M. 1954. Natural selection in Cepaea. Genetics 39: 89-116.
14. ^Ford E.B. 1975. Ecological genetics, 4th ed. Chapman & Hall, London
15. ^Jones J.S., Leith B.N. & Rawlings P. 1977. Polymorphism in Cepaea: a problem with too many solutions. Annual Review of Ecology and Systematics8, 109-143.
16. ^Cook L.M. 1998. A two-stage model for Cepaea polymorphism. Phil. Trans. R. Soc. B353, 1577-1593.
17. ^Painter T.S. 1933. "A new method for the study of chromosome rearrangements and the plotting of chromosome maps". Science78: 585–586.
18. ^Stalker H.D and Carson H.L. 1948. "An altitudinal transect of Drosophila robusta". Evolution1, 237–48.
19. ^Dobzhansky T. 1970. Genetics of the evolutionary process. Columbia University Press N.Y.
20. ^[Dobzhansky T.] 1981. Dobzhansky's genetics of natural populations. eds Lewontin RC, Moore JA, Provine WB and Wallace B. Columbia University Press N.Y.
21. ^Ford E.B. 1975. Ecological genetics. 4th ed. Chapman & Hall, London.
Sours: https://en.wikipedia.org/wiki/Balancing_selection

715 716 717 718 719