Variation in mutation biases can obscure signals of selection on codon usage
Ahmed Abdellatif
Department of Genetics, Rutgers University, Piscataway, New Jersey
Corresponding Author: a.abdellatif@rutgers.edu
Introduction:
The genetic code contains 61 sense codons (as well as 3 stop codons) used to create the 20 amino acids needed to make proteins. The lack of a one-to-one relationship has led many researchers to describe the genetic code as redundant. Of these 20 amino acids, 18 are coded for by more than one codon; the non-uniform usage of synonymous codons is common. This is called codon usage bias (CUB). Mutational biases are well-known to shape CUB; however, it is also thought that natural selection drives non-synonymous codon usage. Early work on CUB noted that codon use correlates with tRNA abundances, and this correlation is strongest in the most highly expressed genes (1, 2). Importantly, the tRNA pool is not uniform: some codons have more corresponding tRNA than their synonyms. Thus, it is hypothesized that selection for translation efficiency is the major driver of adaptive CUB. Other selective forces are also thought to shape CUB. For example, amino acids conserved in orthologous proteins show more biased codon usage, leading to the hypothesis that translation accuracy also drives selection on codon usage (3, 4).
Due to the relationship between codon usage, the tRNA pool, and gene expression, multiple metrics have been developed to quantify the CUB of a gene—which serves as a reliable predictor of gene expression. Such metrics include the Codon Adaptation Index (CAI) (5) and the tRNA Adaptation Index (tAI) (6). CAI compares the similarity of codon frequencies of a gene and a set of highly expressed genes; tAI estimates the degree of adaptation of a gene to the tRNA pool. A significant correlation between a species’ CUB metrics and empirical estimates of gene expression obtained via high-throughput transcriptomic and proteomic measurements often suggests that selection for translation efficiency is sufficient to shape codon usage patterns in that species. Notably, such correlations are typically strongest in species with large effective population sizes (7). Species with smaller effective population sizes, such as humans, typically show very little adaptive CUB; this is because genetic drift in species with low effective population sizes dominates the selection for translation efficiency, which is very weak overall.
The classic population geneticist model of CUB proposed by Bulmer assumes that, under weak selection on translation efficiency, codon usage frequencies will be at a selection-mutation-drift balance (8). More recent models have attempted to model the evolution of CUB more explicitly by combining mechanistic models with population genetics theory (9–12).
A recent study by LaBella et al. (13) quantified the variability in codon usage across 327 yeast species within the Saccharomycontina subphylum. Codon usage was quantified by comparing the relative synonymous codon usage (RSCU) across species (5). To estimate genome-wide selection for translation efficiency, Labella et. al. used a regression-based approach that combined the the tRNA Adaptation Index with another CUB metric, the effective number of codons (Nc) (14). LaBella et al. found that 264 genomes exhibited moderate to significant translational selection; this was a genome-wide parameter, not applicable to specific codons. LaBella et al., however, did not separate the effects of mutation and selection when quantifying codon usage, and their phylogenetic analysis of codon usage was limited to detecting phylogenetic signals.
Recent work found that signals of selection and mutation bias can be confounded by a large introgression, as is the case in the yeast L. kluyveri (15). Introgression is the cross-species transfer of genetic material between the two parent species of an interspecies hybrid via repeated backcrossing of the hybrid with one of the parent species. By analyzing the CUB of ancestral and introgressed genes separately, the ability to explain codon usage patterns and predict gene expression improved significantly. This suggests that intragenomic variation in the evolutionary histories of genes can confound signals of selection on codon usage.
This project has two parts. First, we would like to see how allowing for intragenomic variation in the evolutionary parameters shaping CUB impacts our ability to predict gene expression across the 49 yeast species. For most species, we expect that the differences in evolutionary histories will be primarily shaped by differences in the mutation biases, such as those caused by GC-biased gene conversion (16). For most genes, we expect the selective environment determined by the tRNA pool to be the same for most genes. Only in the rare situation of recent introgressions or horizontal gene transfer events do we expect the selective environment to differ, and in order for such events to have a considerable effect on our results, they would need to make up a considerable portion of a species’ genome.
Second, we would like to examine the shifts in selection and mutational biases that are shaping codon usage bias across the yeast subphylum. Although Labella et. al. touched on some of these concepts, the methods they relied on do not reflect the underlying evolutionary processes that shape CUB. Using a model rooted in population genetics theory, we will do this by estimating codon-specific selection coefficients and mutation bias parameters and comparing these parameters across species using phylogenetic comparative methods (PCM) (17, 18). We will systematically examine if shifts in selection for particular codons can be attributed to shifts in the tRNA pool, which Labella et. al. did not investigate.
Methods:
Analyzing Codon Usage Patterns Across 49 yeast sub-phylum:
Protein-coding sequences and protein amino acid sequences FASTA files were downloaded from the repository provided by Labella et. al.. Mitochondrial genes were identified via a BLAST search against known mitochondrial genes and were removed. Analysis of protein-coding sequences will be performed with ROC-SEMPPR using the AnaCoDa R package (19) of 49 Saccharomycotina species.
Identification of genes with different evolutionary histories:
Protein-coding sequences and protein amino acid sequences FASTA files were downloaded from the repository provided by Labella et. al.. Mitochondrial genes were identified via a BLAST search against known mitochondrial genes and were removed. Analysis of protein-coding sequences will be performed with ROC-SEMPPR using the AnaCoDa R package (19) of 49 Saccharomycotina species.
Ortholog identification and function predictions:
Orthogroups (orthologs and paralogs) will be identified across all 49 yeast species using the software OrthoFinder (22). Functions and domain predictions will be predicted using eggNog-Mapper (23).
Empirical estimates of gene expression:
Empirical gene expression estimates will be obtained for as many species as possible in the form of RNA-seq data, which will be downloaded from the Sequence Read Archive (SRA). Using Bash scripts, the data will then be pre-processed to trim adapter sequences on the ends; this will be performed using the fastp, a FASTQ processing tool. Then, gene expression will be estimated using kallisto, a program used for quantifying abundances of target sequences (e.g. transcripts from RNA-Seq data) using high-throughput sequencing reads. These empirical estimates can then be compared with the predicted estimates from ROC-SEMPPR to see whether ROC-SEMPPR shows better agreement with empirical data when it is assumed there is no intragenomic variation in the evolutionary histories of genes, or when variation is allowed.
Results:
Relationship between predicted and observed gene expression in closely related specie varies:
The relationship between predicted gene expression and empirical gene expression data varies greatly, even between closely related species. ROC-SEMPPR used on the 49 species of Saccharomycotina yeast with empirical estimates of mRNA abundances available (see Methods). Some species showed high correlation between empirical and predicted expression, while the others showed a weak or even negative relationship. There are three possible causes: either A.) the expression data is of poor quality, B.) closely related species experienced a sudden change in the strength of selection, or C.) our predictions of gene expression were inaccurate.
The expression data used between related species were similar in depth and are moderately correlated. This is taken as evidence that variation in data quality is not the sufficient to cause the large changes in correlation between predicted and empirical expression across the closely related species. We compared the relative synonymous codon usage (RSCU) between genes with high and low expression levels (top 5% and bottom 5%, respectively). RSCU values of the highly expressed genes are highly correlated, indicating that the overall direction of selection on synonymous codon usage is similar between species and is therefore not the cause of the discrepancies.
Intragenomic variation in mutational biases affects estimates of translational selection:
To account for local variation, genes were assigned to evolve in one of two mutational regimes by algorithmic k-medoid CLARA clustering. The two clusters created were a higher GC3% cluster and a lower GC3% cluster. Genes were classified by their average GC3% content, and the two distinct clusters suggest some genes experience stronger GC mutational biases.
ROC-SEMPPR was refit to all species allowing for mutation bias parameters to vary across the two clusters (except for selection parameters, which evolve under the same tRNA pool.) When intragenomic mutational bias variation was accounted for, species with a significant between GC3% clusters produced a significant improvement in the correlation between predicted and empirical gene expression. Those with little differences between clusters often resulted in poorer correlation when mutational bias was allowed to vary. This suggests that, in these cases, clustering diluted signals of selection on codon usage. By contrast, when the 5 species whose predicted expression estimates benefitted greatly from variable mutational biases were modelled with constant mutational bias, the model fit essentially captured the lower GC3% and higher GC3% clusters. This suggests that if mutational bias variation is strong enough, it can be mistaken for natural selection.
Genes in a mutational cluster are physically clustered on chromosomes:
Of note, the genes in a mutational cluster were also physically clustered on chromosomes. In order to better understand the masking of signals of natural selection observed in higher-GC3%-variability species, we sought to determine the molecular basis behind the clustering of genes. When genes in individual clusters were checked for proximity along chromosomes, large physical clusters of genes were found—each corresponding to a specific mutational cluster. This is a possible indicator of biased gene conversion being the cause of intrachromosomal differential mutational rates, rather than a novel mechanism.
Effects of Variable Mutation on Estimates of Natural Selection:
Allowing for intragenomic variation in mutational biases in gene expression modes improves detection of natural selection on codon usage; identifying the “optimal” or “preferred” codon “selected” for is a key aspect of studying CUB. Allowing for intragenomic variation in mutational biases in the gene expression model was found to improve the model’s ability to identify optimal codons across the previously mentioned 5 species which benefited from variable mutational bias; the model with constant mutational bias failed to identify the optimal codon for the majority of AAs. In addition, whenever the same codon was identified as optimal by both models, the models showed significant differences in selection coefficient estimates, which indicates that the estimation of the strength of the selection force also improves when intragenomic mutation bias variation is taken into account.
Discussion:
Being able to quantify the relative contributions of non-adaptive (e.g. mutation bias) and adaptive evolutionary forces shaping codon usage bias (CUB) is a key aspect of studying CUB. Although studies often assume mutation bias is constant, this is not always the case (due to mechanisms such as biased gene conversion). By examining 49 Saccharomycontina yeasts, we determined that failing to account for potential intragenomic variation in mutation bias obfuscates signals of natural selection on codon usage in various phylogenetically diverse species; it also diminishes the ability to identify the direction and strength of said selection (i.e. misidentification of optimal codons).
Unsupervised clustering of genes by GC3% content unmasks the genes evolving under different mutational biases. Interestingly, the unsupervised clustering is parroted by physical clustering along chromosomes, with genes belonging to the same mutational cluster often being grouped together physically the genome. This suggests a molecular foundation to the clusters formed by the unsupervised clustering. Indeed, the physical clustering of genes closely matching the GC3% clustering is consistent with GC-biased gene conversion (gBGC) being at work.
Genes were put into clusters by k-medioids CLARA clustering algorithm, with k = 2. It is possible further clustering could further refine and improve the ability to detect signals of natural selection on codon usage; however, because under strong translational selection, gene expression is a major factor shaping variation in codon frequencies, over-clustering of genes could dilute signals of selection, with highly expressed genes and altered codon frequencies being erroneously identified as evolving under a different mutational bias.
Further work is needed to understand the causes of intragenomic variation in mutation biases across the Saccharomycotina species. Tests of selection on codon usage are often based on comparing the genes in question with a set of reference set of genes, created with highly expressed genes, with a bias towards specific codons. This reference set must be chosen care when dealing with variable mutation biases to avoid under- or overestimating selection on codon usage. Analysis of SNPs detected in the species being studied can be used to account for these mutational bias variations; however, polymorphism and recombination data for non-model organisms is scant, which limits this approach to a select few species.
We estimated selection on codon usage under variable mutational biases by combining unsupervised machine-learning with an explicit populations genetics model. Having models that can explicitly incorporate evolutionary forces shaping codon usage are crucial as we continue to study codon usage.
References:
- Ikemura T. Correlation between the Abundance of Escherichia coli Transfer RNAs and the Occurrence of the Respective Codons in its Protein Genes: A Proposal for a Synonymous Codon Choice that is Optimal for the E. coli Translational System. J Mol Biol. 1981;151:389–409.
- Gouy M, Gautier C. Codon usage in bacteria: Correlation with gene expressivity. Nucleic Acids Res [Internet]. 1982 Nov 25 [cited 2021 Apr 7];10(22):7055–74. Available from: https://academic.oup.com/nar/article/10/22/7055/1027194
- Akashi H. Synonymous codon usage in Drosophila melanogaster: natural selection and translational accuracy. Genetics. 1994;136(3).
- Drummond DA, Wilke CO. Mistranslation-Induced Protein Misfolding as a Dominant Constraint on Coding-Sequence Evolution. Cell. 2008;134:341–52.
- Sharp PM, Li W. The codon adaptation index - a measure of directional synonymous codon usage bias, and its potential applications. Nucl Acids Res. 1987;15(3):1281–95.
- dos Reis M, Savva R, Wernisch L. Solving the riddle of codon usage preferences: a test for translational selection. Nucleic Acids Res. 2004;32(17):5036–44.
- Charlesworth B. Effective population size and patterns of molecular evolution and variation. Nat Rev Genet. 2009;10:195–205.
- Bulmer M. The selection-mutation-drift theory of synonymous codon usage. Genetics. 1991;129(3).
- Gilchrist MA. Combining Models of Protein Translation and Population Genetics to Predict Protein Production Rates from Codon Usage Patterns. Mol Biol Evol. 2007;24(11):2362–72.
- Shah P, Gilchrist M. Explaining complex codon usage patterns with selection for translational efficiency, mutation bias, and genetic drift. PNAS. 2011;108(25):10231–6.
- Wallace EWJ, Airoldi EM, Drummond DA. Estimating Selection on Synonymus Codon Usage from Noisy Experimental Data. Mol Biol Evol. 2013;30(6):1438–53.
- Gilchrist MA, Chen WC, Shah P, Landerer CL, Zaretzki R. Estimating Gene Expression and Codon-Specific Translational Efficiencies, Mutation Biases, and Selection Coefficients from Genomic Data Alone. Genome Biol Evol. 2015;7:1559–79.
- Labella AL, Opulente DA, Steenwyk JL, Hittinger CT, Rokas A. Variation and selection on codon usage bias across an entire subphylum. PLoS Genet [Internet]. 2019 Jul 1 [cited 2020 Oct 19];15(7):e1008304. Available from: https://doi.org/10.1371/journal.pgen.1008304
- Wright F. The “effective number of codons” used in a gene. Gene. 1990;87(1):23–9.
- Landerer C, O’Meara BC, Zaretzki R, Gilchrist MA. Unlocking a signal of introgression from codons in Lachancea kluyveri using a mutation-selection model. BMC Evol Biol [Internet]. 2020 Aug 26 [cited 2020 Oct 19];20(1):109. Available from: https://bmcevolbiol.biomedcentral.com/articles/10.1186/s12862-020-01649-w
- Lesecque Y, Mouchiroud D, Duret L. GC-biased gene conversion in yeast is specifically associated with crossovers: Molecular mechanisms and evolutionary significance. Mol Biol Evol [Internet]. 2013 Jun 1 [cited 2020 Oct 22];30(6):1409–19. Available from: https://academic.oup.com/mbe/article/30/6/1409/1145455
- Felsenstein J. Phylogenies and the Comparative Method on JSTOR. Am Nat [Internet]. 1985 [cited 2018 Nov 29];125(1):1–15. Available from: https://www.jstor.org/stable/2461605?seq=1#metadata_info_tab_contents
- Hansen TF. STABILIZING SELECTION AND THE COMPARATIVE ANALYSIS OF ADAPTATION. Evolution (N Y) [Internet]. 1997 Oct 1 [cited 2018 Dec 4];51(5):1341–51. Available from: http://doi.wiley.com/10.1111/j.1558-5646.1997.tb01457.x
- Landerer C, Cope A, Zaretzki R, Gilchrist MA. AnaCoDa: analyzing codon data with Bayesian mixture models. Bioinformatics [Internet]. 2018;bty138. Available from: http://dx.doi.org/10.1093/bioinformatics/bty138
- dos Reis M, Wernisch L, Savva R. Unexpected correlations between gene expression and codon usage bias from microarray data for the whole \textit{Escherichia coli} K-12 genome. Nucleic Acids Res. 2003;31(23):6976–85.
- Cope AL, Hettich RL, Gilchrist MA. Quantifying codon usage in signal peptides: Gene expression and amino acid usage explain apparent selection for inefficient codons. Biochim Biophys Acta - Biomembr. 2018;1860(12).
- Emms DM, Kelly S. OrthoFinder: Phylogenetic orthology inference for comparative genomics. Genome Biol [Internet]. 2019 Nov 14 [cited 2020 Nov 25];20(1):238. Available from: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1832-y
- Huerta-Cepas J, Forslund K, Coelho LP, Szklarczyk D, Jensen LJ, Von Mering C, et al. Fast genome-wide functional annotation through orthology assignment by eggNOG-mapper. Mol Biol Evol. 2017 Aug 1;34(8):2115–22.
- Chen S, Zhou Y, Chen Y, Gu J. Fastp: An ultra-fast all-in-one FASTQ preprocessor. In: Bioinformatics [Internet]. Oxford University Press; 2018 [cited 2021 May 5]. p. i884–90. Available from: https://github.com/OpenGene/fastp.
- Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol [Internet]. 2016 May 1 [cited 2021 May 5];34(5):525–7. Available from: http://www.nature.com/