- Open Access
Optimization of genomic selection training populations with a genetic algorithm
© Akdemir et al.; licensee BioMed Central. 2015
- Received: 29 May 2014
- Accepted: 30 March 2015
- Published: 6 May 2015
In this article, we imagine a breeding scenario with a population of individuals that have been genotyped but not phenotyped. We derived a computationally efficient statistic that uses this genetic information to measure the reliability of genomic estimated breeding values (GEBV) for a given set of individuals (test set) based on a training set of individuals. We used this reliability measure with a genetic algorithm scheme to find an optimized training set from a larger set of candidate individuals. This subset was phenotyped to create the training set that was used in a genomic selection model to estimate GEBV in the test set. Our results show that, compared to a random sample of the same size, the use of a set of individuals selected by our method improved accuracies. We implemented the proposed training selection methodology on four sets of data on Arabidopsis, wheat, rice and maize. This dynamic model building process that takes genotypes of the individuals in the test sample into account while selecting the training individuals improves the performance of genomic selection models.
- Genomic Selection
- Genomic Estimate Breeding Value
- Training Population
- Prediction Error Variance
- Training Sample Size
Genomic selection (GS) in animal or plant breeding is based on genomic estimated breeding values (GEBV). Prediction of GEBV involves a whole-genome regression model in which the known phenotypes are regressed on the markers. For breeding programs with limited phenotyping resources, the genotypic information can be used to select a ‘good’ training set of individuals to be phenotyped. Once the phenotypes are measured, a regression model can be trained to predict GEBV of individuals that have not been phenotyped. Since phenotyping is a time-consuming and costly process, selecting a ‘good’ training population is key to the success of GS.
In this article, we concentrate our efforts on the design of a training population to maximize the accuracy of the GS models. We have imagined a scenario in which two sets of individuals with genotypic information are available. The first set includes candidate individuals from which individuals for the training set are selected for phenotyping. The prediction model will be trained on this set. The second set is a test set on which the prediction model is validated or within which selection is applied to move breeding forward to the next cycle. We assume that genotyping information of genome-wide markers is available for all individuals. This scenario is very similar to that of breeding programs that use GS and calculate GEBV from a limited number of phenotypic observations. This scenario is especially useful when phenotyping is expensive and genotypic information is relatively affordable. In this work, we show that a model building process which takes genotypes of the individuals in the test sample into account while selecting the training individuals improves the performance of prediction models built on a random sample of the same size.
Note that our method of selection does not require any phenotypic information. It only requires genotypic information about the individuals in a candidate set and a test set (genome-wide markers or the relationship information given by a pedigree). Based on genotyping information, our method selects an optimized training population, which will be phenotyped after selection. In our study, the phenotypes are needed only to validate the benefits of the proposed method and to compare it to the random sample (correlation coefficients between GEBV and the observed phenotypes are calculated).
Various regression models have been successfully used to predict the breeding values in plants and animals [1,2]. In both simulation studies and empirical studies of dairy cattle, mice and in bi-parental populations of maize, barley and Arabidopsis, marker-based GEBV are quite accurate. However, as training and testing populations diverge, the accuracies of GEBV decreases [3,4]. Because breeding populations tend to change over time, accuracies of GEBV obtained from the training population decrease over time. Similarly, in the presence of strong population structure, GEBV obtained by using sub-populations are usually not accurate for individuals in other sub-populations.
where K 21 is the matrix of genomic relationships between individuals in the test set with each of the individuals in the training set, K 11 measures the genomic relationships in the training set and parameter δ is related to the heritability (h 2) of the trait as follows δ=(1−h 2)/h 2. This reliability measure is related to Henderson’s prediction error variance (PEV)  and the more recent coefficient of determination (CD) of Laloë et al. . These measures were used in  to investigate the issue on training population design.
Computational cost of calculating the reliability measure in Equation (1) and the related PEV and CD increase with sample size. Finding an optimal training population involves evaluating these measures many times which is not computationally feasible for large sample sizes . Therefore, it is important to be able to estimate reliability efficiently. In the next section, we derive a computationally efficient approximation to the PEV based on the principal components of the genotypes and use this measure for training population design. Other efficient methods for the calculation of these statistics have been discussed in the literature [7,10,11].
Another major originality of our method compared to the optimization schemes recommended in [5,6] is that we calculate the PEV for the individuals in the test set instead of the candidate set, i.e., we use domain information about the test data while building the estimation model by choosing individuals for the training set such that they minimize the PEV in the test set. The methods developed here can be used for dynamic model building, in other words, for choosing different training sets to be phenotyped, and hence fitting different estimation models, as a function of the test set.
In order to estimate the parameters of this model, we acquire observations on n Train individuals from the larger candidate set. The model is used to generate predictions on a fixed set of n Test individuals.
for a choice of λ>0. In order to obtain minimum variance for our predictions in the test data set, we considered minimizing the scalar measure t r(P E V R i d g e (M Test )) with respect to M Train when selecting individuals for the training set.
Therefore, maximizing average reliability is equivalent to minimizing the total P E V R i d g e in Equation (3).
This approximation involves the inversion of a k+1 dimensional matrix and is computationally efficient compared to the measures in Equations (3) and (1), which involve the inversion of m+1 and n Train dimensional matrices.
Since many candidate training sets need to be evaluated in the course of optimization, we preferred the computationally efficient approximation in Equation (4) over the exact P E V R i d g e (M Test ). The scalar measure obtained by taking the trace of Equation (4) was used to evaluate training populations subsequently.
Numerous algorithms have been proposed for optimal design. Most of these approaches combine heuristics with an exchange algorithm [5,18,19]. The training set design is a combinatorial optimization problem for which genetic algorithms [20-22] are particularly suitable. Genetic algorithms use a population of candidate solutions that are represented as binary strings of 0s and 1s, this population evolving toward better solutions. At each iteration of the algorithm, a fitness function is used to evaluate and select the elite individuals and subsequently the next population is formed from the elites by genetically motivated operations such as crossover and mutation. Since genetic algorithms are particularly suitable for optimization of combinatorial problems, we have used one here. It should be noted that the solutions obtained by a genetic algorithm are usually sub-optimal and different solutions can be obtained given a different starting population of candidate solutions.
In the following section, we evaluate our training population design scheme by fitting a semi-parametric mixed model (SPMM) [23,24] using the genotypes and phenotypes in the training set and calculating the correlation of the test set phenotypes to their predictions based on this model. In these mixed models, genetic information, either pedigree- or marker-based, is used to construct an additive relationship matrix. These models have been successfully used to predict breeding values in plants and animals.
An additive relationship matrix can be calculated from the centered scaled marker genotype matrix M as K=M M ′/m. Given a similarity matrix K, the principal components used in our algorithm can be calculated from this matrix. Therefore, the statistic in Equation (4) can also be used in cases where only a similarity matrix is available.
For the examples in the next section, we fixed λ at 1/m. Although this choice is somewhat arbitrary (corresponds to a heritability value of 1/2), our method is robust to the choice of this parameter. Forty principal components were used for the approximation.
A training set of size n Train in a candidate set of size n can be identified with a n−vector of 0’s and 1’s, where a 1 at a locus means that the corresponding individual in the candidate set is in the training set. Therefore, all candidate solutions to the optimization problem are vectors of length n with a total of n Train 1’s. The genetic algorithm that we have applied starts with a random set of such solutions and generates new solutions based on one locus crossover event between two randomly selected parent solutions followed by a single random mutation event (which replaces a 1 to 0 and a 0 to 1) that occurs with probability 0.5. We used 300 iterations of the genetic algorithm with population size 800 (which amounts to evaluating 300∗800 solutions) and selection intensity 5/800 at each iteration. The training set with the best PEV measure was taken as the optimized training population at the last iteration. We decided to stop the iterations at 300 because no improvement in the criterion was observed after about 200 iterations. The solutions from the genetic algorithm may be suboptimal. To overcome this, in practice, the algorithm can be run many times, and the individuals that have been most often included can be used as the training set.
To illustrate our method, we used several datasets of different origins. The Arabidopsis dataset was published by Atwell et al.  and is available at cynin.gmi.oeaw.ac.at/home/resources/atpolydb. The wheat data was downloaded from www.triticeaetoolbox.org. The rice data was published in  and was downloaded from www.ricediversity.org/data. The maize data set was obtained from .
Accuracies were obtained by calculating the Pearson’s correlation coefficient (r) between the raw phenotypic values and the GEBV for the individuals in the test data set. The accuracies were not adjusted for trait heritabilities.
In the first two examples, the test individuals were sampled from the same population as the candidate set of individuals. Remaining examples deal with cases in which the distributions of the individuals in the test set and candidate set were not the same.
The Arabidopsis data set consisted of individuals from 199 inbred lines along with observations on 107 traits. Here, we report results for 49 of these traits. The genotype data set included 216 130 genome-wide markers.
The rice diversity panel consisted of 400 diverse accessions of inbred lines of rice (O. Sativa) from 82 countries, including many landraces, representing all the major rice growing regions of the world. This panel was genotyped with a 44-K SNP chip. Two years (2006 and 2007) and two replicates were used to evaluate each line for important agronomic traits. This data was first presented in  and was also analyzed in . A more detailed description of the accessions and geographical distribution of the rice germplasm is in . We selected six of these traits for our analysis, namely florets per panicle (FP), panicle fertility (PF), seed length (SL), seed weight (SW), seed surface area (SSA) and straighthead susceptibility (SHS) and used the phenotypic means of each inbred line across years and replicates.
The left plot in Figure 3 represents the relationships between the individuals in the test dataset and the individuals that are most frequently selected in the training set by the first two principal components of the marker dataset. The right graph is a neighbor-joining tree based on the genotypic distance matrix, where each node represents a genotype and the distances between the nodes are indicators of the dissimilarities of the genotypes. On this neighbor-joining tree, we highlighted the lines that were most frequently selected in training sets.
As shown in Figure 3, the optimization algorithm tends to select individuals that are on average similar to individuals in the test data set (which amounts to selecting individuals that are close to the overall mean or sub-population means), but at the same time it reaches some degree of diversity in the training set.
The accuracies of the genomic selection models tended to decrease as the training and test populations diverged. In each of the examples below, accuracies were better with the optimized samples for test sets of individuals which were not random samples from the same population from which the candidate sets were selected.
The results on accuracies were similar to those obtained in the previous examples: models from optimized samples outperformed the models from random samples of the same size, but this difference decreased as the training sample size increased. The plots in Figure S3 (see Additional file 2: Figure S3) show that the selection scheme prefentially selected different individuals when the test set was varied.
In this example, we evaluated the ability of estimating GEBV across clusters in a highly structured Maize data set. This data is described in  and was also analyzed in . The data set consists of 68 120 markers on 2 279 USA national inbred maize lines and their phenotypic means for degree days to silking. First, we divided the data into five clusters using the Euclidean distance matrix and the Ward’s criterion for hierarchical clustering. The numbers of individuals in the resulting clusters were 1317, 184, 552, 95 and 131 in the first, second, third, fourth and fifth clusters, respectively.
Although the results obtained in the examples described here suggest that accuracies can be improved by our method, not all of these improvements were statistically significant. Results should be interpreted as suggestive trends.
In this article, we address the training design problem and based on examples, and we show that incorporating genetic information in the test set when available can improve the accuracies of prediction models and that our method is computationally efficient.
Our results show that the accuracy of the prediction models can be improved if the individuals in the training population are selected using our scheme, especially when the required training sample size is small. Models built based on optimized samples are usually more accurate compared to models built based on random samples of larger sizes. Larger training sample sizes tend to increase accuracy, but simulations suggest that, in some cases, small training sample sizes can be just as accurate . The conditions under which small training sizes retain full accuracy or possibly increase it, were not explored here since our main purpose was to select a sample size that was dictated by a given phenotyping budget. Another related scenario that was outside the context of this article involves identifying individuals that should be further phenotyped when there is already some phenotypic information about these individuals. We intend to address these questions in future studies.
In our examples, we selected the training populations separately for each trait mainly because a different subset of individuals was observed for different traits in the data sets. In practice, it would be better to select a single training population for all the traits with similar heritabilities because in real conditions phenotyping comes after this step and the procedure is robust to the choice of the shrinkage parameter λ, which is a function of heritability. The robustness was verified by trying different λ values in the algorithm and comparing the resulting training sets and it was found to be in line with the conclusions in .
Our method is useful when a breeding program can only phenotype a subset of the available genotyped individuals, but aims at evaluating the breeding value of a (possibly much larger) group of genotypes. Genomic selection allows to estimate the breeding value of plants or animals using genotypic and phenotypic information from a training population. By replacing random sampling with our optimized selection scheme while selecting the training set, the breeding values in the test set can be estimated with higher accuracies. If the candidate and the test sets are both randomly selected from the same population, selecting an optimized training sample from the candidates with our method improves the accuracies of GEBV for this population. However, the use of our method is also limited since it requires that all the genotypes are known in advance and that the individuals that are selected in the training set are available for phenotyping.
We have discussed the training population design problem in the context of the regression of continuous traits on the genotypes based on SPMM. However, our proposed approach can be used to obtain more accurate prediction models in the general statistical learning domain. The bibliography on optimal experiment design in statistical learning is extensive [19,33,34] and review articles [18,35] provide a good survey of the area. The methods described here can be used to find optimal experiment designs for high-dimensional prediction problems where cost per individual of measuring or analyzing the response variable is too high, and therefore, a small number of training examples is required. They are also particularly useful when the candidate set from which the training set must be chosen is not representative of the test data set.
Our results indicate that the genetic algorithm scheme adopted here is very efficient at finding a good solution in the training population design problem. However, there is no guarantee that the solutions found by this algorithm are globally optimal solutions. Since the purpose of the article was to evaluate the overall improvement over many replications of the same experiments, we could not afford to start the genetic algorithm at several different starting points but it would be safer to do so. After several runs of the algorithm, we recommend picking the solution that led to the best value of the utility function or the individuals that were most frequently selected in the training set for the final training set.
A dynamic model building approach might be more suitable when the individuals in the test set are structured. An optimized set of individuals has a better chance of representing the sub-populations compared to a random sample of the same size. Accuracies can be improved by using different models for different parts of the test set, built on the basis of a subset of individuals that are chosen from the candidate set by the training population design algorithm. We intend to explore this and some related issues in a follow-up article.
This research was supported by the USDA-NIFA-AFRI Triticeae Coordinated Agricultural Project, award number 2011-68002-30029.
- Muir WM. Comparison of genomic and traditional blup-estimated breeding value accuracy and selection response under alternative trait and genomic parameters. J Anim Breed Genet. 2007; 124:342–55.View ArticlePubMedGoogle Scholar
- Heslot N, Yang HP, Sorrells ME, Jannink JL. Genomic selection in plant breeding: a comparison of models. Crop Sci. 2012; 52:146–60.View ArticleGoogle Scholar
- Windhausen VS, Atlin GN, Hickey JM, Crossa J, Jannink JL, Sorrells M, et al. Effectiveness of genomic prediction of maize hybrid performance in different breeding populations and environments. G3 (Bethesda). 2012; 2:1427–36.View ArticleGoogle Scholar
- Crossa J, Pérez P, Hickey JM, Burguenó J, Ornella L, Cerón-Rojas J, et al. Genomic prediction in CIMMYT maize and wheat breeding programs. Heredity. 2014; 112:48–60.View ArticlePubMed CentralPubMedGoogle Scholar
- Rincent R, Laloë D, Nicolas S, Altmann T, Brunel D, Revilla P, et al. Maximizing the reliability of genomic selection by optimizing the calibration set of reference individuals: Comparison of methods in two diverse groups of maize inbreds (zea mays l.)Genetics. 2012; 192:715–28.View ArticlePubMed CentralPubMedGoogle Scholar
- Elsen JM, de Givry S, Katsirelos G, Shumbusho F. Optimizing the reference population in a genomic selection design. Proceedings of WCB13 Workshop on Constraint Based Methods for Bioinformatics 37.Google Scholar
- VanRaden P. Efficient methods to compute genomic predictions. J Dairy Sci. 2008; 91:4414–23.View ArticlePubMedGoogle Scholar
- Henderson CR. Best linear unbiased estimation and prediction under a selection model. Biometrics. 1975; 31:423–47.View ArticlePubMedGoogle Scholar
- Laloë D, Phocas F, Ménissier F. Considerations on measures of precision and connectedness in mixed linear models of genetic evaluation. Genet Sel Evol. 1996; 28:359–78.View ArticlePubMed CentralGoogle Scholar
- Misztal I, Wiggans G. Approximation of prediction error variance in large-scale animal models. J Dairy Sci. 1988; 71:27–32.View ArticleGoogle Scholar
- Hickey JM, Veerkamp RF, Calus M, Mulder HA, Thompson R. Estimation of prediction error variances via monte carlo sampling methods using different formulations of the prediction error variance. Genet Sel Evol. 2009; 41:1–9.View ArticleGoogle Scholar
- Resende MF, Munóz P, Resende MD, Garrick DJ, Fernando RL, Davis JM, et al. Accuracy of genomic selection methods in a standard data set of loblolly pine (pinus taeda l.)Genetics. 2012; 190:1503–10.View ArticlePubMed CentralPubMedGoogle Scholar
- Ogutu JO, Schulz-Streeck T, Piepho HP. Genomic selection using regularized linear regression models: ridge regression, lasso, elastic net and their extensions. BMC Proc. 2012; 6:S10.View ArticlePubMed CentralPubMedGoogle Scholar
- Hoerl AE, Kennard RW. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics. 1970; 12:55–67.View ArticleGoogle Scholar
- Piepho HP. Ridge regression and extensions for genomewide selection in maize. Crop Sci. 2009; 49:1165–76.View ArticleGoogle Scholar
- Endelman JB. Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome. 2011; 4:250–5.View ArticleGoogle Scholar
- Petersen KB, Pedersen M S. The Matrix Cookbook. 2008. http://matrixcookbook.com.
- Pronzato L, Muller WG. Design of computer experiments: space filling and beyond. Stat Comput. 2012; 22:681–701.View ArticleGoogle Scholar
- Fedorov VV. Theory of Optimal Experiments. New York: Academic Press Inc.; 1972.Google Scholar
- Holland JH. Genetic algorithms and the optimal allocation of trials. SIAM J Comput. 1973; 2:88–105.View ArticleGoogle Scholar
- Davis LD. Handbook of Genetic Algorithms. vol. 115. New York: Van Nostrand Reinhold; 1991.Google Scholar
- Goldberg DE. Genetic algorithms in search, optimization and machine learning. Delhi: Pearson Education; 2006.Google Scholar
- de Los Campos G, Gianola D, Rosa GJ, Weigel KA, Crossa J. Semi-parametric genomic-enabled prediction of genetic values using reproducing kernel hilbert spaces methods. Genet Res (Camb). 2010; 92:295–308.View ArticleGoogle Scholar
- Gianola D, van Kaam JB. Reproducing kernel hilbert spaces regression methods for genomic assisted prediction of quantitative traits. Genetics. 2008; 178:2289–303.View ArticlePubMed CentralPubMedGoogle Scholar
- Akdemir D, Godfrey OU. EMMREML: Fitting mixed models with known covariance structures. 2014. R package version 2.0. http://CRAN.R-project.org/package=EMMREML.
- Akdemir D. STPGA: Selection of training populations by genetic algorithm. 2014. R package version 1.0. http://CRAN.R-project.org/package=STPGA.
- R Core Team. R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. 2013. R Foundation for Statistical Computing. http://www.R-project.org/.
- Atwell S, Huang YS, Vilhialmsson BJ, Willems G, Horton M, Li Y, et al. Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature. 2010; 465:627–31.View ArticlePubMed CentralPubMedGoogle Scholar
- Zhao K, Tung CW, Eizenga GC, Wright MH, Ali ML, Price AH, et al. Genome-wide association mapping reveals a rich genetic architecture of complex traits in oryza sativa. Nat Commun. 2011; 2:467.View ArticlePubMed CentralPubMedGoogle Scholar
- Romay MC, Millard MJ, Glaubitz JC, Peiffer JA, Swarts KL, Casstevens TM, et al. Comprehensive genotyping of the USA national maize inbred seed bank. Genome Biol. 2013; 14:R55.View ArticlePubMed CentralPubMedGoogle Scholar
- Wimmer V, Lehermeier C, Albrecht T, Auinger HJ, Wang Y, Schon CC. Genome-wide prediction of traits with different genetic architecture through efficient variable selection. Genetics. 2013; 195:573–87.View ArticlePubMed CentralPubMedGoogle Scholar
- Habier D, Fernando RL, Dekkers JC. Genomic selection using low-density marker panels. Genetics. 2009; 182:343–53.View ArticlePubMed CentralPubMedGoogle Scholar
- Pukelsheim F. Optimal design of experiments. vol 50. Philadelphia: SIAM; 2006.View ArticleGoogle Scholar
- Melas VB. Functional approach to optimal experimental design. vol 184. New York: Springer; 2006.Google Scholar
- Atkinson AC, Bailey R. One hundred years of the design of experiments on and off the pages of Biometrika. Biometrika. 2001; 88:53–97.View ArticleGoogle Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.