 Research Article
 Open Access
 Published:
Accuracies of univariate and multivariate genomic prediction models in African cassava
Genetics Selection Evolution volume 49, Article number: 88 (2017)
Abstract
Background
Genomic selection (GS) promises to accelerate genetic gain in plant breeding programs especially for crop species such as cassava that have long breeding cycles. Practically, to implement GS in cassava breeding, it is necessary to evaluate different GS models and to develop suitable models for an optimized breeding pipeline. In this paper, we compared (1) prediction accuracies from a singletrait (uT) and a multitrait (MT) mixed model for a singleenvironment genetic evaluation (Scenario 1), and (2) accuracies from a compound symmetric multienvironment model (uE) parameterized as a univariate multikernel model to a multivariate (ME) multienvironment mixed model that accounts for genotypebyenvironment interaction for multienvironment genetic evaluation (Scenario 2). For these analyses, we used 16 years of public cassava breeding data for six target cassava traits and a fivefold crossvalidation scheme with 10repeat cycles to assess model prediction accuracies.
Results
In Scenario 1, the MT models had higher prediction accuracies than the uT models for all traits and locations analyzed, which amounted to on average a 40% improved prediction accuracy. For Scenario 2, we observed that the ME model had on average (across all locations and traits) a 12% improved prediction accuracy compared to the uE model.
Conclusions
We recommend the use of multivariate mixed models (MT and ME) for cassava genetic evaluation. These models may be useful for other plant species.
Background
Cassava (Manihot esculenta Crantz) [1] is a staple food for over 700 million people in Africa, South America and Asia [2]. Cassava also has immense industrial potential. White cassava starch is easy to extract and contains low levels of fat (~ 1.5%), protein (~ 0.6%) and phosphorus (~ 4%), which are desirable features for the food industry [3, 4]. Given the issues of climate change and fastgrowing populations in countries that rely heavily on cassava, the need for rapid genetic improvement of cassava is critical. To enable this, genetic evaluation protocols based on best linear unbiased prediction (BLUP) analysis [5, 6] and selection on a merit index [7, 8] have been recommended to maximize gain from selection [9].
Genomic selection (GS) [10] offers crop species such as cassava a tremendous opportunity for accelerated genetic gains [11] by using wholegenome single nucleotide polymorphisms (SNPs) scored with methods such as genotypingbysequencing (GBS) [12]. These wholegenome SNPs could be sufficiently dense to be in linkage disequilibrium with most quantitative trait loci (QTL) that affect traits of interest. Using GS, selection is imposed at these QTL without actually identifying the QTL or the functional polymorphisms [10]. In addition, these SNPs will help to better track relatedness due to Mendelian sampling [13], which leads to improved selection accuracies especially when pedigree records are not available [14].
GS models for plant genetic evaluation
Genetic evaluation [9] starts by accurately estimating the genetic value of an individual for a wide range of traits using its own performance records, progeny performance records, records from relatives, or a combination of the three [15]. Usually this estimation is carried out by using the univariate single environment onestep (uT) BLUP model [16] to obtain estimated breeding values (EBV) for one trait at a time. In plant and animal breeding, breeders usually select multiple traits at the same time that are often genetically correlated, with correlations that can range from weak to strong. The uT model for traits that are measured in a single environment assumes zero genetic and residual covariances between these traits such that information from other traits is not used when obtaining EBV of the evaluated individuals for the traits in the analysis. However, the optimal estimation procedure is to combine information from multiple trait records and obtain EBV using the multitrait single environment onestep BLUP model (MT) [17, 18]. The MT model does not assume zero genetic and residual covariances but provides an estimate for these and also uses this information when obtaining individual EBV for the traits in the analysis. The MT model has several advantages over the uT model including:

1.
Higher prediction accuracies for individual traits in the model because of more information (direct or indirect) and better connectedness of the data [19], especially when traits with varying heritabilities are analyzed jointly. This is true if the genetic correlations in the model are significant or substantial with low error correlations.

2.
Simplified index selection because optimal weight factors for the total merit index are the economic weights [19].

3.
Procedures for obtaining genetic and residual covariances and incorporating these in EBV estimates for acrosslocation, country or region evaluations [20,21,22].
While MT models have clear advantages over uT models, they require the estimation of additional parameters (i.e., genetic and error covariances), which will affect accuracies of EBV. The number of additional parameters increases as the number of traits increases. For large models, many additional parameters can lead to convergence problems in the analysis. Lastly, an appreciable amount of data is required to get good estimates of these additional parameters.
In most plant breeding programs, genotypes are evaluated in multienvironment trials (MET) usually at advanced stages of breeding. The goal is to sample the influence on selection candidates of the range of environments for which varieties will be targeted. Addressing the problem of the analysis of MET brings into focus another potential use for MT models [23]. Here, phenotypes of the same trait, but measured at different locations are parameterized as different traits in the MT model [24], producing what we call a multivariate singletrait multienvironment BLUP (ME) model. Like the MT model, the ME model estimates genetic covariances of the same trait measured in multiple environments, which may lead to more accurate estimates of individual EBV for the trait in all the environments in which data are recorded. For the ME models used for modelling MET data, residual covariances are set to zero reflecting the assumption that no mechanism generates error covariances of a trait measured in different environments [20]. In contrast, the typical univariate BLUP model for modelling MET data, termed the univariate multienvironment onestep model (uE), fits a multikernel mixed model with the genotypic effect as one kernel and the genotypebyenvironment (GxE) effect as the second kernel and maybe environment as the third kernel [25]. This model yields a GxE variance for a MET and individuals can be ranked on their performance in different locations. Different variants of the ME model have been used for modeling environment covariance structures in plant [26,27,28,29] and in animal breeding [30, 31]. Genetic covariances from the ME model offer a convenient tool to assess the impact of GxE on a trait and relate directly to the extent of GxE at all locations in the analysis. A low genetic correlation between EBV for a trait at different locations from the ME model indicates a high GxE impact on that trait [9, 32,33,34,35].
To select the GS model for a practical cassava breeding program, it is necessary to compare models that will be efficient at various stages of cassava breeding with MET data. Finally, fitting multivariate BLUP models is not trivial. Even with software that, in principle, can fit these models, model convergence is not guaranteed and may require several attempts [36,37,38] and univariate models may be more practical if the benefits of the multivariate models are not substantial.
The objectives of this paper are to (1) compare multitrait (MT) and single trait (uT) mixed models for single environment data using crossvalidation, and (2) compare the multivariate multienvironment (ME) model to a singletrait multienvironment (uE) model using crossvalidation and assessing the GxE impact on the traits analyzed via genetic covariances from the ME model fit.
Methods
Cassava phenotype data
We used historical phenotype data from different trials that were conducted for the cassava breeding program at the International Institute of Tropical Agriculture (IITA), Ibadan, Nigeria. The genetic gain population represents a collection of clones selected from the 1970s to 2007 within this program [39, 40]. Some of these clones are West African landraces and some are of East African origin. Clones in the genetic gain population have gone through advanced stages of the cassava breeding process up to onfarm variety testing trials. The data used in our analysis include data that were collected on clonal evaluation trials (CET), which are augmented design trials with typically two known checks and unreplicated plots with five plants. These data were collected from three target locations in Nigeria: Ibadan (7.40°N, 3.90°E), Mokwa (9.3°N, 5.0°E), and Ubiaja (6.66°N, 6.38°E). These locations represent regions, which encompass about 35% of the cassava production in Nigeria. Datasets were collected from 2000 to 2015 and included trials with most of the 739 clones of the genetic gain population. Six target agronomic traits were used in the analysis including seedling vigor (VIGOR), number of storage roots per plot at harvest (RTNO), fresh weight of harvested roots expressed in tons per hectare (T/ha) (FYLD), percent dry matter (DM) of storage roots, which measures root dry weight as the % of the root fresh weight, plot mean cassava mosaic disease severity (MCMDS), which is rated on a scale from 1 (no symptoms) to 5 (extremely severe), and plot mean cassava green mite (MCGM) severity, which is rated on a scale from 1 (no symptoms) to 5 (extremely severe). Cassava mosaic disease is caused by a Begomovirus that belongs to the Geminiviridae family, and is carried and transmitted by the whitefly Bemisia tabaci. The cassava green mite is Mononychellus tanajoa [41]. These traits are target traits used in the selection index for selection decisions in the IITA cassava breeding program. Phenotype data metrics are in Table 1. All trait records were plot averages for both clonal accessions and checks. All checks were included in the analysis.
Cassava genotyping data
DNA from 739 clones from the 2013 genetic gain trial at IITA was extracted by using DNeasy Plant Mini kits (Qiagen) and was quantified using PicoGreen. Genotypingbysequencing (GBS) was used for genotyping [12] these clones. Six 95plex and one 75plex ApeKI libraries were constructed and sequenced on Illumina HiSeq, one lane per library. SNPs were called from the sequence data using the TASSEL pipeline version 4.0 [42], using an alignment to the Manihot esculenta version 6 reference genome [43]. Average sequencing depth for polymorphic loci was 5×. Individuals with more than 80% and SNPs with more than 60% missing calls were removed. SNP genotyping data were converted to numeric genotypes (0, 1, 2) and missing genotyping data were imputed using a LASSO regression method (Ariel Chan, personal communication, 2014) that was implemented using the R glmnet package [44]. The resulting dataset was rounded to obtain numerical genotypes (0, 1, 2) and consisted of 183,201 SNPs scored in 739 clones.
Statistical analysis
We structured the cassava phenotype data described above into two types of data that are commonly used in most plant breeding programs. The first set was achieved by pooling data from multiple years at specific locations (multiyear trials data). We termed this scenario the singleenvironment genetic evaluation (Scenario 1). The resulting prediction accuracies from this dataset were assessed for the three locations. The second scenario was achieved by using data from multiple locations and years (MET) but, in this case, locationspecific information was extracted by modeling GxE interactions. We termed this scenario the multienvironment genetic evaluation (Scenario 2) and its goal was to assess the impact of GxE and determine the best way to fit it while also using information from correlated environments.
Pseudotrue genetic values for prediction accuracy computations
To validate the models in this study, first we defined a univariate singletrait mixed model for each trait at each location separately (to preserve the variation embedded in each location) using an identity covariance matrix among clone effects, which assumes no relationship among the clones. The univariate mixed model was as follows:
with \( {\mathbf{u}} \sim N\left( {0,\upsigma_{\text{u}}^{2} {\mathbf{I}}} \right) \) and \( {\mathbf{e}} \sim N\left( {0,\upsigma^{2} {\mathbf{I}}} \right) \), where \( {\mathbf{y}} \) is a vector of observations, \( {\mathbf{b}} \) is a vector of fixed effects with the design matrix \( {\mathbf{X}} \) (relating observations to fixed effects, in this case including grand mean and a nested effect of trialwithinyear and the ratio of plants harvested to number planted); \( {\mathbf{u}} \) is a vector of clonal genetic effects with the design matrix \( {\mathbf{Z}} \) (relating observations to clones). This model was fit using the lmer function in the R lme4 package [45] and resulting BLUP values (\( \hat{u} \)), which we refer to as estimated genotypic values (EGV), were used as pseudotrue genetic effects to compute prediction accuracy, as commonly used in the plant breeding literature [41, 46, 47].
GS models for Scenario 1
We defined two mixed models that were fitted as follows.
The singletrait mixed model (uT)
with \( {\mathbf{u}} \sim N\left( {0,\upsigma_{\text{u}}^{2} {\mathbf{K}}} \right) \) and \( {\mathbf{e}} \sim N\left( {0,\upsigma^{2} {\mathbf{I}}} \right) \), where \( {\mathbf{y}} \) is the response vector of a trait for a given location, \( {\varvec{\upbeta}} \) is the vector of fixed effects with the design matrix \( {\mathbf{X}} \) (with components as in Model 1 above for each location and trait); \( {\mathbf{u}} \) is the vector of random additive genomic effects with the design matrix \( {\mathbf{Z}} \) (relating trait values to clones) and \( {\mathbf{K}} \) is the additive genomic relationship matrix generated from SNPs as in method 1 of VanRaden [48] implemented in preGSf90 [49].
The multitrait mixed model
where \( {\mathbf{y}} = ({\mathbf{y}}_{1}^{{\prime }} ,{\mathbf{y}}_{2}^{{\prime }} ,{\mathbf{y}}_{3}^{{\prime }} , \ldots , {\mathbf{y}}_{\text{d}}^{{\prime }} )^{ '} \), \( {\mathbf{u}} = ({\mathbf{u}}_{1}^{{\prime }} ,{\mathbf{u}}_{2}^{{\prime }} , \ldots , {\mathbf{u}}_{\text{d}}^{{\prime }} )^{ '} \) and \( {\mathbf{e}} = ({\mathbf{e}}_{1}^{{\prime }} ,{\mathbf{e}}_{2}^{{\prime }} ,{\mathbf{e}}_{3}^{{\prime }} , \ldots , {\mathbf{e}}_{\text{d}}^{{\prime }} )^{ '} \), and \( {\mathbf{y}} \) is the response vector of \( d \) traits (six core traits described above), \( {\mathbf{X}} \) is a design matrix for fixed effects \( {\varvec{\upbeta}} \) (with components as in Model 1 above for each location and trait) and \( {\mathbf{Z}} \) is a design matrix for random genetic effects \( {\mathbf{u}} \). Following a multivariate normal distribution (\( N_{m} \)), the marginal density of \( {\mathbf{y}} \) is:
and \( {\mathbf{V}} = {\mathbf{Z}}\left( {{\mathbf{G}} \otimes {\mathbf{K}}} \right){\mathbf{Z}}^{\text{T}} + {\mathbf{R}} \otimes {\mathbf{I}}) \).
The matrices \( {\mathbf{G}} \) and \( {\mathbf{R}} \) are \( d \times d \) symmetric unstructured genomic and error covariance matrices respectively, \( {\mathbf{K}} \) remains the additive genomic relationship matrix for \( n \) clones generated from SNPs as above, \( {\mathbf{I}} \) is an identity matrix.
Models (2) and (3) were fitted separately for each location Ubiaja, Mokwa and Ibadan, respectively, allowing the error (co)variances associated with these locations to be distinct. Note also that, in these models, genotypebylocation effects are confounded with the main genotype effects such that variance components may change between locations. The effects of years and trials were fixed because emphasis was on location effects since these locations represented different production regions and we sought to capture consistent effects of these locations. In contrast, year effects are variable and by definition not consistent. Also following practice in cassava breeding [46, 47], multiple observations of one clone were not considered as repeated measures. Although these subjects were clones, data were collected from distinct individuals and thus they are independent. Hence, these measurements were treated as samples of clones and should lead to better precision in the prediction of breeding values.
GS models for Scenario 2
We also defined two mixed models with the aim of modeling genotypebyenvironment interaction effects as follows.
The compound symmetric multienvironment model (uE)
Here, we describe the uE model, first how it is fit and then we show its compound symmetry structure. The model is as follows:
with \( {\mathbf{y}} = ({\mathbf{y^{\prime}}}_{\text{a}} ,{\mathbf{y^{\prime}}}_{\text{b}} ,{\mathbf{y^{\prime}}}_{\text{c}} )^{ '} \), \( {\mathbf{e}} = ({\mathbf{e^{\prime}}}_{\text{a}} ,{\mathbf{e^{\prime}}}_{\text{b}} ,{\mathbf{e^{\prime}}}_{\text{c}} )^{ '} \),
where \( {\mathbf{y}} \) is the vector of a trait at locations \( {\text{a}} \), \( {\text{b}} \) and \( {\text{c}} \) (corresponding to Ubiaja, Mokwa and Ibadan), \( {\varvec{\upbeta}} \) is the vector of fixed effects with the design matrix \( {\mathbf{X}} \) (relating observations to fixed effects as in Model 1); \( {\mathbf{u}} \) is the vector of random additive genomic effects with the design matrix \( {\mathbf{Z}}_{1} \) (relating trait values to clones), \( {\mathbf{w}} \) is the vector of random clonebylocation interaction effects with the design matrix \( {\mathbf{Z}}_{2} \), which is \( diag\left( {{\mathbf{Z}}_{\text{a}} ,\varvec{ }{\mathbf{Z}}_{\text{b}} ,\varvec{ }{\mathbf{Z}}_{\text{c}} } \right) \) that relates records to clones in locations \( {\text{a}} \), \( {\text{b}} \) and \( {\text{c}} \), respectively. For the \( {\text{c}} \)th location, a column of \( {\mathbf{Z}}_{\text{c}} \) may be all 0s if the clone represented by the column was not evaluated in that location. \( \upsigma_{\text{u}}^{2} {\mathbf{K}} \) is the additive genomic relationship matrix generated from SNPs as above, \( {\mathbf{I}} \) is an identity matrix and \( {\mathbf{I}}_{3} \) is a \( 3 \times 3\varvec{ } \) identity matrix. In this model, the genomic value of a clone for the \( {\text{c}} \)th location was estimated as \( {\hat{\mathbf{u}}} + \widehat{{{\mathbf{w}}_{\text{c}} }} \). A more complete account of the error terms would have included clonebyyear and clonebylocationbyyear terms in the model. While such a model would have characterized the error in more detail, we believe that the obtained improvement of withinlocation estimation would have been marginal. Model 5 implies a compound symmetry structure [50] as described below.
The uE model defined as a compound symmetry (CS) covariance structure model
Using the same symbols as above, we defined the uE model with a CS covariance structure as:
with \( {\mathbf{y}} = ({\mathbf{y^{\prime}}}_{\text{a}} ,{\mathbf{y^{\prime}}}_{\text{b}} ,{\mathbf{y^{\prime}}}_{\text{c}} )^{ '} \) and \( {\mathbf{e}} = ({\mathbf{e^{\prime}}}_{\text{a}} ,{\mathbf{e^{\prime}}}_{\text{b}} ,{\mathbf{e^{\prime}}}_{\text{c}} )^{ '} \),
The genomic effect from this CS model for the \( {\text{c}} \)th location \( \widehat{{{\mathbf{w}}_{c} }} \) is equal to \( {\mathbf{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{u} }} + \widehat{{{\mathbf{w}}_{c} }} \) from the uE model. The \( {\mathbf{Z}}_{2} \) matrix is the same as in the uE model. Compared to the ME model described below, which replaces \( {\varvec{\Phi}} \) with an unstructured covariance matrix with nine parameters (six for genetic and three for error (co)variances, respectively), the CS model has three parameters including \( \upsigma_{{{\text{u}} + {\text{w}}}}^{2} \) (equivalent to \( \upsigma_{\text{u}}^{2} +\upsigma_{\text{w}}^{2} \) in the uE model), \( \upsigma_{\text{e}}^{2} \) and \( \uprho \). For any trait for which the CS covariance structure best fits the data, it is expected that model uE will provide more accurate GEBV than the ME model, which will overfit the data. Furthermore, the uE model defined here assumes a homogeneous variance across locations \( {\text{a}} \), \( {\text{b}} \) and \( {\text{c}} \). Although a CS model with heterogeneous variances can be fit, this was not the case for the uE model. This assumption is incorrect if there are significant heterogeneous variances across these locations. In such a case, the ME model should provide more accurate breeding values.
The multivariate multienvironment (ME) model
We fit the ME model in a singlestep procedure using the following model:
with \( {\mathbf{y}} = ({\mathbf{y^{\prime}}}_{\text{a}} ,{\mathbf{y^{\prime}}}_{\text{b}} ,{\mathbf{y^{\prime}}}_{\text{c}} )^{ '} \), \( {\mathbf{u}} = ({\mathbf{u^{\prime}}}_{\text{a}} ,{\mathbf{u^{\prime}}}_{\text{b}} , {\mathbf{u^{\prime}}}_{\text{c}} )^{ '} \) and \( {\mathbf{e}} = ({\mathbf{e^{\prime}}}_{\text{a}} ,{\mathbf{e^{\prime}}}_{\text{b}} ,{\mathbf{e^{\prime}}}_{\text{c}} )^{ '} \), where \( {\mathbf{y}} \) is the vector of a trait in locations \( {\text{a}} \), \( {\text{b}} \) and \( {\text{c}} \) (corresponding to Ubiaja, Mokwa and Ibadan) recorded for \( n \) clones, the \( {\mathbf{X}} \) and \( {\mathbf{Z}} \) design matrices are block diagonal matrices represented as \( diag\left( {{\mathbf{X}}_{\text{a}} ,\varvec{ }{\mathbf{X}}_{\text{b}} ,\varvec{ }{\mathbf{X}}_{\text{c}} } \right) \) and \( diag\left( {{\mathbf{Z}}_{\text{a}} ,\varvec{ }{\mathbf{Z}}_{\text{b}} ,\varvec{ }{\mathbf{Z}}_{\text{c}} } \right) \), respectively allowing for missing clones and observations. \( {\mathbf{X}} \) is a design matrix for fixed effects \( {\varvec{\upbeta}} \) (with components as in Model 1) and \( {\mathbf{Z}} \) is a design matrix for random genomic effects \( {\mathbf{u}} \). Following a multivariate normal distribution (\( N_{m} \)), the marginal density of \( {\mathbf{y}} \) is:
and \( {\mathbf{V}} = {\mathbf{Z}}\left( {{\mathbf{G}} \otimes {\mathbf{K}}} \right){\mathbf{Z}}^{\text{T}} + {\mathbf{R}} \otimes {\mathbf{I}}) \).
Given that \( d \) is the number of locations being analyzed, \( {\mathbf{G}} \) is a \( d \times d \) symmetric and unstructured genomic covariance matrix, while \( {\mathbf{R}} \) is a \( d \)dimensional diagonal error covariance matrix, \( {\mathbf{K}} \) remains the additive genomic relationship matrix for \( n \) clones generated from SNPs as above, and \( {\mathbf{I}} \) is an identity matrix. In this model, the error covariance matrix \( {\mathbf{R}} \) is diagonal, thus allowing heterogeneous variances for a trait at different locations but the covariances are fixed to zero following the assumption that no mechanism generates error covariances of a trait measured in multiple environments.
Estimation of the parameters in Models (2), (3), (5) and (6) were performed using the average information (AI) REML procedure implemented in the airemlf90 program [49] from which the best linear unbiased estimator (BLUE) of fixed effects and the BLUP of random effects were obtained by solving the mixed model equations (MME) [5, 6]. Custom Rscripts were used for crossvalidation.
Comparison of prediction accuracies
We used a fivefold crossvalidation scheme with 10 repeats for comparisons between the univariate and multivariate models. We used the same folds for the models in each scenario. Hereafter, we refer to predicted BLUP or genomic effects from these models as genomic EBV (GEBV). Prediction accuracies were calculated as a correlation of the validation fold GEBV to their corresponding EGV.
Results
Scenario 1: MT versus uT model
In Scenario 1, we observed that the prediction accuracies of the MT model were higher than those of the uT models for most traits and locations in our analysis (Table 2). On average (across traits and locations), the MT model yielded prediction accuracies that were 59% higher for VIGOR, 43% for RTNO, 27% for DM, 40% for MCMDS, 55% for FYLD and 18% for MCGM compared to the uT model. Averaged across traits and locations, the MT models were 40% more accurate than the uT models.
Scenario 2: ME versus uE model
In Scenario 2, we observed different patterns of prediction accuracies of the uE and ME models. The ME model yielded higher prediction accuracies for DM and MCMDS at all locations. On average (across locations), the uE model resulted in prediction accuracies that were 2% better for VIGOR and 1% for RTNO, while the ME model resulted in prediction accuracies that were 32% better for DM, 24% for MCMDS, 5% for FYLD, and 4% for MCGM. Prediction accuracy of the ME model was 12% higher than that of the uE model averaged across all traits and locations in the model. Trait correlations from the ME model representing the expected correlated responses to selection ranged from 0.21 to 0.66 for VIGOR, 0.36 to 0.54 for RTNO, 0.57 to 0.81 for DM, 0.68 to 0.87 for MCMDS, 0.31 to 0.52 for FYLD and 0.24 to 0.53 for MCGM. Thus, genetic effects for MCMDS and DM were more consistent across locations than those for the other traits.
Discussion
Scenario 1: MT versus uT model
Some studies reported comparisons between MT and uT genomic prediction models using simulated data or real datasets [51,52,53]. Based on simulated datasets, Guo et al. [53] and Calus et al. [52] reported similar accuracies between MT and uT models with accuracies of the MT models for lowly heritable traits being slightly higher when the genomic correlations between the traits increased. Using Holstein and Jersey breed datasets from the US Dairy National evaluation program, VanRaden et al. [51] also reported similar accuracies between MT and uT models for all the traits analyzed. However, for several traits, they obtained accuracies that were slightly higher with the uT model than with the MT model. For highly heritable traits and especially if complete phenotypic data are available for these, accuracies obtained by the MT model are not clearly better than those obtained by the uT model [53]. Improvement in prediction accuracies with the MT model is accrued mostly for lowly heritable traits when they are analyzed jointly with highly heritable traits that have medium to high genetic correlations and low residual correlations [52, 53]. Our results were consistent with those of other studies [52, 53] since, in our analysis, our MT model yielded higher accuracies for most traits and locations and resulted from the joint analysis of low and high heritability traits. Most of the genetic correlations between traits at all locations in the MT models were significant (substantial) with low error correlations (not shown). These contributed to the increased prediction accuracies observed for the MT models compared to those of the uT models. Substantial increases in prediction accuracies of the MT models were observed for VIGOR, RTNO and FYLD, which had mostly moderate to high genetic correlations with other traits at all locations although their heritabilities were mostly low.
For parental selections in specific locations, we recommend the use of MT models but to confirm this conclusion, further studies on the selection gains obtained by applying these models are necessary.
Scenario 2: ME versus uE model
Comparisons between different ME and uE genomic prediction models have been reported in plant breeding literature [54,55,56]. Burgueno et al. [29] conducted extensive modeling for multienvironment trials using pedigree and genomic markers and incorporated many covariance structures including diagonal, factor analytic (FA), identity and unstructured covariances for both the genomic and error components in their models. They observed that prediction accuracies of a genomic ME model with a diagonal genomic covariance structure and a diagonal error covariance structure (ME_{D–D}) were higher than that of a genomic ME model with a FA genomic covariance structure and diagonal error covariance (ME_{FAD}) for most of the locations measured in their analysis based on a crossvalidation scheme (CV1) [29]. This ME_{D–D} is a univariate model with fewer parameters but it can be compared to our uE model. Although the uE model in our study assumed identical genomic and error variances for all locations analyzed, total phenotypic variance was partitioned into direct clonal genomic, clonebylocation interaction and error variance components. Hence, effects due to clones and clonebylocation interaction were combined to generate locationspecific GEBV, which can be compared to locationspecific GEBV obtained in the ME_{D–D} model. Our results were in line with this study for the traits VIGOR and RTNO at all locations with the uE model yielding higher prediction accuracies than the ME models and differed for the traits DM and MCMDS at all locations with the ME models yielding higher prediction accuracies. However, on average across locations and traits, prediction accuracies of the ME models were higher.
To further understand the strength of the impact of GxE interaction on the cassava core traits analyzed in our study, we used information from the proportion of total variation explained by clone and clonebylocation effects from the uE model (Table 3). From the total variation explained by SNPs, the effect of clonebylocation interaction was approximately 30% for VIGOR, 48% for RTNO, 12% for DM, 15% for MCMDS, 56% for FYLD and 46% for MCGM. These proportions show strong clonebylocation interactions for FYLD, RTNO, MCGM and VIGOR but weak interactions for DM and MCMDS. In addition, the genetic correlations between the three locations for DM and MCMDS were relatively high ranging from 57 to 81 and 68 to 87%, respectively (Table 4), supporting our findings. These high correlations revealed that cassava DM and MCMDS were repeatable across the locations in our study, which suggests that genotypes selected for these traits will perform comparably across locations. From the genetic correlations in Table 4, improvement for RTNO and FYLD at Ubiaja will result in a correlated response of about 50% for these traits at Mokwa and about 35% at Ibadan. The low predicted correlated responses confirm that the environment had a higher impact on RTNO, FYLD, VIGOR and MCGM, thus improving these traits is more challenging. This makes a case for decentralized breeding especially for yield component traits. Breeding for good varieties that combine these core traits may be targeted towards specific locations or groups of locations with specific genotypes selected for these locations.
The ME model exploits the positive genomic correlations captured in its \( {\mathbf{G}} \) matrix for prediction. The major difference between the prediction accuracies obtained by the ME and uE models were mainly due to the fact that the ME model accounted for genetic covariances when generating GEBV since genetic variances from both models were similar. Genetic covariances from the ME models are a reflection of the GxE interactions for the trait of interest and ME breeding values capture both additive genotypic and additive genotypebyenvironment effects. However, the lack of information from betweentrait correlations (which are captured by MT models) in ME breeding values represents a challenge when selection decisions based on information from the interconnection between multiple trait and multiple location data are required. The interconnection between these data may be useful for understanding GxE and for selection on traits that are highly influenced by the environment. Therefore, there is an opportunity for interconnection between information from a valuable single environment and MET data which are readily available in plant breeding programs.
Another potential use of the ME models is for clustering environments into target populations of environments (TPE). If correlated responses to selection of target traits are similar for certain locations based on genetic correlations from the ME model, then these locations can be grouped into a TPE. Regional breeding can begin within this TPE and all multilocation trials are carried out within this TPE. For example, for the traits VIGOR, DM and MCMDS that have correlated responses to selection ranging from 66 to 87% (Table 4), Ubiaja and Ibadan can belong to same TPE.
Parameter estimates and implications for cassava breeding
The estimates of genomic correlations and heritabilities in Table 5 have interesting implications for cassava genetic improvement. Genetic correlations between RTNO and FYLD estimated with the MT model were high and positive for all locations (ranging from 0.65 to 0.8), whereas those between RTNO and DM and between FYLD and DM were close to zero (ranging from − 0.02 to 0.20). The genetic correlations for these core production traits (DM, RTNO and FYLD) indicate that concurrent improvement of these traits is achievable. However, more replication in trials targeting these production traits will help reduce error variances and improve the accuracy of parental selections given the low heritabilities for FYLD and RTNO. VIGOR can also be improved concurrently with these production traits since it is mostly positively correlated with these (Table 5). The disease trait (MCMDS) showed moderate to strong negative genomic correlations with VIGOR and the production traits, which is favorable for cassava breeding in Africa especially where the cassava mosaic disease (CMD) pressure is high. Consequently, cassava breeders have selected for CMD resistance genes over time [57, 58]. With the favorable genetic correlations between these target traits in mind, the merit index from MT breeding values should be efficient since it takes genetic correlations into account.
We would like to make it clear here that fitting MT and ME models are computationally expensive since, in our case, they required the estimation of 90 and 36 additional covariance parameters for the MT and ME models, respectively, compared to the uT and uE models. We had a few thousand records to estimate these parameters accurately for our target traits as shown by the standard errors of these estimates in Tables 4 and 5. When these correlations are not significant, breeding values from univariate models are sufficient because MT models are not expected to result in improved prediction accuracies [59].
Conclusions
The effectiveness of a breeding program is evaluated by its ability to provide adapted and productive varieties to the farming community in the target environments it serves. To achieve this goal for the cassava breeding program at IITA, we recommend a decentralized breeding strategy for the different agroecological zones in Nigeria using total merit indices based on MT breeding values. Further studies should be conducted to understand how much selection gain can be achieved by using this strategy. ME models provided less improvement in prediction accuracy but were useful for understanding GxE interactions.
References
 1.
The International Plant Names Index (IPNI). http://www.ipni.org/ipni/idPlantNameSearch.do?id=3517901. Accessed 31 Oct 2015.
 2.
Taylor N, Chavarriaga P, Raemakers K, Siritunga D, Zhang P. Development and application of transgenic technologies in cassava. Plant Mol Biol. 2004;56:671–88.
 3.
Moorthy SN. Physicochemical and functional properties of tropical tuber starches: a review. Starch. 2002;54:559–92.
 4.
Balagopalan C. Cassava utilization in food, feed and industry. In: Hillocks RJ, Thresh JM, Bellotti AC, editors. Cassava: biology, production and utilization. Wallingford: CAB International; 2002. p. 301–18.
 5.
Henderson CR. Best linear unbiased estimation and prediction under a selection model. Biometrics. 1975;31:423–47.
 6.
Henderson CR. Estimation of variances and covariances under multiple trait models. J Dairy Sci. 1984;67:1581–9.
 7.
Smith HF. A discriminant function of plant selection. Ann Hum Genet. 1936;7:240–50.
 8.
Hazel LN. The genetic basis for constructing selection indexes. Genetics. 1943;28:476–90.
 9.
Ducrocq V, Wiggans G. Genetic improvement of cattle. In: Garrick DJ, Ruvinsky A, editors. The genetics of cattle. 2nd ed. Wallingford: CABI; 2015. p. 371–96.
 10.
Meuwissen THE, Hayes BJ, Goddard ME. Prediction of total genetic value using genome wide dense marker maps. Genetics. 2001;157:1819–29.
 11.
Hayes BJ, Bowman PJ, Chamberlain AJ, Goddard ME. Invited review: genomic selection in dairy cattle: progress and challenges. J Dairy Sci. 2009;92:433–43.
 12.
Elshire RJ, Glaubitz JC, Sun Q, Poland JA, Kawamoto K, Buckler ES, et al. A robust, simple genotypingbysequencing (GBS) approach for high diversity species. PLoS One. 2011;6:e19379.
 13.
Segelke D, Reinhardt F, Liu Z, Thaller G. Prediction of expected genetic variation within groups of offspring for innovative mating schemes. Genet Sel Evol. 2014;46:42.
 14.
Habier D, Fernando RL, Dekkers JC. The impact of genetic relationship information on genomeassisted breeding values. Genetics. 2008;177:2389–97.
 15.
VanRaden PM, Wiggans GR. Derivation, calculation, and use of national animal model information. J Dairy Sci. 1991;74:2737–46.
 16.
Goddard M. Genomic selection: prediction of accuracy and maximisation of long term response. Genetica. 2009;136:245–57.
 17.
van der Werf J, van Arendonk JAM, De Vries AG. Improving selection of pigs using correlated characters. In: Proceedings of the 43rd annual meeting of the European federation of animal science (EAAP), 14–17 September 1992, Madrid. 1992.
 18.
Ducrocq V. Multiple trait prediction: principles and problems. In: Proceedings of the 5th world congress on genetics applied to livestock production, 7–12 August 1994, Guelph. 1994.
 19.
Colleau JJ, Ducrocq V, Boichard D, Larroque H. Approximate multitrait BLUP evaluation to combine functional traits information. Interbull Bull. 1999;23:1–9.
 20.
Schaeffer LR. Multiplecountry comparison of dairy sires. J Dairy Sci. 1994;77:2671–8.
 21.
Schaeffer LR. Multiple trait international bull comparisons. Livest Prod Sci. 2001;69:145–53.
 22.
Thompson R, Meyer K. A review of theoretical aspects in the estimation of breeding values for multitrait selection. Livest Prod Sci. 1986;15:299–313.
 23.
Smith A, Cullis B, Thompson R. Analyzing variety by environment data using multiplicative mixed models and adjustments for spatial field trend. Biometrics. 2001;57:1138–47.
 24.
Falconer DS. The problem of environment and selection. Am Nat. 1952;86:293–8.
 25.
Malosetti M, Ribaut JM, van Eeuwijk FA. The statistical analysis of multienvironment data: modelling genotypebyenvironment interaction and its genetic basis. In: Monneveux P, Ribaut JM, Okono A, editors. Drought phenotyping in crops: from theory to practice, vol 4. Lausanne, Switzerland: Frontiers Media, SA; 2014. p. 53–69.
 26.
Crossa J, Burgueño J, Cornelius PL, McLaren G, Trethowan R, Krishnamachari A. Modeling genotype × environment interaction using additive genetic covariances of relatives for predicting breeding values of wheat genotypes. Crop Sci. 2006;46:1722–33.
 27.
Burgueño J, Crossa J, Cornelius PL, Yang RC. Using factor analytic models for joining environments and genotypes without crossover genotype × environment interaction. Crop Sci. 2008;48:1291–305.
 28.
Burgueño J, Crossa J, Miguel Cotes J, San Vicente F, Das B. Prediction assessment of linear mixed models for multienvironment trials. Crop Sci. 2011;51:944–54.
 29.
Burgueño J, de los Campos G, Weigel K, Crossa J. Genomic prediction of breeding values when modeling genotype × environment interaction using pedigree and dense molecular markers. Crop Sci. 2012;52:707–19.
 30.
de los Campos G, Gianola D. Factor analysis models for structuring covariance matrices of additive genetic effects: a Bayesian implementation. Genet Sel Evol. 2007;39:481–94.
 31.
Meyer K. Factoranalytic models for genotype x environment type problems and structured covariance matrices. Genet Sel Evol. 2009;41:21.
 32.
Tsuruta S, Lourenco DAL, Misztal I, Lawlor TJ. Genotype by environment interactions on culling rates and 305day milk yield of Holstein cows in 3 US regions. J Dairy Sci. 2015;98:5796–805.
 33.
Kolmodin R, Strandberg F, Madsen P, Jensen J, Jorjiani H. Genotype by environment interaction in Nordic dairy cattle studied using reaction norms. Acta Agric Scand Anim Sci. 2002;52:11–24.
 34.
Windig JJ, Calus MPL, Veerkamp RF. Influence of herd environment on health and fertility and their relationship with milk production. J Dairy Sci. 2005;88:335–47.
 35.
Cooper M, DeLacy IH. Relationships among analytical methods used to study genotypic variation and genotypebyenvironment interaction in plant breeding multienvironment experiments. Theor Appl Genet. 1994;88:561–72.
 36.
Ducrocq V, Boichard D, Barbat A, Larroque H. Implementation of an approximate multitrait BLUP evaluation to combine production traits and functional traits into a total merit index. In: Proceedings of the 52nd annual meeting of the European association for animal production, 26–29 August 2001, Budapest; 2001.
 37.
Lassen J, Sørensen MK, Madsen P, Ducrocq V. A stochastic simulation study on validation of an approximate multitrait model using preadjusted data for prediction of breeding values. J Dairy Sci. 2007;90:3002–11.
 38.
Lassen J, Sørensen MK, Madsen P, Ducrocq V. An approximate multitrait model for genetic evaluation in dairy cattle with a robust estimation of genetic trends. Genet Sel Evol. 2007;39:353–67.
 39.
MaziyaDixon B, Dixon AGO, Adebowale ARA. Targeting different end uses of cassava: genotypic variations for cyanogenic potentials and pasting properties. Int J Food Sci Technol. 2007;42:969–76.
 40.
Okechukwu RU, Dixon AGO. Genetic gains from 30 years of cassava breeding in Nigeria for storage root yield and disease resistance in elite cassava genotypes. J Crop Improv. 2008;22:181–208.
 41.
Ly D, Hamblin M, Rabbi I, Melaku G, Bakare M, Gauch HG Jr, et al. Relatedness and genotype × environment interaction affect prediction accuracies in genomic selection: a study in cassava. Crop Sci. 2013;53:1312–25.
 42.
Glaubitz J, Casstevens T, Elshire R, Harriman J, Buckler ES. TASSEL 3.0 genotyping by sequencing (GBS) pipeline documentation. Edward S. Buckler. Ithaca: USDAARS. 2012. http://www.maizegenetics.net/tassel/docs/TasselPipelineGBS.pdf. Accessed 3 Jan 2014.
 43.
Goodstein DM, Shu S, Howson R, Neupane R, Hayes RD, Fazo J, et al. Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res. 2012;40:D1178–86.
 44.
Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33:1–22.
 45.
Bates D, Maechler M, Bolker B, Walker S. lme4: linear mixedeffects models using Eigen and S4. R package version 1.05. http://CRAN.Rproject.org/package=lme4. 2013.
 46.
Rutkoski J, Benson J, Jia Y, BrownGuedira G, Jannink JL, Sorrells M. Evaluation of genomic prediction methods for Fusarium head blight resistance in wheat. Plant Genome. 2012;5:51–61.
 47.
Wolfe MD, Kulakow P, Rabbi IY, Jannink JL. Markerbased estimates reveal significant nonadditive effects in clonally propagated cassava (Manihot esculenta): implications for the prediction of total genetic value and the selection of varieties. G3 (Bethesda). 2016;3:3497–506.
 48.
VanRaden PM. Efficient methods to compute genomic predictions. J Dairy Sci. 2008;91:4414–23.
 49.
Misztal I, Tsuruta S, Strabel T, Auvray B, Druet T, Lee D. BLUPF90 and related programs (BGF90). In: Proceedings of the 7th world congress on genetics applied to livestock production, 19–23 August 2002, Montpellier. CDROM Communication 28:07; 2002.
 50.
Piepho HP, Pillen K. Mixed modelling for QTL × environment interaction analysis. Euphytica. 2004;137:147–53.
 51.
VanRaden PM, Tooker ME, Wright JR, Sun C, Hutchison JL. Comparison of singletrait to multitrait national evaluations for yield, health, and fertility. J Dairy Sci. 2014;97:7952–62.
 52.
Calus MP, Veerkamp RF. Accuracy of multitrait genomic selection using different methods. Genet Sel Evol. 2011;43:26.
 53.
Guo G, Zhao F, Wang Y, Zhang Y, Du L, Su G. Comparison of singletrait and multipletrait genomic prediction models. BMC Genet. 2014;15:30.
 54.
Oakey H, Cullis B, Thompson R, Comadran J, Halpin C, Waugh R. Genomic selection in multienvironment crop trials. G3 (Bethesda). 2016;6:1313–26.
 55.
Cuevas J, Crossa J, Soberanis V, PérezElizalde S, PérezRodríguez P, de los Campos G, et al. Genomic prediction of genotype × environment interaction kernel regression models. Plant Genome. 2016;9:3.
 56.
Cuevas J, Crossa J, MontesinosLópez OA, Burgueño J, PérezRodríguez P, de los Campos G. Bayesian genomic prediction with genotype × environment interaction kernel models. G3 (Bethesda). 2017;7:41–53.
 57.
Jennings DL. Breeding for resistance to African cassava mosaic disease: progress and prospects. In: Nestel, editor. African cassava mosaic. Report of an interdisciplinary workshop, 19–22 February 1976, Muguga; 1976.
 58.
Wolfe MD, Rabbi IY, Egesi C, Hamblin M, Kawuki R, Kulakow P, et al. Genomewide association and prediction reveals genetic architecture of Cassava mosaic disease resistance and prospects for rapid genetic improvement. Plant Genome. 2016;9:2.
 59.
Schaeffer LR. Sire and cow evaluation under multiple trait models. J Dairy Sci. 1984;67:1567–80.
Authors’ contributions
UGO designed, carried out study and drafted manuscript, DA provided statistical assistance and advice, JLJ supervised the study and revised the manuscript, IR and PK supervised data generation for this study and revised the manuscript. All authors read and approved the final manuscript.
Acknowledgements
We acknowledge the Bill & Melinda Gates Foundation and UKaid (Grant 1048542; http://www.gatesfoundation.org) and support from the CGIAR Research Program on Roots, Tubers and Bananas (http://www.rtb.cgiar.org). We give special thanks to AGO Dixon for his development of many of the breeding lines and historical data we analyzed. Thanks also to AI Smith and technical teams at IITA for collection of other phenotypic data and to A Agbona and M Wolfe for data curation. We give special thanks to the editors and the anonymous reviewers who helped improve the quality of this paper.
Competing interests
The authors declare that they have no competing interests.
Availability of data and materials
Phenotypic and genotypic data and analyses scripts used in this study can be downloaded via ftp at (ftp://ftp.cassavabase.org/manuscripts/).
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Okeke, U.G., Akdemir, D., Rabbi, I. et al. Accuracies of univariate and multivariate genomic prediction models in African cassava. Genet Sel Evol 49, 88 (2017). https://doi.org/10.1186/s127110170361y
Received:
Accepted:
Published:
Keywords
 Cassava Breeding
 Fivefold Crossvalidation Scheme
 Best Linear Unbiased Prediction (BLUP)
 BLUP Model
 International Institute Of Tropical Agriculture (IITA)