Use and optimization of different sources of information for genomic prediction

Ilska, Joanna J.; Meuwissen, Theo H. E.; Kranis, Andreas; Woolliams, John A.

doi:10.1186/s12711-017-0365-7

Research Article
Open access
Published: 11 December 2017

Use and optimization of different sources of information for genomic prediction

Joanna J. Ilska¹,
Theo H. E. Meuwissen²,
Andreas Kranis^1,3 &
…
John A. Woolliams¹

Genetics Selection Evolution volume 49, Article number: 90 (2017) Cite this article

2027 Accesses
3 Citations
3 Altmetric
Metrics details

Abstract

Background

Molecular data is now commonly used to predict breeding values (BV). Various methods to calculate genomic relationship matrices (GRM) have been developed, with some studies proposing regression of coefficients back to the reference matrix of pedigree-based relationship coefficients (A). The objective was to compare the utility of two GRM: a matrix based on linkage analysis (LA) and anchored to the pedigree, i.e. ${\mathbf{G}}_{{{\mathbf{LA}}}} ,$ and a matrix based on linkage disequilibrium (LD), i.e. ${\mathbf{G}}_{{{\mathbf{LD}}}}$, using genomic and phenotypic data collected on 5416 broiler chickens. Furthermore, the effects of regressing the coefficients of ${\mathbf{G}}_{{{\mathbf{LD}}}}$ back to A (LDA) and to ${\mathbf{G}}_{{{\mathbf{LA}}}}$ (LDLA) were evaluated, using a range of weighting factors. The performance of the matrices and their composite products was assessed by the fit of the models to the data, and the empirical accuracy and bias of the BV that they predicted. The sensitivity to marker choice was examined by using two chips of equal density but including different single nucleotide polymorphisms (SNPs).

Results

The likelihood of models using GRM and composite matrices exceeded the likelihood of models based on pedigree alone and was highest with intermediate weighting factors for both the LDA and LDLA approaches. For these data, empirical accuracies were not strongly affected by the weighting factors, although they were highest when different sources of information were combined. The optimum weighting factors depended on the type of matrices used, as well as on the choice of SNPs from which the GRM were constructed. Prediction bias was strongly affected by the chip used and less by the form of the GRM.

Conclusions

Our findings provide an empirical comparison of the efficacy of pedigree and genomic predictions in broiler chickens and examine the effects of fitting GRM with coefficients regressed back to a reference anchored to the pedigree, either A or ${\mathbf{G}}_{{{\mathbf{LA}}}}$. For the analysed dataset, the best results were obtained when ${\mathbf{G}}_{{{\mathbf{LD}}}}$ was combined with relationships in A or ${\mathbf{G}}_{{{\mathbf{LA}}}}$, with optimum weighting factors that depended on the choice of SNPs used. The optimum weighting factor for broiler body weight differed from weighting factors that were based on the density of SNPs and theoretically derived using generalised assumptions.

Background

Thanks to recent advances in genomic technologies, increasing amounts of genotypes are generated worldwide for many livestock species. A central use in animal breeding is the prediction of estimated breeding values (EBV). As genomic data accumulate, these estimates are expected to become more accurate than those obtained using traditional methods based on best linear unbiased predictions (BLUP) [1] that use phenotype and pedigree information only [2].

In the pedigree-based BLUP methodology, the genetic (co)variances of the breeding values (BV) of individuals in the population are modelled by the numerator relationship matrix (${\mathbf{A}}$) scaled by the additive genetic variance. For genomic predictions, it is common to infer genomic relationships by using information on linkage disequilibrium (LD) from the identity-by-state (IBS) among individuals at marker loci [3]. The matrix of these observed relationships (${\mathbf{G}}$) offers more informed estimates of relationships among individuals than pedigree alone, with the added benefit of accounting for the different Mendelian sampling among siblings. To obtain genomic EBV, the expected relationship values of ${\mathbf{A}}$ are replaced by those of ${\mathbf{G}}$ in the mixed model equations, which is referred to as genomic BLUP (GBLUP).

Underlying GBLUP is the idea of an equivalent ridge regression model on the allele counts to exploit LD between markers and causative quantitative trait loci (QTL). A genomic relationship matrix that is fully based on LD (${\mathbf{G}}_{{{\mathbf{LD}}}}$) removes the assumption of an unrelated base population that is made when constructing ${\mathbf{A}}$ and implicitly traces relationships that precede those contained in the pedigree [4], and makes it feasible to obtain EBV without knowledge of the pedigree. However, unless the dataset is very large, the accuracy of the EBV obtained with the LD approach can deteriorate over relatively few generations, since LD is broken down by recombination during meiosis [5]. Since the underlying methodology of the LD approach is based on the association between markers and phenotypes, the choice of SNPs used as markers and their location may influence the results and efficacy of this approach.

A drawback of the LD-based approach is the imperfect linkage between markers and QTL, which can result in over-estimation of marker effects and sampling errors in the genomic relationship coefficients [6]. As such, it has been proposed that bias in relationship estimates may be alleviated by regressing the relationship coefficients of ${\mathbf{G}}$ towards the reference values in ${\mathbf{A}}$. VanRaden [3] proposed a deterministic way of deriving the optimum regression coefficient based on the number of markers available and suggested that, given a large enough number of markers, the optimum regression coefficient may be as large as 0.95, which represents only a small change to the values of ${\mathbf{G}}$.

Irrespective of their indirect effect on the trait, markers can provide invaluable information on the inheritance of chromosome segments, tracked from the base population down the pedigree, which can be used to form an identity-by-descent (IBD) matrix [7]. A linkage analysis (LA) approach combines the theoretical assumptions of ${\mathbf{A}}$ that individuals in the base population are unrelated with observed sharing of marker alleles among genotyped individuals. Therefore, a genomic relationship matrix constructed by using the LA-based approach (${\mathbf{G}}_{{{\mathbf{LA}}}}$) has a structure that is defined by families and assumes that genetic variants that are present in the base population are distinct, in spite of being IBS. Since the method uses markers to track recombinations in the genome, rather than associations with phenotypes, the choice of SNPs may not influence predictions from the LA approach to the same degree as those from the LD approach.

From the assumptions of these three approaches (pedigree, LD, and LA), it follows that the relationships among individuals in the base generation of the pedigree are the same for the pedigree- and LA-based approaches, while they have values that are estimated directly from the genotype data in the LD approach. Since each of these methods represents a different source of information for predicting EBV, a flexible approach that combines their contributions could provide for optimal use of genotypes and pedigree. Therefore, the objective of this study was to evaluate the performance of the ${\mathbf{A}}$, ${\mathbf{G}}_{{{\mathbf{LD}}}}$ and ${\mathbf{G}}_{{{\mathbf{LA}}}}$ matrices, as well as their composites, when fitted to (G)BLUP models for EBV of broiler chickens. The fit of the models to data was assessed by their likelihood, while the efficacy of predicting BV of selection candidates was evaluated using empirical accuracy and bias estimates. To assess the possible effect of the choice of SNPs on the performance of the tested methods, matrices were calculated on two different in silico chips.

Methods

Data

The dataset used in the analysis was provided by Aviagen Ltd and consisted of data on 5416 broiler chickens, 1089 males and 4327 females, over six generations. All animals came from a commercial pedigreed female-parent line of broiler breeders that had been closed for 30 years. As described elsewhere [8], the breeding objective was broadly defined and balanced across growth, efficiency, reproductive performance, welfare and health-related traits. A detailed description of the housing and husbandry conditions under which these animals were reared is in [9]. The pedigree had a base population of 288 individuals and a total of 320 sires and 1132 dams, with an average number of offspring of 16 and 5 per sire and dam, respectively. The animals were assigned to contemporary groups of 193 hatch weeks (HW), with on average 26 individuals per HW. Of these, 1446 animals (sires and grand-sires) were genotyped at high-density using the Affymetrix Axiom 600k chip [10], while their offspring were genotyped at low-density (3k Illumina chip [11]) and imputed up to the 600k chip using AlphaImpute [12]. The accuracy of imputation was validated independently of this study, and was found to be greater than 0.97 (unpublished data). The phenotype used was juvenile body weight (BWT), which was recorded at 35 days of age on both sexes on all animals.

The population was split into a training population (TRN) and testing population (TST), consisting of 3146 and 2270 individuals, respectively. The TST individuals were offspring and siblings of individuals in the TRN and none of them had offspring with records included in the TRN. Phenotypes of TST individuals were masked when estimating variances and predicting breeding values and were later used to evaluate empirical accuracy and bias of predictions.

Quality control and choice of SNPs

The genotypes for the Affymetrix chip were assessed using quality control (QC) procedures within PLINK [13]. After QC, 431k SNPs (69% of total) with known chromosomal locations (based on the chicken genome assembly version 4, i.e. GalGal4) remained and were distributed across 27 chromosomes, including all macro-chromosomes (chromosomes 1 to 5), intermediate chromosomes (6 to 10), and 17 of the 28 micro-chromosomes. Table 1 summarises the SNPs that failed particular screening criteria.

Table 1 Quality control criteria and number of markers failing each criterion expressed as a percentage of the total 625,995 SNPs

Full size table

From the 431k SNPs that passed QC, two in-silico chips were created, each with ~ 27k SNPs (1000 per chromosome): (1) a panel with near evenly spaced markers, i.e. the ESM chip, and (2) a panel with markers selected for their effect on the trait derived using genome-wide association (GWA), i.e. the GWAM chip, as described below. Since the number of SNPs per chromosome was kept constant in spite of large differences in map length, the density of SNPs on the micro-chromosomes was higher than on macro- and intermediate chromosomes.

ESM chip

SNPs on each chromosome were selected according to their linkage map spacing. The linkage map used was assembled from the accumulated Aviagen data. When multiple SNPs were available, those with a high minor allele frequency (MAF) were preferred. For micro-chromosomes 25, 26, 27 and 28, the number of SNPs selected was 888, 998, 998 and 991 respectively, resulting in 26,875 SNPs.

GWAM chip

SNPs were selected based on a GWA analysis of BWT conducted using only the TRN set, and carried out in PLINK [13]. SNPs were ranked for each chromosome according to their P value. The top 1000 SNPs on each of the 27 chromosomes were selected (resulting in 27,000 SNPs on the chip), irrespective of the threshold for genome-wide significance.

These two chips differed in the average MAF of the SNPs selected, as shown on Fig. 1, with the ESM chip favouring SNPs with a higher MAF and the GWAM chip favouring SNPs with a lower MAF. The distribution of the inter-marker intervals also differed between chips (Fig. 2).

Calculation of relationship matrices

Different relationship matrices were calculated for individuals in the total population (TST plus TRN). The numerator relationship matrix ${\mathbf{A}}$ was calculated using ASReml procedures [14]. A relationship matrix constructed using linkage analysis (${\mathbf{G}}_{{{\mathbf{LA}}}}$) was calculated using the linkage disequilibrium multi-locus iterative peeling method (LDMIP) described by Meuwissen et al. [7], with the elements of ${\mathbf{G}}_{{{\mathbf{LA}}}}$ obtained by averaging the relationship calculated for each locus. A relationship matrix based on linkage disequilibrium (${\mathbf{G}}_{{{\mathbf{LD}}}}$) was constructed using the ACTA software package [15] following Method 2 of VanRaden [3].

Composite relationship matrices

As each of the above matrices uses a different source of information, integrating such information may maximize the benefit of using SNP genotypes [3, 6, 14]. Integration of relationships was done by weighting the relationship coefficients from two matrices ${\mathbf{M}}_{1}$ and ${\mathbf{M}}_{2}$ according to:

$${\mathbf{M}} =\uplambda{\mathbf{M}}_{1} + \left( {1 -\uplambda} \right){\mathbf{M}}_{2} .$$

Two types of integration were considered: LDA, where ${\mathbf{M}}_{1} = {\mathbf{G}}_{{{\mathbf{LD}}}}$ and ${\mathbf{M}}_{2} = {\mathbf{A}}$, and LDLA where ${\mathbf{M}}_{1} = {\mathbf{G}}_{{{\mathbf{LD}}}}$ and ${\mathbf{M}}_{2} = {\mathbf{G}}_{{{\mathbf{LA}}}}$. The optimum weighting factor was found by incrementing $\uplambda$ from 0 to 1, in 0.1 steps. Options with $\uplambda = 1$ always represented information obtained only from ${\mathbf{G}}_{{{\mathbf{LD}}}}$, while $\uplambda = 0$ sourced all information from ${\mathbf{A}}$ in LDA or from ${\mathbf{G}}_{{{\mathbf{LA}}}}$ in LDLA.

Prediction of breeding values

Linear mixed models were fitted to the TRN data using all relationship matrices described above: 11 in the LDA sequence and 11 in the LDLA sequence, with ${\mathbf{G}}_{{{\mathbf{LD}}}}$ common to both sequences. The mixed linear models (MLM) were fitted using ASReml [12] as follows:

$${\mathbf{y}} = {\mathbf{X}}{\varvec{\uptau}} + {\mathbf{Zu}} + {\mathbf{e}},$$

where ${\mathbf{y}}$ denotes the vector of observations, ${\varvec{\uptau}}$ the vector of estimates for the fixed effects of hatch week and sex, with the design matrix ${\mathbf{X}}$; ${\mathbf{u}}$ is the vector of breeding values with the design matrix ${\mathbf{Z}}$; and ${\mathbf{e}}$ is the vector of residual environmental effects. The breeding values ${\mathbf{u}}$ were assumed to be random and distributed as $MVN\left( {0,V_{A} {\mathbf{M}}} \right)$, where $V_{A}$ is the additive genetic variance and ${\mathbf{M}}$ is a relationship matrix as described above. The residual effects were assumed to be random and distributed $MVN\left( {0,V_{E} {\mathbf{I}}} \right)$, where $V_{E}$ is the residual variance and ${\mathbf{I}}$ is an identity matrix.

Log-likelihoods were used to compare the fit of the models to the TRN data through the log-likelihood ratio test. The log-likelihood profile was calculated for both LDA and LDLA as a function of $\uplambda$. From these profiles, the value of $\uplambda$ at the peak was obtained, together with the corresponding 95% support intervals. The latter were calculated as the interval for which twice the drop in log-likelihood from the peak value was less than 3.84, i.e. twice the difference in log-likelihood was smaller than the critical value of a chi-squared distribution with 1 degree of freedom.

Empirical accuracy and bias

The practical application of genomic evaluation depends on the empirical accuracy of BV prediction and its bias. These were assessed using the TST set, with phenotypic records masked in BV prediction and revealed for the calculation of empirical accuracy and bias. The empirical accuracy was calculated from the residual correlations of the predicted breeding values (EBV) with the phenotypes, after fitting a linear model to both EBV and phenotype to account for the fixed effects of hatch week and sex. This linear model was fitted separately to the EBV and phenotype in the TST data only, using the GenStat software [16]. To approximate the empirical accuracy of prediction for the breeding value, the residual correlation was divided by the square root of the estimate of heritability ($h^{2}$) for BWT. The same value of $h^{2}$ = 0.35 was assumed throughout and was obtained by using ${\mathbf{A}}$ with the TRN set. This estimate is consistent with published estimates for BWT [17].

The bias was estimated by the regression coefficient of BWT in the TST set on EBV in a fixed linear model that included the fixed effects of sex and hatch week. A regression coefficient of 1 is consistent with no bias, since a difference in EBV between two individuals is an unbiased prediction of the difference in their true BV, and consequently in their phenotypes (given the basic assumption that the phenotype is the sum of the BV and other terms independent of the BV). Regression coefficients greater or less than 1 are indicative of under- and over-prediction of differences in BV, respectively.

Results

Although the two types of chips used for creating the relationship matrices between individuals consisted of approximately the same number of SNPs, the choice of the SNPs changed the profile of the likelihood for models fitted to the data. To facilitate presentation of results, the findings from analyses carried out using the ESM chip are presented first, followed by results obtained from analyses using the GWAM chip. The comparison between the chips is provided at the end of this section.