- Open Access
Enlarging a training set for genomic selection by imputation of un-genotyped animals in populations of varying genetic architecture
© Pimentel et al.; licensee BioMed Central Ltd. 2013
- Received: 7 September 2012
- Accepted: 24 March 2013
- Published: 26 April 2013
The most common application of imputation is to infer genotypes of a high-density panel of markers on animals that are genotyped for a low-density panel. However, the increase in accuracy of genomic predictions resulting from an increase in the number of markers tends to reach a plateau beyond a certain density. Another application of imputation is to increase the size of the training set with un-genotyped animals. This strategy can be particularly successful when a set of closely related individuals are genotyped.
Imputation on completely un-genotyped dams was performed using known genotypes from the sire of each dam, one offspring and the offspring’s sire. Two methods were applied based on either allele or haplotype frequencies to infer genotypes at ambiguous loci. Results of these methods and of two available software packages were compared. Quality of imputation under different population structures was assessed. The impact of using imputed dams to enlarge training sets on the accuracy of genomic predictions was evaluated for different populations, heritabilities and sizes of training sets.
Imputation accuracy ranged from 0.52 to 0.93 depending on the population structure and the method used. The method that used allele frequencies performed better than the method based on haplotype frequencies. Accuracy of imputation was higher for populations with higher levels of linkage disequilibrium and with larger proportions of markers with more extreme allele frequencies. Inclusion of imputed dams in the training set increased the accuracy of genomic predictions. Gains in accuracy ranged from close to zero to 37.14%, depending on the simulated scenario. Generally, the larger the accuracy already obtained with the genotyped training set, the lower the increase in accuracy achieved by adding imputed dams.
Whenever a reference population resembling the family configuration considered here is available, imputation can be used to achieve an extra increase in accuracy of genomic predictions by enlarging the training set with completely un-genotyped dams. This strategy was shown to be particularly useful for populations with lower levels of linkage disequilibrium, for genomic selection on traits with low heritability, and for species or breeds for which the size of the reference population is limited.
- Linkage Disequilibrium
- Genomic Selection
- Genomic Prediction
- Estimate Breeding Value
- Imputation Accuracy
Prediction of breeding values of animals using genomic information was proposed by Meuwissen et al.  and since then the way breeding programs of livestock are conducted has changed considerably. Due to recent advances in genotyping technologies, the amount of genomic information available for genomic selection (GS) has increased from a few thousand  to 50k  and 800k  single nucleotide polymorphism (SNP) markers and today tends towards whole-genome sequence . The population structures observed in many livestock species are often characterized by large full- and half-sib families, and by the presence of animals (especially males) with a very large number of progeny. These conditions make it possible to infer the genotype of an un-genotyped individual using genomic information from its family members, which is usually referred to as pedigree-based imputation. High levels and extents of linkage disequilibrium (LD) have been reported in livestock populations, such as cattle , sheep , chickens , pigs  and horses . The presence of high LD between markers can be used to infer the genotype at an un-genotyped locus based on available genotypes at neighbouring markers, which is usually referred to as population-based imputation. Such features make it possible to impute genotypes at untyped markers in a larger panel of markers from genotypes obtained with a smaller panel. In order to reduce genotyping costs, much effort has been put on developing methods and software to impute genotypes at high-density chips from animals genotyped at low-density chips [11–15]. Accuracy of imputation may vary depending on the source of information being used to infer the genotypes and also on population structures. Hayes et al.  investigated the success of imputation from 5 k to 50 k genotypes in four sheep breeds and reported accuracies ranging from 71 to 80% depending on the breed. Erbe et al.  used the software BEAGLE  without pedigree information to impute genotypes at 800 k SNPs from dairy bulls genotyped at 50 k and reported accuracies of imputation (defined as the proportion of correctly imputed genotypes) ranging from 0.96 to 0.98 in Jersey and Hosltein cattle, respectively. Meuwissen and Goddard  applied a method for imputing whole-sequence genotypes on individuals genotyped at a low density panel and reported that 10% of the missing genotypes were erroneously imputed.
In principle, an increase in marker density should result in higher LD between the markers and the quantitative trait loci underlying a given trait, and consequently in more accurate genomic predictions. However, the advantage of using a high-density panel for GS compared to a low-density panel depends on which markers are included in the low-density panel. Such a formulation can be interpreted in terms of variable selection in a linear model, which has been a topic of frequent research aiming at reducing over-parameterisation in statistical models for GS [19, 20], as well as making the implementation of a genomic breeding program more cost-effective . Based on simulation analyses, Habier et al.  showed that low-density marker panels could be used in GS with a limited loss in accuracy compared to that achieved with high-density panels.
According to a study using dairy cattle data by Weigel et al. , moving from a set of 300 markers to a set of 2000 markers represented a gain in accuracy of ~30% or ~113%, depending on how the subsets of markers were selected (with largest effects or equally spaced). When moving from 2000 to 32 518 markers, gains in accuracy were only ~8% or ~13%. There is further empirical evidence that the relationship between gain in accuracy and increase in marker density tends to reach a plateau. VanRaden et al.  reported an average difference in accuracy of only 0.4% between predictions from a 50 k and a high-density (777 k) chip. As suggested by the results from VanRaden et al. , an increase in the number of animals in the training set should be more effective for improving the accuracy of genomic predictions than increasing the number of markers, especially when there is evidence that the benefit from increasing density tends to reach a plateau.
Many of the studies done with imputation so far have focused on the increase in density of markers panels through imputation and its impact on accuracy of genomic predictions. Results from Weigel et al.  in Jersey cattle indicated that if a suitable reference population genotyped with a 50 k chip is available, genotyping selection candidates with a 3 k instead of a 50 k chip and then imputing the remaining genotypes would result in a loss of predictive ability of only 5%. Dassonneville et al.  also studied the effect of genotyping selection candidates either with a 50 k or with a 3 k chip followed by imputation and reported losses in reliability ranging from 0.02 to 0.06 in Holstein cattle. Erbe et al.  used dairy cattle data to investigate the impact on the accuracy of genomic predictions of an increase in marker density from 50 k to 800 k through imputation, and reported an average gain in accuracy of 0.01 in Holsteins and 0.03 in Jersey cattle.
Imputation can be used to increase the number of markers. However, the benefit is expected to reach a plateau beyond a certain density. Imputation can also be used to increase the size of the training set with animals that were not genotyped at all. Cleveland et al.  investigated the impact of imputation on genomic predictions, and compared a training set of fully genotyped males and females with a training set in which only males were genotyped and females were imputed. An alternative and interesting analysis would be to compare the accuracy achieved in a training set with only genotyped males to that achieved with a training set containing the imputed females as well. A situation somewhat similar to that was investigated by Pszczola et al. , who compared a training set of genotyped bulls with a training set enlarged by imputed bulls. They used an additive relationship matrix relating genotyped to un-genotyped bulls to perform the imputation and reported an accuracy of imputation of 0.59, but the inclusion of the imputed bulls in training did not increase the accuracy of genomic prediction. This may be explained by the fact that the un-genotyped (imputed) bulls in their population had no offspring and the highest degree of relationship between them and the genotyped bulls was half-sib or parent-offspring. Imputation may be improved if the un-genotyped individuals to be imputed are defined in a specific design such that genotypes can be inferred with higher probabilities. For instance, imputation is likely to be more accurate when genotyped close relatives are available . In some applications of GS, this may occur naturally. For example, when a training set is created for GS on traits that are expressed only in females, as for new traits in dairy cattle for which the cows’ phenotypes are difficult to measure (e.g. ) and/or for which accurate conventionally estimated breeding values of bulls are not yet available as an alternative response variable. In most livestock species, the number of males used for breeding is usually limited, thus when a reference population of males is in an advanced stage (as in the dairy industry, for example) most of the intensively used breeding males have probably been already genotyped. We consider a situation where a reference population of females is created and all or most of their sires and maternal grandsires have been already genotyped. In such cases, there is a considerable amount of family information available that can be used to try to infer the genotypes of the dams of these females. The configuration of the genotyped family members in this specific design should allow a much better quality of imputation of the un-genotyped dams than when performing imputation in a general framework on subjects from a pedigree with variable levels of relationships to the genotyped individuals. Although specific, this design is relevant since it will naturally arise in all future applications of genomic selection for new phenotypes.
The objectives of this work were: (1) to investigate the performance of two imputation methods for a completely un-genotyped dam, using the information on its genotyped family members and the mating partner plus the estimates of either allele or haplotype frequencies; (2) to investigate the effects of different population structures, levels of LD and distribution of allele frequencies on the success of imputation; and (3) to evaluate the impact of enlarging a training set with imputed dams on the accuracy of genomic predictions for different populations, levels of heritability (h2) of the trait under selection and sizes of training sets already available.
The second imputation procedure is done in two stages and therefore will be referred to as the Two_Step method. In a first step, only the Dam genotypes that can be inferred with probability 1 are assigned (see Additional file 1: Table S1) and whenever the probability is lower than 1, the Dam genotype is set to missing. In a second step, the genotyping data from the Dam, containing assigned and missing genotypes, are combined with all available genotyping data from the MGS, Offspring and Sire, and missing genotypes are filled in using LD information. The second step was carried out using the software fastPHASE  for haplotype reconstruction and inference of missing genotypes.
To assess the efficiency of the two methods described above, imputation of Dam genotypes was also performed using two currently available imputation programs: findhap.f90 Version 2  and AlphaImpute Beta 1.16 .
For the comparison of imputation methods, genomic data were simulated using the software QMSim . The simulated genome consisted of one chromosome of 100 cM, on which 2000 bi-allelic markers (coded as alleles 1 and 2) were randomly allocated. Marker allele frequencies in the first historical generation were set equal to 0.5 and the mutation rate was set to 2.5e-5. In order to generate different genomic structures that may influence the success of imputation, four populations were simulated, which differed in the level of LD and the presence or absence of selection. The increase in the level of LD desired for two of the populations was induced by simulating a bottleneck in the historical population. Therefore, the four scenarios were created as follows: no bottleneck and no selection (LowLD_NoSel), no bottleneck and selection (LowLD_Sel), bottleneck and no selection (HighLD_NoSel), and bottleneck and selection (HighLD_Sel). For each of the four scenarios, 10 replicates were simulated.
To generate a minimum level of LD for the two scenarios without bottleneck, a historical population of 4000 animals was mated at random for 1600 discrete generations, without selection, without migration and with an equal number of animals from both genders. Then the population size was increased to 4040 in the following 20 generations and kept at a constant size for 20 additional generations. For the two scenarios with bottleneck, the historical population was initially set to 2000 animals and mated at random for 2500 generations. After this, a bottleneck was simulated by gradually decreasing the population size to 200 animals over the following 70 generations; these 200 animals were further mated at random for 10 generations. The population size was then gradually expanded from 200 to 4040 animals within the next 20 generations, and remained at a size of 4040 for 20 additional generations. In all four scenarios, population size was 4040 in the last historical generation, which included 40 males.
Starting with the 4000 female and 40 male founders from the last historical generation, 10 additional generations were simulated to form the recent population. In the recent population, the proportion of male offspring was 0.5, litter size was 1, a random mating design was applied and replacement ratios for sires and dams were 0.5 and 0.25, respectively. These parameters were common to all four scenarios. For the two scenarios without selection, a random selection design was used and the culling design was based on the age of the animal. For the two scenarios with selection, both selection and culling designs were based on estimated breeding values (EBV). These EBV were obtained by solving Henderson’s mixed model equations  using pedigree information and phenotypic records from a trait with h2 = 0.20. Since the proportions of female and male offspring were identical, the last generation of the recent population contained 2000 female offspring. Genotype imputation was then performed on the dams of these 2000 female offspring from the last generation.
To investigate the impact of imputation on the accuracy of genomic predictions, the size of the training set used for SNP effect estimation is a relevant parameter. For that purpose, the same simulation procedures described above for the four scenarios were applied again in another simulation, in which a larger population was generated at the end. Instead of using a size of 4040 for the last historical generation, the number of female founders was set to 32 000 so that 16 000 female offspring in the last generation were available for the imputation of their dams. As above, 10 replicates of each scenario were simulated for the larger populations.
Assessment of LD in the simulated populations
Outputs from QMSim included information about the paternal and maternal alleles of each locus, which allowed the determination of linkage phase and the calculation of haplotype frequencies. The level of LD in the four simulated scenarios could then be assessed by calculating the squared correlation coefficient (r2) between each pair of markers in the last generation. To minimize the influence of the minor allele frequency (MAF) on the measure of LD, r2 values were computed only for pairs of markers with a MAF greater than 0.05. The decay of LD with increasing inter-marker distances was also assessed by calculating the mean r2 within bins of inter-marker distances.
Prediction of genomic breeding values
where μ is an overall mean; α is the vector of allele substitution effects; ι is a vector of ones, of order equal to the number of animals in the training set; X is the matrix of SNP genotypes, coded as the number of copies (or dosage) of allele 2, of the animals in the training set; y is the vector of phenotypes; I is an identity matrix of order equal to the number of markers and ϕ is an assumed ratio of residual to marker variances. This ratio of variances was calculated using the simulated h2 values and assuming a marker variance equal to the additive variance divided by the number of markers. For each scenario and replicate, only markers with a MAF greater than 0.05 were used in the estimation of SNP effects. Genomic breeding values were then predicted as , where Z is the matrix of SNP genotypes, coded as the number of copies of allele 2, of the animals in the validation set. Accuracy of genomic evaluation was calculated as the correlation between GEBV and the simulated true breeding values of the animals in the validation set.
LD and distribution of allele frequencies in the simulated populations
Mean linkage disequilibrium (r 2 ) within different inter-marker distances in the simulated populations used for the comparison of imputation methods
Inter-marker distance (kb)
The different population structures simulated in the four scenarios not only affected the pattern of LD, but also caused different shapes of the distribution of allele frequencies. Histograms of the frequencies of allele 2 for all replicates of the four scenarios are provided in Additional file 2: Figure S2. In the LowLD_NoSel scenario (the one with the lowest level of LD), the distribution of allele frequencies was bell-shaped, with a much higher frequency of markers with intermediate allele frequencies compared to markers with extreme allele frequencies. In the HighLD_NoSel scenario, the distribution was more uniform, with a slightly higher frequency of markers with extreme allele frequencies. Selection caused a higher frequency of markers with extreme allele frequencies, especially in the scenario HighLD_Sel. Variability in the distributions across replicates was large in the scenarios LowLD_Sel and HighLD_Sel, whilst a very uniform pattern was observed in the LowLD_NoSel and HighLD_NoSel scenarios.
The level of LD directly affects the performance of the Two_Step method, since information on haplotype frequencies is used by fastPHASE to impute the missing genotypes. The Single_Step method does not use LD information but its performance will be affected by the different shapes of the distribution of allele frequencies, since the genotypes of markers with more extreme allele frequencies are easier to impute.
Quality of imputation between scenarios
Percentage of correctly imputed genotypes of the Dams for two imputation methods
0.70 ± 0.003
0.77 ± 0.045
0.81 ± 0.005
0.85 ± 0.019
0.60 ± 0.004
0.71 ± 0.056
0.75 ± 0.005
0.80 ± 0.021
The average success rate for both methods of imputation for Dams that had more than 300 unambiguously inferred genotypes in the HighLD_Sel scenario was ~0.92 (Figure 2). This proportion is similar to what one would expect to achieve when moving from a low density to a higher density panel of markers e.g. [35, 36]. In both approaches described here, this level of success rate could be achieved for completely un-genotyped Dams.
The Two_Step method could be compared to imputation from low to high density (e.g., 3k to 50k), in which first a ‘low density chip’ is built based on the unambiguous cases and then the rest is filled in with LD information. However, three main differences must be pointed out: (i) the Two_Step method starts from completely un-genotyped animals; (ii) after the first step, Dams have genotypes for a ‘low density chip’ but with a different chip for each Dam and not a set of evenly spaced markers common to all Dams; and (iii) information on the genotyped relatives is used only in the first step, which means that after the ‘low density chip’ is built the only information available for imputation is LD, whereas in a low to high density approach, one would still have the possibility of using family information. Obviously, if a low density panel of SNP was also available for these Dams, the average success rate would be even greater, but at the cost of genotyping the Dams for the low density chip. Inspection of the number of genotypes which can be imputed unambiguously may provide an approximate estimate of the expected success rate that may be achieved by imputation. Such an estimate could then be used as an aid to choose the Dams to be genotyped with a low density chip. In the case of a group of Dams, for which say 10 or 15% of the loci can be unambiguously inferred from family information alone, one could choose to leave them completely un-genotyped and do imputation with the Single_Step method. Knowledge about the population structure under consideration (e.g., level of LD and distribution of allele frequencies) would also be required in such a decision process. In order to account for that, simple experiments (e.g., genotyping and imputing a small number of Dams) could be conducted to empirically estimate the expected success rate for Dams with a given number of loci inferred with a probability of one.
One aspect of the imputation procedures proposed here is that genotypic information is assumed to be available on a specific set of animals (Figure 1), including one offspring. These methods can, however, be extended to situations in which a number of genotyped offspring are available, which should considerably improve the quality of imputation. Such improvement would be expected for both methods, since a larger number of offspring would most likely result in a larger number of unambiguous cases.
Comparison with available software
Correlation between true and imputed genotypes from different imputation methods and programs
0.76 ± 0.003
0.83 ± 0.038
0.88 ± 0.004
0.90 ± 0.013
0.81 ± 0.003
0.86 ± 0.028
0.90 ± 0.003
0.93 ± 0.009
0.57 ± 0.008
0.74 ± 0.066
0.80 ± 0.006
0.85 ± 0.021
0.52 ± 0.006
0.69 ± 0.065
0.74 ± 0.006
0.82 ± 0.030
0.83 ± 0.003
0.87 ± 0.024
0.86 ± 0.004
0.89 ± 0.010
Imputation accuracies from findhap.f90 were lower than accuracies from Single_Step* and Two_Step. The algorithm implemented in findhap.f90 is a combination of pedigree haplotyping and population haplotyping. Our results indicate that the amount of genotyping information available in the situation considered here (i.e., MGS, Sire and Offspring) seemed to be insufficient for the pedigree haplotyping algorithm to satisfactorily impute a completely un-genotyped Dam. Many other studies reporting performance results from findhap.f90 applied the program with the main purpose of imputing genotypes from low to high density chips [15, 35, 36]. In such cases, findhap.f90 can take more advantage of the population haplotyping algorithm because of the observed genotypes from the low density chip and may perform imputation with an accuracy greater than 0.95. To resemble an application with a small chip, we performed another series of imputation runs with findhap.f90, in which Dams had genotypes for 125 evenly spaced markers. Average imputation accuracies from findhap.f90 when moving from the sparse (125) to the dense (2000) set of markers were 0.96 (LowLD_NoSel), 0.98 (LowLD_Sel), 0.97 (HighLD_NoSel) and 0.98 (HighLD_Sel). These numbers are not comparable to the results in Table 3. They are rather used to illustrate the magnitude of accuracy expected when imputation is applied to move from low to high density chips, which also indicates a strong dependency of the performance of findhap.f90 on the number of unambiguously imputed loci.
Accuracies of imputation from AlphaImpute were higher than from the Two_Step method, especially in the LowLD scenarios. In some cases, although the complete genotypes cannot be inferred unambiguously, one can at least be sure about the presence of one of the alleles. This piece of information is neglected by the Two_Step method, since when moving from the first to the second step, the only information available for haplotype reconstruction are the unambiguous genotypes. An improvement in imputation accuracy from the Two_Step method could be achieved if known alleles were also taken into account in the haplotyping step. This information seems to be more efficiently used by the algorithm implemented in AlphaImpute, which is a combination of long-range phasing and haplotype library imputation. Results from AlphaImpute were similar to results obtained with Single_Step*. In the LowLD scenarios, AlphaImpute performed better and in the HighLD scenarios, results from Single_Step* were better. The strength of AlphaImpute is its flexibility, since it can handle different levels of relationship between the surrogate and the genotyped animals. The strength of the Single_Step* method is its simplicity and ease of programming, which enables very fast imputation. Since the difference in performance was smaller in the LowLD than in the HighLD scenarios and the intended application was for the specific situation considered here, Single_Step* was the method of choice to investigate the impact of imputation on accuracy of genomic predictions.
Impact on the accuracy of genomic breeding values
Mean linkage disequilibrium (r 2 ) within different inter-marker distances in the simulated populations used for the comparison of the accuracy of genomic predictions with and without imputation
Inter-marker distance (kb)
An increase in the accuracy of GEBV was observed when using TSA instead of TS, which demonstrates that enlarging a training set with imputed Dams represents an advantage. The extent of this advantage differed between the different population structures simulated. In the LowLD_NoSel scenario, the gain in accuracy, expressed as percentage of the accuracy with TS, ranged from 3.7% to 37.1%. The benefit of incorporating imputed Dams in the training set was overall larger for this scenario, despite the fact that with this scenario genotype imputation was performed with the poorest quality. In the other three scenarios, the maximum gains were 11.1% (LowLD_Sel), 15.3% (HighLD_NoSel) and 11.9% (HighLD_Sel), and the minimum gains were close to zero. Because imputation is not perfect, the increase in accuracy obtained with TSA was generally lower than what could be achieved by enlarging TS with another set of genotyped offspring. For each of the four scenarios, we compared the increase in accuracy obtained when: (1) enlarging TS by doubling the number of genotyped offspring; or (2) enlarging TS with imputed Dams. For example, in the LowLD_NoSel scenario with an h2 of 0.05, moving from a TS of 1800 to a TS of 3600 offspring gave a gain in accuracy of 32% (from 0.31 to 0.41). Adding the 2000 imputed Dams to a TS of 1800 offspring (i.e., a TSA of 3800 animals) gave a gain in accuracy of 23% (from 0.31 to 0.38), which is 72% of the gain in the first case and reflects the fact that the proportion of correctly imputed genotypes of Dams is lower than 1. On average, across all h2 and numbers of offspring, the gain in accuracy in the second case was 93% (LowLD_NoSel), 62% (LowLD_Sel), 78% (HighLD_NoSel) and 63% (HighLD_Sel) of the gain in accuracy obtained in the first case. The first case would require doubling the costs by genotyping another set of offspring, whereas in the second case, no additional costs for genotyping are needed. If there is funding available for genotyping more animals, then increasing the size of the training set with genotyped animals should improve the accuracy of genomic predictions more. Different strategies can be used to genotype more animals, e.g. genotyping for a low density chip the Dams with very few loci for which imputation can be unambiguously made, as pointed out in the previous section. Nevertheless, according to our results, even if all available funding for genotyping has been spent, there is still room for an additional improvement in genomic predictions by enlarging TS with imputed Dams.
The magnitude of the gain in accuracy when moving from TS to TSA varied not only between scenarios but also for different values of h2 and numbers of offspring already available in TS. The effects of h2, number of offspring and simulated scenario on the difference between accuracies obtained with TS and TSA were all significant (P < 0.001). Pszczola et al.  added 1000 imputed bulls to a training set of 1000 genotyped bulls and did not find any significant increase in accuracy of genomic predictions. The authors attributed their finding to the low accuracy of imputation in their study (0.58). Nevertheless, Pszczola et al.  reported a trend of increasing difference in accuracy with decreasing h2, which is consistent with our results. The population of Pszczola et al.  was simulated to resemble a dairy cattle population with a considerably high level of LD (average r2 of 0.41 between adjacent markers, which were on average 0.13 cM apart). This level of LD is higher than that observed in our scenario with the highest LD (HighLD_Sel), in which the increase in accuracy of genomic predictions was overall the lowest in our study. This agrees with our indication that the impact of enlarging a reference population with imputed individuals in terms of accuracy of genomic prediction depends on the population structure under consideration.
Generally, the larger the accuracy already obtained with TS, the lower is the increase in accuracy achieved with TSA. Regression analyses of the percentage increase in accuracy obtained with TSA against the accuracy already obtained with TS across all h2 and numbers of offspring for the four scenarios were performed. Results fitted a negative linear relationship well, with coefficients of determination of 0.80, 0.88, 0.68 and 0.85 for scenarios LowLD_NoSel, LowLD_Sel, HighLD_NoSel and HighLD_Sel (Additional file 2: Figure S4). This pattern was not only observed when moving from TS to TSA, but also when moving from a smaller to a larger TS. This can also be seen in the shapes of the surfaces presented in Figure 3, in which the increase in accuracy resulting from an increase in either h2 or the number of offspring tends to reach a plateau.
Genotypes of a dam’s sire, one offspring and this offspring’s sire, as well as estimates of marker allele frequencies were used to impute genotypes of dams with an accuracy, i.e. the correlation between observed and imputed genotypes, ranging from 0.81 to 0.93. Accuracy of imputation was higher in populations with higher levels of LD and with distributions of allele frequencies containing a larger proportion of markers with more extreme allele frequencies.
Overall, inclusion of imputed dams in the training set increased genomic predictions, up to 37%. The impact of enlarging the training set with imputed dams on the accuracy of genomic predictions depends on the heritability of the trait, on the number of animals in the already available training set, and on the population structure.
Besides being useful for reducing costs of genotyping by imputing high-density panels on animals genotyped with low-density panels, imputation can also be used to achieve an extra increase in accuracy of genomic predictions by enlarging the training set with completely un-genotyped dams. This strategy is particularly useful for populations with low levels of LD, for genomic selection on traits with low h2, and for species or breeds for which the reference population size is limited.
The authors thank John Hickey for kindly providing the software AlphaImpute, Paul VanRaden for making findhap.f90 freely available online and two anonymous reviewers for fruitful comments and suggestions.
- Meuwissen THE, Hayes BJ, Goddard ME: Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001, 157: 1819-1829.PubMed CentralPubMedGoogle Scholar
- Sargolzaei M, Schenkel FS, Jansen GB, Schaeffer LR: Extent of linkage disequilibrium in Holstein cattle in North America. J Dairy Sci. 2008, 91: 2106-2117. 10.3168/jds.2007-0553.View ArticlePubMedGoogle Scholar
- Pimentel ECG, Erbe M, König S, Simianer H: Genome partitioning of genetic variation for milk production and composition traits in Holstein cattle. Front Genet. 2011, 2: 19-PubMed CentralGoogle Scholar
- Erbe M, Hayes BJ, Matukumalli LK, Goswami S, Bowman PJ, Reich CM, Mason BA, Goddard ME: Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. J Dairy Sci. 2012, 95: 4114-4129. 10.3168/jds.2011-5019.View ArticlePubMedGoogle Scholar
- Ober U, Ayroles JF, Stone EA, Richards S, Zhu D, Gibbs RA, Stricker C, Gianola D, Schlather M, Mackay TFC, Simianer H: Using whole-genome sequence data to predict quantitative trait phenotypes in Drosophila melanogaster. PLoS Genet. 2012, 8: e1002685-10.1371/journal.pgen.1002685.PubMed CentralView ArticlePubMedGoogle Scholar
- Farnir F, Coppieters W, Arranz JJ, Berzi P, Cambisano N, Grisart B, Karim L, Marcq F, Moreau L, Mni M, Nezer C, Simon P, Vanmanshoven P, Wagenaar D, Georges M: Extensive genome-wide linkage disequilibrium in cattle. Genome Res. 2000, 10: 220-227. 10.1101/gr.10.2.220.View ArticlePubMedGoogle Scholar
- McRae AF, McEwan JC, Dodds KG, Wilson T, Crawford AM, Slate J: Linkage disequilibrium in domestic sheep. Genetics. 2002, 160: 1113-1122.PubMed CentralPubMedGoogle Scholar
- Heifetz EM, Fulton JE, O’Sullivan N, Zhao H, Dekkers JCM, Soller M: Extent and consistency across generations of linkage disequilibrium in commercial layer chicken breeding populations. Genetics. 2005, 171: 1173-1181. 10.1534/genetics.105.040782.PubMed CentralView ArticlePubMedGoogle Scholar
- Amaral AJ, Megens HJ, Crooijmans RPMA, Heuven HCM, Groenen MAM: Linkage disequilibrium decay and haplotype block structure in the pig. Genetics. 2008, 179: 569-579. 10.1534/genetics.107.084277.PubMed CentralView ArticlePubMedGoogle Scholar
- Corbin LJ, Blott SC, Swinburne JE, Vaudin M, Bishop SC, Woolliams JA: Linkage disequilibrium and historical effective population size in the Thoroughbred horse. Anim Genet. 2010, 41 (Suppl. 2): 8-15.View ArticlePubMedGoogle Scholar
- Druet T, Georges M: A hidden Markov model combining linkage and linkage disequilibrium information for haplotype reconstruction and quantitative trait locus fine mapping. Genetics. 2010, 184: 789-798. 10.1534/genetics.109.108431.PubMed CentralView ArticlePubMedGoogle Scholar
- Daetwyler HD, Wiggans GR, Hayes BJ, Woolliams JA, Goddard ME: Imputation of missing genotypes from sparse to high density using long-range phasing. Genetics. 2011, 189: 317-327. 10.1534/genetics.111.128082.PubMed CentralView ArticlePubMedGoogle Scholar
- Hickey JM, Kinghorn BP, Tier B, Wilson JF, Dunstan N, van der Werf JHJ: A combined long-range phasing and long haplotype imputation method to impute phase for SNP genotypes. Genet Sel Evol. 2011, 43: 12-10.1186/1297-9686-43-12.PubMed CentralView ArticlePubMedGoogle Scholar
- Sargolzaei M, Chesnais JP, Schenkel FS: FImpute - An efficient imputation algorithm for dairy cattle populations. J Dairy Sci. 2011, 94 (1): 421-Google Scholar
- VanRaden PM, O’Connell JR, Wiggans GR, Weigel KA: Genomic evaluations with many more genotypes. Genet Sel Evol. 2011, 43: 10-10.1186/1297-9686-43-10.PubMed CentralView ArticlePubMedGoogle Scholar
- Hayes BJ, Bowman PJ, Daetwyler HD, Kijas JW, van der Werf JHJ: Accuracy of genotype imputation in sheep breeds. Anim Genet. 2012, 43: 72-80.View ArticlePubMedGoogle Scholar
- Browning BL, Browning SR: A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet. 2009, 84: 210-223. 10.1016/j.ajhg.2009.01.005.PubMed CentralView ArticlePubMedGoogle Scholar
- Meuwissen THE, Goddard ME: The use of family relationships and linkage disequilibrium to impute phase and missing genotypes in up to whole-genome sequence density genotypic data. Genetics. 2010, 185: 1441-1449. 10.1534/genetics.110.113936.PubMed CentralView ArticlePubMedGoogle Scholar
- Verbyla KL, Hayes BJ, Bowman PJ, Goddard ME: Accuracy of genomic selection using stochastic search variable selection in Australian Holstein Friesian dairy cattle. Genet Res. 2009, 91: 307-311. 10.1017/S0016672309990243.View ArticleGoogle Scholar
- Long N, Gianola D, Rosa GJ, Weigel KA: Dimension reduction and variable selection for genomic selection: application to predicting milk yield in Holsteins. J Anim Breed Genet. 2011, 128: 247-257. 10.1111/j.1439-0388.2011.00917.x.View ArticlePubMedGoogle Scholar
- Weigel KA, de los Campos G, González-Recio O, Naya H, Wu XL, Long N, Rosa GJM, Gianola D: Predictive ability of direct genomic values for lifetime net merit of Holstein sires using selected subsets of single nucleotide polymorphism markers. J Dairy Sci. 2009, 92: 5248-5257. 10.3168/jds.2009-2092.View ArticlePubMedGoogle Scholar
- Habier D, Fernando RL, Dekkers JCM: Genomic selection using low-density marker panels. Genetics. 2009, 182: 343-353. 10.1534/genetics.108.100289.PubMed CentralView ArticlePubMedGoogle Scholar
- VanRaden PM, Olson KM, Null DJ, Sargolzaei M, Winters M, van Kaam JBCHM: Reliability increases from combining 50,000- and 777,000-marker genotypes from four countries. Interbull Bull. 2012, 46: 75-79.Google Scholar
- VanRaden PM, Van Tassell CP, Wiggans GR, Sonstegard TS, Schnabel RD, Taylor JF, Schenkel FS: Invited review: reliability of genomic predictions for North American Holstein bulls. J Dairy Sci. 2009, 92: 16-24. 10.3168/jds.2008-1514.View ArticlePubMedGoogle Scholar
- Weigel KA, de los Campos G, Vazquez AI, Rosa GJM, Gianola D, Van Tassel CP: Accuracy of direct genomic values derived from imputed single nucleotide polymorphism genotypes in Jersey cattle. J Dairy Sci. 2010, 93: 5423-5435. 10.3168/jds.2010-3149.View ArticlePubMedGoogle Scholar
- Dassonneville R, Brøndum RF, Druet T, Fritz S, Guillaume F, Guldbrandtsen B, Lund MS, Ducrocq V, Su G: Effect of imputing markers from a low-density chip on the reliability of genomic breeding values in Holstein populations. J Dairy Sci. 2011, 94: 3679-3686. 10.3168/jds.2011-4299.View ArticlePubMedGoogle Scholar
- Cleveland MA, Hickey JM, Kinghorn BP: Genotype imputation for the prediction of genomic breeding values in non-genotyped and low density genotyped individuals. BMC Proc. 2011, 5: S6-PubMed CentralView ArticlePubMedGoogle Scholar
- Pszczola M, Mulder HA, Calus MPL: Effect of enlarging the reference population with (un)genotyped animals on the accuracy of genomic selection in dairy cattle. J Dairy Sci. 2011, 94: 431-441. 10.3168/jds.2009-2840.View ArticlePubMedGoogle Scholar
- Hickey JM, Kinghorn BP, Tier B, van der Werf JHJ, Cleveland MA: A phasing and imputation method for pedigreed populations that results in a single-stage genomic evaluation. Genet Sel Evol. 2012, 44: 9-10.1186/1297-9686-44-9.PubMed CentralView ArticlePubMedGoogle Scholar
- Berry DP, Bastiaansen JWM, Veerkamp RF, Wijga S, Wall E, Berglund B, Calus MPL: Genome-wide associations for fertility traits in Holstein–Friesian dairy cows using data from experimental research herds in four European countries. Animal. 2012, 6: 1206-1215. 10.1017/S1751731112000067.View ArticlePubMedGoogle Scholar
- Scheet P, Stephens M: A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet. 2006, 78: 629-644. 10.1086/502802.PubMed CentralView ArticlePubMedGoogle Scholar
- Sargolzaei M, Schenkel FS: QMSim: a large-scale genome simulator for livestock. Bioinformatics. 2009, 25: 680-681. 10.1093/bioinformatics/btp045.View ArticlePubMedGoogle Scholar
- Henderson CR: Best linear unbiased estimation and prediction under a selection model. Biometrics. 1975, 31: 423-447. 10.2307/2529430.View ArticlePubMedGoogle Scholar
- Weigel KA, Van Tassell CP, O’Connell JR, VanRaden PM, Wiggans GR: Prediction of unobserved single nucleotide polymorphism genotypes of Jersey cattle using reference panels and population-based imputation algorithms. J Dairy Sci. 2010, 93: 2229-2238. 10.3168/jds.2009-2849.View ArticlePubMedGoogle Scholar
- Johnston J, Kistemaker G, Sullivan PG: Comparison of different imputation methods. Interbull Bull. 2011, 44: 25-33.Google Scholar
- Gredler B, Seefried FR, Schuler U, Bapst B, Schnyder U, Hickey JM: Imputation in Swiss cattle breeds. Interbull Bull. 2011, 44: 8-11.Google Scholar
- Hickey JM, Crossa J, Babu R, de los Campos G: Factors affecting the accuracy of genotype imputation in populations from several maize breeding programs. Crop Sci. 2012, 52: 654-663. 10.2135/cropsci2011.07.0358.View ArticleGoogle Scholar
- Daetwyler HD, Pong-Wong R, Villanueva B, Wooliams JA: The impact of genetic architecture on genome-wide evaluation methods. Genetics. 2010, 185: 1021-1031. 10.1534/genetics.110.116855.PubMed CentralView ArticlePubMedGoogle Scholar
- Goddard ME: Genomic selection: prediction of accuracy and maximisation of long term response. Genetica. 2009, 136: 245-257. 10.1007/s10709-008-9308-0.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.