A hybrid method for the imputation of genomic data in livestock populations
 Roberto Antolín^{1},
 Carl Nettelblad^{2},
 Gregor Gorjanc^{1},
 Daniel Money^{1} and
 John M. Hickey^{1}Email author
DOI: 10.1186/s127110170300y
© The Author(s) 2017
Received: 5 September 2016
Accepted: 13 February 2017
Published: 3 March 2017
Abstract
Background
This paper describes a combined heuristic and hidden Markov model (HMM) method to accurately impute missing genotypes in livestock datasets. Genomic selection in breeding programs requires highdensity genotyping of many individuals, making algorithms that economically generate this information crucial. There are two common classes of imputation methods, heuristic methods and probabilistic methods, the latter being largely based on hidden Markov models. Heuristic methods are robust, but fail to impute markers in regions where the thresholds of heuristic rules are not met, or the pedigree is inconsistent. Hidden Markov models are probabilistic methods which typically do not require specific family structures or pedigree information, making them very flexible, but they are computationally expensive and, in some cases, less accurate.
Results
We implemented a new hybrid imputation method that combined heuristic and HMM methods, AlphaImpute and MaCH, and compared the computation time and imputation accuracy of the three methods. AlphaImpute was the fastest, followed by the hybrid method and then the HMM. The computation time of the hybrid method and the HMM increased linearly with the number of iterations used in the hidden Markov model, however, the computation time of the hybrid method increased almost linearly and that of the HMM quadratically with the number of template haplotypes. The hybrid method was the most accurate imputation method for lowdensity panels when pedigree information was missing, especially if minor allele frequency was also low. The accuracy of the hybrid method and the HMM increased with the number of template haplotypes. The imputation accuracy of all three methods increased with the marker density of the lowdensity panels. Excluding the pedigree information reduced imputation accuracy for the hybrid method and AlphaImpute. Finally, the imputation accuracy of the three methods decreased with decreasing minor allele frequency.
Conclusions
The hybrid heuristic and probabilistic imputation method is able to impute all markers for all individuals in a population, as the HMM. The hybrid method is usually more accurate and never significantly less accurate than a purely heuristic method or a purely probabilistic method and is faster than a standard probabilistic method.
Background
This paper describes a combined heuristic and hidden Markov model (HMM) method to accurately impute missing genotypes in livestock datasets. Methods for imputing genotypes are essential for modern livestock breeding because they help to facilitate genomic selection, which has become the dominant method for genetic evaluation of livestock. Imputation can costeffectively generate the highdensity genotypes of many individuals required for genomic selection [1, 2]. Typically, the genotyping strategies used in livestock breeding involve genotyping a small number of individuals with expensive highdensity marker panels and large numbers with cheaper lowdensity panels, then using imputation to infer the untyped highdensity markers in the individuals genotyped at lowdensity. Imputation methods work by identifying haplotypes shared between individuals. The methods used generally fall into two broad categories: (1) heuristic methods that are designed to identify and propagate linkage information about long haplotypes (e.g., >10 cM), which is typically shared between closely related individuals; and (2) probabilistic methods that are designed to identify and propagate linkage disequilibrium information about short haplotypes (e.g., <1 cM), which is typically shared between distantly related individuals.
Heuristic methods use the basic principles of inheritance and are fast and accurate in many of the circumstances that are common to livestock applications [3]. Heuristic methods make explicit use of pedigree information and make inferences from information on closely related individuals from large families and the large portions of the genome shared between pairs of related individuals. However, heuristic methods do not impute alleles if such data is lacking or unreliable. The AlphaImpute program [3] is an example that combines several heuristic methods, such as basic rules of Mendelian inheritance, longrange phasing, and haplotype library imputation algorithm [4]. Other examples that are based on heuristic methods include Findhap [5] and FImpute [6].
Probabilistic methods mainly use HMM approaches to model genotype variation along chromosomes and the sharing of genomic segments between nominally unrelated individuals. HMMbased imputation methods are computationally more demanding, slower, and inherently less accurate than heuristic methods when they do not take pedigree information into account. HMM methods commonly used in livestock applications were primarily developed for application in human populations where pedigree information is typically lacking and pedigree structures, such as small family sizes, are not wellsuited for exploiting heuristic algorithms.
HMM methods are used to describe the variation of an observable variable in a sequence, as a function of an underlying sequence of hidden variables that each have a set of \(K\) distinct states [7]. When HMM methods are applied to genotype imputation, the observable variable is a marker genotype, the sequence is a set of \(M\) markers along the chromosome, and the hidden variable represents the possible haplotypes that underlie the genotype. Given the number of markers, \(M\), and the number of hidden states, \(K\), the computational time of hidden Markov models scale as \(O\left( {M \times K^{2} } \right),\) which limits the effectiveness of genomic applications with large numbers of markers and many possible haplotypes.
Distinct HMM algorithms with different representation of hidden states and computational considerations have been developed to alleviate the computational burden when analysing dense genomic data, such as, PHASE [8]; fastPHASE [9]; Beagle [10]; SHAPEIT [11]; Impute2 [12]; MaCH [13]; MERLIN [14]; cnF2freq [15]. PHASE uses a Markov chain Monte Carlo algorithm to estimate the actual pair of hidden gametes of each individual as a mosaic of haplotypes given the observed genotypes and the underlying recombination rates [8]. PHASE is very accurate but computationally intractable for large datasets. fastPHASE, Beagle, SHAPEIT, Impute2, and MaCH are computationally tractable HMM methods for phasing and imputation. fastPHASE uses an expectation–maximisation approach and infers the most likely hidden states by clustering similar haplotypes [9]. Albeit faster, fastPHASE is still computationally expensive and its expectation–maximisation algorithm can get stuck in a local maximum. Beagle relies on a similar concept as fastPHASE, but clusters haplotypes locally [10]. SHAPEIT follows the HMM of PHASE but collapses all the haplotypes into a graph structure and uses this structure to divide the haplotypes into disjoint segments of J distinct haplotypes used as hidden states [11]. SHAPEIT samples pairs of haplotypes with a Markov chain Monte Carlo algorithm that is linear in the number of the distinct haplotypes, J, to reduce computational intensity even further. SHAPEIT is also able to integrate pedigree information for disjoint sets of duos and trios directly in the model, as well as adding a proofreading step based on the separate local duoHMM model [16]. Impute2 approximates PHASE but, instead of conditioning on all haplotypes of all individuals, it restricts the number of haplotypes to the effective population size to decrease computational intensity [12]. Impute2 selects haplotypes that are similar to the haplotypes of the individual being imputed as the hidden states, which can lead to local minima [17].
MaCH is close to PHASE in the use of a mosaic of haplotypes to explain the observed genotypes. However, MaCH uses a model parameterised by recombination and mutation rates to iteratively improve the phasing of each individual in a Markov chain Monte Carlo framework [13]. In this regard, MaCH is similar to Impute2, but the method selects a user specified number of template haplotypes at random instead of selecting those that are expected to be similar to haplotypes carried by the individual being imputed as in the case of Impute2. To reduce computational intensity, MaCH limits the template haplotypes, to a number specified by the user. MERLIN models the state space as the combination of haplotypes of all individuals that are included in the same pedigree [14]. This is successful for small nuclear families, but it is not feasible for livestock applications where the number of hidden states increases exponentially with the number of individuals. However, in the cases it can handle, the optimum solution achieved is the globally preferable solution based on the modelling assumptions, while most other HMM approaches only provide approximations. cnF2freq models the state of multiple individuals at once, but maintains several separate local pedigrees of a single individual and its immediate ancestors, making the problem tractable, but again introducing the risk of getting stuck in a local optimum [15].
Heuristic and HMM imputation methods have different strengths and weaknesses. Combining the two approaches in a single algorithm may improve performance. Errors and missing genotypes generated by heuristics can be resolved by exploiting the probabilistic nature of Markov models. For example, if one parent of an individual is known, genotyped, and can be fully phased using heuristics, while the other parent is unknown, not genotyped, or cannot be phased using heuristics, a heuristic method can be used to impute the alleles on the gamete inherited from the first parent, while a HMM can be used to impute the alleles on the gamete inherited from the other parent.
Heuristic methods can also be used to increase the computational efficiency of HMM methods [18]. The phased information from the heuristic approach, which can be obtained with a very limited computational burden, can be used to supply the HMM with an accurate, relevant and relatively small set of reference haplotypes. This set of reference haplotypes can reduce the computational demand of the HMM permitting it to handle large populations better. Similar ideas are used in minimac, which achieves very fast computation by using prephased data [19].
In this study, we present a combined method for genotype imputation that takes advantage of the accuracy and speed provided by heuristic methods and the robustness of HMM methods. In particular, the method improves the heuristic method of AlphaImpute [3] with the HMM implemented in MaCH [13], and has been released as the version v1.5 of AlphaImpute. Performance of the combined method using real and simulated data is compared with results obtained separately from heuristic and HMM methods.
Methods
The imputation method presented in this study is a combined heuristic and HMM method to impute genotypes of all individuals in a population for all markers. The method incorporates the heuristic imputation approach of AlphaImpute and the HMM method of MaCH. AlphaImpute implements the heuristic method explained in Hickey et al. [3] and MaCH implements the HMM explained in Li et al. [13]. All three methods are briefly described in the following sections.
Heuristic method of AlphaImpute
AlphaImpute combines: (1) basic rules of Mendelian inheritance; (2) segregation analysis; (3) longrange phasing; and (4) haplotype library imputation in order to phase and impute genotype data of all individuals in a population [3]. The program iterates across these four sets of actions multiple times to accumulate information and determine the haplotype that each individual carries at each position along the genome.
The basic rules of Mendelian inheritance and the segregation analysis are used in conjunction with all pedigree and genotype information to derive phase for as many alleles as possible under the assumption that each locus is inherited independently of its neighbours at this step.
Longrange phasing and haplotype library imputation, which are both implemented in the AlphaPhase software and described in full detail in Hickey et al. [4], are used to derive the haplotypes that are carried by the individuals that are genotyped at highdensity. Both longrange phasing and haplotype library imputation work by dividing the genome into genome regions, referred to as cores, and resolving the haplotypes within the cores for the individuals concerned. Cores of different lengths are used in several runs to phase each locus as part of overlapping cores and to facilitate the identification of phasing errors. These phasing steps generate a library of haplotypes for each core that are used later.
Missing alleles are then imputed by matching haplotypes that are obtained during the longrange phasing to alleles that are imputed and phased by the basic rules of Mendelian inheritance and by the segregation analysis. All haplotypes stored in the haplotype libraries are considered candidates of the true haplotype of the proband (i.e., the individual being phased) for each core of each phasing round. Alleles that are imputed and phased by the basic rules of Mendelian inheritance and by the segregation analysis are compared to corresponding alleles in each of the haplotypes in the library and haplotypes that are consistent with the alleles of a proband are retained as candidate haplotypes. This is repeated for each core. For a given marker position, individual alleles are imputed such that all remaining haplotypes across all of the cores that span this marker are in agreement. To impute from parental haplotypes this process is also repeated with a restriction that the haplotypes retained in the haplotype library comprise only those haplotypes that are carried by the parents. Libraries are updated with any new haplotype found. The matching process is iterated a defined number of times and at the end of each iteration each chromosome of each individual is traversed in each direction to detect recombination locations and to model the imputation of alleles in the regions of uncertainty that are adjacent to these recombinations as a weighted average of the two gametes carried by the relevant parent [20]. At the end of the final iteration, the segregation analysis is repeated and used to fill in alleles that remain unimputed.
Hidden Markov model algorithm of MaCH
In its diploid form, MaCH implements a HMM that characterises the unphased genotypes, \(\varvec{G},\) as a mosaic of pairs of haplotypes taken from a set of template haplotypes, \(\varvec{H}\) [13]. The mosaic of pairs of haplotypes represents the hidden sequence of states. At a locus, \(i = 1, \ldots ,M,\) a hidden state is represented as \(S_{i} = \left( {h_{i} ,k_{i} } \right),\) where h _{i} corresponds to the haplotype on the first gamete, and \(k_{i}\) corresponds to the haplotype on the second gamete. The total number of possible states is \(\left \varvec{H} \right^{2} ,\) corresponding to all the combinations of pairs of haplotypes, \(\varvec{S} = \{ \left( {h, k} \right) \left {h, k \in H\} } \right.,\) where \(h\) and \(k\) are two haplotypes of \(\varvec{H}.\) The objective is to deduce the hidden sequence of states that best fits the data, i.e., to estimate the best pair of haplotypes that explain the unphased observed genotypes. This information in turn enables imputation at the unobserved loci.
 1.
the prior probability is defined by assuming that every state is equally likely at the first marker,
 2.
the probability of transition from one state to another is given as a function of \(m  1\) crossover parameters, \(\theta_{i} ,\) which reflect the recombination rates, and
 3.
the probability of observing a genotype at each locus given a particular state is defined as a function of the m error parameters, \(\varepsilon_{i} ,\) which indicate the effect of genotype errors and mutations.
The updating process is repeated an arbitrary number of iterations over the whole population. Each iteration involves an update of the model parameters (mutation and recombination rates). The crossovers and the errors are set to 0.01 at first, and are reestimated at each iteration. For this purpose, the algorithm stores the number and position of recombinations, and the number of times the inferred genotype agrees with the observed one. The new estimates of the parameters improve the model and, thus, the haplotype sampling. The final pair of haplotypes for each individual is the consensus pair that minimises the total proportion of switches in haplotypes (i.e., switch error) when compared to the haplotypes sampled at each round.
MaCH also implements a haploid version of the model described above to improve the model when phase data is available. For phased individuals, the Baum’s forward–backward algorithm is used twice to independently sample the two haplotypes from the template haplotypes.
The computational complexity of analysing a single individual in MaCH scales as \(O\left( {M \times \left \varvec{H} \right^{2} } \right),\) where M is the number of markers and \(\left \varvec{H} \right\) is the number of template haplotypes. The complexity of the forward–backward algorithm is \(O\left( {M \times \left \varvec{S} \right^{2} } \right),\) where \(\left \varvec{S} \right\) is the number of hidden states. Therefore, the diploid HMM of the MaCH algorithm scales as \(O\left( {M \times \left \varvec{S} \right^{2} } \right) = O\left( {M \times \varvec{H}^{4} } \right),\) thus the number of template haplotypes is usually limited to reduce computational load. However, it is possible to further reduce the computational requirements of the algorithm to \(O\left( {M \times \left \varvec{H} \right^{2} } \right)\) by taking advantage of regular patterns within the transition probability matrix.
Motivation for an algorithm that combines AlphaImpute and a HMM
AlphaImpute is computationally feasible for large datasets and is highly accurate for most of the genome of most individuals in typical livestock populations, but it has some weaknesses. For example, an individual that is to have its highdensity genotypes imputed is genotyped at lowdensity, its sire and maternal grandsire are both genotyped at highdensity and there is sufficient highdensity information on their relatives to enable their genotype data to be completely and accurately phased. However, no genotype information is available on the dam and the maternal granddam. Thus, this data would enable all of the gametes the individual inherited from its sire to be accurately imputed, with the exception of the few small regions that are adjacent to recombination events. The imputation of the alleles on the gamete inherited from the individual’s dam is more complex. While the portion of this gamete that derives from the maternal grandsire would be imputed with a relatively high level of accuracy, the lack of information available to impute the portion of this gamete that derives from the maternal granddam makes its imputation more difficult. These situations are common in livestock populations, where genotyping strategies often involve only genotyping male ancestors at highdensity.
Another common situation is that the genotyping strategy may involve genotyping all parents of selection candidates at highdensity and the selection candidates themselves at lowdensity. The quality control checks of the genotype data may show that one of the parents of a selection candidate is incorrectly identified in the pedigree records and must be set to missing. Thus, AlphaImpute would be able to accurately impute the gamete that the selection candidate inherited from one of its parents, but would have limited ability to perform imputation of the gamete inherited from the parent that has been set to missing. Also an individual that is part of the population being imputed by AlphaImpute may have unknown pedigree information for both its sire and its dam, due to any number of reasons, while most of the rest of the population has complete or partially complete pedigree information. In all three of these examples, a HMM could impute the segments that AlphaImpute is not able to impute well.
HMM methods are computationally intensive and typically less accurate than AlphaImpute in circumstances in which the pedigree information and population structure allow AlphaImpute to perform well. Much of the computational requirements of a HMM, and imputation errors that result from it, derive from establishing the template haplotypes and estimating the recombination rates between markers. AlphaImpute can accurately resolve haplotypes with computational efficiency.
Based on the different strengths, the two methods could be used to augment each other in several ways. For instance, accurate haplotypes that are resolved with heuristic methods could be fed as template haplotypes to a HMM and thus be used to increase its imputation accuracy and computational efficiency. Template haplotypes could be chosen for a particular genome segment with or without regard to any available pedigree information. A haploid version of the HMM could be used for individuals for which AlphaImpute could perform imputation for one gamete but not the other or not for part of one of the gametes while a diploid version of the same model could be used for individuals for which AlphaImpute could not perform imputation for either gamete or for the same region of both gametes. In addition, in some situations the heuristic rules of AlphaImpute may be able to partially impute some of the alleles on a gamete or segment of a gamete. This would effectively increase marker density and could increase the accuracy of the subsequent imputation by the HMM.
Hybrid algorithm
The method we propose (hybrid method) combines the different components of AlphaImpute with the HMM that underlies MaCH into a single framework. The hybrid method begins by applying the imputation method of AlphaImpute: basic rules of Mendelian inheritance, segregation analysis, longrange phasing, and haplotype library imputation. This gives accurate imputation for all individuals for which enough information is available. It also gives a large haplotype library that includes some wholechromosome phased gametes. The HMM is then applied to impute the remaining missing alleles of any individual that are not imputed by AlphaImpute. The haplotype library is used to sample the template haplotypes used in the HMM. The haplotype update is run through several iterations so the parameters of the model converge. The solutions of the first iterations are disregarded as burnin. Allele dosages and the final imputation are calculated using the remaining iterations. Finally, haplotypes are constructed from the most frequent alleles for each locus, and allele probabilities are assessed as the average across runs.

Training mode 1 Use the highdensity genotypes in the diploid HMM to estimate the model parameters;

Training mode 2 Sample a set of template haplotypes from a haplotype library and use them in the haploid HMM to estimate the model parameters;

Training mode 3 Use both of the data types and versions of the HMM from training modes 1 and 2 jointly while ensuring that genomic data for any individual enters the training set only once, with the haplotype information taking precedence;

Imputation mode 1 Use the haploid HMM to impute segments of a gamete that have not been imputed by AlphaImpute;

Imputation mode 2 Use the diploid HMM to impute segments of both gametes that have not been imputed by AlphaImpute.
To determine which imputation mode should be used for a given segment of a gamete or pairs of individuals’ gametes, the proportions of alleles imputed for the individual by AlphaImpute are examined. When the number of imputed alleles of a gamete from AlphaImpute is above a userspecific threshold Imputation mode 1 is used. When the number of imputed alleles of both individuals’ gametes from AlphaImpute is above the userspecific threshold Imputation mode 2 is used.
The hybrid algorithm is designed to work for biallelic markers. Alleles are coded as 0, 1 or 9, where 9 is a missing allele; allele probabilities range from 0 to 1. Genotypes are determined as the sum of the allele codes, and allele dosages are assessed as the sum of the allele codes weighted by the allele probabilities. Therefore, genotypes are coded as 0, 1, 2 or 9, where 0 is the reference homozygote, 1 is the heterozygote, 2 is the alternative homozygote, and 9 is a missing genotype. Allele dosages range continuously from 0 to 2.
Datasets
Performance of the hybrid method was tested using both a real pig dataset [courtesy of The Pig Improvement Company (PIC)] and simulated data.
Real data
In the real pig dataset, genotype information was spread sparsely across multiple generations. The pedigree consisted of 6473 animals, including 3213 genotyped at highdensity with the Illumina PorcineSNP60 BeadChip. Some animals had multiple generations of highdensity genotyped ancestors available, while others had no, or very few, ancestors genotyped at highdensity. This population came from a single PIC breeding line and therefore, animals were moderately to highly related to each other as is typical in most livestock breeding programs.
The genotyped animals were divided into training and testing sets. The testing set comprised the 509 most recently born animals genotyped at highdensity. We used a single chromosome with 3129 qualitycontrolled highdensity markers. To explore the effect of different genotyping strategies, animals in the testing set had different number of lowdensity markers selected from highdensity, which were used to impute the remaining markers. We used 15, 30, 300, 600, or 2000 lowdensity markers on this chromosome, which were selected at random. These numbers are roughly equivalent to 300, 600, 6000, 12,000 and 40,000 markers per 20 chromosome genome. The training set comprised the remaining 2704 animals genotyped at highdensity for all markers.
To explore the effect of having genotype data for the ancestors, animals in the testing set were grouped into six categories. These categories represented patterns of relationship between the animals in the population and their most recent highdensity ancestors. The categories were: both parents genotyped (Both); sire and maternal grandsire genotyped (SireMGS); dam and paternal grandsire genotyped (DamPGS); sire genotyped (Sire); dam genotyped (Dam); and other relatives genotyped (Other).
Simulated data
Two sets of data were also simulated. For the first set, the pedigree of the real pig dataset was used and genotype data were simulated for different high and lowdensity panels. For the second set, a fivegeneration pedigree was simulated by mating 25 sires with 500 dams to produce 1000 progeny per generation. Genotype data were simulated for high and lowdensity panels. To explore the effect of imputation in the absence of pedigree data, another pedigree was created by randomly removing the sire and dam links for 500 individuals in the last generation of the simulated pedigree. These individuals without pedigree information were treated as unrelated individuals and the remaining pedigree information was used for imputing other animals and for resolving haplotypes of the training set. The parent’s genotypes were retained in the training set.
 1.
Generate wholegenome sequence data,
 2.
Generate the marker genotypes, and
 3.
Mask the genotype information for markers that are not in the lowdensity panels.
Chromosomes of individuals in the first generation were sampled from the 1000 simulated base haplotypes and those in the following generations were sampled from the chromosomes of their parents involving recombination. Crossovers occurred with 1% probability per cM and were uniformly distributed along the chromosomes.
For each chromosome, highdensity panels were created by randomly selecting 2000 (H2k) and 10,000 (H10k) of the segregating sites. These numbers are roughly equivalent to 60,000 and 300,000 markers per 30chromosome genome.
Description of the simulated marker panels
Panel code  Panel design  Number of markers per chromosome  Number of markers across the genome 

H10k  Highdensity  10,000  300,000 
H2k  Highdensity  2000  60,000 
L2k  80%  2000  60,000 
L600^{a}  94/70%  600  18,000 
L300^{a}  97/85%  300  9000 
L30^{a}  99.7/98.3%  30  900 
L15^{a}  99.8/99.2%  15  450 
Comparison
The hybrid method was compared to the AlphaImpute and MaCH imputation methods using simulated and real data. The comparison was made in terms of computation time and imputation accuracy for different imputation strategies. Final results for the comparison of the simulated data are presented as the mean of five replicates.
The computation time was measured as the CPU time that the three methods used to run the simulated data. The CPU time is the total amount of time the CPU spent executing instructions. If the software were run in serial mode, the CPU time would be comparable to the real time the software takes to run. However, AlphaImpute and the hybrid method allow parallelization of some calculations. For instance, the phasing runs of the longrange phasing step of AlphaImpute and the hybrid method can be run simultaneously. Also the updating process of an individual’s haplotypes of the hybrid method is independent for each animal and can be run in parallel.
Number of animals per category
Category  Count 

Both  46 
SireMGS  63 
DamPGS  21 
Sire  36 
Dam  19 
Other  324 
Total  509 
The imputation accuracy was also computed as the average of markerwise Pearson correlations between the true and imputed allele dosages. Averages were computed among genotypes categorized into intervals of minor allele frequencies: [0.0, 0.025], [0.025, 0.05], [0.05, 0.075], [0.075, 0.10], [0.10, 0.15], [0.15, 0.20], [0.20, 0.25], [0.25, 0.30], [0.30, 0.35], [0.35, 0.40], [0.40, 0.45], and [0.45, 0.50].
For the simulated data, several imputation strategies using different combinations of the number of the template haplotypes and the number of iterations of the HMM parameters were set. The number of template haplotypes ranged from 100 to 300, with increments of 50 haplotypes. The number of iterations ranged from 20 to 50 with increments of 10, with 5 burnin iterations.
Other parameters for both the heuristic and the HMM were set by default and are described in the following section.
Parameter settings
For the heuristic methods, a default set of 10 core lengths ranging from 500 to 9000 highdensity markers was used for the various iterations of the longrange phasing and haplotype library imputation process. The set of cores was chosen to minimize both the computational intensity of the phasing processes and to maximize the number of times an allele was phased as a part of cores spanning different markers. The number of processors available was set to 20 in order to run simultaneously all the phasing runs of the longrange phasing step of the hybrid method and AlphaImpute. The number 20 comes from two times for each of the 10 core lengths, as described in the original publication of AlphaImpute [3]. The number of iterations of the heuristic rules was set by default to 5 for both the hybrid method and AlphaImpute.
For the HMM, the thresholds that control the training and imputation modes of the hybrid method and the number of processors available were set by default. A threshold of 50% was chosen for the training mode. Thus, if more than 50% of the individuals were phased by the previous heuristic step, their phased gametes were used to populate the template haplotypes in the HMM. An individual is considered to be phased if 99% or more of its markers have been phased. A threshold of 90% was chosen for the imputation mode. Thus, if more than 90% of the markers of an individual were imputed by the previous heuristic step, then the haploid version of the HMM was used for that individual, otherwise the diploid version was used. The number of processors available for the HMM was set to 8. This is arbitrary and depends on the number of processors that are available. The number of burnin iterations was set to 5.
Results
We compared the computation time and imputation accuracy, of the hybrid method, AlphaImpute and MaCH. AlphaImpute had the fastest computation time, followed by the hybrid method and then MaCH. The hybrid method was the most accurate and its accuracy increased with the number of template haplotypes and with the marker density of the lowdensity panel. Removing pedigree information reduced imputation accuracy for the hybrid method and AlphaImpute, but the hybrid method remained the most accurate. The imputation accuracy of all three methods decreased with the minor allele frequency and the hybrid method was the most accurate across minor allele frequencies.
Computation time
AlphaImpute always required the same CPU time under the parameter settings considered, whereas the CPU time required by the two HMM increased with the number of template haplotypes and with the number of iterations. AlphaImpute was the fastest, and the hybrid method was faster than MaCH regardless of the number of template haplotypes or the number of iterations.
Computation times were measured using an Intel Xeon E52630 v3 (2.4 GHz) 16cores processor based cluster running on a 64bit Linux (Scientific Linux 7).
Imputation accuracy
Simulated data
 1.
When pedigree information was fully available, the hybrid method was comparable to AlphaImpute and better than MaCH.
 2.
When pedigree information was not available, the hybrid method was about twice as accurate as AlphaImpute and MaCH for the lowdensity panels.
 3.
When pedigree information was not available, the imputation accuracy of the hybrid method and MaCH increased with the number of template haplotypes.
 4.
The imputation accuracy of the hybrid method increased with the marker density of the lowdensity panels and was higher than the accuracy of AlphaImpute and MaCH across all the lowdensity panels.
 5.
The imputation accuracy of the hybrid method was the highest and remained stable for six categories of individuals depending on which of their immediate ancestor were genotyped at highdensity.
Figure 2 also shows that without pedigree information, the imputation accuracy of AlphaImpute and MaCH was about 55% (H10k) and 47% (H2k) worse than that of the hybrid method, which increased with the number of template haplotypes. When pedigree information was not used, the imputation accuracy of the hybrid method ranged between 0.50 and 0.65, while imputation accuracy of MaCH ranged from 0.20 to 0.35, depending on the number of template haplotypes. The imputation accuracy of AlphaImpute is independent of the number of template haplotypes and its imputation accuracy was 0.31 and 0.34 for imputing from L30 to H2k and H10k highdensity panels. For both highdensity panels, the accuracies of AlphaImpute and MaCH were comparable if MaCH used more than 250 (H10k) and more than 200 (H2k) template haplotypes and were about half the accuracy of the hybrid method (>0.50).
Figure 2 shows that the imputation accuracies for imputing to the H10k highdensity panel were always slightly less accurate than imputation to the L2k panel. For this reason, we consider only the H10k highdensity panel hereinafter.
Next, we compared the imputation accuracies of the hybrid method and MaCH using an imputation of 200 template haplotypes over 20 iterations, which represents a good compromise between time and accuracy (see Additional file 2: Table S1).
Reducing the amount of genotype information available on immediate ancestors affected the imputation accuracy. Panels (a)–(f) of Fig. 4 show imputation accuracies of all the three methods for the categories where both parents (Both), sire and maternal grandsire (SireMGS), dam and paternal grandsire (DamPGS), sire only (Sire), dam only (Dam), and other relatives (Other) were genotyped at highdensity, respectively. The imputation accuracy of the hybrid and AlphaImpute was higher than 0.90 across all the categories for the two lowest lowdensity panels. The imputation accuracy of MaCH decreased to less than 0.73 (L15) when at least one of the parents was genotyped at highdensity and dropped to 0.33 (L15) when none of the parents were genotyped at highdensity (Other).
Real data
This section shows a comparison of the imputation accuracies of all three methods when real data was used (see Additional file 4: Table S2). For the comparison of the hybrid method and MaCH, we considered 200 template haplotypes and 20 estimation iterations as parameters of the HMM. The comparison provided two main results regarding the lowdensity panels and the genotype information available on immediate ancestors.
Reducing the amount of genotype information of both parents decreased the imputation accuracy. Panels (a)–(f) of Fig. 5 show imputation accuracies of all three methods for the categories where both parents (Both), sire and maternal grandsire (SireMGS), dam and paternal grandsire (DamPGS), sire only (Sire), dam only (Dam), and other relatives (Other) were genotyped at highdensity, respectively. The imputation accuracies of the three methods were above 0.95 for the highest lowdensity panels and slightly decreased when none of the parents was genotyped at highdensity (Other). For lowdensity panels with the lowest marker density, the imputation accuracies decreased from ~0.8 across all categories to ~0.6 for the hybrid method and AlphaImpute, and to less than 0.4 for MaCH when none of the parents were genotyped at highdensity (Other).
Minor allele frequency
 1.
the imputation accuracy of the three methods increased with increases in the minor allele frequency,
 2.
the imputation accuracy of the three methods increased as the genotype information of the lowdensity panels increased, and
 3.
the hybrid method always had the greatest accuracy when pedigree information was missing.
Imputation accuracy in relation to minor allele frequency
The hybrid method and AlphaImpute were the most accurate and their imputation accuracy remained above 0.94 across all minor allele frequencies when the pedigree was available. Figure 6 also shows that when the pedigree information was removed, the imputation accuracy of the three methods decreased with decreases in the minor allele frequency. In this case, the hybrid method was the most accurate and its accuracy increased gradually as the minor allele frequency increased. Its accuracy was always above 0.60 except for minor allele frequencies that were lower than 0.05 for which the accuracy dropped below 0.50. In the absence of pedigree information, AlphaImpute was more accurate than MaCH except for rare alleles with minor allele frequencies higher than 0.1. However, even at the highest minor allele frequencies MaCH never exceeded the accuracy attained by the hybrid method at the lowest minor allele frequencies.
For the lowest marker density of the lowdensity panels (L15 and L30), the imputation accuracy of all three methods was substantially lower and more sensitive to allele frequency. The hybrid method was the most accurate for all values of minor allele frequency. In the absence of pedigree information, AlphaImpute was more accurate than MaCH for higher values of minor allele frequency, and slightly worse for very low values.
Discussion
 1.
What are the advantages of using the hybrid method versus using a faster and nearly as accurate imputation method like AlphaImpute?
 2.
How does the computation time of the hybrid method benefit from the combination of heuristic and probabilistic approaches?
 3.
What is the best way to measure imputation accuracy of genotypes with rare alleles?
 4.
How does the hybrid method benefit from the combination of heuristic and probabilistic approaches to impute genotypes with rare alleles?
Imputation methods usually comprise a phasing step that resolves the haplotypes of individuals genotyped at highdensity, and an imputation step that identifies which combinations of these haplotypes are carried by the individuals genotyped at lowdensity. The hybrid method creates a reference set of extremely accurate haplotypes by applying a longrange phasing and haplotype library heuristic imputation algorithm, AlphaPhase [4]. The advantage of AlphaPhase is that the phasing algorithm can use the genotype information of surrogate parents (i.e., individuals that share a haplotype with the proband and that do not have any opposing homozygous genotypes with the proband). This makes it unnecessary to genotype multiple generations of ancestors at highdensity in order to phase genotypes. The hybrid method then uses these wellphased haplotypes to impute the missing genotypes in a combined heuristic and probabilistic approach.
In the imputation step, the hybrid method accurately imputes genotypes by identifying the exact pair of haplotypes for single markers or groups of consecutive markers with basic heuristic pedigreebased phasing rules [3]. However, the heuristic rules are not sufficient to model the recombination of those haplotypes. Failures are particularly pronounced for individuals that are genotyped at very lowdensity because it is very likely that recombination occurs at markers for which genotype information cannot be inferred, leading to haplotype switch errors. To avoid haplotype switch errors, the hybrid method uses the HMM of MaCH [13] to estimate the recombination rates between markers and the most likely position for recombination in a given haplotype.
The accurate prephasing helps to reduce the computation time of the hybrid method without loss in imputation accuracy by decreasing the number of iterations and the number of template haplotypes. This makes the hybrid method faster than similar HMM methods such as MaCH. Unlike MaCH, the hybrid method takes advantage of the pedigree information to prephase and unambiguously impute genotypes. Prephasing genotype data provides very accurate haplotypes to be used by the HMM. If the number of wellphased haplotypes is high, the faster haploid version of the HMM can be used in most cases instead of the slower diploid version. In fact, our results show how the computation time of the hybrid method scales almost linearly with the number of template haplotypes, whereas the computation time of MaCH scales quadratically. Moreover, seeding a probabilistic method with wellphased haplotypes means that the model parameters are estimated with fewer iterations. Also, the hybrid method is more robust than MaCH in terms of the number of template haplotypes especially when pedigree information is available (see Additional file 7: Figure S5, Additional file 8: Figure S6). By speeding up the HMM via integration into the hybrid method, it can be applied to larger datasets and its own inherent accuracy can be higher since a user can use more template haplotypes for a fixed amount of computation time. More template haplotypes make the HMM more accurate.
Correct quantification of imputation accuracy of genotypes with rare alleles is especially important for sequence data for which there are many alleles with low frequencies, and the markerwise correlation between the true genotypes and allele dosages is the best way to measure this imputation accuracy. Imputation accuracy computed in this way measures how much is gained by a sophisticated imputation method in contrast to naïve procedures of imputing two times the allele frequency or the most frequent genotype. The correlation between true genotypes and genotype dosages is a better estimator of imputation accuracy than imputation error rates, particularly for the imputation of genotypes with very rare alleles (<0.1). Imputation error rate, computed as the percentage of genotypes that are imputed incorrectly is strongly affected by the minor allele frequency [24, 25]. When the minor allele frequency is very low, missing genotypes are almost certain to be homozygous for the common allele and naïve imputation methods yield an imputation error rate of almost 0%. The prior uncertainty increases with minor allele frequency, and thus imputation error rate increases as the minor allele frequency increases [25]. In contrast, the correlation coefficient between true genotypes and the allele dosages is an unbiased estimator of the imputation accuracy [24]. The Pearson correlation coefficient (used to calculate imputation accuracy in this paper) assumes that the two variables are normally distributed and requires that the mean and variance be standardized for both true genotypes and genotype dosages in order to calculate the correlation across individuals for a single locus. Thus, true genotypes and dosages were normalized by the mean and standard deviation of the true values. As a consequence, loci with a low minor allele frequency, for which genotypes are difficult to impute, are given more importance.
Markerwise imputation accuracies of the hybrid method show that the combination of heuristic methods with a HMM helps to impute genotypes with rare alleles. Our results demonstrate that probabilistic methods based on linkage disequilibrium such as the HMM in MaCH are less accurate than heuristic methods based on linkage such as AlphaImpute, which agrees with previous studies [13, 24, 25]. However, the hybrid method was more accurate than both AlphaImpute and MaCH across minor allele frequencies and across genotype densities especially when pedigree information was missing (see Additional file 5: Figure S3). A possible explanation for this is that rare alleles become more frequent in the template haplotypes used by the HMM because heuristic methods are able to impute some of these rare alleles without ambiguity.
Conclusions
The hybrid heuristic and probabilistic imputation method proposed in this paper imputes accurate genotypes quickly for large populations and large numbers of markers even when limited pedigree information is available. It is usually more accurate and never significantly less accurate than a purely heuristic method and is faster than a standard probabilistic method.
Declarations
Authors’ contributions
JMH and RA conceived the original ideas, designed the algorithm, and wrote the first version of the program with input from CN, GG and DM. RA performed the analysis and wrote the first draft of the paper. All authors read and approved the final manuscript.
Acknowledgements
The authors acknowledge the financial support from the BBSRC ISPG to The Roslin Institute “BB/J004235/1”, from Genus PLC and from Grant Ns. “BB/M009254/1”, “BB/L020726/1”, “BB/N004736/1”, “BB/N004728/1”, “BB/L020467/1”, “BB/N006178/1” and Medical Research Council (MRC) Grant No. “MR/M000370/1”. This work has made use of the resources provided by the Edinburgh Compute and Data Facility (ECDF) (http://www.ecdf.ed.ac.uk/). The authors thank Dr. Susan Cleveland (WI, USA) and Dr. Andrew Derrington (Scotland, UK) for assistance in refining the manuscript.
Competing interests
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Goddard ME. The use of high density genotyping in animal health. In: Pinard MH, Gay C, Pastoret PP, Dodet B, editors. Developments in biologicals. Basel: KARGER; 2008. p. 383–9. http://www.karger.com/doi/10.1159/000317189. Accessed 1 Sept 2016.
 Habier D, Fernando RL, Dekkers JCM. Genomic selection using lowdensity marker panels. Genetics. 2009;182:343–53.View ArticlePubMedPubMed CentralGoogle Scholar
 Hickey JM, Kinghorn BP, Tier B, van der Werf JH, Cleveland MA. A phasing and imputation method for pedigreed populations that results in a singlestage genomic evaluation. Genet Sel Evol. 2012;44:9.View ArticlePubMedPubMed CentralGoogle Scholar
 Hickey JM, Kinghorn BP, Tier B, Wilson JF, Dunstan N, van der Werf JH. A combined longrange phasing and long haplotype imputation method to impute phase for SNP genotypes. Genet Sel Evol. 2011;43:12.View ArticlePubMedPubMed CentralGoogle Scholar
 VanRaden PM, Null DJ, Sargolzaei M, Wiggans GR, Tooker ME, Cole JB, et al. Genomic imputation and evaluation using highdensity Holstein genotypes. J Dairy Sci. 2013;96:668–78.View ArticlePubMedGoogle Scholar
 Sargolzaei M, Chesnais JP, Schenkel FS. A new approach for efficient genotype imputation using information from relatives. BMC Genomics. 2014;15:478.View ArticlePubMedPubMed CentralGoogle Scholar
 Rabiner L. A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE. 1989;77:257–86.View ArticleGoogle Scholar
 Stephens M, Smith NJ, Donnelly P. A new statistical method for haplotype reconstruction from population data. Am J Hum Genet. 2001;68:978–89.View ArticlePubMedPubMed CentralGoogle Scholar
 Scheet P, Stephens M. A fast and flexible statistical model for largescale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet. 2006;78:629–44.View ArticlePubMedPubMed CentralGoogle Scholar
 Browning SR, Browning BL. Rapid and accurate haplotype phasing and missingdata inference for wholegenome association studies by use of localized haplotype clustering. Am J Hum Genet. 2007;81:1084–97.View ArticlePubMedPubMed CentralGoogle Scholar
 Delaneau O, Marchini J, Zagury JF. A linear complexity phasing method for thousands of genomes. Nat Methods. 2012;9:179–81.View ArticleGoogle Scholar
 Howie BN, Donnelly P, Marchini J. A flexible and accurate genotype imputation method for the next generation of genomewide association studies. PLoS Genet. 2009;5:e1000529.View ArticlePubMedPubMed CentralGoogle Scholar
 Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol. 2010;34:816–34.View ArticlePubMedPubMed CentralGoogle Scholar
 Abecasis GR, Cherny SS, Cookson WO, Cardon LR. Merlin—rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet. 2002;30:97–101.View ArticlePubMedGoogle Scholar
 Nettelblad C, Holmgren S, Crooks L, Carlborg Ö. cnF2freq: efficient determination of genotype and haplotype probabilities in outbred populations using Markov models. In: Rajasekaran S, editor. Bioinformatics and computational biology. Berlin: Springer; 2009. p. 307–19. doi:10.1007/9783642007279_29.View ArticleGoogle Scholar
 O’Connell J, Gurdasani D, Delaneau O, Pirastu N, Ulivi S, Cocca M, et al. A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genet. 2014;10:e1004234.View ArticlePubMedPubMed CentralGoogle Scholar
 Nettelblad C. Breakdown of methods for phasing and imputation in the presence of double genotype sharing. PLoS One. 2013;8:e60354.View ArticlePubMedPubMed CentralGoogle Scholar
 Druet T, Georges M. A hidden Markov model combining linkage and linkage disequilibrium information for haplotype reconstruction and quantitative trait locus fine mapping. Genetics. 2010;184:789–98.View ArticlePubMedPubMed CentralGoogle Scholar
 Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR. Fast and accurate genotype imputation in genomewide association studies through prephasing. Nat Genet. 2012;44:955–9.View ArticlePubMedPubMed CentralGoogle Scholar
 Hickey JM, Kranis A. Extending longrange phasing and haplotype library imputation methods to impute genotypes on sex chromosomes. Genet Sel Evol. 2013;45:10.View ArticlePubMedPubMed CentralGoogle Scholar
 Chen GK, Marjoram P, Wall JD. Fast and flexible simulation of DNA sequence data. Genome Res. 2009;19:136–42.View ArticlePubMedPubMed CentralGoogle Scholar
 Hickey JM, Gorjanc G. Simulated data for genomic selection and genomewide association studies using a combination of coalescent and gene drop methods. G3 Genes Genomes Genet. 2012;2:425–7.Google Scholar
 VillaAngulo R, Matukumalli LK, Gill CA, Choi J, Van Tassell CP, Grefenstette JJ. Highresolution haplotype block structure in the cattle genome. BMC Genet. 2009;10:19.View ArticlePubMedPubMed CentralGoogle Scholar
 Calus MPL, Bouwman AC, Hickey JM, Veerkamp RF, Mulder HA. Evaluation of measures of correctness of genotype imputation in the context of genomic prediction: a review of livestock applications. Animal. 2014;8:1743–53.View ArticlePubMedGoogle Scholar
 Hickey JM, Crossa J, Babu R, de los Campos G. Factors Affecting the accuracy of genotype imputation in populations from several maize breeding programs. Crop Sci. 2012;52:654–63.View ArticleGoogle Scholar