There are three primary steps in AlphaImpute, (1) segregation analysis to calculate allele probabilities for each locus of each animal in the pedigree, (2) long-range phasing and haplotype library imputation to phase all high-density genotyped individuals and create a haplotype library for each genomic region for the dataset, and, (3) impute missing alleles by matching the allelic probabilities to haplotypes in the haplotype library.
This algorithm is designed to work for biallelic SNP. SNP genotypes are coded as 0, 1, 2, or 3, where 0 is a homozygote, 1 heterozygote, 2 the alternative homozygote, and 3 is a missing genotype. SNP alleles are coded as 0 or 1.
Step 1. Using the algorithm of Kerr and Kinghorn (1996), calculate allele probabilities for each locus of each individual in the pedigree, using all pedigree and genotype information (both high and low density).
Step 2. Using the LRPHLI algorithm of Hickey et al. (2011) as implemented in AlphaPhase1.1, phase the individuals genotyped at high density a number of times and place the identified haplotypes in a library. LRPHLI divides chromosomes into cores of specified length (e.g. 100 SNP). By running LRPHLI a number of times, overlaps between cores are created and each locus is phased as part of different cores. This facilitates the identification of phasing error.
Step 3. Impute missing alleles by matching alleles imputed at Step 1 to haplotypes phased at Step 2. This involves several sub-steps. These can be divided into major and minor sub-steps. Each sub-step is sequentially passed through; after each major sub-step, each minor sub-step is sequentially passed through. The description of Step 3 will begin with a description of the minor sub-steps, followed by a description of the major sub-steps.
Minor sub-step 1.
Parent homozygous fill in. Fill in the allele of an offspring of a parent that has both its alleles filled in and has a resulting genotype that is homozygous.
Minor sub-step 2.
Phase complement. If the genotype at a locus for an individual is known and one of its alleles has been determined, then impute the missing allele as the complement of the genotype and the known phased allele.
Minor sub-step 3.
Impute parents from progeny complement. If one of the parental alleles is known and the other missing, then fill in the missing allele in the parent if at least one of its offspring is known to carry an allele that does not match the known allele in the parent. (e.g. if a sire has a 0 as one of its alleles and the other allele is missing but one of its offspring carries a 1 in the gamete received from the sire, then we can determine that the sire’s missing allele is 1).
Minor sub-step 4.
Make genotype. Any individual that has a missing genotype but has both alleles known, has its genotype filled in as the sum of the two alleles.
Major sub-step 1.
Convert allele probabilities to phase. Alleles with probabilities greater than 0.99 of being 0 or 1 are imputed.
Major sub-step 2.
Fill in base animals. If a base animal has high-density genotype information, it is filled in by arbitrarily assigning one of its haplotypes for one of its cores as coming from its paternal gamete and the other haplotype as coming from its maternal gamete. Haplotypes at other cores are appended to the central haplotypes where overlapping information can be used to determine which haplotype at an adjacent partially overlapping core matches the arbitrarily labeled paternal (maternal) haplotype.
Major sub-step 3.
Candidate haplotype library imputation of alleles. For each core of each round of the LRPHLI algorithm, all haplotypes that have been found and stored in the haplotype library are initially considered to be candidates for the true haplotype that an individual carries on its gametes. Within the core, all alleles that are known are compared to corresponding alleles in each of the haplotypes in the library. Haplotypes that have a number of disagreements greater than a small error threshold have their candidacy rejected. At the end of this loop, the surviving candidate haplotypes are checked for locations that have unanimous agreement about a particular allele. For alleles with complete agreement, a count of the suggested allele is incremented. Alleles are imputed if, at the end of passing across each core and each round of the LRPHLI algorithm, the count of whether the alleles are 0 or 1 is above a threshold in one direction and below a threshold in the other. This helps to prevent the use of phasing errors that originate from LRPHLI.
Major sub-step 4.
Imputation based on parental phase. This is similar to Major sub-step 3, with the exception that the process is restricted to individuals that have parents with high-density genotype information and the candidate haplotypes for each individual’s gametes are restricted to the two haplotypes that have been identified for each of its parents by the LRPHLI algorithm. Errors in phasing are accounted for in the same way as in Major sub-step 3.
Major sub-step 5.
Individual phase imputation. This is similar to Major sub-step 3, with the exception that the process is restricted to individuals that have high-density genotype information and the candidate haplotypes are restricted to the two haplotypes that have been identified for the individual by the LRPHLI algorithm. Effectively, it determines the parental origin of each of these haplotypes. Errors in phasing are accounted for in the same way as in Major sub-step 3.
Major sub-step 6.
Internal candidate haplotype library imputation of alleles. This step is similar to Major sub-step 3, with the exception that haplotype libraries are internally built using the information that has been previously imputed. Several different core lengths are used to define the length of the haplotypes and to ensure that errors can be accounted for in the same way as in Major sub-step 3.
Major sub-step 7.
Internal imputation based on parental phase. This step is similar to Major sub-set 4, with the exception that it is attempted for all animals in the pedigree on the basis that their parents may now have imputed high-density information. In the same way as Major sub-step 6, several core lengths are used to define the length of the haplotypes and to ensure that errors can be accounted for in the same way as in Major sub-step 3.
Major sub-step 8.
Imputation based on identifying where recombination occurs during inheritance from parent to offspring. Each gamete of an individual is examined from the beginning to the end and from the end to the beginning of the chromosome. In each direction, at loci where both the individual and its parent are heterozygous and have phase information resolved, this information is used to determine which of the parental gametes the individual received. Loci for which this cannot be determined but which are between two loci that (a) can be determined and (b) come from the same parental gamete, are assumed to come from this gamete (i.e. no double recombination event in between). Alleles are imputed in the individual when analysis in both directions of the chromosome has identified the same inherited gamete and when the parent is phased for this locus in the suggested gamete, subject to the restrictions that the number of recombination events for the individuals is less than a threshold and that the region in which two recombination events occurred exceeds a threshold length. Major sub-step 8 is iterated a number of times with increasingly relaxed restrictions. After each iteration, the minor sub-steps are also carried out.
Major sub-step 9.
Recalculate genotype probabilities. Using the imputed genotype information, the allele probabilities and genotype probabilities are recalculated as in Step 1. Alleles that are still missing are imputed as the recalculated allele probability. Missing genotypes are imputed from the allelic probabilities when both alleles are still missing, in advance of the recalculation of allele probabilities, or from the imputed allele and the allele probability when only one allele was not imputed.
Other miscellaneous steps
Steps are also included to divide the data into the high and low density sets of animals, to edit the SNP data, check for Mendelian inconsistencies, and identify base parents.