Extending long-range phasing and haplotype library imputation algorithms to very large and heterogeneous datasets

Background This paper describes the latest improvements to the long-range phasing and haplotype library imputation algorithms that enable them to successfully phase both datasets with one million individuals and datasets genotyped using different sets of single nucleotide polymorphisms (SNPs). Previous publicly available implementations of long-range phasing could not phase large datasets due to the computational cost of defining surrogate parents by exhaustive all-against-all searches. Further, both long-range phasing and haplotype library imputation were not designed to deal with large amounts of missing data, which is inherent when using multiple SNP arrays. Methods Here, we developed methods which avoid the need for all-against-all searches by performing long-range phasing on subsets of individuals and then combing results. We also extended long-range phasing and haplotype library imputation algorithms to enable them to use different sets of markers, including missing values, when determining surrogate parents and identifying haplotypes. We implemented and tested these extensions in an updated version of our phasing software AlphaPhase. Results A simulated dataset with one million individuals genotyped with the same set of 6,711 SNP for a single chromosome took two days to phase. A larger dataset with one million individuals genotyped with 49,579 SNP for a single chromosome took 14 days to phase. The percentage of correctly phased alleles at heterozygous loci was respectively 90.5% and 90.0% for the two datasets, which is comparable to the accuracy achieved with previous versions of AlphaPhase on smaller datasets. The phasing accuracy for datasets with different sets of markers was generally lower than that for datasets with one set of markers. For a simulated dataset with three sets of markers 2.8% of alleles at heterozygous positions were phased incorrectly whereas the equivalent figure with one set of markers was 0.6%. Conclusions The improved long-range phasing and haplotype library imputation algorithms enable AlphaPhase to quickly and accurately phase very large and heterogeneous datasets. This will enable more powerful breeding and genetics research and application.

designed to deal with large amounts of missing data, which is inherent when using 23 multiple SNP arrays. 24

25
Here, we developed methods which avoid the need for all-against-all searches 26 by performing long-range phasing on subsets of individuals and then combing results. 27 We also extended long-range phasing and haplotype library imputation algorithms to 28 enable them to use different sets of markers, including missing values, when 29 determining surrogate parents and identifying haplotypes. We implemented and tested 30 these extensions in an updated version of our phasing software AlphaPhase. 31

32
A simulated dataset with one million individuals genotyped with the same set 33 of 6,711 SNP for a single chromosome took two days to phase. A larger dataset with 34 one million individuals genotyped with 49,579 SNP for a single chromosome took 14 35 days to phase. The percentage of correctly phased alleles at heterozygous loci was 36 Background 50 51 Here we describe the latest improvements to the Long-Range Phasing (LRP) and 52 Haplotype Library Imputation (HLI) algorithms to phase genotypes for hundreds of 53 thousands of individuals that have been genotyped on different platforms. Phasing 54 genotypes is the process of inferring the parental origin of individual's alleles. This 55 process resolves the inheritance of chromosome segments in a population and is as 56 such a cornerstone technique in genetics. For example, it is useful for making 57 genotype calls, imputing genotypes, detecting phenotype-genotype associations in the 58 presence of effects such as allele-specific expression, and inferring recombination 59 points and demographic history [1]. 60 The size of genomic datasets has grown rapidly in recent years as genotype data is 61 collected on an increasing number of individuals. In agriculture this growth has been 62 driven by the increased value of genomic selection [2-4], whereas in human genetics 63 it has been driven by the increased power of genome-wide association studies [5-7] 64 and genomic prediction in human medicine [8]. Examples of such large datasets 65 include the UK Biobank [9], which has recently released genotype data from 66 approximately half a million people [10], and the US Dairy Cattle and Irish Cattle 67 Breeding Federation Databases that each host well over a million of genotyped 68 animals [4,11,12]. 69 In many cases these datasets have been collected over several years and have been 70 genotyped using different single nucleotide polymorphism (SNP) arrays [4,12]. 71 Methods such as SNPchiMp [13] have been developed to allow the manipulation of 72 different sets of markers from multiple SNP arrays, but their main aim is to ensure 73

116
Both LRP and HLI operate on genome regions called cores. A core is a set of 117 consecutive SNPs for which phasing is being attempted. For further details see Hickey 118 et al. [16]. 119 LRP infers the phase of an individual by using other individuals known to share a 120 haplotype with the individual. Individuals sharing a haplotype are called "surrogate 121 parents" (shortened here to "surrogates") and are identified by finding no opposing 122 homozygote markers at any position within a core. These surrogates are then 123 partitioned into either paternal or maternal surrogates of the individual using pedigree 124 information, if it is available. If pedigree information is not available, this assignment 125 is arbitrary. 126 If a surrogate is homozygous at a position then it enables phasing of the individual at 127 that position. If no homozygous surrogate is found, then it may be possible to phase 128 the individual by using surrogates of surrogates. This process can be continued to an 129 arbitrary depth. In practice, the consensus of several homozygous surrogates is taken 130 to allow for error in determining surrogates or genotype data. 131 HLI infers phase by matching partially phased haplotypes to a library of known 132 haplotypes. In the existing algorithm the initial library is constructed from the fully 133

154
The LRP algorithm was modified to enable the identification of surrogates in the 155 presence of missing data. Missing data hinders the identification of opposing 156 homozygotes and so has the potential to wrongly identify surrogates. To alleviate this 157 problem we introduced an additional parameter that defines the required number of 158 shared markers by two individuals before surrogacy is tested. 159 The HLI algorithm required more complex modifications. In a multiple SNP array 160 setting it is likely that most, or even all, individuals will have been genotyped with 161 one array, so will not have a data for all array markers. Consequently, LRP cannot 162 -9 -infer parental origin of alleles at missing markers. We developed methods that 163 allowed partially inferred haplotypes to be included in the haplotype library and to be 164 used to infer other haplotypes. 165 Allowing for partially inferred haplotypes in the haplotype library makes matching a 166 new partially inferred haplotype to a library haplotype much more difficult. It is 167 necessary to ensure that the two haplotypes have enough markers with non-missing 168 information to be confident they are the same haplotype. Thus, we added a parameter 169 to the HLI algorithm that specifies the required number of shared alleles to match two 170 haplotypes ( Figure 1a). 171 In some cases it is possible that the new haplotype matches more than one library 172 haplotype. In these cases it is possible that the new haplotype identifies duplicated 173 library haplotypes (Figure 1b) AlphaPhase was modified to store haplotypes and genotypes as bits and to use bit 183 operations to operate on multiple SNPs at once wherever possible. It was also 184 parallelised in several places to exploit high performance computing clusters. 185 sampled SNPs from the unused set. From these new sets we then created a second HD 233 and second MD array. We generated the third and fourth HD and MD arrays from the 234 previous arrays in the same way. Rather than removing the exact number of SNPs that 235 were to be added (or removed) we sampled based on a probability that resulted in the 236 expected number of SNPs being added (or removed). Because of this sampling the 237 resulting arrays were of slightly different in size. 238 We created the fifth HD and MD arrays to represent arrays from a different family of 239 arrays, perhaps from a different manufacturer. These arrays were simulated in a 240 similar way to the other arrays but by removing and replacing fifty percent of the 241 SNPs from the first HD and MD arrays. 242 For all scenarios we simulated individuals as being genotyped on different arrays by 243 assigning them to arrays in proportions we might expect to see in real datasets (Table  244 1). 245

Phasing parameters used for AlphaPhase 246
AlphaPhase has several parameters that control phasing of alleles. Two of these were 247 expected to have a significant effect on the performance of AlphaPhase -the existing 248 parameter controlling core length (defined as the number of SNP in each core) and a 249 new parameter that controlled the size of phasing subsets to speedup phasing of a 250 large dataset. 251 The length of the cores can have a significant effect on phasing accuracy [16]. To find 252 the best core length for both of the MD and HD scenarios we tested different core 253 lengths. For the Illumina 50Kv2 scenarios we tested core lengths in the same range as 254 those tested in Hickey et al. [16] for a similar size array: 50, 100, 200, 500, and 1,000 255 SNPs. For the Illumina HD scenario we tested core lengths of 500, 1,000, 2,000, 256 5,000, and 10,000 SNPs because the Illumina HD array contains approximately ten 257 times as many SNPs as the Illumina50Kv2 array. 258 We tested different sizes of the phasing subsets as this was expected to have an effect 259 on phasing accuracy. Tested values were 500, 1,000, 2,000, 5,000, and 10,000 260 individuals. For the Illumina 50Kv2 scenario we tested all combinations of core 261 length and subset size. We only report subset size results for a fixed core length of 262 500 SNPs since the interaction between core length and subset size was minimal (data 263 not shown). For the Illumina HD scenario we set the core length to 5,000 SNPs when 264 testing subset size. 265 For evaluating the Heterogeneous Array scenarios we set the core length to 500 SNPs 266 when the dataset consisted of only MD arrays since this value gave good performance 267 in Homogeneous MD Array scenarios. Similarly, for datasets containing HD arrays 268 we set core length to 5,000 SNPs. In both scenarios we set subset size to 5,000 SNPs. 269 AlphaPhase had several other parameters for which fixed default values were used. 270 Specifically, we fixed the maximum number of surrogates used to ten and allowed 271 10% of the marker genotypes to disagree between pairs of surrogates. We also set the 272 number of allele mismatches for clustering pairs of nearly identical library haplotypes 273 to be zero. 274 When phasing multiple arrays we added an additional parameter to AlphaPhase. This 275 parameter governs the minimum required number of matching alleles before two 276 haplotypes can be identified as the same haplotype. If all SNPs were independent of 277 each other, i.e., there was no linkage between them, we would expect the optimal 278 value of this parameter to remain unchanged irrespective of SNP density. Our results 279 (data not shown) suggest that the presence of linkage does not have a significant 280 effect for the SNP densities considered here and that requiring a match of 200 alleles 281 between two haplotypes is an appropriate value for this parameter. If SNP arrays with 282 greater density are considered then the value of this parameter may need to be revised. 283

Performance Testing 284
To test the performance of the new improvements to the LRP and HLI algorithms on 285 large datasets we used the data from the Homogeneous Array scenarios for both the 286 100k and one million datasets. To test the scenario where parents are known and 287 genotype information is available for them we evaluated phasing accuracy within each 288 of the ten breeds individually using data from all generations. Similarly, to test the 289 scenario where no parentage information is available we evaluated phasing accuracy 290 for each of the ten generations individually ( Figure 2). We report average results 291 across either all ten families or all ten generations. 292 To test the speed and memory usage of AlphaPhase on large datasets we tested 293 multiple combinations of number of generation and families from both the 1000k and 294 the one million datasets using Homogeneous Array scenarios. To test the performance 295 of the new improvements to the LRP and HLI algorithms on heterogeneous datasets 296 we used the data from the Heterogeneous Array scenarios on the 100k dataset. 297 We report three phasing statistics -percentage of correctly phased alleles, percentage 298 of unphased alleles, and percentage of incorrectly phased alleles. Unless explicitly 299 stated otherwise, we report these statistics for heterozygous loci only. We also report 300 on memory usage and runtimes. Runs were performed on computers with an Intel 301 Xeon Processor E5-2630 v3 (2.4 GHz) and between 64 and 256GB of RAM. 302

Core Length 305
To determine the accuracy of our new sub-setting method we first determined the 306 optimal core length for each of the Illumina 50Kv2 and Illumina HD scenarios. Figure  307 3 and Tables S1-S2 show the accuracy on the Illumina 50Kv2 scenario for a variety of 308 core lengths. Figure 3a shows the percentage of correctly phased heterozygous loci for 309 the Illumina 50Kv2 array per family scenario. The percentage of correctly phased 310 alleles increased as the core length increased, although the difference in accuracy 311 between a core length of 500 SNPs (92.7%) and 1,000 SNPs (93.3%) was small. For 312 the per generation scenario the percentage of correctly phased alleles peaked at a core 313 length of 500 SNPs (92.3%) before dropping significantly for a core length of 1,000 314 SNPs (89.8%). The pattern for the number of incorrectly phased alleles for the 315 Illumina 50Kv2 (Figure 3b) was less clear although there was a significant increase in 316 the number of incorrectly phased alleles for a core length of 1,000 SNPs. Using a core 317 length of 500 SNP the percentage of alleles incorrectly phased was 0.6% (per family 318 scenario) and 0.9% (per generation scenario). 319 Figure 3 and Tables S3-S4 show that for the Illumina HD scenario the percentage of 320 correctly phased alleles at heterozygous loci peaked at either a core length of 2,000 321 SNPs (per generation) or 5,000 SNPs (per family). For this scenario the number of 322 incorrectly phased alleles was minimised at a core length of 1,000 SNPs (0.3% per 323 family; 0.5% per generation), a shorter core length than that which maximised the 324 number of correctly phased markers. Using a core length of 5,000 SNPs 93.5% (per 325 family) or 92.1% (per generation) of alleles were phased correctly, while 0.8% (per 326 family) or 1.3% (per generation) were phased incorrectly. 327 In all scenarios runtime was inversely proportional to core length (Tables S1-S4). We 328 chose to study core lengths of 500 SNPs (for MD scenarios) and 5,000 SNPs (for HD 329 scenarios) as a reasonable trade-off between accuracy and runtime. For these core 330 lengths runtime was around three minutes for both arrays and for both the per 331 generation and per family scenarios. Memory usage was 2.5GB for the Illumina 332 50Kv2 array and 5.3GB for the Illumina HD array (Tables S1-S4). 333

Subset Size 334
Subset size can be expected to have a significant effect on the accuracy of phasing as 335 it will directly influence the number of surrogates that are found. To test this we 336 evaluated subset sizes of between 500 and 10,000 individuals ( In nearly all scenarios runtime was proportional to subset size (Tables S5-S8). The 347 exception was for the Illumina 50Kv2 per family scenario in which runtime for a 348 subset size of 10,000 individuals was noticeably shorter than runtime for a subset size 349 -17 -of 5,000 individuals. This was likely due to nearly all animals having had both parents 350 included in the subset when the subset size was 10,000, which can significantly 351 decrease the time required to partition surrogates into maternal and paternal 352 surrogates. Memory usage also increases as subset size grows. We chose to use a 353 subset size of 5,000 for the remainder of this study as a reasonable trade-off between

Accuracy, Runtime, and Memory Usage on Different Dataset Sizes 361
To test the performance of our new improvements to the LRP and HLI algorithms on 362 datasets of different sizes we created multiple different sized scenarios from the 100k 363 and the one million datasets. Phasing accuracy was broadly comparable to the phasing 364 accuracy observed when investigating optimal core length and subset size (Tables S9  365 and S10). Figure 5 shows that runtimes scaled approximately linearly with the number 366 of individuals in a dataset. For the Illumina 50Kv2 dataset memory usage varied 367 between 0.6GB for the smallest dataset of 1,000 individuals to 76GB for a dataset of 368 one million individuals ( Figure 6 and Table S19). Comparable figures for the Illumina 369 HD dataset were 0.9GB and 325GB ( Figure 6 and We also tested a mixture of one MD array and one HD array with nine individuals 380 genotyped on the MD array for every individual genotyped on the HD array (Mixed 381 MD/HD). As expected the percentage of correctly phased alleles at heterozygous loci 382 was lower than in other scenarios, but was still 93.4%. Runtime was around six hours 383 and memory usage was 5.3GB. For the Ten Array scenario the percentage of correctly 384 phased alleles was slightly higher at 95.4%, although memory usage was also higher 385 at 7.6GB. 386 Table 3 shows results for the per generation scenarios. Results were broadly 387 comparable to the per family results. Across the scenarios containing only MD arrays 388 the percentage of correctly phased alleles at heterozygous loci was between 93.1% 389 and 95.6% with between 3.1% and 3.6% incorrectly phased. Runtime was similar to 390 the per family scenarios taking around three minutes for the two array scenarios and 391 two minutes for the Three MD scenarios. Memory usage was the same as that of the 392 per family scenarios. As expected we found that core length has a significant effect on phasing accuracy. 418 Short cores showed similar levels of phasing accuracy. Accuracy started to deteriorate 419 notably as cores get longer. We also found that phasing accuracy is a function of the 420 length of the core as a proportion of the length of the chromosome, rather than the 421 number of SNPs it contains, and that the Illumina 50Kv2 and Illumina HD arrays 422 show a very similar pattern of results when core length is expressed as a proportion of 423 chromosome length. This is to be expected, as phasing accuracy is likely to be highly 424 affected by the presence of recombinations within a core. The latter can be reliably 425 measured by the relative size of a core versus the whole chromosome, and less so by 426 the number of SNP array markers in a core. The reduction in accuracy observed as the 427 core length increased is likely due to the increased chance of a core containing a 428 recent recombination. This would reduce the number of surrogates and thus, reduce 429 the information available for phasing. increased. This decrease in phasing performance was very likely due to the reduction 447 in the number of surrogates that would be expected in a smaller subset which, in turn, 448 led to less information with which to accurately phase. 449 The increase in phasing accuracy that resulted from increasing subset size had a 450 significant cost in terms of runtime. We found an approximately linear relationship 451 between the size of the subset and runtime with the used runtime parameters. 452 Consequently, there was a trade-off between runtime and accuracy. As the size of 453 subsets got larger the number of incorrectly phased alleles appeared to begin to 454 plateau for a subset size of 5,000 or greater and runtime started to increase 455 significantly. For the datasets we tested the subset size for which this value occurred 456 seemed to be largely invariant to total dataset size and so we suggest that a subset size 457 of 5,000 is an appropriate compromise. The optimal size for datasets with a different 458 structure, such as a human populations in which individuals are likely to be less 459 related than the ones considered here, warrants further investigation. 460 Phasing of large datasets is likely to be computationally expensive for any phasing 461 techniques due to the large number of individuals involved. Our results suggest that 462 there is sufficient information in small subsets from a larger dataset to allow a 463 significant number of alleles to be phased accurately. This suggests that for other 464 phasing methods, such as those based on probabilistic models, it could also be 465 datasets. In general, accuracy was slightly worse than for the single array scenarios 480 that were tested, although in many scenarios the Heterogeneous Arrays phased 481 slightly more alleles correctly. However, this increase in percentage of correctly 482 phased alleles came at the cost of phasing more alleles incorrectly as well. 483 AlphaPhase can now also accurately phase individuals genotyped on a mixture of MD 484 and HD SNP arrays. The phasing of such datasets is likely to become increasingly 485 common as it is desirable to continue to use the data already collected using MD 486 arrays even as the use of HD arrays grows. Although the phasing accuracy for 487 heterogeneous datasets was often lower than when individuals were genotyped on a 488 single SNP array, the percentage of correctly phased alleles was still over 93% in all 489 scenarios tested other than the Mixed MD/HD per generation scenario. In this 490 scenario many individuals had high amounts of missing data due to them being 491 genotyped on the MD rather than HD array and so we would expect phasing to be 492 more difficult. 493  Figure 1 -New improvements to the LRP and HLI algorithms for dealing with 587 library haplotypes with missing data. a) In this example we have generated 588 haplotypes using two different SNP arrays indicated by green and blue haplotypes. If 589 the shared markers between two haplotypes are identical (shown in red) then the two 590 haplotypes can be merged into one haplotype. To ensure the two haplotypes are the 591 same haplotype we set a minimum number of alleles that must be shared. Note that 592 in reality blue and green markers will both be distributed along the length of the 593 haplotype. b) In this example we have generated haplotypes using three different 594 SNP arrays. Finding the new purple haplotype allows us to recognise that the green, 595 purple, and blue haplotype are actually the same haplotype.   Tables 636 Table 1 -The different genotyping scenarios tested   637   S  c  e  n  a  r  i  o  D  e  s  c  r  i  p  t  i  o  n  I  l  l  u  m  i  n  a  5  0  K  v  2  I  l  l  u  m  i  n  a  5  0  K  v  2  (  a  l  l  )  I  l  l  u  m  i  n  a  H  D  I  l  l  u  m  i  n  a  H  D  (  a  l  l  )  T  w  o  I  l  l  u  m  i  n  a  I  l  l  u  m  i  n  a  5  0  K  v  1  a  n  d  I  l  l  u  m  i  n  a  5  0  K  v  2  i  n  a  1  :  1  r  a  t  i  o  T  w  o  M  i  x  e  d  I  l  l  u  m  i  n  a  5  0  K  v  2  a  n  d  I  D  B  v  3  i  n  a  1  :  1  r  a  t  i  o  T  h  r  e  e  M  D  I  l  l  u  m  i  n  a  5  0  K  v  2  ,  G  S  e  e  k  H  D  a  n  d  I  D  B  v  3  i  n  a  1  :  1  :  1  r  a  t  i  o  M  i  x  e  d  M  D  /  H  D  I  l  l  u  m  i  n  a  5  0  K  v  2  a  n  d  I  l  l  u  m  i  n  a  H  D  i  n  a     -41 -  -42 -    -45 -