Comparison of linkage disequilibrium estimated from genotypes versus haplotypes for crossbred populations

Background Linkage disequilibrium (LD) is commonly measured based on the squared coefficient of correlation \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left({r}^{2}\right)$$\end{document}r2 between the alleles at two loci that are carried by haplotypes. LD can also be estimated as the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${r}^{2}$$\end{document}r2 between unphased genotype dosage at two loci when the allele frequencies and inbreeding coefficients at both loci are identical for the parental lines. Here, we investigated whether \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${r}^{2}$$\end{document}r2 for a crossbred population (F1) can be estimated using genotype data. The parental lines of the crossbred (F1) can be purebred or crossbred. Methods We approached this by first showing that inbreeding coefficients for an F1 crossbred population are negative, and typically differ in size between loci. Then, we proved that the expected \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${r}^{2}$$\end{document}r2 computed from unphased genotype data is expected to be identical to the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${r}^{2}$$\end{document}r2 computed from haplotype data for an F1 crossbred population, regardless of the inbreeding coefficients at the two loci. Finally, we investigated the bias and precision of the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${r}^{2}$$\end{document}r2 estimated using unphased genotype versus haplotype data in stochastic simulation. Results Our findings show that estimates of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${r}^{2}$$\end{document}r2 based on haplotype and unphased genotype data are both unbiased for different combinations of allele frequencies, sample sizes (900, 1800, and 2700), and levels of LD. In general, for any allele frequency combination and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${r}^{2}$$\end{document}r2 value scenarios considered, and for both methods to estimate \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${r}^{2}$$\end{document}r2, the precision of the estimates increased, and the bias of the estimates decreased as sample size increased, indicating that both estimators are consistent. For a given scenario, the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${r}^{2}$$\end{document}r2 estimates using haplotype data were more precise and less biased using haplotype data than using unphased genotype data. As sample size increased, the difference in precision and biasedness between the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${r}^{2}$$\end{document}r2 estimates using haplotype data and unphased genotype data decreased. Conclusions Our theoretical derivations showed that estimates of LD between loci based on unphased genotypes and haplotypes in F1 crossbreds have identical expectations. Based on our simulation results, we conclude that the LD for an F1 crossbred population can be accurately estimated from unphased genotype data. The results also apply for other crosses (F2, F3, Fn, BC1, BC2, and BCn), as long as (selected) individuals from the two parental lines mate randomly.

LD is traditionally measured based on the comparison of the observed haplotype frequencies with the expected haplotype frequencies under linkage equilibrium. A common statistical measure of LD is the covariance between loci, D , which is equal to the excess of coupling phase haplotypes, D ij = P ij − P i P j , where P ij refers to the frequency of gametes (haplotypes) that carry the pair of alleles i and j at the two loci, P i and P j refer to the frequency at locus i and locus j , respectively, and P i P j is the expected frequency of this haplotype under linkage equilibrium [6]. Another common measure is the squared coefficient of correlation ( r 2 ) between the alleles at the two loci within haplotypes, [7].
To calculate D and r 2 using the expressions given above, the haplotypes carried by the individuals must be known. However, Rogers and Huff [8] showed that LD can also be estimated by correlating unphased genotype dosages at the two loci, which makes the computation simple and fast. They demonstrated that LD estimated from unphased genotypes yields very similar results to LD estimated from haplotypes. In their derivation, however, they assumed equal inbreeding coefficients for the two loci and equal allele frequencies for the paternal and maternal gametes that created the population. In this context, the inbreeding coefficient measures the departure from Hardy-Weinberg equilibrium and, thus, can take positive or negative values. However, for crossbred individuals inbreeding coefficients can differ between the two loci, and paternal and maternal allele frequencies can differ because the two parents come from different lines.
Here, we investigated whether LD in crossbred populations can be estimated using unphased genotype data. We assumed that sires and dams of the crossbreds originate from two distinct lines but are otherwise mated to each other at random. We address this question in three steps. First, we derive the inbreeding coefficients of crossbreds, showing that they take negative values that typically differ between loci. As a result, the derivation of Rogers and Huff [8] cannot be used to demonstrate the equivalence of genotype-based LD to haplotype-based LD for a crossbred population. Second, we show theoretically that LD computed from genotype frequencies has the same expected value for a given dataset as LD computed from haplotype frequencies, even for a crossbred population. Finally, we investigate the precision and potential bias of LD estimated from unphased genotype data versus haplotype data, using stochastic simulation.

Inbreeding coefficients for a crossbred population
Consider two outbred lines, A and B . We want to investigate the inbreeding coefficients for two bi-allelic loci, M and N , in the F1 crossbred offspring that result from the crossing of random individuals from two parental lines. With alleles denoted 0 and 1, p AM is the frequency of allele 1 at locus M in line A , and p BM is the frequency of allele 1 at locus M in line B . The expected frequency of allele 1 at locus M in the crossbreds then is With random mating between individuals from the two parental lines, the frequency of genotype 11 in the crossbreds is p AM p BM . The deviation of this frequency from Hardy-Weinberg equilibrium follows from [6,9].
where f M is the inbreeding coefficient at locus M in the crossbreds.
The inbreeding coefficient follows from solving this expression for f M , substituting p M = p AM +p BM 2 , and simplifying the expression, giving: Note that the numerators of f M and f N are always negative, except when p AM = p BM and p AN = p BN , while the denominators are always positive. This shows that the inbreeding coefficients of crossbreds are negative, meaning that heterozygosity is greater than would be expected under Hardy-Weinberg equilibrium (for example p AM = 0.05 , p BM = 0.09 , p AN = 0.25, and p BN = 0.29 yields f M = −0.0056 and f N = −0.0015).
We investigated under which conditions the inbreeding coefficients at the two loci are equal by solving the expression f N = f M for the allele frequencies, using Wolfram Mathematica (www. wolfr am. com). Apart from the trivial solutions of p = 0, p = 1, and equal allele .
Marginal frequency r + t s + u r + s + t + u = 1 frequencies at both loci, we found only three solutions (see Appendix 1). Hence, this result demonstrates that the inbreeding coefficients at two arbitrary loci in a crossbred population will usually be different. This implies that the derivation of Rogers and Huff [8] cannot be used to demonstrate the equivalence of genotype-based LD to haplotype-based LD for a crossbred population.

Haplotype-based linkage disequilibrium
In this section, we show that the expected LD based on r 2 computed from the genotype frequencies of the crossbred population is identical to the true r 2 based on haplotype frequencies, even when the inbreeding coefficients differ between the two loci. Note that we consider the true (i.e., population) value of r 2 here, rather than an estimate from a sample. As we consider bi-allelic loci, we have four haplotype frequencies for each line, denoted r , s , t , and u for line A , and using ′ to refer to frequencies for line B , we have haplotype frequencies r ′ , s ′ , t ′ , and u ′ for line B . Table 1 shows expressions for the marginal frequency for each of the alleles. Although the expressions for the marginal frequencies in Table 1 can be simplified by formulating them in terms of allele frequencies, we stick to the haplotype frequencies to facilitate comparison with results for the genotype-based r 2 . Crossbred genotypes consist of two sets of haplotypes, one from each parental line, which may have a different r 2 . By definition, the r 2 in the crossbreds depends on the (co)variances between loci in the crossbred population, so we cannot simply average the r 2 of the two parental lines. From the definitions of correlation, variance, and covariance, it follows that the r 2 for the crossbred population equals the square of the average of the covariances between haplotypes for each of the two lines, divided by the product of the average variance across the two lines at each locus. For line A , the covariance between haplotypes (i.e. D) follows from Table 1 as u − (t + u)(s + u) , where u is the expectation of the cross product of the allele frequencies at each locus, while (t + u)(s + u) is the cross product of the expectations of these allele frequencies (expected haplotype frequency in line A under linkage equilibrum). Hence, this result follows immediately from the definition of a covariance. The covariance ( D ) for line B is analogous, using symbols denoted by ′ . The variance in allele count follows from the binomial distribution with n = 1 for haplotypes and are thus equal to p(1 − p) , p denoting the allele frequency. For line A the variance equals (s + u)(r + t) for locus N , and (t + u)(r + s) for locus M , with analogous equations for line B . Using these values in the haplotype-based r 2 for the crossbred population yields the following true r 2 in the crossbred population: where the numerator is the square of the average of the covariances for the two parental lines, while the denominator is the product of the average of the variances. Note that the constant 2 2 in the numerator of Eq. (1) and 2 2 in the demoninator of Eq. (1) (2 for each variance) cancelled out in the derivation of the equation.

Genotype-based squared correlation
The following inputs are required to derive the genotypebased r 2 in crossbreds: genotype frequencies and the expectations of squares and cross products of genotype dosage, 0, 1, and 2, in crossbreds. Using the haplotype frequencies in Table 1 and the assumption that individuals of line A mate at random to individuals of line B , we find the genotype frequencies in the crossbred population as shown in Table 2. Next, using these genotype frequencies, Table 3 shows the expectations of squares and cross products of genotype dosages. Computations of the expectations of combinations of genotypic values are in Appendix 1.
Using the values in Table 3, the covariance of genotype dosage at the two loci follows from cov M g N g = E M g N g − E M g E N g , where M g and N g are the genotype dosages at loci M and N , and the variances of genotype dosage follow from var M g = E M 2 g −E 2 M g and the corresponding expression for locus N . Substituting the resulting expressions into the (1)  This expression is identical to the expression for the true haplotype-based r 2 (Eq. (1)). Thus, when two lines (the lines can be pure or crossbred) are crossed but individuals from the two lines are mated at random to each other, expectations of the genotype-based and the haplotype-based r 2 in the crossbreds (F1, F2, Fn) and in other cross types (BC1, BC2, BCn) are identical, irrespective of differences in the inbreeding coefficients at the two loci. Note that our derivation also applies to other measures of LD, i.e. D and D ′ . For example, measures of D based on genotypes and haplotypes are the numerators of Eqs. (1) and (2), which are identical. Furthermore, using Eqs. (1) and (2), the r in the crossbred population can be predicted if the haplotype and genotype frequencies of the two parental lines are known.
Note that Eq. (2) refers to the expected r 2 between the genotype dosage at the two loci, not to an estimate thereof. Hence, although the expected values of r 2 geno and r 2 hap are identical, their estimates for a given data set may differ depending on sampling bias and the sampling errors of the estimates. This will be investigated using a simulation study in the next section. (2) .

Simulation
The objective of the simulation was to investigate and compare the bias and precision of the genotype-based and haplotype-based estimates of r 2 for a crossbred population. We investigated the bias and the precision for different sets of allele frequencies, levels of LD as measured by r 2 , and sample sizes. To limit computation time, we directly sampled haplotypes according to their probability distribution, rather than simulating a population of individuals. The haplotype probability distribution follows from the allele frequencies at the two loci and the level of LD. Using the haplotype frequencies and sample size, haplotypes were sampled from a multinomial distribution for each of the two parental lines. The genotypes of the crossbred individuals were obtained by random sampling of one haplotype from each line. Next, the genotype-based and haplotype-based estimates of r 2 were computed from the genotypes and haplotypes, respectively, of the crossbred offspring. The parameter values (allele frequencies, r 2 for each line, and sample size) that were used for simulation were used to compute the true r 2 in the crossbreds, using Eq. (1), which was used as a benchmark to evaluate the precision and bias of the two estimates of r 2 . Thus, there were three measures of r 2 : Table 3 Unordered genotypes, their genotype dosages, frequencies, and expectations of genotype dosages, and squares and cross products of genotype dosages, for locus M and N, in the crossbred offspring from random mating between lines A and B a E M 2 g refers to the expectation of the squared genotype dosage for locusM , and similar definitions apply for theE M g ,E N 2 g , E N g ,E M g N g . The derivation of the expectations of genotype dosages is given in Appendix 2

Frequency Expectations
the true r 2 calculated from the parameter values used for simulation, the haplotype-based estimate of r 2 , and the genotype-based estimate of r 2 . For each set of parameters, results were based on 1000 replicates. We used the R software [10] to simulate the data and analyse the results. The source code for the simulation is available at the following GitHub repository. https:// github. com/ seteg nworku/ Simul ation-code-for_ LD_ cross bred_ pop.

Scenarios investigated
We considered only biallelic loci at two loci in crossbreds resulting from the random mating of two outbred lines ( A and B ). We varied three parameters: (i) allele frequencies and (ii) r 2 in the parental lines, and (iii) the sample size. For the allele frequencies, we considered a range from 0.05 to 0.45, incremented by 0.10, for both lines. To limit the number of scenarios, we used equal allele frequencies at the two loci for most scenarios.
Note that there is no true difference between the major and the minor allele, e.g., p A = 0.05 is equivalent to p A = 0.95, such that results for allele frequencies ranging from 0.55 to 0.95 are identical to those for 0.05 to 0.45. For r 2 in the parental lines, we considered values of 0.2, 0.4, 0.6, and 0.8. To reduce the number of scenarios, r 2 was the same in both lines. We considered sample sizes of 900, 1800, and 2700. This resulted in a total of 180 scenarios with equal allele frequencies at the two loci within each line, of which 120 had different allele frequencies between the two lines, and all had equal r 2 in the two lines (Table 4). In addition to those 180 scenarios, we investigated a few scenarios where allele frequencies differed between loci within the parental lines and for which r 2 differed between the parental lines.

Results and discussion
The full results for all 180 simulated scenarios, including bias, ratio of precision (ratio of standard deviation for the r 2 estimates using unphased genotype and haplotype data), correlation of the standard deviation, of the r 2 estimate using unphased genotype and haplotype data is given in the following R shiny App (https:// seteg nmaths. shiny apps. io/ LD_ App/). The source code for the Shiny App is available in the following github repository:  Fig. 1 Comparison of estimates of linkage disequilibrium (r 2 ± SD) based on unphased genotype and haplotype data for scenarios where allele frequencies differed between loci and between lines, with r 2 = 0.2 for line A and r 2 = 0.4 for line B . Sample size was 900 (1000 replicates) https:// github. com/ seteg nworku/ Linka ge_ diseq uilib rium_ cross bred_ Shiny App. Results showed that the estimates of r 2 for 180 scenarios were unbiased, both for the haplotype-based and the unphased genotype-based estimates of r 2 . Moreover, simulation results also confirmed our theoretical finding that unphased genotype-based and haplotype-based r 2 on average are the same for a given dataset, irrespective of differences in inbreeding coefficients between the two loci ( Fig. 1).
As shown in Fig. 1, r 2 for a given dataset was unbiased for scenarios where allele frequencies differed between loci (i.e., inbreeding coefficients differed between the two loci) and between lines, and when the r 2 differed between the lines (0.2 and 0.4). We also tested the bias of LD estimates using unphased genotype and haplotype data for different sample sizes (Fig. 2). As shown in Fig. 2, for all scenarios, both estimators were unbiased for a sample size above 300. However, with sample size of 300 or less (100, 200, and 300), we found a small downward bias for both the unphased genotype-and haplotype-based estimates (the independent sample t-test showed the bias was significant for some of the scenarios for both the unphases genotype-and haplotype-based estimates). It is well known that the estimator of the correlation coefficient is known to be biased, and more so for smaller samples [11], which may explain the bias we found in small samples.

Bias
For all scenarios (180), the estimates of the r 2 using unphased genotype and haplotype data were both unbiased. We ran an independent sample t-test to test the bias of the estimates of r 2 using unphased genotype and haplotype data from the true r 2 . For all 180 scenarios, the bias of the estimates was not significantly different from zero for both methods (p value > 0.05). The average absolute bias across 180 scenarios was 0.0004 when using unphased genotype data, and 0.0003 when using haplotype data ( Table 5). The maximum absolute bias across the 180 scenarios was 0.003 when using unphased genotype data and 0.002 when using haplotype data. As expected, the bias decreased as sample size increased. For example, with unphased genotype data, the average absolute bias was 0.0005 for a sample size of 900 and 0.0001 for a sample size of 2700. Corresponding values for haplotype data were 0.0003 and 0.0001. These results show that the estimators of r 2 are consistent for both unphased genotype data and haplotype data, because the bias of the r 2 estimates decreased as sample size increased.

Precision
For all scenarios, estimates of LD based on haplotype data were more precise than estimates based on unphased genotype data, although the differences were small. For example, the mean standard deviation of the estimates of r 2 across all scenarios was 0.023 when using unphased genotype data and 0.021 when using haplotype data. The maximum standard deviation for estimates of r 2 across all scenarios was 0.057 using unphased genotype data and 0.055 using haplotype data. The precision of the estimates of r 2 increased as sample size increased, both with unphased genotype and with haplotype data. For example, the average standard deviation across all scenarios with a sample size of 900 was 0.031 with unphased genotype data and 0.027 with haplotype data.
The corresponding values for a sample size of 2700 were 0.018 and 0.016. This result was as expected because the standard error of the estimate of a correlation coefficient decreases as sample size increases [12]. Thus, with  a sufficient sample size, r 2 in crossbreds can be estimated accurately based on unphased genotype data. We further investigated in which scenarios the difference in precision for the estimates of r 2 using unphased genotype versus haplotype data was the largest. We investigated this by computing the ratio of the standard deviations of the estimates of r 2 using haplotype data and unphased genotype data. Thus, smaller values of this ratio indicate a greater superiority of estimates based on haplotypes. As shown in Fig. 3, the ratio of precision was less than 1 for all scenarios, indicating that the estimate based on haplotype data was more precise than that based on unphased genotype data. The ratio of the precision increased as the level of LD increased. For example, for an r 2 of 0.2, the ratio of precision ranged from 0.75 to 0.9, while with an r 2 of 0.8, the ratio ranged from 0.92 to 0.98. The difference between the estimates of r 2 based on unphased genotype vs. haplotype data originates solely from the double heterozygotes (00/11 for coupling phase, or 01/10 for repulsion phase). As r 2 increases, the frequencies of the coupling phase haplotypes 00 and 11 or of the repulsion phase haplotypes 01 and 10, increase, which reduces the opportunity for the haplotype method to provide extra information by distinguishing between them. As a result, at larger r 2 , the precision of the estimates of r 2 using unphased genotype and haplotype data are expected to be closer to each other. On the other hand, at low r 2 , all haplotypes (00, 01, 10, 11) are possible and the haplotype-based method provides additional information. For this reason, the estimate of r 2 based on haplotype data is more precise than the estimate based on unphased genotype data, in particular when the true r 2 is small.
The ratio of precision decreased when the minor allele frequencies for the two loci increased (Figs. 3 and 4). the minor allele frequencies at the two loci decrease, which reduces the extra information provided by the haplotype-based method. This is in agreement with [13]. There was also an interaction between the level of LD and the minor allele frequency, with the ratio of precision increasing when the level of LD increased but this increase was larger for higher values of the minor allele frequency (Fig. 4). The ratio of precision at allele frequencies of 0.05 and 0.05 was 0.91 when r 2 was 0. In real applications, the true r 2 is unknown and the r 2 computed using haplotype data would serve as the reference value. In that case, the comparison would be between the r 2 computed using unphased genotype data relative to the estimate based on haplotype data. In this case, the average absolute bias across the 180 scenarios using unphased genotype was very close to zero (0.00017) and the average standard deviation of estimates based on unphased genotype data across all scenarios relative to haplotype data was 0.0026. In addition, the haplotypebased method assumes that the haplotype can be determined without error for each individual, which means that in reality the absolute bias may be lower than the above value of 0.00017, depending on the error of haplotype estimation. Thus, estimates of r 2 computed using unphased genotype and haplotype data are indistinguishable in terms of both bias and precision in practice, particularly with sufficient sample size.
This paper extends the work of Rogers and Huff [8] and Weir [14], who showed that LD can be estimated from unphased genotype data when the allele frequency in line A and line B is the same, and when the inbreeding coefficient is identical for the two loci. Here, we showed that LD can also be estimated using unphased genotype data when the allele frequencies differ between lines A and B and the inbreeding coefficients differ between the two loci. This is particularly relevant for hybrids in plant breeding [15] and for crossbreds in animal breeding [16,17].

Conclusions
This work shows that the expectation of estimates of linkage disequilibrum (LD) between loci based on unphased genotypes and haplotypes in F1 crossbreds are identical. Estimates of LD, i.e. r 2 , are more precise and less biased when based on haplotype data compared to unphased genotype data. For both unphased genotype and haplotype data, the precision of r 2 increases and the bias of the estimates decreases as sample size increases. More importantly, the difference in precision and bias between estimates of r 2 using haplotype and unphased genotype data decreases as sample size increases. Thus, LD in a crossbred population can be estimated using unphased genotyped data with little bias and good precision, particularly with sufficient sample size.
This appendix shows the computation of the expectation for different linear combination of genotypic values M g and N g at loci M and N , respectively, as indicated in Table 3.

Computation of E M g N g
Simplifying this yields: