Estimating autozygosity from high-throughput information: effects of SNP density and genotyping errors

Background Runs of homozygosity are long, uninterrupted stretches of homozygous genotypes that enable reliable estimation of levels of inbreeding (i.e., autozygosity) based on high-throughput, chip-based single nucleotide polymorphism (SNP) genotypes. While the theoretical definition of runs of homozygosity is straightforward, their empirical identification depends on the type of SNP chip used to obtain the data and on a number of factors, including the number of heterozygous calls allowed to account for genotyping errors. We analyzed how SNP chip density and genotyping errors affect estimates of autozygosity based on runs of homozygosity in three cattle populations, using genotype data from an SNP chip with 777 972 SNPs and a 50 k chip. Results Data from the 50 k chip led to overestimation of the number of runs of homozygosity that are shorter than 4 Mb, since the analysis could not identify heterozygous SNPs that were present on the denser chip. Conversely, data from the denser chip led to underestimation of the number of runs of homozygosity that were longer than 8 Mb, unless the presence of a small number of heterozygous SNP genotypes was allowed within a run of homozygosity. Conclusions We have shown that SNP chip density and genotyping errors introduce patterns of bias in the estimation of autozygosity based on runs of homozygosity. SNP chips with 50 000 to 60 000 markers are frequently available for livestock species and their information leads to a conservative prediction of autozygosity from runs of homozygosity longer than 4 Mb. Not allowing heterozygous SNP genotypes to be present in a homozygosity run, as has been advocated for human populations, is not adequate for livestock populations because they have much higher levels of autozygosity and therefore longer runs of homozygosity. When allowing a small number of heterozygous calls, current software does not differentiate between situations where these calls are adjacent and therefore indicative of an actual break of the run versus those where they are scattered across the length of the homozygous segment. Simple graphical tests that are used in this paper are a current, yet tedious solution.


Conclusions:
We have shown that SNP chip density and genotyping errors introduce patterns of bias in the estimation of autozygosity based on runs of homozygosity. SNP chips with 50 000 to 60 000 markers are frequently available for livestock species and their information leads to a conservative prediction of autozygosity from runs of homozygosity longer than 4 Mb. Not allowing heterozygous SNP genotypes to be present in a homozygosity run, as has been advocated for human populations, is not adequate for livestock populations because they have much higher levels of autozygosity and therefore longer runs of homozygosity. When allowing a small number of heterozygous calls, current software does not differentiate between situations where these calls are adjacent and therefore indicative of an actual break of the run versus those where they are scattered across the length of the homozygous segment. Simple graphical tests that are used in this paper are a current, yet tedious solution.

Background
Runs of homozygosity (ROH) are continuous stretches of homozygous genotypes without heterozygosity in the diploid state. Although ROH can arise by different mechanisms [1], the primary cause is believed to be inbreeding [2]. Long ROH are most likely the result of recent inbreeding, where recombination events do not shorten identical haplotypes inherited from the common ancestor. Short ROH, in contrast, suggest more ancient inbreeding. The ability of ROH to reveal information about ancient and recent genetic events makes them useful tools to analyze population history [3], inbreeding levels [4] and effects of inbreeding on complex traits and congenital disorders [5].
While ROH from high-throughput genotyping analyses have been studied extensively in humans, such analyses are rare in cattle and other livestock species [6][7][8][9][10]. The lack of standards for ROH definition and identification may introduce bias in ROH-based estimates of autozygosity. Howrigan et al. [11] found that the numbers and sizes of ROH that are identified in genotyping data can strongly depend on certain parameters and thresholds imposed during sequence analysis. In addition, pruning single nucleotide polymorphisms (SNPs) that show low minor allele frequency (MAF), that deviate from Hardy-Weinberg equilibrium (HWE), or that show high linkage disequilibrium (LD), can affect the results [12,13].
The density of the SNP chip used to generate the data for ROH identification is another factor that strongly affects autozygosity estimates. Purfield et al. [6] compared estimates obtained using the two SNP chips most frequently used in cattle: the Illumina BovineSNP50 Genotyping BeadChip with 54 001 SNPs (50 k) and the Illumina BovineHD Genotyping BeadChip with 777 972 SNPs (HD). They concluded that the 50 k chip is appropriate only for identifying ROH longer than 5 Mb. Indeed, analyses based on lower-density chips can fail to detect heterozygous SNP genotypes that are present in observed ROH.
The frequency of SNP genotyping errors is another factor that can affect ROH-based estimates of autozygosity. Since this frequency usually varies between 0.2% and 1.0% [11,14], it may affect identification of very long ROH that contain numerous SNPs. In fact, any genotyping error, whether homozygote to heterozygote or vice versa, can affect the determination of ROH. A potential solution is to allow a certain number of SNPs to be heterozygous [1], but whether this compromises the reliability of ROH analyses has not been systematically analyzed.
The aim of this study was to analyze the identification of ROH of different length categories and the estimation of genomic inbreeding coefficients based on ROH in three cattle breeds (Brown Swiss, Pinzgauer, Tyrol Grey). Our study focused on the effects of chip density (777 972 versus 54 001 SNPs) and genotyping errors. Results demonstrate, both graphically and statistically, that density of SNP chips affects ROH detection and subsequent estimation of inbreeding levels. The optimal number of heterozygous SNPs allowed during ROH analysis was found to depend on chip density and ROH length.

Genotype data and quality control
The semen samples of the animals included in this study used for DNA extraction and genotyping, were obtained from AI centers through their routine practice in the framework of breeding programs. Therefore, no ethical approval was required for sampling of biological material. DNA samples were obtained from 277 bulls of three breeds: Brown Swiss, 46; Pinzgauer, 118; and Tyrol Grey, 113. Mean pedigree-based inbreeding coefficients (and ranges) were as follows: Brown Swiss, 0.033 (0.009-0.096); Pinzgauer, 0.019 (0-0.088); and Tyrol Grey, 0.022 (0-0.169). The mean complete generation equivalent (see e.g., [15] was highest for Brown Swiss (7.32 generations) and lowest for Pinzgauer (5.32 generations). DNA samples were genotyped using the BovineHD Bead Chip (Illumina Inc., San Diego, CA), which contains 777 972 SNPs; this data set is referred to hereafter as the highdensity (HD) panel. For comparison, we extracted and retained SNPs from this panel that were common to both the HD panel and the bovine SNP50 Beadchip v1 (Illumina Inc., San Diego, CA), which contains 54 001 SNPs and which will be referred to in the remainder as the 50 k panel.
Data extraction and quality control were performed separately for each breed. We excluded all SNPs that had not been assigned to a chromosome or that had been assigned to chromosomes X or Y or to the mitochondrial genome. We also excluded SNPs for which more than 10% of genotypes were missing and SNPs with an Illumina GenCall score ≤ 0.7 or an Illumina GenTrain score ≤ 0.4. Two Tyrol Grey bulls were excluded from further analysis because more than 5% of their genotypes were missing. In doing this, our objective was to exclude poorly performing loci and minimize risk of genotyping errors. After quality control, the numbers of SNPs in the HD and 50 k panels were as follows for each breed: Brown Swiss, 615 618 and 38 710; Pinzgauer, 606 120 and 38 198; and Tyrol Grey, 684 172 and 42 997.
Although it is customary in genome-wide association studies and ROH analyses to exclude SNPs with low MAF or high LD with neighboring SNPs or that deviate from HWE, we did not apply such exclusion criteria in our study. Instead we relied on Illumina quality scores (GenCall, GenTrain) to reduce genotyping problems. We also defined the minimum ROH length as 1 Mb to exclude short, common ROH arising from LD [3,6].

ROH calling options
ROH were identified in every individual using the SNP & Variation Suite (v7.6.8 Win64; Golden Helix, Bozeman, MT, USA www.goldenhelix.com). This algorithm is designed to find stretches of consecutive homozygous SNPs; it works continuously across an entire chromosome, examining every possible run that matches the user-specified parameters. We chose this software instead of the PLINK ROH algorithm [16], which uses a sliding window that may introduce artificial runs and fail to identify segments longer than the window.
ROH exceeding the allowed number of heterozygotes or missing SNPs were checked automatically to determine whether they should be removed based on their length, SNP density, and user-specified parameters. ROH were called if 15 or more consecutive homozygous SNPs [17] were present at a density of at least 1 SNP every 100 kb, with gaps of no more than 1000 kb between them. These density and gap thresholds were applied to SNPs in both the HD and 50 k panels to ensure comparability of the results.
Five categories of ROH length (in Mb) were defined: [1,2], (2,4], (4,8], (8,16], and >16. The number of heterozygous SNPs allowed was set to different values for different length categories. First, we called ROH without allowing any heterozygous calls, and we obtained the average numbers of SNPs in each length category ( Table 1). We then assumed a genotype error rate of 0.25%, recalculated the numbers of heterozygote calls allowed, and rounded the number of heterozygous SNPs allowed to the nearest whole number. This approach led to the following numbers of heterozygous SNPs allowed for each length category (in Mb) in the HD panel: [1,2], one heterozygeous SNP; (2,4], two heterozygous SNPs; (4,8], four heterozygous SNPs; (8,16], eight heterozygous SNPs; and >16, 16 heterozygous SNPs ( Table 2, class C). In the case of the 50 k panel, we allowed one heterozygous SNP for length category >16, and no heterozygous SNPs for the other categories ( Table 2, class A).
Like the number of heterozygous SNPs, we set the number of missing SNPs allowed to different values for different length categories. First, we determined ROH allowing any number of missing SNPs and then used the results to set limits. This approach led to the following limits for missing SNPs for each ROH length category (in Mb) in the HD and 50 k panels, respectively: [1,2], four or no missing SNPs; (2,4], eight or no missing SNPs; (4,8], 16 or one missing SNP; (8,16], 32 or two missing SNPs; and > 16, 64 or four missing SNPs.

Calculating inbreeding coefficients from runs of homozygosity (F ROH )
Statistically F ROH is defined as the length of the autosomal genome present in ROH, divided by the overall length of the autosomal genome covered by the SNPs [18]. For each bull, we calculated F ROH>1 Mb , F ROH>2 Mb , F ROH>4 Mb , F ROH>8 Mb and F ROH>16 Mb based on ROH of different minimum lengths (> 1, > 2, > 4, > 8 or > 16 Mb). F ROH was calculated for different minimum ROH lengths because lengths of autozygous segments in a genome are predicted to show an exponential distribution, with a mean length equal to 1/2 g Morgan, where g is the number of generations since the common ancestor (e.g. [11]). If the genome of an individual contains segments as short as 1 Mb, we can conclude that the individual's autozygosity originated from common ancestors up to 50 generations in the past. Based on the F ROH values across all ROH lengths, detected with both 50 k and HD panel, correlations with pedigree inbreeding coefficients were calculated in order to investigate their relationships.

Identifying significant differences in autozygosity estimates based on the number of heterozygous calls allowed
Mean values of F ROH were calculated within classes (scenarios) in which different numbers of heterozygous SNPs were allowed in each ROH length category. Eight classes (A to H) were defined, two (A and B) for the 50 k panel and six (C-H) for the HD panel ( Table 2). Numbers of heterozygous SNPs allowed within a class were based on the average numbers of SNPs in a length category and an assumed genotyping error rate of 0.25% for classes A and C. The other classes were formed by successively halving the allowed number of heterozygotes and only considering longer segments (see Table 2).
Mean F ROH values obtained when allowing different numbers of heterozygous SNPs were compared within the same length category using paired t-tests. In addition, F ROH values were compared between the 50 k and HD panels.
Dots indicate that the value of 0 (no heterozygous allowed) for the given length category was reached in a previous class of the same panel and information is not repeated.
The SAS 9.3 [19] procedure TTEST with the PAIRED statement was used to generate p values. The step-down Bonferroni method of Holm [20] using the MULTTEST procedure and the HOLM statement was used to adjust the p values of the 186 comparisons.

Results and discussion
Impact of SNP chip density on ROH identification Across all three cattle breeds, we identified 19 392 ROH segments using the 50 k panel and 14 148 ROH segments using the HD panel (Table 3). For all three breeds, analysis with the 50 k panel identified more ROH > 1 Mb than the HD panel. The two panels gave similar numbers of ROH > 4 Mb. As ROH length increased, the HD panel yielded a higher number of ROH than the 50 k panel (Figure 1). The 50 k panel revealed an abundance of small segments and overestimated the numbers of segments 1-4 Mb long, suggesting that it is not sensitive enough for the precise determination of small segments.
The 50 k panel did, however, prove suitable for detecting segments longer than 4 Mb. This finding is consistent with that of Purfield et al. [6], who concluded that the 50 k panel recognizes only segments longer than 5 Mb as well as the HD panel does.
The 50 k and HD panels gave noticeably different distributions and mean values of ROH length within each length category (Figure 2). Differences were greatest for the [1,2] length category, and then gradually disappeared as ROH length increased. These findings provide further evidence that data from the 50 k panel lead to imprecise determination of short ROH and overestimation of F ROH .

Impact of genotyping errors on autozygosity estimates
To our knowledge, a simulation study by Howrigan et al. [11] is the only source of recommendations on the number of heterozygous calls allowed in ROH. They suggested allowing no heterozygous calls. However, since genotyping errors in SNP chip data do occur, it seems more reasonable to allow some heterozygous calls, particularly for ROH > 8 Mb on dense SNP chips. These long segments are much more frequent in cattle populations than in human populations, even for population isolates (e.g. [21]). We determined the numbers of SNPs in ROH of specific lengths and assumed a 0.25% rate of genotyping errors in order to define the number of heterozygous genotypes allowed separately for each ROH length category. Then, we determined mean F ROH values for the classes defined in Table 2 for different allowed numbers of heterozygous calls. Paired t-tests were conducted within the eight classes (A-H) within the same length category and within each cattle breed (  (Table 4). In fact, inbreeding coefficients derived from ROH > 16 Mb with no allowance for heterozygous calls were lower than inbreeding coefficients estimated from pedigrees. These findings suggest that for such long    Definition of Class is according to the number of heterozygous SNPs allowed within ROH length categories (see Table 2). F ROH values were obtained by allowing different numbers of heterozygous SNPs in each ROH length category; different letters indicate statistical significance within the same column and breed (P < 0.05, paired t-test); P values were corrected for multiple tests using the step-down Bonferroni method of Holm [18]. ROH, which can have more than 5000 to 6000 SNPs, some heterozygous calls must be allowed due to the possibility of genotyping errors. At the same time, the number of allowable heterozygous calls should be limited. On the one hand, SNP data from chromosome 20 in the 46 Brown Swiss cattle (Figure 3) shows clearly that single, potentially miscalled heterozygous SNPs would interrupt ROH segments if such SNPs were not allowed. On the other hand, the figure also shows that allowing certain minimum numbers of heterozygous SNPs leads to inaccurate ROH calling at the ends of ROH. Such inaccurate calling is also likely to be a problem in individual ROH, since we sometimes observed multiple heterozygous SNPs close together within a ROH, not only when using the SNP & Variation software suite but also when using other programs (PLINK; [16]; cgaTOH; [22]; data not shown). In any event, ROH identification software should be improved so that instances of multiple heterozygous SNPs very close to one another should automatically lead the program to define separate ROH. Until such an improvement is made, we recommend careful visual analysis of ROH segment structure in order to exclude spurious ROH.

Inbreeding coefficients estimated from ROH and ROH distribution
The HD panel gave the following mean F ROH values across all ROH lengths: Brown Swiss, 0.151; Pinzgauer, 0.062; and Tyrol Grey, 0.066. Short ROH, i.e. 1 to 2 Mb long, covered an average of 36.7 Mb of the 2.3 Gb of the autosomal cattle genome covered with SNPs ( Figure 2), with the highest short-ROH coverage observed in Brown Swiss and the lowest in Pinzgauer, the total genome length covered by all ROH > 1 Mb was 24.5% for one Brown Swiss bull and 23.0% for one Tyrol Grey bull. ROH > 16 Mb covered an average of 66.1 Mb of genome, although this number varied widely from animal to animal and between breeds. The highest long ROH coverage was observed in Brown Swiss and the lowest in Pinzgauer cattle. Some animals lacked such long ROH, whereas others showed a few that covered more than 300 Mb. The greatest genome coverage by long ROH was observed in a Tyrol Grey bull, in which 12 long ROH segments covered 368. 6 Mb, corresponding to an average segment length of ≈30 Mb. The length of an autozygous segment indicates its age; since haplotypes are broken up by meiotic recombination, a short autozygous region is likely to have an ancient origin, while a long one probably arose recently [2,4]. These findings suggest that the Brown Swiss breed experienced both recent and ancient inbreeding events to a higher degree than the two other breeds.
Correlations of F ROH values across all ROH lengths with pedigree inbreeding coefficients were similar to those previously reported by Ferenčaković et al. [8]. Correlations for the 50 k panel were 0.62, 0.65 and 0.77 for Brown Swiss, Pinzgauer and Tyrol Grey, respectively, and corresponding values were 0.61, 0.62 and 0.75 for the HD panel. Differences in correlations between panels within breeds were not statistically significant. Variation of these values is most likely due to the fact that pedigree-based inbreeding coefficients do not account for variation in meiosis, inheritance of segments of chromosomes and LD.
The genomic distribution of ROH based on the HD panel data shows that 99.98% of SNPs occurred within an ROH of at least one individual. However, the frequency with which different SNPs occurred within ROH was not uniform across the genome, revealing genomic regions with abundant ROH, called ROH hotspots, which are also often detected in human populations [23,24]. Several ROH hotspots were common to all three breeds. For example, two hotspots were identified on chromosome 6 in all three breeds: one at 5.3-6.3 Mb and another at 38.4-39.5 Mb (Figure 4). Why these hotspots occur, and how they compare among cattle breeds and with other animal species, are questions currently under investigation.

Conclusions
ROH identification in cattle is usually performed with the Illumina BovineSNP50 Genotyping BeadChip (50 k panel) or the Illumina BovineHD Genotyping BeadChip (HD panel). Here, we report that data from the 50 k panel do not represent the true state of autozygosity well for short ROH segments, while it is as reliable as the HD panel data for ROH > 4 Mb. When shorter segments are included with the 50 k panel, F ROH is systematically overestimated. The bias due to potential genotyping errors depends on the allowance of heterozygous genotypes in a ROH calling software. While not allowing for heterozygous calls often just splits a very long ROH in two shorter ones that are still recognized and therefore the level of autozygosity of an individual is virtually unaffected, there are many cases where the shorter part of the split does not reach the minimum size of a ROH and the level of autozygosity of an individual is underestimated. Allowing many heterozygous calls in an ROH adds many short segments that are most likely not autozygous to the terminal regions of ROH. Our aim was to provide guidelines to identify ROH from high-throughput SNP genotype data. First, quality control should be performed by removing SNPs based on strict limits on genotype quality scores provided to reduce genotyping errors. Second, the number of heterozygous SNPs allowed should be determined separately for each ROH length of interest and for each SNP density, as suggested here. Third, if multiple heterozygous SNPs are allowed within the same ROH, adjacent heterozygous SNPs should be treated differently from heterozygous SNPs that are further apart. Because no current ROH identification software takes care of adjacent heterozygous SNPs, careful visual inspection of ROH segments should be applied to exclude spurious ROH called by the software.