- Research Article
- Open Access
Assigning breed origin to alleles in crossbred animals
© The Author(s) 2016
- Received: 21 December 2015
- Accepted: 10 August 2016
- Published: 22 August 2016
For some species, animal production systems are based on the use of crossbreeding to take advantage of the increased performance of crossbred compared to purebred animals. Effects of single nucleotide polymorphisms (SNPs) may differ between purebred and crossbred animals for several reasons: (1) differences in linkage disequilibrium between SNP alleles and a quantitative trait locus; (2) differences in genetic backgrounds (e.g., dominance and epistatic interactions); and (3) differences in environmental conditions, which result in genotype-by-environment interactions. Thus, SNP effects may be breed-specific, which has led to the development of genomic evaluations for crossbred performance that take such effects into account. However, to estimate breed-specific effects, it is necessary to know breed origin of alleles in crossbred animals. Therefore, our aim was to develop an approach for assigning breed origin to alleles of crossbred animals (termed BOA) without information on pedigree and to study its accuracy by considering various factors, including distance between breeds.
The BOA approach consists of: (1) phasing genotypes of purebred and crossbred animals; (2) assigning breed origin to phased haplotypes; and (3) assigning breed origin to alleles of crossbred animals based on a library of assigned haplotypes, the breed composition of crossbred animals, and their SNP genotypes. The accuracy of allele assignments was determined for simulated datasets that include crosses between closely-related, distantly-related and unrelated breeds. Across these scenarios, the percentage of alleles of a crossbred animal that were correctly assigned to their breed origin was greater than 90 %, and increased with increasing distance between breeds, while the percentage of incorrectly assigned alleles was always less than 2 %. For the remaining alleles, i.e. 0 to 10 % of all alleles of a crossbred animal, breed origin could not be assigned.
The BOA approach accurately assigns breed origin to alleles of crossbred animals, even if their pedigree is not recorded.
- Minor Allele Frequency
- Tail Length
- Additional Rule
- Heterozygous Genotype
- Incorrect Assignment
Several production systems, including those for pigs and chickens, are based on crossbreeding (e.g., [1–3]) to take advantage of the increased performance of crossbred compared to purebred animals. One limitation of these breeding programs is that selection is performed on purebred animals, although the aim is to improve crossbred performance. Besides the genetic differences between purebred and crossbred animals, purebred animals are mainly housed in nucleus farms with high-health conditions, while crossbred animals are housed under field conditions.
With the advent of genomic selection, several authors have proposed genomic evaluation methods that use phenotypic records on crossbred animals to increase response to selection for crossbred performance (e.g., [2, 4, 5]). These approaches compute estimated breeding values for crossbred performance using many single nucleotide polymorphisms (SNPs). Several factors have an impact on the effect that can be measured for a SNP. First, the effect of the same allele, but of different breed origin, in a crossbred animal may differ because of different levels of linkage disequilibrium (LD) between the SNP and a quantitative trait locus (QTL) in the purebred populations. Second, different genetic backgrounds, e.g., dominance, or epistatic interactions, can explain that the same allele has different effects in purebred and crossbred animals. Third, the environmental conditions under which purebred and crossbred animals are raised may vary, which can result in genotype-by-environment interactions. Thus, SNP effects may be breed-specific, which has led to the implementation of genomic selection of purebred animals for crossbred performance that take breed-specific effects of SNP alleles into account [3, 5]. However, these methods assume that breed origin of alleles in crossbred animals is known. Results from simulations showed that models that consider breed-specific effects can outperform the current genomic models that assume that the SNP effect is the same across breeds, at least under some conditions [2, 5]. Although breed-specific effect models appear promising based on these simulation studies, the question whether they will outperform other models remains open. To apply a model that considers breed-specific effects on real field data, accurate estimates of local ancestry for the SNP alleles of crossbred animals are needed. In this context, local ancestry refers to the breed origin of each SNP allele for each locus for each crossbred animal.
Several approaches (e.g., [6–9]) have been proposed to estimate local ancestry in admixed populations. These approaches can be an essential step in the mapping of disease genes , in the control of population structure for genome-wide association studies (GWAS) , or even in the study of population genetic processes that involve admixed populations [12–14]. Some of these approaches specifically focus on local ancestry inference in admixed populations that originate from two or more populations a few generations back. However, these approaches may be less applicable in our context for several reasons. One reason is that they do not consider that each crossbred animal originates from a well-defined crossbreeding scheme, in which the purebred populations, i.e. the ancestral populations, are at most the second ancestral generation. Also, these methods implicitly assume genetically-diverged populations [7, 8], which is generally not the case for purebred pig or chicken populations, which may include different lines of the same breed, or a cross of several breeds (i.e., a synthetic breed). Therefore, the aim of our study was to develop an approach for assigning breed origin to alleles (termed BOA) of animals that come from specific crossbreeding schemes. Furthermore, we determined the accuracy of allele assignments by using simulated datasets that involved crosses between closely-related, distantly-related or unrelated breeds. The BOA approach requires several phasing analyses of the genotypes of purebred and crossbred animals. The effects of different phasing parameters and several nuisance factors in the data, such as the presence of a haplotype in another pure breed that would preclude the assignment to the first pure breed, were also tested. In addition, the developed method was applied to real pig genotype data to investigate whether the results were consistent with those obtained from simulated data.
The data used in this study was collected as part of routine data recording in a commercial breeding program. Samples collected for DNA extraction were only used for routine diagnostic purposes of the breeding program. Data recording and sample collection were conducted strictly in line with the Dutch law on the protection of animals (Gezondheids—en welzijnswet voor dieren).
To test the accuracy of an approach aimed at assigning breed origin to alleles, the true origin of each allele of crossbred animals must be known. This was achieved by simulating historic and breed populations using the QMSim software , and then simulating a three-way crossbreeding program with five generations of random selection using a custom Fortran program. For the historic population, 1000 discrete random mating generations with a constant size of 1000 individuals were simulated, followed by 50 generations in which the effective population size was reduced to 100 individuals. The next eight generations were simulated to expand the population size to 810. For the first 1050 simulated generations, half of the simulated animals were males and the other half were females. In the next eight generations, 60 males and 750 females were simulated. Matings for all generations were based on the random union of gametes, which were randomly sampled from the pools of male and female gametes. To simulate the three breed populations (hereafter referred to as breeds A, B, and C), three random samples were drawn from the last generation of the historic population (i.e., generation 1058), each including 20 males and 250 females. Subsequently, within each breed, 5, 20, or 50 generations of random mating were simulated before starting the three-way crossbreeding scheme, which will be referred to as scenarios with closely-related breeds, distantly-related breeds, and unrelated breeds, respectively. For the simulated 5, 20, and 50 generations of pseudo-random mating, one litter with two individuals per female (i.e. one male and one female) was assumed.
In the second step, a three-way crossbreeding program with five generations of random selection was simulated. Purebred (i.e., A, B, and C) animals that were used to start the crossbreeding program were from generations 1063, 1078, and 1108 for the closely-related, distantly-related and unrelated breeds, respectively. During the crossbreeding program, and for each breed, A, B, and C, purebred animals were randomly selected and mated to simulate the next generation by maintaining a constant size of 20 males and 250 females. From each of the five generations, B and C purebred animals were randomly crossed to produce five generations of 10 BC crossbred males and 100 BC crossbred females. These BC crossbred animals were then randomly mated to males from breed A to produce five generations of A(BC) crossbred animals. For each generation, 110 A(BC) animals were simulated. Purebred animals that were used as parents of crossbred animals could also be parents of purebred animals in the next generation.
For the three scenarios, the genome consisted of two chromosomes, i.e. a 3.20 Morgan long chromosome (chr1) with 6700 SNPs and a 0.61 Morgan long chromosome (chr2) with 1353 SNPs. These two chromosomes were designed to resemble Sus Scrofa chromosomes (SSC) 1 and SSC18, respectively, with a SNP density that was comparable to that of a 60 k SNP chip. The SNP positions were randomized across the genome and a recurrent mutation rate of 2.5 × 10−5 was assumed. All SNPs that segregated in the last historical generation (i.e., generation 1058) and with a minor allele frequency (MAF) higher than or equal to 0.10 were selected and used to simulate the genotypes of the purebred and crossbred animals, as well as for all subsequent analyses. Breed origin of each allele was recorded for each crossbred animal.
To compose the datasets of genotypes, 75 % of purebred (A, B, and C) and crossbred [BC, and A(BC)] males and females that were produced during the three-way crossbreeding program were randomly selected. Random selection of purebred and crossbred animals led to datasets of genotypes that did not include all parents of the crossbred animals and for which not all purebred animals had crossbred offspring. It was assumed that pedigree information was not available for any animal.
Descriptive statistics for three simulated scenarios (10 replicates; SD within brackets) and for the real data
Number of animals
Number of SNPs
F ST a
For the three scenarios, i.e. closely-related breeds, distantly-related breeds, and unrelated breeds, the level of genetic differentiation between the three breeds was measured using the global Wright’s F ST statistic , as implemented in the software Genepop (4.2) [18, 19]. Genotypes for all selected SNPs and for all purebred animals, from all five purebred generations simulated for the three-way crossbreeding program were used to estimate F ST. The same statistics were computed for the real dataset by considering all selected SNPs on SSC2 and 18 for all available purebred animals.
Assignment of allele origin
The BOA approach that we developed to assign breed origin to alleles of crossbred animals, consisted of three steps: (1) phasing the genotypes of both purebred and crossbred animals, (2) assigning breed origin to the phased haplotypes, and (3) assigning breed origin to alleles of crossbred animals based on the library of assigned haplotypes, the breed composition of the crossbred animals and the zygosity (i.e., homozygosity or heterozygosity) of their genotypes.
AlphaPhase1.1 (version 1) software  was chosen for phasing available genotypes. AlphaPhase1.1 implements a long-range phasing (LRP) and haplotype library imputation algorithm (LRPHLI) and resolves phase without depending on family structure or pedigree information. The LRPHLI uses long haplotypes and the principle of surrogate parents, which are individuals that share a haplotype with the individual being phased. They are identified by having no opposing homozygote genotypes with this individual within a string of consecutive SNPs that includes a core and adjacent tails (hereafter called “core and tail length”, in terms of numbers of SNPs) . A core is a string of consecutive SNPs for which phasing is being determined, and the adjacent tails are strings of consecutive SNPs that are adjacent to either end of a core.
A total of n phasing analyses with different core and tail lengths were performed, such that each SNP was phased many times as a part of cores that span different SNP windows. Using different lengths of consecutive SNPs addresses the fact that the expected size of shared haplotypes is larger for more closely-related individuals than for less related individuals . When analysing the simulated data, nine different core and tail lengths were considered. Applied combinations of core and tail lengths (core length, tail length) were (150, 200), (200, 200), (250, 100), (250, 200), (300, 100), (300, 200), (350, 50), (350, 100), and (350, 200). All phasing analyses were performed twice considering either offset or non-offset analyses, which resulted in 18 phasing analyses per simulation replicate. Offset analyses were designed to create 50 % overlap between cores of the offset and non-offset analyses, by moving the beginning of each core to halfway along the first core of the non-offset analyses. Because offset and non-offset analyses were always performed together for a specific combination of core and tail lengths, the term “phasing analysis” will hereafter refer to both analyses. Different core lengths combined with offset and non-offset analyses help to remove phasing errors that AlphaPhase1.1 may introduce. For all phasing analyses, 1 % genotype errors and 1 % disagreement between genotypes and haplotypes were allowed. Both the number of surrogate parents across which information pertaining to a phase must be accumulated before this phase can be declared, and the maximum percentage of surrogate disagreements that still allow phase declaration were set to 10. The same settings were used for the real data.
Assigning breed origin to haplotypes
A specific haplotype detected in a crossbred animal is fully informative if this haplotype occurs in only one of the purebred populations. Therefore, after each phasing analysis, the next step involved listing all haplotypes that were phased in the purebred populations. Subsequently, haplotypes that were phased within only one purebred population were identified and their origin was assigned to this breed, and added to the library associated with this breed origin. Thus, a library of haplotypes assigned to a specific breed origin included all assigned haplotypes that were derived from all the phasing analyses (i.e., across the different core and tail lengths, as well as across the offset and non-offset analyses).
Allocation of a haplotype to a unique purebred population may not always be possible, especially for closely-related populations, which can share large haplotypes. To allow assignment of breed origin for most haplotypes, a ‘relaxation factor’ (fr) was applied. Using this fr, a haplotype was assigned to a purebred population, if less than fr % of all copies of that haplotype were observed in the other purebred populations. Haplotypes that did not fulfil this condition remained unassigned. Relaxation factors fr of 0, 10 and 20 % were considered in this study.
Assigning breed origin to alleles of crossbred animals
Assignment of breed origin to each SNP allele of a crossbred animal was based on (1) the library of assigned haplotypes, (2) breed composition of the crossbred animal [e.g., BC or A(BC)], and (3) zygosity of the SNP genotype of the crossbred animal (i.e., homozygosity or heterozygosity) to which the considered allele contributes. Breed composition and zygosity of the SNP genotypes were assumed to be correct. A pseudo-code for assigning breed origin to alleles is described in the “Appendix”.
Haplotypes in crossbred animals were traced back in the library of assigned haplotypes, which provided breed origin for each allele of the haplotypes in crossbred animals. Because n offset and non-offset phasing analyses were performed (e.g., n = 9 and 2n = 18 for this study), each allele could receive a maximum of 2n possible breed origins. A smaller number of breed origin assignments was also possible for a specific allele if the haplotype of this allele was not phased or not assigned a breed origin in some analyses. If breed origin assignments were not the same for an allele across the different phasing analyses; but in agreement with the breed composition and zygosity of the SNP genotype of the crossbred animal, the most frequent breed assignment was considered as the breed origin.
As mentioned previously, the BOA approach takes breed composition of a crossbred animal into account to assign breed origin. This knowledge helps to assign breed origin to an allele that is present in several haplotypes that are assigned to different breed origins. For the two-way and three-way crossbred animals, one of the two alleles of each SNP must originate from the paternal breed. Assigning the paternal allele first reduces the possibilities of assignment for the maternal allele. For example, for a homozygous SNP for an A(BC) animal, its breed composition (i.e., its paternal breed is A) helps to assign one allele to breed A, even if different haplotypes that contain this allele were assigned different breed origins.
Zygosity of the SNP genotype was also taken into account to avoid disagreement between genotypes based on the input data and based on the two phased haplotypes. Such issues can arise because AlphaPhase1.1 allows disagreements between genotypes and haplotypes. Thus, an allele that is present in a haplotype may differ from the allele observed in the genotype, which results in a heterozygous genotype based on the input data and a homozygous genotype based on the two phased haplotypes (or vice versa). The BOA approach considers as correct the zygosity of the SNP genotype based on the input data.
Accuracy of allele origin assignment and effects of different settings using simulated data
For each breeding scenario (i.e., closely-related, distantly-related, or unrelated breeds) combined with each value of fr, i.e. 0, 10 or 20 %, accuracy of assignment of allele origin was computed for chromosomes chr1 and chr2 separately on a per animal basis. Breed origin assignment was assessed for each BC and A(BC) crossbred animal. The minimum, average, and maximum percentage of alleles of an animal that were assigned a correct or incorrect breed origin (%correct or %incorrect) or that were unassigned (%unknown) were computed. All scenarios were replicated 10 times and %correct, %incorrect, and %unknown were averaged across animals and replicates.
Effects of the number of core and tail lengths and of the number of offset and non-offset phasing analyses considered for assignment of allele origin were studied through forward selection, with the aim to identify useful sets of settings to be used for phasing. Starting with no phasing analysis, addition of each offset and non-offset phasing analysis was tested using the average %correct for A(BC) animals as criterion. Then, the offset and non-offset phasing analysis that improved average %correct most was added. This process was repeated until all offset and non-offset phasing analyses were added. The forward selection was performed for all scenarios and values of fr. The order, in which the different phasing analyses were added, was studied to evaluate which (combination of) settings yielded the highest average % of correctly assigned alleles.
Assignment of allele origin using real data
Assignment of allele origin was performed for all EF and D(EF) crossbred animals by considering the nine offset and non-offset phasing analyses (i.e., a total of 18 phasing analyses) for SSC2 and 18. For each relaxation factor (i.e., 0, 10, and 20 %), the average, minimum and maximum percentages of assigned alleles (%assigned) for each EF and D(EF) animal were computed for each chromosome separately. The percentage of animals with at least 80 % assigned alleles was also computed, as an arbitrary measure to evaluate the number of genotypes that would be useful for subsequent analysis, as well as the average %assigned for each of the breed origins that contributed to the EF or D(EF) animals.
Characteristics of simulated data
For each replicate of the simulated data, about 1000 purebred animals and 420 crossbred animals were randomly selected from the three-way crossbreeding program to assign breed origin to alleles (Table 1). The two simulated chromosomes had on average 15 SNPs per cM across all replicates and scenarios, i.e., 4811 SNPs for chr1 and 926 SNPs for chr2 (Table 1). MAF of the SNPs in the purebred animals for chr1 and chr2, averaged across SNPs and replicates, ranged from 0.27 to 0.30 for closely-related breeds, from 0.21 to 0.28 for distantly-related breeds, and from 0.15 to 0.25 for unrelated breeds. To quantify the divergence between the simulated breeds, the estimated global Wright’s F ST, i.e., the average inbreeding rate of the sub-population relative to the whole population, were equal to 0.04 (±0.00) for the closely-related breeds, 0.13 (±0.01) for the distantly-related breeds, and 0.28 (±0.02) for the unrelated breeds (Table 1) .
Percentage of assigned alleles
Percentages of alleles correctly (%correct) or incorrectly (%incorrect) assigned a breed origin or unassigned (%unknown) for a crossbred animal, and percentages of crossbred animals having at least 80 % assigned alleles using simulated data
f r a
Chromosome 1 of BC animals
Chromosome 2 of BC animals
Chromosome 1 of A(BC) animals
Chromosome 2 of A(BC) animals
While most of the animals had only a few unassigned alleles, %unknown reached high values for some animals, especially for chr2. For example, the maximum %unknown for chr2 of a BC animal from closely-related breeds was equal to 67.0 % (±15.5 %) (Table 2; Additional file 1: Tables S1, S2, S3, S4). Therefore, for some BC or A(BC) animals, breed origin was not assigned to many of their alleles.
Accuracy of allele assignment
Across all analyses, the %incorrect averaged across animals and across replicates was at most equal to 1.99 % (±0.17 %) for chr2 of A(BC) animals from closely-related breeds with an fr of 20 %. The %incorrect decreased slightly as the distance between breeds increased, or fr decreased (Table 2; Additional file 1: Tables S1, S2, S3, S4). Characteristics of the chromosome, such as length and number of SNPs, also influenced %incorrect. For all scenarios and values of fr, %incorrect was always higher for chr2 than for chr1 but it was not possible to determine if this was due to the length of the chromosome or the number of SNPs. In addition to the average %incorrect, knowing the maximum %incorrect for an animal may be important. For both BC and A(BC) animals, the highest %incorrect was obtained for chr2 for the closely-related breed scenario (Table 2; Additional file 1: Tables S1, S2, S3, S4). The highest maximum %incorrect (averaged across all replicates) reached 10.1 % for a BC animal (fr = 10 %) and 27.6 % for an A(BC) animal (fr = 20 %).
Regardless of the scenario, the average %incorrect was similar and low (i.e., always less than 2.0 % for all scenarios, fr values, both chromosomes, and all animals). Since the average %incorrect remained relatively constant, the effect of the different factors on the average %correct was the inverse of that on the average %unknown. The average %correct was affected by the characteristics of the chromosome and increased as the distance between breeds or fr increased. For all scenarios, fr values and both chromosomes, the average %correct ranged from 90.7 to 98.6 % for BC animals and from 87.4 to 97.0 % for A(BC) animals. Some BC and A(BC) animals had (close to) 100 % of alleles with correctly assigned breed origins (Table 2; Additional file 1: Tables S1, S2, S3, S4). Comparing the results for %incorrect and %unassigned showed that the BOA approach was more likely to consider the origin of an allele as unknown than to assign an incorrect breed origin.
Impact of distance between breeds
A greater distance between breeds had a favourable effect on the percentage and accuracy of breed origin assignment, while this relationship appears to reach a plateau at distances greater than 20 generations. Results for distantly-related and unrelated breeds were similar regardless of the chromosome, fr, or type of crossbred animals. Increasing the distance between breeds from 20 (F ST = 0.13; Table 1) to 50 generations (F ST = 0.28; Table 1) had less impact on allele assignment than increasing it from 5 (F ST = 0.05; Table 1) to 20 generations (i.e., between closely- and distantly-related breeds).
Impact of the relaxation factor
The relaxation factor fr was introduced because many haplotypes can be present in more than one purebred population, especially for closely-related populations that can share long haplotypes. Indeed, the impact of fr was greater for closely-related breeds than for distantly-related breeds. For example, the largest increase of the average %correct due to increasing fr from 0 to 20 % was equal to 4.27 %, for chr2 of A(BC) animals for closely-related breeds (Table 2). Increasing fr from 0 to 20 % also increased the percentages of BC and A(BC) animals having at least 80 % of alleles assigned. The largest increase was observed for chr2 for both BC animals (i.e., an increase of 8.3 %) and A(BC) animals (i.e., an increase of 11.6 %) from closely-related breeds (Table 2; Additional file 1: Tables S1, S2, S3, S4). Increasing fr did not or only slightly affect the average %incorrect; the largest increase, 0.16 %, was observed for chr2 of A(BC) animals from closely-related breeds (Table 2). Given that the average %incorrect remained almost constant, the effect of increasing fr mainly resulted in a greater percentage of correctly assigned alleles that previously fell in the unknown origin category.
Impact of core and tail lengths
Spearman rank correlations between the order of the phasing analyses obtained from forward selection and a predefined order for simulated data
Characteristics of the real data
Genotypes for SSC2 and 18 of about 950 D purebred animals and of at least 1800 E and F purebred animals were available. Genotypes for 324 EF animals and for 241 D(EF) animals were also available (Table 1). SSC2 and 18 included 2496 and 1129 SNPs, respectively. The estimated global F ST was equal to 0.15 (Table 1).
Percentage of assigned alleles
Percentages of assigned alleles on SSC2 and SSC18 for an EF or a D(EF) animal and percentages of EF and D(EF) animals with at least 80 % assigned alleles
f r a
Percentage of animals with more than 80 % assigned
Breed origin of alleles
Average (SD) percentages of alleles on SSC2 and SSC18 assigned to each parental breed for an EF or a D(EF) animal
Some maternal chromosomes of D(EF) animals were (mainly) assigned to one of the two maternal breed origins (e.g., animals 5, 9, or 19; Fig. 5). These maternal chromosomes show a limited number of recombinations, as expected, and the percentages of breed origin for individual chromosomes can deviate considerably from their expectation (i.e., from 25 %). In addition, we found that recombinations occurred more frequently towards the end of the chromosomes and less in the middle based on the physical map, which is consistent with the genetic map length and recombination rate being higher in the more distal part of the chromosome .
Impact of core and tail lengths
The objectives of this study were (1) to develop an approach (BOA) for assigning breed origin to alleles of crossbred animals, and (2) to study its accuracy as a function of different factors. The results obtained from simulated and real data showed that the BOA approach accurately assigns breed origin to alleles of crossbred animals, and that its accuracy depends on various factors, such as the distance between the parental breeds.
Distance between breeds
The global Wright’s F ST statistic measures the average inbreeding in a sub-population relative to the whole population and takes the effect of population subdivision into account. For example, for the distantly-related breeds, the estimated global F ST was equal to 0.13, which indicated that about 13 % of the genetic variance in the combined population can be attributed to differentiation of the breeds (Table 1) [22, 24]. Based on the estimated global F ST, the distances among the three breeds included in the real dataset were similar to those between the distantly-related breeds included in the simulated data. Comparison of results for assignment of breed origin showed that the %assigned was slightly lower for the real data than expected based on results obtained with the simulated data for distantly-related breeds. This may be explained by breed composition errors, genotype errors, or the structure of the purebred populations that were included in the real dataset.
The BOA approach and additional rules
The BOA approach assigned a breed origin to each allele of each SNP, based on a library of assigned haplotypes, the zygosity of the SNP genotype to which the considered allele contributes, and breed composition of the crossbred animal, if a most frequent breed assignment was observed. The BOA approach was specifically designed to determine breed origin of alleles in crossbred animals from well-defined crossbreeding schemes, and for which the purebred populations are up to two generations back. Breed composition of the crossbred animals, or at least its expectation, is expected to be known. In addition, the BOA approach was able to deal with scenarios that involved closely-related breeds. These characteristics can be overlooked by software tools that were developed to infer local ancestry in (recently) admixed populations [6–9], in which admixed individuals are mated to produce the next generation and, these tools may, therefore, not be adequate for the crossbreeding situation. As for the BOA approach, most of these tools require phased genotypes for the ancestral populations (e.g., [7–9]) and inference of local ancestry is mainly realized through a Markov process (e.g., [6–9]). These methods also use allele frequencies, levels of LD between subsets of SNPs in the ancestral populations, pedigree information, and/or recombination rates. While such information is not (directly) used by the BOA approach, it could be useful to increase %assigned. Some additional rules to the BOA approach, e.g. based on allele frequencies, and their effects on allele assignments, are discussed below. Nevertheless, while these software tools may not be adequate for the typical crossbreeding programs for pigs or chicken, they may be useful for other livestock production systems, such as those for cattle, for which crossbreeding schemes are more complex . It should be possible, however, to adapt the BOA approach for these more complex scenarios.
Across the simulated scenarios, the percentage of alleles in a three-way crossbred animal that were correctly assigned to breed origin was higher than 90 %, and the percentage of incorrectly assigned alleles was always lower than 2 %. For the remaining alleles, between 0 and 10 % of all alleles in a three-way crossbred animal had no breed origin assigned. Additional rules to the BOA approach, which could be applied post-processing, could increase the %assigned. For example, assignment of the other allele at the SNP and assignments of alleles at other SNPs near the unassigned allele were not considered by the BOA approach. If one allele at a SNP was not present in at least one assigned haplotype, breed origin was not assigned to this allele, even if breed origin was assigned to the other allele at this SNP. This explains why for some SNP genotypes, breed origin was assigned for one but not the other allele, even for crossbred animals that originated from only two breeds (e.g., animal 1 in Fig. 4).The reason why breed origin of an allele was not assigned based on the assignment of the other allele was to avoid adding incorrect assignments in case the first allele was incorrectly assigned, which could increase the average %assigned but also the %incorrect. For the same reason, breed origins of assigned alleles in the neighbourhood of unassigned alleles were not used to assign breed origin to these unassigned alleles.
To test the accuracy of allele assignments by using information of assigned alleles, additional rules were added as a post-processing step of the BOA approach in order to assign breed origin to (1) the paternal allele if the maternal allele was already assigned, (2) the maternal allele if the paternal allele was assigned and if the considered animal originated from only two breeds, and (3) alleles if they were surrounded by alleles that were assigned the same breed origin, and if these two assigned alleles were present in the same haplotype that had the smallest core length and that was assigned this breed origin. Surrounding assigned alleles may be separated by several unassigned alleles from a considered unassigned allele. Pseudo-code for these additional rules can be found in the “Appendix”. The additional rules were applied to both simulated and real data (results not shown), and increased the number of assigned alleles by increasing both the %correct and %incorrect. For example, for chr2 of A(BC) animals from closely-related breeds (fr = 0 % and nine phasing analyses), the average %correct increased by 3.7 % and the average %incorrect increased by 0.2 %. For BC animals of the same scenarios, the average %correct and %incorrect increased by 7.6 and 0.4 %, respectively. The average %correct for BC animals was therefore close to 100 %. Detailed results for chr2 with nine phasing analyses are in Additional file 1: Table S8. While the additional rules increased the number of incorrect assignments (slightly), the impact on average %incorrect, relative to not using the additional rules, decreased as the distance between breeds increased. For the real data, the additional rules assigned breed origin to 93.2 % (i.e., an increase of 2.7 %) of the alleles for D(EF) animals and to 98.4 % (i.e., an increase of 9.5 %) of the alleles on SCC2 for the EF animals, with fr = 0 %. Additional file 2: Figures S4 and S5 show breed origins assigned to alleles along SSC2 for 20 randomly-selected EF and D(EF) animals, respectively. The additional rules were especially beneficial for two-way crossbred animals, which was expected because both paternal and maternal alleles of two-way crossbred animals can be assigned by these rules, but only the paternal alleles of three-way crossbred animals. Furthermore, the greater %assigned was mainly due to the assignment of unassigned second alleles, which was also as expected because haplotypes with a small core length can potentially be shared by several breeds, which limits their assignment of breed origin, and therefore, the increase in %assigned. Because the increase in incorrect assignments was limited, the additional rules should be used in order to maximize the number of alleles for which breed origin is assigned.
Percentages (averaged across SNPs) of correctly assigned heterozygous genotypes (%correct) for chromosome 1 for BC animals, and differences (diff) between observed and expected %correct for the simulated data
It is also worth noting that the BOA approach only considers two- and three-way crossbreeding schemes. An extension to a four-way crossbreeding scheme is straightforward by modifying BOA for the paternal allele by applying rules similar to those for the maternal allele. Lower rates of assignment could be expected for four-way crossbred animals, especially because the additional rules that are proposed above cannot be applied for both their paternal and maternal alleles.
The number of haplotypes that segregate only within one of the parental purebred populations may be limited, especially for closely-related breeds which can share many haplotypes. Also, some alleles may be incorrectly phased or not phased at all . For this reason, fr was introduced to allow haplotypes to be assigned, even if a percentage of their copies was observed in other parental purebred populations. Higher fr than the values used here should be avoided because they may increase the %incorrect considerably. For example, fr = 50 % would allow the assignment of breed origin to a haplotype even if 50 % of its copies were observed in the other parental breeds. For the simulated data of unrelated breeds, varying fr did not or only very slightly affect the %correct, %incorrect and %unknown, as expected. The main effect of fr was observed for scenarios with closely-related breeds, for which sharing of haplotypes between breeds is more common. Increasing fr mostly allowed to correctly assign a higher percentage of alleles that were previously considered as having an unknown origin. However, the impact of increasing fr from 0 to 10 % on %correct was greater than increasing it from 10 to 20 %. This was also observed for the real data, for which the increase of %correct was less than 1 % when fr increased from 10 to 20 % compared to more than 2 % when fr increased from 0 to 10 %. Based on these results, fr values greater than 0 % were useful and allowed assignment (correctly or incorrectly) of on average more than 90 % of the alleles of a crossbred animal, without (or only slightly) increasing the rate of incorrect assignments (as observed based on simulated data).
The phasing method
Several phasing methods exist, such as pedigree-based phasing methods (e.g., ), LD-based phasing methods (e.g., Beagle , SHAPE-IT ), and LRP methods [20, 21]. Pedigree-based methods were not considered in this study because the pedigree of crossbred animals is not available in many crossbreeding programs, or their direct parents may not be genotyped. Thus, in real data, purebred and crossbred genotyped animals may be distant relatives that are separated by several generations, the parents of crossbred animals may not be included in the genotype dataset, or the pedigree of crossbred animals could be incomplete. LD-based phasing methods were considered to be suboptimal for crossbred populations because they rely on short haplotypes that may be common to several breeds, as detailed by Hidalgo et al. , and Amaral et al. , for pig breeds, and by Villa-Angulo et al.  for cattle breeds. Both these issues are avoided with LRP methods that aim at identifying and using distant relatives. LRP methods overcome the issue of common LD between breeds because long-range haplotypes are longer than one LD block but are still shared between purebreds and their close crossbred relatives. Also, LRP does not require knowledge of pedigree . AlphaPhase1.1 (version 1) software  that implements LRPHLI without pedigree was therefore chosen for this study.
Based on their experience, Hickey et al.  recommended the use of core and tail lengths of 300 to 500 SNPs (with a core length of 100 SNPs) for a 60k SNP panel. Consistent with Hickey et al. , the longest core and tail lengths required the shortest computational times. Furthermore, computational times increased with increasing distance between breeds for the same core and tail lengths. Distances between breeds were created with the simulation of 5, 20, or 50 generations of random mating before starting the three-way crossbreeding scheme, leading to higher inbreeding levels with increasing distances between breeds. As detailed by Hickey et al. , increasing inbreeding levels increases the number of surrogate parents in a dataset, which increases the computational requirements. This is in agreement with the estimated global F ST (Table 1). Based on simulated data, the longest core and tail lengths appeared to be more suitable when the breeds were more closely related. Thus, a general recommendation is to increase core and tail lengths as F ST decreases, in addition to taking the characteristics of the genome under study into account, such as chromosome lengths and the number of SNP per chromosome.
Increasing the size of the datasets (results not shown) increases computational time. Hickey et al.  suggested that the computationally intensive phasing analyses could be performed on a random or selected subset of a large dataset of purebred and crossbred animals. A haplotype library can be built on a subset of data and then used to phase the crossbred animals that were not included in the phased subset. This haplotype library could also be used to phase crossbred individuals that are added to the dataset later on.
The analyses in this study were performed without pedigree for both purebred and crossbred animals. However, in real field data, pedigree may be known for, at least, some animals, e.g., for the purebred animals and this could be considered for the phasing analyses to reduce computation time for the larger datasets . Although not tested in our study, inclusion of pedigree is not expected to improve the percentages of phased and assigned alleles because Hickey et al.  reported negligible effects on the phasing performance when pedigree information was ignored.
In the context of genomic selection for crossbred performance, the BOA approach can be used to determine the breed origin of alleles at genotyped SNPs for crossbred animals, which is required for models that take breed-specific effects into account (e.g., [3, 5, 31]). The BOA approach could also be useful to perform GWAS based on crossbred performance and taking into consideration that the effects of causative mutations on phenotypes may depend on breed origin. However, future studies are required to evaluate the effects of the low percentages of unknown and incorrect allele assignments on accuracy and bias of genomic predictions (or GWAS).
It should also be noted that some animals had a high percentage of unassigned alleles, which makes them not useful for subsequent analyses. These genotypes could be discarded, or breed origin could be assigned to their alleles based on, e.g. allele frequencies, as proposed above. This latter option should be applied with care, and after exploring the possible reasons for the high percentage of unassigned alleles (e.g., low percentage of phased alleles, breed composition errors, genotype errors).
Some studies have suggested that the use of haplotypes could lead to higher prediction accuracies than using SNP genotypes (e.g., [32, 33]). Potential reasons are that haplotypes may be in higher LD with causative mutations than individual SNPs and, therefore, capture more variation than SNPs. Thus, assigned haplotypes could be used in a haplotype-based genomic model that takes their breed origin into account. Using assigned haplotypes instead of assigned alleles could potentially reduce effects of incorrect allelic assignments.
The BOA approach accurately assigns breed origin to alleles of crossbred animals in a two- or three-way crossbreeding program. This procedure requires no prior knowledge of pedigree and no close relationships between crossbred and purebred animals, since it relies on long-range phasing.
JWMB derived the approach. JV conceived the study, wrote the program that implements the approach, performed the analyses, and drafted the manuscript. JJW wrote the simulation program. CAS provided real data. All authors provided valuable insights throughout the analysis and writing process. All authors read and approved the final manuscript.
JV, MPLC, JJW and JWMB acknowledge financial support from the Dutch Ministry of Economic Affairs, Agriculture, and Innovation (Public–private partnership “Breed4Food” code BO-22.04-011-001-ASG-LR-3). CAS acknowledges financial support from the Netherlands Organisation for Scientific Research (NWO) through the LocalPork project W 08.250.102 in the Food and Business Global Challenges Program. Topigs-Norsvin is gratefully acknowledged for making the genotype data available.
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Wei M, van der Werf JHJ. Maximizing genetic response in crossbreds using both purebred and crossbred information. Anim Prod. 1994;59:401–13.View ArticleGoogle Scholar
- Toosi A, Fernando RL, Dekkers JCM. Genomic selection in admixed and crossbred populations. J Anim Sci. 2010;88:32–46.View ArticlePubMedGoogle Scholar
- Christensen OF, Madsen P, Nielsen B, Su G. Genomic evaluation of both purebred and crossbred performances. Genet Sel Evol. 2014;46:23.View ArticlePubMedPubMed CentralGoogle Scholar
- Dekkers JCM. Marker-assisted selection for commercial crossbred performance. J Anim Sci. 2007;85:2104–14.View ArticlePubMedGoogle Scholar
- Ibánẽz-Escriche N, Fernando RL, Toosi A, Dekkers JC. Genomic selection of purebreds for crossbred performance. Genet Sel Evol. 2009;41:12.View ArticlePubMedPubMed CentralGoogle Scholar
- Sankararaman S, Sridhar S, Kimmel G, Halperin E. Estimating local ancestry in admixed populations. Am J Hum Genet. 2008;82:290–303.View ArticlePubMedPubMed CentralGoogle Scholar
- Baran Y, Pasaniuc B, Sankararaman S, Torgerson DG, Gignoux C, Eng C, et al. Fast and accurate inference of local ancestry in Latino populations. Bioinformatics. 2012;28:1359–67.View ArticlePubMedPubMed CentralGoogle Scholar
- Churchhouse C, Marchini J. Multiway admixture deconvolution using phased or unphased ancestral panels. Genet Epidemiol. 2013;37:1–12.View ArticlePubMedGoogle Scholar
- Hellenthal G, Busby GBJ, Band G, Wilson JF, Capelli C, Falush D, et al. A genetic atlas of human admixture history. Science. 2014;343:747–51.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhu X, Cooper RS. Admixture mapping provides evidence of association of the VNN1 gene with hypertension. PLoS One. 2007;2:e1244.View ArticlePubMedPubMed CentralGoogle Scholar
- Pasaniuc B, Zaitlen N, Lettre G, Chen GK, Tandon A, Kao WHL, et al. Enhanced statistical tests for GWAS in admixed populations: assessment using African Americans from CARe and a breast cancer consortium. PLoS Genet. 2011;7:e1001371.View ArticlePubMedPubMed CentralGoogle Scholar
- Gautier M, Naves M. Footprints of selection in the ancestral admixture of a New World Creole cattle breed. Mol Ecol. 2011;20:3128–43.View ArticlePubMedGoogle Scholar
- Kim ES, Rothschild MF. Genomic adaptation of admixed dairy cattle in East Africa. Front Genet. 2014;5:443.PubMedPubMed CentralGoogle Scholar
- Khayatzadeh N, Meszaros G, Gredler B, Schnyder U, Curik I, Solkner J. Prediction of global and local Simmental and Red Holstein Friesian admixture levels in Swiss Fleckvieh cattle. Poljoprivreda. 2015;21:63–7.View ArticleGoogle Scholar
- Sargolzaei M, Schenkel FS. QMSim: a large-scale genome simulator for livestock. Bioinformatics. 2009;25:680–1.View ArticlePubMedGoogle Scholar
- Ramos AM, Crooijmans RPMA, Affara NA, Amaral AJ, Archibald AL, Beever JE, et al. Design of a high density SNP genotyping assay in the pig using SNPs identified and characterized by next generation sequencing technology. PLoS One. 2009;4:e6524.View ArticlePubMedPubMed CentralGoogle Scholar
- Wright S. The interpretation of population structure by F-statistics with special regard to systems of mating. Evolution. 1965;19:395–420.View ArticleGoogle Scholar
- Raymond M, Rousset F. GENEPOP (Version 1.2): population genetics software for exact tests and ecumenicism. J Hered. 1995;86:248–9.Google Scholar
- Rousset F. genepop’007: a complete re-implementation of the genepop software for Windows and Linux. Mol Ecol Resour. 2008;8:103–6.View ArticlePubMedGoogle Scholar
- Hickey JM, Kinghorn BP, Tier B, Wilson JF, Dunstan N, van der Werf JH. A combined long-range phasing and long haplotype imputation method to impute phase for SNP genotypes. Genet Sel Evol. 2011;43:12.View ArticlePubMedPubMed CentralGoogle Scholar
- Kong A, Masson G, Frigge ML, Gylfason A, Zusmanovich P, Thorleifsson G, et al. Detection of sharing by descent, long-range phasing and haplotype imputation. Nat Genet. 2008;40:1068–75.View ArticlePubMedPubMed CentralGoogle Scholar
- Falconer DS, Mackay TFC. Introduction to quantitative genetics. 4th ed. Harlow: Pearson Education Limited; 1996.Google Scholar
- Tortereau F, Servin B, Frantz L, Megens HJ, Milan D, Rohrer G, et al. A high density recombination map of the pig reveals a correlation between sex-specific recombination and GC content. BMC Genomics. 2012;13:586.View ArticlePubMedPubMed CentralGoogle Scholar
- Holsinger KE, Weir BS. Genetics in geographically structured populations: defining, estimating and interpreting F ST. Nat Rev Genet. 2009;10:639–50.View ArticlePubMedPubMed CentralGoogle Scholar
- Uricchio LH, Chong JX, Ross KD, Ober C, Nicolae DL. Accurate imputation of rare and common variants in a founder population from a small number of sequenced individuals. Genet Epidemiol. 2012;36:312–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Browning BL, Browning SR. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet. 2009;84:210–23.View ArticlePubMedPubMed CentralGoogle Scholar
- Delaneau O, Marchini J, Zagury JF. A linear complexity phasing method for thousands of genomes. Nat Methods. 2011;9:179–81.View ArticlePubMedGoogle Scholar
- Hidalgo AM, Bastiaansen JW, Harlizius B, Megens HJ, Madsen O, Crooijmans RP, et al. On the relationship between an Asian haplotype on chromosome 6 that reduces androstenone levels in boars and the differential expression of SULT2A1 in the testis. BMC Genet. 2014;15:4.View ArticlePubMedPubMed CentralGoogle Scholar
- Amaral AJ, Megens HJ, Crooijmans RPMA, Heuven HCM, Groenen MAM. Linkage disequilibrium decay and haplotype block structure in the pig. Genetics. 2008;179:569–79.View ArticlePubMedPubMed CentralGoogle Scholar
- Villa-Angulo R, Matukumalli LK, Gill CA, Choi J, Tassell CPV, Grefenstette JJ. High-resolution haplotype block structure in the cattle genome. BMC Genet. 2009;10:19.View ArticlePubMedPubMed CentralGoogle Scholar
- Zeng J, Toosi A, Fernando RL, Dekkers JC, Garrick DJ. Genomic selection of purebred animals for crossbred performance in the presence of dominant gene action. Genet Sel Evol. 2013;45:11.View ArticlePubMedPubMed CentralGoogle Scholar
- Calus MPL, Meuwissen THE, de Roos APW, Veerkamp RF. Accuracy of genomic selection using different methods to define haplotypes. Genetics. 2008;178:553–61.View ArticlePubMedPubMed CentralGoogle Scholar
- Cuyabano BCD, Su G, Rosa GJM, Lund MS, Gianola D. Bootstrap study of genome-enabled prediction reliabilities using haplotype blocks across Nordic Red cattle breeds. J Dairy Sci. 2015;98:7351–63.View ArticlePubMedGoogle Scholar