Mapping genomic regions affecting milk traits in Sarda sheep by using the OvineSNP50 Beadchip and principal components to perform combined linkage and linkage disequilibrium analysis

Background The detection of regions that affect quantitative traits (QTL), to implement selection assisted by molecular information, remains of particular interest in dairy sheep for which genetic gain is constrained by the high costs of large-scale phenotype and pedigree recording. QTL detection based on the combination of linkage disequilibrium and linkage analysis (LDLA) is the most suitable approach in family-structured populations. The main issue in performing LDLA mapping is the handling of the identity-by-descent (IBD) probability matrix. Here, we propose the use of principal component analysis (PCA) to perform LDLA mapping for milk traits in Sarda dairy sheep. Methods A resource population of 3731 ewes belonging to 161 sire families and genotyped with the OvineSNP50 Beadchip was used to map genomic regions that affect five milk traits. The paternally and maternally inherited gametes of genotyped individuals were reconstructed and IBD probabilities between them were defined both at each SNP position and at the genome level. A QTL detection model fitting fixed effects of principal components that summarize IBD probabilities was tested at each SNP position. Genome-wide (GW) significance thresholds were determined by within-trait permutations. Results PCA resulted in substantial dimensionality reduction, in fact 137 and 32 (on average) principal components were able to capture 99% of the IBD variation at the locus and genome levels, respectively. Overall, 2563 positions exceeded the 0.05 GW significance threshold for at least one trait, which clustered into 75 QTL regions most of which affected more than one trait. The strongest signal was obtained for protein content on Ovis aries (OAR) chromosome 6 and overlapped with the region that harbours the casein gene cluster. Additional interesting positions were identified on OAR4 for fat content and on OAR11 for the three yield traits. Conclusions PCA is a good strategy to summarize IBD probabilities. A large number of regions associated to milk traits were identified. The outputs provided by the proposed method are useful for the selection of candidate genes, which need to be further investigated to identify causative mutations or markers in strong LD with them for application in selection programs assisted by molecular information.

Background The identification of genomic regions that affect traits of interest and the application of marker-or gene-assisted selection [1] in livestock are crucial to speed up genetic improvement. This is especially valid in species for which genetic gain is hampered by the relatively high costs of large-scale phenotyping and the logistic constraints of artificial insemination [2]. However, application of selection assisted by molecular information for traits that are influenced by numerous loci, each one explaining a small portion of the trait variance, is limited by the lack of power of experiments based on low-density marker maps [3]. In sheep, attempts to identify regions that affect quantitative traits (QTL) were performed first by using microsatellite maps [4][5][6][7][8][9][10][11][12][13]. Most of the identified QTL showed low significance levels and large confidence intervals, and thus their use in selection programs was not possible [2].
Nevertheless, the discovery of thousands of singlenucleotide markers (SNPs) and cost-effective tools (SNP arrays) to genotype them on a large number of animals as well as the recent availability of affordable whole-genome sequencing techniques, has opened new perspectives. It is expected that polymorphisms with small effects that collectively explain an increasing amount of the genetic variance may be gradually identified by using denser molecular marker maps on larger structured resource populations [14]. Thus, individuals that belong to preexisting and new dairy sheep experimental populations have been accurately recorded for several traits and genotyped with 50 K and/or 600 K SNP chips. This is the case of an experimental flock of Sarda ewes, which has been set up since 2000 as a resource population to implement selection assisted by molecular information in the breeding scheme of this Italian dairy breed. QTL detection studies based on SNP arrays and the availability of increasingly accurate genome asec15nnotation data allow the listing of potential candidate genes [15][16][17][18][19][20]. Moreover, whole-genome re-sequencing of target animals has been used to restrict the number of candidate polymorphisms. Thus, putative causative mutations that affect traits of economic interest have been proposed [21][22][23][24].
Among the available QTL detection approaches, those based on the combined use of linkage disequilibrium (LD) and linkage analysis (LA) information (LDLA) have been indicated as the most powerful, robust and precise in populations that are structured in families [16,25,26]. The main reason is that they account for both recombination events that occurred within genotyped generations and historical recombination events that occurred in generations prior to genotyping [25]. Several approaches have been proposed to combine LD and LA information [25,[27][28][29][30][31]. The classical LDLA method [25,32] performs a variance component analysis at each putative QTL position by using identity-by-descent (IBD) probabilities between haplotypes. The main issue is that the IBD probability matrix is often dense, non-positive definite and computationally demanding for its inversion [29,33]. Thus, either strategies that perform the hierarchical clustering of haplotypes based on IBD probability [33][34][35] or the approximation of the IBD based on the extent of the identity-by-state status between haplotypes [36] have been used. However, these approximations inevitably result in a loss of information [29].
An alternative way to process IBD information is principal component analysis (PCA), which has the desirable feature of collapsing information that is contained in a set of correlated variables by a smaller set of orthogonal variables. As such, PCA has been proposed as a technique to reduce the dimensionality of predictors in genomic selection [37,38].
The aim of this study was to detect genomic regions that affect milk traits in the resource population of Sarda sheep by applying an LDLA approach combined with PCA to overcome computational issues of the IBD matrix and minimize the loss of information.

Resource population
The generation of the resource population (RP) started in 1999 when 10 Lacaune × Sarda F1 sires were mated to Sarda ewes to produce 928 back-cross female lambs in the framework of an European project aimed at detecting QTL in the main European sheep breeds (QLK5-CT-2000-00656; "genesheepsafety"). Subsequently, we focused on the detection of QTL segregating in the pure Sarda breed, and since 2002 we used exclusively Sarda rams (SA) to produce the yearly replacement of RP. Until 2009, the average size of the sire families was 43 daughters whereas, from 2010 onward the average size of families decreased to nine daughters, in order to increase the number of Sarda bloodlines represented in the RP. Sarda sires were always chosen based on their genetic impact on the registered population among rams belonging to the artificial insemination centre of the breed.
In total, 3949 ewes from 161 rams (10 F1 and 151 SA) were generated until 2015. Ewes of RP were kept until the 4th (occasionally the 5th) lactation on an experimental farm. The farming system was similar to that commonly applied in Sardinia with most of the adult ewes lambing in autumn and yearlings lambing between January and March. The ewes were milked twice a day by machine from weaning (3-4 weeks from lambing) until the end of July. The feeding regime was based on controlled grazing, supplemented by hay and concentrates in winter and late spring.

Genotypes and phenotypes
All the ewes of RP and their sires as well as the 10 Lacaune sires of F1 and 11 Sarda sires of SA were genotyped with the Illumina Inc. OvineSNP50 Beadchip. SNP editing was performed using call rate and MAF thresholds of 95% and 1%, respectively. The ovine genome assembly v4.0 and the software SNPchimMpv.3 [39] were used to construct the genetic map by assuming 1 Mb = 1 cM. Unmapped SNPs and SNPs on sex chromosomes were not included in the study.
A large range of traits of economic interest was measured in the RP. In the current study, we focused on milk traits: milk yield (MY); fat yield (FY); protein yield (PY); fat content (FP) and protein content (PP). MY, FP and PP were measured twice a month during the milking period at the am and pm milking. Lactation records were computed by the Fleischmann method, using records from the milking period only (in agreement with ICAR recommendations), by considering an initial suckling period of 30 days. Finally, 13,059 lactations of 3731 ewes recorded from 2000 to 2017 were retained. The average number of records per ewe was 3.5 ± 1.02, ranging from 1 (5% of animals) to 5 (9.3% of animals); most of the ewes (55.5%) had four records.
First, in order to adjust for the main environmental effects, raw lactation records were analysed with singletrait repeatability animal models using the ASReml 4.1 software [40]. Genetic relationships between animals were taken into account by calculating the genomic relationship matrix [41] between 4513 animals, including F1 and SA sires and their genotyped ancestors. The animal model included as fixed effects the year-management-group interaction (37 levels), the year-month of lambing-parity-age class interaction (230 levels) and the milking length within age class (adult and primiparous) as a covariate. The average performance deviation (APD) of each ewe was calculated as the average of lactation records adjusted for fixed effects. The APD used in this study as pseudo-phenotypes for QTL detection differ from the yield deviations [42] that were used in similar studies in that the performances are not adjusted for permanent environmental effects in order to prevent inaccurate estimations due to the likely confounding between permanent environment and additive genetic effects. Indeed, Pearson's correlations between additive genetic and permanent environmental effects from the repeatability animal model ranged, for the five analyzed traits, from 0.46 for FY to 0.50 for PP. Moreover, although in this study we shall investigate only additive effects, ADP include dominance and epistatic genetic effects when they exist and are suitable pseudo-phenotypes for testing such effects in further analyses. Finally, 3731 APD from as many ewes were available. To verify the suitability of the applied animal model, the ratio between the estimated genomic and total variance was compared with the heritabilities reported in the literature. In the same way, the correlations between APD and GEBV of different traits were compared with phenotypic and genetic correlations estimated in other studies.

Classification of gametes and reconstruction of gametic phases
By "gamete", we refer to the whole haploid set of autosomes that are inherited by an individual from one of the two parents. Moreover, we classified the gametes of the population as base haplotypes (BH) when inherited from an ungenotyped parent and replicated haplotypes (RH) when inherited from a genotyped parent. The pool of BH comprised both gametes of F1 rams and of 63 SA rams, the maternal or paternal gametes of the 35 ewes with an unknown sire or dam, respectively, and the maternal gametes of the 928 back-cross ewes and of 85 SA rams for which the sire was genotyped. Finally, 1207 gametes were classified as BH, i.e. the 10 F1 sires paternal Lacaune gametes ( BH L ) and 1197 Sarda origin gametes ( BH S ). Then, all 7462 gametes ( n RH ) carried by the 3731 ewes with production records ( n P ) were considered as replicates (RH) of the 1207 BH ( n BH ). An example of how gametes were classified is given in Appendix.
The paternal and maternal inherited gametes of all the genotyped individuals were reconstructed by using a procedure based on the linkage disequilibrium multilocus iterative peeling method proposed by Meuwissen and Goddard [43]. In this method, the parental origin of the alleles carried by an individual is iteratively inferred on the genotypes of parents and offspring at a given locus if they are already phased or at the neighbouring phased loci if the phase at the given locus is unknown. Here, the LD at the population level was ignored, since the population structure was expected to allow a high level of precision using family relationships only. For individuals with both parents without a genotype, the paternal or maternal origin was arbitrary assigned. Genotypes for which the parental origin of alleles was assigned with a probability lower than 0.99 were assumed missing.

Calculation of IBD probabilities
The marked familial structure of the RP led us to exploit the information from the within-family linkage analysis (LA) in addition to that from the population-wide linkage disequilibrium (LD) to estimate IBD probabilities.
First, IBD probabilities between BH and RH were calculated by LA ( IBD LA ) given the known gametic phases and the pedigree information. The grand-parental origin of each RH was estimated at each SNP position with certainty when the genotype at a given position was not missing and the parent transmitting RH was heterozygous. When these conditions were not fulfilled, the probability of a grand-parental origin at a given locus was determined based on information from the closest neighbouring informative loci [44,45]. Then, transmission from BH to RH was traced through generations following Fernando and Grossman [46]. At each SNP position l , IBD LA probabilities were stored in a matrix H l with size n RH × n BH . Moreover, the number of replicates of a given BH in RP at each SNP position l ( f l ) was calculated as f l = H ′ l 1 , where 1 is a vector of n RH ones (see Appendix). Secondly, IBD between BH were estimated by LD analysis ( IBD LD ) at each SNP position following Meuwissen and Goddard [47]. The IBD LD probability was conditioned to the identity-by-state (IBS) status of neighbouring SNPs using windows of 21 SNPs (10 upstream and 10 downstream resulting in an average window length of 1 Mb) and to the within-breed (Lacaune or Sarda) expected homozygosity. The IBD LD between BH S and BH L were assumed to be null. At each SNP position l , IBD LD probabilities were stored in a matrix U l with size n BH × n BH .
Once we had precisely estimated the covariances between BH and between BH and RH as well as the BH number of replicates at each locus, the average values across loci were also calculated and stored in U g ( n BH × n BH ), H g ( n RH × n BH ) and f g ( n BH ), respectively. These matrices will be used later to estimate genomewide IBD probabilities between gametes to adjust for the polygenic effect of background genes.

Principal component analysis and QTL detection model
Hereafter we describe a novel LDLA approach for QTL mapping that, similarly to the basic LDLA model proposed by Meuwissen et al. [25], relies on the IBD information at the locus level and takes the effect of background genes into account. In their study, Meuwissen et al. [25] modelled the phenotypic records by the random effects of all the inherited gametes ( h , twice the number of individuals) at a given position l and the random polygenic effects ( u , i.e. the combined effect of background genes) of all the individuals. In the original model [25]: y = 1µ + Zh + u + e , the covariance matrix of gametic effects G l included IBD probabilities between founder gametes that were obtained by LD analysis ( IBD LD ) [47] and IBD probabilities between founder and non-founder gametes and between nonfounders gametes that were obtained by combining the corresponding transmission probabilities ( IBD LA ) with IBD LD , using the algorithm described by Fernando and Grossman [46]. The additive polygenic covariance between individuals was considered through the numerator relationship matrix A based on pedigree information. In the basic method [25], the maximum likelihood estimates of the variance components were calculated at each putative QTL position l . The application of this method implies some relevant issues related to the nature of the G l matrix, which is usually dense and may turn out to be non-positive definite, and the computational needs in applying the variance component analysis at each investigated position. To overcome these issues, we propose a novel approach which uses the principal component analysis (PCA) to handle G l and exploits the dimensional reduction of the model equations that may be achieved by PCA to estimate both QTL and polygenic effects. First, PCA is used to capture the IBD information at the locus level with the aim of overcoming the difficulties in inverting the G l matrix, in a different way from previous strategies having the same purpose [33][34][35][36]48] which frequently result in a loss of information [29]. Second, PCA is applied to the matrix of the genome-wide IBD probabilities between gametes (i.e. the average across loci of IBD probabilities locus-wide) which is used instead of the classical numerator relationship matrix ( A ) to take into account the polygenic effects. A similar approach was used by Rothammer et al. [49,50], which applied PCA to reduce the dimension of the relationship matrix and used explanatory PC as fixed effects in their QTL detection model. Third, PC that explain most of the variability of both the locus-level and genome-wide IBD probability matrices are included in the model as fixed effects to estimate both QTL and polygenic effects by performing a least square analysis instead of the more computationally demanding variance component one.
Thus, at each SNP position l the model is the following: where y is a vector of APD of n p ewes for MY, PY, FY, PP and FP; µ is the overall mean; β l is a vector of the fixed effects of the n PC l principal components that explain more than 99% of the within breed variation ( PC l ) of the IBD probability matrix G l , i.e. β l summarizes the effects of haplotypes at the QTL position l ; α l is a vector of the fixed effects of the n PC g principal components that explain more than 99% of the variation ( PC g ) of the genome-wide IBD probability matrix, i.e. α l summarizes the polygenic effects of the gametes; 1 is a vector of n p ones; Z is a n p × n RH incidence matrix relating phenotypes with RH; V l is a n RH × n PC l matrix including the PC l scores of RH, V g is a n RH × n PC g matrix including the PC g scores of RH; ε is a vector of n p residuals assuming that ε ∼ N 0, σ 2 ε R −1 with R a diagonal matrix with the APD's reliability ( r ) as diagonal element. For each investigated trait (MY, PY, FY, PP and FP), reliabilities were calculated as r i = 1 − se â i 2 /σ 2 a , from a repeatability linear model y ij = a i + e ij , where y ij is the performance deviation j (i.e. the lactation record j adjusted for the fixed effects estimated with the full animal model) of ewe i , a i is the (1) y = 1µ + ZV l β l + ZV g α l + ε random ewe effect assuming that a ∼ N 0, σ 2 a I and e ij is the corresponding error, assuming that e ∼ N 0, σ 2 e I . Below, we explain how the PC scores of the V l and V g matrices were calculated. In addition, a numerical example is given in Appendix.
As far as the V l elements are concerned, in order to limit the computational requirements to extract PC directly from the large ( n RH × n RH ) G l matrix, the PCA was performed on a n BH × n BH matrix denoted as U w l , where the IBD LD probabilities between BH, stored in U l , were weighted for the IBD LA probabilities between BH and RH by condensing H l information in a n BH × n BH diagonal matrix F l , in which the diagonal elements are the number of RH of each BH (stored in f l , where f l = H ′ l 1). The U w l matrix is defined as: PCA was carried out on U w l by using the Jacobi algorithm. Eigenvectors ( V w l ) relating to the largest principal components that together explain more than 99% of the within-breed variation ( PC l ) were retained. Finally, IBD LA probabilities between BH and RH ( H l ) were combined with V w l to define the PC l scores of RH ( V l ) by: Note that when IBD LA between BH and RH are estimated with certainty and, thus, H l only contains 0 and 1, then: H l U l H ′ l = G l and F l = H ′ l H l ; eigenvalues from G l correspond to eigenvalues from U w l for the explanatory principal components ( PC l ) and PC l scores from G l correspond to V l (see Appendix). When IBD LA between BH and RH are estimated without certainty and H l contains intermediate values between 0 and 1, PC l scores from G l and V l do not correspond perfectly and differences tend to increase as the uncertainty of IBD LA probabilities increases. This is because U w l only considers the covariance that derives from IBD LD excluding the covariance between RH pairs generated by imprecise estimation of transmissions of BH. This effect is negligible in our experiment because most transmission probabilities are estimated with certainty.
Since the IBD LD between BH S and BH L was set to 0 and two sets of breed-specific PC were obtained, the matrix V l can be detailed as V S l V L l . where V S l and V L l are the PC l summarising IBD probabilities between the gametes of Sarda and Lacaune origin, respectively. In Eq. (1) V l elements, which are related by the incidence matrix Z to phenotypes, are used as covariates on the investigated traits to estimate QTL effects at locus l ( β l ).
As far as the V g elements are concerned, the PCA performed directly on the weighted genome-wide IBD LD matrix, ( U w g ) computed as in Eq. (2) resulted in 1022 PC that were needed to capture 99% of the total variation. This limited dimensional reduction is due to the moderate genome-wide IBD LD . probabilities between BH (on average around 0.1) and the small number of replicates of some BH on RP. Thus, in order to not over-parameterize the model, we considered the BH with the highest impact on RP ( BH h ). Then, to recover information from BH with few RH, a matrix of coefficients W relating BH h to all the BH was calculated as: where U g_h is the section of U g including IBD LD probabilities between all the BH with BH h and U −1 g_hh is the inverse of the section of U g including IBD LD probabilities between BH h pairs. The average number of replicates per BH h was then updated as f g_h = W ′ f g . The BH h set was iteratively selected as the smallest group of BH that satisfied the condition that f g_h /n RH is higher than 0.99. According to the analysis at the locus level, the Jacobi algorithm was performed on the matrix U w g_hh computed as: where F g_h is a diagonal matrix, with its diagonal elements being the number of replicates stored in f g_h . Eigenvectors ( V w g_hh ) of the largest principal components that together explain more than 99% of the total variation of U w g_hh ( PC g ) were retained. Finally, genome-wide IBD LA probabilities between BH and RH ( H g ) were combined with V w g_hh to define PC g scores of RH ( V g ) by: In Eq. (1), V g , scores which are related by the incidence matrix Z to phenotypes, are used as covariates on the investigated traits to estimate polygenic effects ( α l ).
Note that β l and α l vectors in Eq. (1) are both fixed effects that are depicted separately to highlight that the model aims at estimating QTL effects ( β l ) while adjusting for the background of polygenes ( α l ). Moreover, covariates related to β l (ZV l ) are locus-specific whereas covariates related to α l ZV g remain constant throughout the genome.
In addition, in accordance with V l , β ′ l can be detailed as β l and β L l summarize the effects of the Sarda and Lacaune gametes, respectively.
The model was tested at each SNP position by F-tests. Three null hypotheses were tested, H 0 : β l = 0; H 0 : β S l = 0 and H 0 : β L l = 0. In the current study, only BH S tests, corresponding to H 0 : β S l = 0, will be analysed and discussed.
Genome-wide (GW) significance thresholds were determined by 2000 within-trait permutations of the residuals ( ε ) of the reduce model y = 1µ + ZV g α + ε , where only the polygenic effects were considered. In order to break free from differences in the number of degrees of freedom at each SNP position, genomewide maxima of the negative logarithms of p-values [− log 10 (p-value)] from each permutation were used to construct the null distribution.

Production data and phenotypes for QTL detection
The ratios between genomic and total variance estimated by a single-trait animal model (Table 1) are consistent with the estimates of heritabilities for dairy traits in the literature [2]. Content traits were more heritable than yield traits. The most and the less heritable traits were PP and PY, respectively. APD and GEBV correlations between traits showed similar values and were consistent with phenotypic and genetic correlations reported in previous studies on sheep [2]. Strong correlations (from 0.88 to 0.95) were observed between yield traits, and moderate positive correlations were observed between content traits (0.58 and 0.62). MY was negatively correlated with both content traits, while the correlations of the other two yield traits with content traits were low.

Reconstruction of gametic phases
Preliminary editing of data led to remove SNPs with more than 5% missing genotypes and with a minor allele frequency (MAF) lower than 0.01. Only SNPs located on the 26 autosomes were retained.
The phasing procedure allowed the reconstruction of the sequence of the alleles carried by the investigated BH and RH for more than 99.5% of the SNP positions. After phasing, another 120 SNPs were excluded, because their genotypes were inconsistent with the phase estimated from neighbouring SNPs. Finally, 43,390 SNPs were retained for further analyses. The explored genomic portion was 2437 Mb long and the average distance between SNPs was 56 ± 49 kb with a maximum gap of 2.377 Mb on Ovis aries (OAR) chromosome 21.

IBD probability calculation
On average, the maximum locus-wide IBD LA probability between each RH and all the BH was 0.99, which indicates the precise reconstruction of the meioses occurred across RP generations.
The distribution of the genome-wide number of replicates of the 1207 BH in RP ( f g ) are depicted in Fig. 1. The impact of BH on RP was extremely variable, in fact the average number of replicates per BH ranged from 1 to 202. The locus-wide IBD LD probability between BH pairs was zero in 68% of cases. The distribution of non-zero IBD LD probabilities is shown in Fig. 2. The 13% and 6% of the locus-wide IBD LD were lower than 0.05 and higher than 0.95, respectively, which suggests that, for a large proportion of BH pairs, it would have been possible to approximate the IBD status to 0 or 1. However, another 13% of locus-wide IBD LD showed intermediate values, for which the approximation to 0 or 1 would have been less accurate.
Most of the BH pairs showed a genome-wide IBD LD probability equal to 0.10, with 95% having values ranging from 0.07 to 0.16 (Fig. 2). This result confirms that the original pool of Sarda gametes as well as the rams used to generate the yearly replacement of RP show a rather large genetic variability.

Principal component analysis and QTL detection
The distribution of the number of PC needed to capture 99% of the locus-wide variability is shown in Fig. 3. The average number of PC l was 32.3 ± 6.4 with a maximum of 74 and a minimum of 9. As far as the breed of origin is concerned, the number of PC l explaining 99% of variation due to BH S (see Additional file 1) and BH L were 24.1 ± 6.0 and 8.2 ± 1.2, respectively.
Concerning the genome-wide analysis, 139 BH with the highest impact on RP ( BH h ) were selected on the basis of f g and the genome-wide IBD LD probabilities between BH pairs. The BH h set included all 10 Lacaune gametes and 129 Sarda gametes. The sum of the original number of replicates ( f g ) of BH h was 0.69 (i.e. 69% of the RH were replicates of BH h ). The remaining 30% of the RH variation was accounted for through the coefficients included in the W matrix and derived from IBD LD probabilities between BH h and other BH. At the genome-wide level, the number of PC g needed to capture 99% of the variation was 137.
The distributions of the genome-wide maxima of − log 10 (p-value) , corresponding to the null hypothesis Table 1 Ratios between genomic and total variance (diagonal, standard errors of estimates in brackets) and correlations between APD (above the diagonal) and between GEBV (below the diagonal) MY  H 0 : β S l = 0, obtained by 2000 within-trait permutations, did not show relevant differences across traits (see Additional file 2). The 5% threshold ranged from 5.59 to 5.69. Thus, the most conservative value, 5.69, was retained as the common GW threshold for all the analysed traits.
Overall, 2563 positions exceeded the 0.05 GW significance threshold for at least one trait (Fig. 4) (see Additional file 3). There were 200, 108, 122, 918 and 1927 SNP positions significantly associated with MY, FY, PY, FP and PP, respectively. The number of significant positions affecting simultaneously one, two, three and four traits was 1943, 546, 56 and 18, respectively. Several significant positions were adjacent, which may be due to linkage disequilibrium between locations. In order to account for such dependency, significant positions were clustered into QTL regions (QTLR). The correlations between ZV l β l (corresponding to the second term of the model Eq. 1) were calculated for all pairs of significant SNPs on the same chromosome. Then, the strongest signal at the chromosome level was retained as the peak of the first QTLR. The peaks of further QTLR along the chromosome were iteratively identified among the significant locations that had correlations lower than 0.15 with the already defined QTLR peaks. Finally, the remaining significant positions were assigned to the QTLR with which they had the highest correlation. When QTLR for different traits overlapped, we considered them as a unique QTLR. This procedure may underestimate the true number of QTL if more than one gene affecting the trait(s) is located in the same genome region.
Details of the 75 defined QTLR are in Table 2. QTLR were detected across all 26 autosomes except OAR26. The largest number of QTLR (10) was detected on OAR1. Overall, 12, 11, 10, 46 and 43 QTLR significantly affected MY, FY, PY, FP and PP, respectively. Among these 75 QTLR, 46 were significant for one trait only: two for MY, two for FY, two for PY, 23 for FP and 17 for PP; 20 were significant for two traits: one for MY and FY, one for MY and PP and 18 for FP and PP; two QTLR were significant for three traits: one for MY and the two content traits and one for the three yield traits; five QTLR were significant for four traits: one for the three yield traits and FP, three for the three yield traits; and PP, and one for FY, PY, FP and PP; and finally two QTLR significant for all five investigated traits.
The strongest signal was obtained for PP on OAR6 at 85.34 Mb where a nominal p-value of 11.12*10 −67 was observed. The corresponding QTLR harboured significant positions also for FP and MY. The most significant position for FP (p-value = 1.26*10 −15 ) was observed at 12.34 Mb on OAR4. This QTLR affected also PP. The most significant results for the three yield traits (p-value = 2.41*10 −12 , 5.89*10 −10 and 2.20*10 −11 for MY, FY and PY, respectively) were detected at 55.43 Mb on OAR11 where a significant peak for FP was also identified ( Table 2).

Discussion
Power, precision, robustness of QTL mapping experiments in complex populations may be affected by several issues (size of the experiment, number and frequency of base haplotypes, density of marker maps). As described above, the resource population investigated here is constituted by families based on male ancestors. LDLA mapping approaches are expected to be more suitable than linkage and genome-wide association analyses to fine map QTL regions in such populations. In fact, the LDLA analysis combines both the within-family linkage and population-wide linkage disequilibrium information to estimate IBD probabilities between haplotypes [51].
The proposed approach allowed us to solve the model by the least square method, which avoid a computational expensive variance component analysis. The advantages of approaches based on LDLA regression versus those on variance components, in term of ease of use and computing time, are well known [29] and have been clearly demonstrated by Roldand et al. [52].
In our study, LDLA mapping greatly benefits from the structure of the population, in which the ewes born after the first generation have both parents genotyped, which allows a precise reconstruction of the base gametes of the population and their transmission through generations. IBD information can be efficiently captured with PCA and the computational constraints due to the multi-collinearity generated by the high IBD LD probabilities that may occur between pairs of BH at the locus level can be overcome. The use of PCA avoided the implementation of prior clustering of BH or approximations in the IBD probability estimation [33,34]. Moreover the strategy used here to collapse IBD information into principal components (i.e. the use of U w l instead of G l ) is computationally efficient. It relies on the high precision of LA for defining the ancestral origin of each gamete, which is possible for populations with a strong familial structure. The effectiveness of the method in populations with weaker familial structure should be investigated.
Depending on the eigenvalues threshold used for PCs selection, the proposed approach allows to capture most of the IBD variation with a dramatic decrease in the number of effects to estimate. Although the direct solutions of the analysis are the effects of explanatory PC, the effect corresponding to each BH at position l can be easily calculated ( β BH l = F −1/2 l V w l β l ). Effects and frequencies of BH may be used as basic information to identify the BH that contribute most to the significance of a given locus. In fact, in their study on the identification of putative causative mutations that affect the  protein content, Casu et al. [22] used such information to select individuals for whole-genome re-sequencing. The QTL detection model proposed here included a fixed factor to adjust for the polygenic background of each individual based on the effect of PC that summarize the genome-wide BH variation. However, at the genomewide level IBD LD probabilities between BH had moderate values when averaged across the genome. Indeed, most PC showed small eigenvalues and many of them were necessary to explain most of the variability. Thus, we applied an approach that aimed at reducing the number of BH to be included in the PCA to those that had the highest impact on RP ( BH h ) by taking their probability to be carried by an individual with a record into account.
During the development of our method, we applied it to the dataset that was simulated for the XVI QTLMAS meeting [53]. The results of QTL detection were very close to those reported by Garzia Gamez et al. [53] who used a more classical LDLA method [33,53], which implemented variance component analysis and accounted for an individual random polygenic effect (see Additional file 4).
Overall, a large number of genomic regions that significantly affected milk traits were detected in this study. The number of detected regions largely exceeded those obtained by other LDLA studies in dairy sheep for milk production traits [16]. This larger number of detected QTLR is probably due to the larger size of the analyzed population. Indeed, Garzia-Gamez et al. [16] performed a LDLA mapping on a population of about 1700 Churra ewes that were organized in 16 half-sib families and they detected 34 genome-wide significant regions.  Consistent with the estimates of heritabilities, the number of QTLR that affected content traits was larger than that for yield traits. Several positions suggested pleiotropic effects. Twenty-nine QTLR affected more than one trait: nine affected at least two yield traits and frequently one or both of the content traits, four affected MY and both content traits, and 18 were significant for both content traits.
A very long list of positional candidate genes was obtained by overlapping the sheep genome reference (Oar_v4.0) with each QTLR. Overall, 745 annotated genes were detected but only a few of these were cited as dairy-related in previous studies on cattle [54] and sheep [55]. Among these, the most interesting genes were those in the casein cluster (CSN1S1, CSN1S2, CSN2 and CSN3), which is mapped to OAR6 within the 85.00-85.23 Mb interval. This interval overlaps with the position of the strongest signal for PP found in this study. At almost the same position, a QTL for PP was detected in Churra sheep by GWA [21]. A deeper investigation of this region is ongoing by whole-genome re-sequencing of individuals that carry BH with large effects at the significant location. The aim is to list the candidate causative mutations by performing specific association studies of all the polymorphisms included in the genomic region [22].
The QTLR that affects PP and FP on OAR3 at 137.3 Mb overlaps with the α-lactalbumin gene (LALBA). This gene was previously reported as a strong candidate for PP and has been deeply investigated in the Churra breed [16,21]. Two other interesting candidate genes are the growth hormone receptor (GHR) and transcription factor AP-2 gamma (TFAP2C) genes. GHR is located on OAR16 within the 31.83-32.00 Mb interval, where a QTLR that affects yield traits was detected. Previous studies in dairy cattle and sheep have shown that GHR affects milk production [54,56]. TFAP2C, which is involved in the development, differentiation, and oncogenesis of the mammary gland [55], overlaps with a QTLR that is significant for MY, FP and PP on OAR13 at 58.6 Mb.

Conclusions
We present a simple least square model to map QTL. It combines linkage disequilibrium and linkage analysis information and accounts for the polygenic effects of base gametes. The use of principal component analysis was found to be a good strategy to reduce the computational burden. A large number of regions associated to the variability of milk traits were identified. The outputs provided by this method are useful for the selection of individuals and genes that need to be further investigated for identifying causative mutations or markers in strong linkage disequilibrium with causative variants and for implementing them in genomic selection programs.
Additional file 1. Number of principal components (N.PC l ) needed to explain more than 99% of the variability due to the Sarda base gametes (BH S ) at each locus (43,390 SNPs). OAR (x-axis) Ovis aries autosomes. Additional file 4. Application of the proposed method to the XVI QTLMAS simulated population data and comparison to LDLA results presented by Garcia-Gamez et al. [53]. Manhattan plots showing − log 10 (nominal p-values) corresponding to the null hypothesis that the effects of principal components that explain 99% of the variability due to base gametes of the XVI QTLMAS simulated population at each locus (10,000 SNPs) are zero. The dashed black lines indicate the 0.05 genomewide significance threshold determined by Bonferroni correction for all the tests (10,000). Orange diamonds and grey vertical lines indicate the location of true simulated QTL [53]. Green triangles indicate QTL that were detected by variance component based LDLA mapping [33].

Authors' contributions
MGU developed the statistical methodology for QTL analyses, wrote the Fortran programs and drafted the manuscript. SC contributed to the overall design and the development of the methods, carried out the phenotypic analysis, participated in data interpretation and helped to draft the manuscript. TS, with the collaboration of PC and SM, performed the molecular analyses. SLS and SS participated in the data analyses and interpretation of results. AC conceived the overall design, undertook the project management, contributed to the development of the methods, interpretation of results and critically revised the manuscript. All authors read and approved the final manuscript.

Funding
This study was part of the MIGLIOVIGENSAR project funded by Centro Regionale di Programmazione (CRP), Regione Autonoma della Sardegna (LR n.7/2007 R.A.S.)

Availability of data and materials
The data that support the findings of this study are available from Centro Regionale di Programmazione (CRP), Regione Autonoma della Sardegna but restrictions apply to the availability of these data, which were used under license for the current study, and thus are not publicly available. However, data are available from the authors upon reasonable request and with permission of Centro Regionale di Programmazione (CRP), Regione Autonoma della Sardegna.
In this Appendix, we use a simple numerical example with few animals to explain how principal component analysis (PCA) is performed on IBD probability matrices. The example considers two males and 10 females with records and assumes that all the individuals are genotyped (Table 3).
Thus, in this example there are 6 BH ( n BH ) and 20 RH ( n RH , Table 4).
We assumed that the gametic phases of both BH and RH, for a map of m markers, had been reconstructed by the linkage disequilibrium multilocus iterative peeling method [43] and that alleles assigned with a probability lower than 0.99 were set as missing.

Analysis at the locus level
Let l be one of the m loci of the marker map. At this locus l , we assume that the grand-parental origin of each RH has been estimated with certainty by linkage analysis (LA), given the known gametic phases and the pedigree information (Table 5).
Thus, the n RH × n BH matrix of LA-based identity-bydescent ( IBD LA ) probabilities between RH and BH ( H l ) calculated by the Fernando and Grossman procedure [46] from the grand-parental origin probabilities depicted in Table 5 is:   Let us assume that the n BH × n BH matrix of LD-based identity-by-descent ( IBD LD ) probabilities between BH pairs, U l is: The n BH × n BH diagonal matrix F 1/2 l (in which the diagonal elements are the square root of the number of replicates stored in f l ) is: Then, the n BH × n BH IBD LD probabilities matrix weighted for the IBD LA probabilities ( U w l ), calculated as U w l = F 1/2 (2)), is: Eigenvalues of U w l , defined by using the Jacobi algorithm, are in Table 6. The first four principal components capture more than 99% of the variability of U w l and, thus, are retained as explanatory ( PC l ).
The n BH × n PC l (6 × 4) matrix of eigenvectors extracted from U w l and relating BH with the four PC l , V w l is: Finally, the n RH × n PC l (20 × 4) matrix ( V l ) that allocates the n PC l scores of RH, calculated as V l = H l F

Principal component analysis on G l
This section of the Appendix aims at demonstrating the equivalence between PCA that are carried out on U w l and PCA that are directly carried out on the LDLA IBD matrix G l that was built as described by Meuwissen et al. [25]. A n RH × n RH G l matrix can be calculated in our example by using IBD LD probabilities between BH pairs stored in U l and grand-parental origin probabilities (Table 7). It allocates IBD probabilities between RH pairs obtained by combining LD and LA information. The G l in our example is: Eigenvalues of G l , calculated by using the Jacobi algorithm, are depicted in Table 7. The eigenvalues of the first five PC (i.e. those not equal to zero) overlap precisely with those obtained with U w l (Table 6). Thus, even in this case the first four PC are retained as explanatory ( PC l ).
The PC l score of RH extracted from G l and stored in a n RH × n PC l matrix (say V G l ) are: Even in this case, V G l precisely overlap V l . demonstrating that in terms of results the PCA carried out on U w l is equivalent to the PCA carried out directly on G l . Nevertheless, PCA on U w l is much faster and less computationally demanding since it refers to a much smaller matrix G l .

Analysis at the genome-wide level
The n RH × n BH genome-wide IBD LA probabilities matrix H g , should be calculated, in a map with m markers, as H g = (H 1 + H 2 + H 3 + · · · + H m )/m . In this example, we assume that H g has the following final configuration: , and the vector of n BH genome-wide average number of replicates ( f g ) of BH over RH (i.e. the average number of replicates of each BH across-genome), which should be calculated as f g = (f 1 + f 2 + f 3 + · · · + f m )/m or, more simply, as f g = H ′ g 1 (where 1 is a vector of n RH ones), is: The n BH × n BH genome-wide IBD LD probabilities U g , which in real data is calculated as U g = (U 1 + U 2 + U 3 + · · · + U m )/m , is assumed here to have the following final configuration: Note that, we assumed that genome-wide (and thus all the locus-wide) IBD LD probabilities between the BH 2p and all other BH are zero (third row and third column) in order to mimic a gamete from another breed. Moreover, the off-diagonal elements in U g have more moderate values than those which may arise in U l . Because of this, PCA performed on a matrix U w g (calculated as described above for U w l ) does not result in a dimensional reduction as it was observed at the locus level (Table 6).
Let us assume that 1p, 1m, 2p, 2m have been selected as the base gametes with the highest impact ( BH h ) on the population (the demonstration of how BH h are selected in the real data would require a population much larger than that simulated here). The n BH × n BH h section of U g corresponding to the genome-wide IBD LD probabilities between all the BH with BH h , and denoted as U g_h , is: and the n BH h × n BH h matrix U g_hh corresponding to the portion of U g . including IBD LD . probabilities between BH h pairs is:  The n BH × n BH h matrix W , including coefficients relating BH h to all the BH and calculated as: W = U g_h U −1 g_hh (Eq (4)), is thus: The vector of n BH h updated number of replicates of BH h ( f g_h ) calculated as f g_h = W ′ f g is: Note that f g_h /n RH = 16.723/20 = 0.83 (this is the best value that can be obtained in this simulation but it is far away from the 0.99 obtained with real data).
The n BH h × n BH h diagonal matrix F 1/2 g_h (in which the diagonal elements are the root square of the average number of replicates stored in f g_h ) is: Finally, the n BH h × n BH h matrix U w g_hh calculated as U w g_hh = F  Eigenvalues of U w g_hh , which are calculated by using the Jacobi algorithm, are in Table 8. In this example, all four principal components are needed to capture more than 99% of the variability of U w g_hh and thus are retained as explanatory ( PC g ).
The n BH h × n PC g (4 × 4) matrix of eigenvectors extracted from U w g_hh , and relating BH h with the four PC g , V w g_hh is: Finally, the n RH h × n PC g (20 × 4) matrix ( V g ) allocating the PC l scores of RH, calculated as V g = H g WF