- Research
- Open Access
- Published:

# Compatibility of pedigree-based and marker-based relationship matrices for single-step genetic evaluation

*Genetics Selection Evolution***volume 44**, Article number: 37 (2012)

The Erratum to this article has been published in Genetics Selection Evolution 2014 46:20

## Abstract

### Background

Single-step methods provide a coherent and conceptually simple approach to incorporate genomic information into genetic evaluations. An issue with single-step methods is compatibility between the marker-based relationship matrix for genotyped animals and the pedigree-based relationship matrix. Therefore, it is necessary to adjust the marker-based relationship matrix to the pedigree-based relationship matrix. Moreover, with data from routine evaluations, this adjustment should in principle be based on both observed marker genotypes and observed phenotypes, but until now this has been overlooked. In this paper, I propose a new method to address this issue by 1) adjusting the pedigree-based relationship matrix to be compatible with the marker-based relationship matrix instead of the reverse and 2) extending the single-step genetic evaluation using a joint likelihood of observed phenotypes and observed marker genotypes. The performance of this method is then evaluated using two simulated datasets.

### Results

The method derived here is a single-step method in which the marker-based relationship matrix is constructed assuming all allele frequencies equal to 0.5 and the pedigree-based relationship matrix is constructed using the unusual assumption that animals in the base population are related and inbred with a relationship coefficient *γ* and an inbreeding coefficient *γ*/2. Taken together, this *γ* parameter and a parameter that scales the marker-based relationship matrix can handle the issue of compatibility between marker-based and pedigree-based relationship matrices. The full log-likelihood function used for parameter inference contains two terms. The first term is the REML-log-likelihood for the phenotypes conditional on the observed marker genotypes, whereas the second term is the log-likelihood for the observed marker genotypes. Analyses of the two simulated datasets with this new method showed that 1) the parameters involved in adjusting marker-based and pedigree-based relationship matrices can depend on both observed phenotypes and observed marker genotypes and 2) a strong association between these two parameters exists. Finally, this method performed at least as well as a method based on adjusting the marker-based relationship matrix.

### Conclusions

Using the full log-likelihood and adjusting the pedigree-based relationship matrix to be compatible with the marker-based relationship matrix provides a new and interesting approach to handle the issue of compatibility between the two matrices in single-step genetic evaluation.

## Introduction

Single-step methods for genetic evaluation [1–3] have recently become popular because they provide an approach to incorporate genomic information into genetic evaluations that is both coherent and conceptually simple. A single-step method extends the usual pedigree-based method by replacing the additive relationship matrix constructed from pedigree by an additive relationship matrix that combines the marker-based relationship matrix for genotyped animals with the pedigree-based relationship matrix.

An issue with a single-step method is compatibility between the marker-based relationship matrix for genotyped animals and the pedigree-based relationship matrix [4–6]. To handle this problem, it is necessary to determine which allele frequencies should be used in the marker-based relationship matrix and to adjust this matrix to the pedigree-based relationship matrix. In theory, one should use the allele frequencies in the founder population of the pedigree (base animals) for the marker-based and pedigree-based relationships to be compatible, but these allele frequencies are rarely available in practice since base animals are not genotyped. Chen et al. [5] and Forni et al. [7] concluded that using the observed allele frequencies improved accuracy of prediction compared to using allele frequencies equal to 0.5. Studies on how to adjust the marker-based relationship matrix to be compatible with the submatrix of the pedigree-based relationship matrix for genotyped animals have been reported [4–6]. These adjustments consisted of scaling and adding a number to all elements in the marker-based relationship matrix based on equating means of diagonal and off-diagonal elements in the two matrices, and it was demonstrated that accuracy of prediction increased and bias decreased. In relation to this problem of compatibility between marker-based and pedigree-based relationship matrices, with data from routine evaluations, selection affects the allele frequencies over time, and in principle both observed marker genotypes and observed phenotypes contain information about allele frequencies in the base population. Therefore, studies on the adjustment of the marker-based relationship matrix to the pedigree-based relationship matrix have overlooked the fact that it should in principle incorporate information on observed phenotypes.

This work explores two possibilities to solve the problem, i.e. 1) in which the pedigree-based relationship matrix is adjusted to the marker-based relationship matrix and 2) in which the single-step genetic evaluation is extended by using a joint likelihood of observed phenotypes and observed marker genotypes. This results in a single-step method in which the marker-based relationship matrix is constructed assuming all allele frequencies are equal to 0.5 and the pedigree-based relationship matrix is constructed using the unusual assumption that animals in the base population are related and inbreed with a relationship coefficient *γ* and an inbreeding coefficient *γ* / 2. The log-likelihood function used for parameter inference contains two terms. The first term is the REML-log-likelihood for the phenotypes conditional on the observed marker genotypes and the second term is the log-likelihood for the observed marker genotypes. The performance of the proposed method was evaluated using two simulated datasets.

## Methods

This section presents, first, the statistical model on which the single-step methods are based along the lines of Christensen and Lund [3], then the proposal on how to adjust the pedigree-based relationship matrix and finally the adjustment of the marker-based relationship matrix as previously used.

### Single-step model

The marker genotypes are summarised into a marker matrix **m**, where *m*_{
ij
} = −1, 0 or 1 if SNP *j* of individual *i* is 11, 12, or 22, respectively. In the following, capital **M** and lowercase **m** indicate whether marker genotypes are considered as random variables or as non-random variables (observed variables or integration variables), respectively.

Let us consider the simple model

where **y** is the vector of phenotypes, *μ* is the general mean, **1** is a vector of ones, **a** is the vector of breeding values, **Z** is an incidence matrix, and **e** is the vector of residual errors. The breeding value may be decomposed into **a** = **g** + **a**_{
r
}, where **g** is the vector of genomic effects and **a**_{
r
} = (**a** − **g**) is the vector of residual polygenic effects. The residual polygenic effects ${\mathbf{a}}_{r}\sim N(\mathbf{0},{\sigma}_{a}^{2}\omega \mathbf{A})$, where matrix **A** is the pedigree-based additive relationship matrix and *ω* ∈[0;1] is the relative weight on the residual polygenic effect. The genomic values $[\mathbf{g}\mid \mathbf{M}]\sim N(\mathbf{0},{\sigma}_{a}^{2}(1-\omega \left)\mathbf{G}\right(\mathbf{M}\left)\right)$, with $\left(\right)close="">\mathbf{G}\left(\mathbf{M}\right)=(\mathbf{M}-(2\mathit{\rho}-1){\mathbf{1}}^{\mathrm{T}}){(\mathbf{M}-(2\mathit{\rho}-1){\mathbf{1}}^{\mathrm{T}})}^{\mathrm{T}}/s$, where ** ρ** = (

*ρ*

_{1}, … ,

*ρ*

_{ p }) with

*ρ*

_{ j }being the allele frequency for the

*j*th marker and $s=s\left(\mathbf{v}\right)=\sum _{j}{v}_{j}$ with

*v*

_{ j }being defined below. The model for the marker genotypes is that

**M**is a multivariate Gaussian distribution with

where *v*_{
j
} = *v*_{j,j} is a parameter and ${v}_{j{j}^{\prime}}=0$ for *j* ≠ *j*^{′}. The crude assumption that the multivariate distribution is Gaussian is crucial in the derivation, whereas the unrealistic assumption that ${v}_{j{j}^{\prime}}=0$ for *j* ≠ *j*^{′} is made for simplicity. Dividing marker genotypes **m** into observed marker genotypes **m**^{o} and un-observed marker genotypes **m**^{u}, the joint marginal density of observed phenotypes **y** and observed marker genotypes **m**^{o} is

where

By rearranging terms and using that $\int f(\mathbf{g}\mid {\mathbf{m}}^{o},{\mathbf{m}}^{u})f({\mathbf{m}}^{u}\mid {\mathbf{m}}^{o})d{\mathbf{m}}^{u}=f(\mathbf{g}\mid {\mathbf{m}}^{o})$, this becomes

where *f*(**y**∣**m**^{o}) is defined implicitly. The first term *f*(**y**∣**m**^{o}) is the density for the phenotypes given the observed marker genotypes, whereas the second term *f*(**m**^{o}) is the multivariate Gaussian density for the observed marker genotypes.

Single-step methods are based on the density *f*(**y**∣**m**^{o}), which may be written (see [3]) as

where the vector **a** has a mean zero and a variance-covariance matrix **H**_{
ω
} with inverse

with **G**_{
ω
} = (1 − *ω*)**G**(**m**^{o}) + *ω* **A**_{11}. In the above formula, the sub-division is made according to genotyped and non-genotyped animals. The matrices $\left(\right)close="">\mathbf{G}\left({\mathbf{m}}^{o}\right)=({\mathbf{m}}^{o}\phantom{\rule{0.3em}{0ex}}-\phantom{\rule{0.3em}{0ex}}(2\mathit{\rho}\phantom{\rule{0.3em}{0ex}}-\phantom{\rule{0.3em}{0ex}}1){\mathbf{1}}^{\mathrm{T}}){({\mathbf{m}}^{o}\phantom{\rule{0.3em}{0ex}}-\phantom{\rule{0.3em}{0ex}}(2\mathit{\rho}\phantom{\rule{0.3em}{0ex}}-\phantom{\rule{0.3em}{0ex}}1){\mathbf{1}}^{\mathrm{T}})}^{\mathrm{T}}/s$ and **A**_{11} are the marker-based relationship matrix for the genotyped animals and the submatrix of the pedigree-based relationship matrix corresponding to genotyped animals, respectively. The sparse structure of the matrix in equation (1) is the corner-stone for efficient computing using a single-step method. Assuming that **a**∣**m**^{o} is Gaussian distributed, mixed model equations can be solved for BLUP predictions and AI-REML [8] provides REML parameter estimates.

Issues raised by the single-step method are 1) how are the allele frequencies ** ρ** that are used in

**G**(

**m**

^{o}) obtained and 2) how is the required compatibility between

**G**(

**m**

^{o}) and

**A**

_{11}that is evident in equation (1) reached. To investigate these issues, the joint density of observed phenotypes and marker genotypes,

*f*(

**y**,

**m**

^{o}) =

*f*(

**y**∣

**m**

^{o})

*f*(

**m**

^{o}), is taken as the starting point. From this, the full marginal log-likelihood becomes

where ${\ell}_{\mathbf{y}\mid {\mathbf{m}}^{o}}({\sigma}_{a}^{2},{\sigma}_{e}^{2},\omega ,\mathit{\rho},s(\mathbf{v}\left)\right)$ for fixed *ω*, ** ρ** and $s\left(\mathbf{v}\right)=\sum _{j}{v}_{j}$ is the single-step REML-log-likelihood for the phenotypes conditional on the observed marker genotypes, and

with *n*_{1} denoting the number of genotyped animals, is the log-likelihood of the observed marker genotypes. Thus, the parameter ** ρ** enters into both terms of the full log-likelihood in equation (2), and this implies that estimation of allele frequencies

**should be in principle based on this log-likelihood. In particular, if selection has been performed the phenotypes will contain information about the allele frequencies which may cause bias if ignored. However, estimating**

*ρ***and**

*ρ**s*=

*s*(

**v**) by maximising $\left(\right)close="">\ell ({\sigma}_{a}^{2},\omega ,{\sigma}_{e}^{2},\mathit{\rho},\mathbf{v})$ for all parameters jointly is not feasible in practice since

**is a very high-dimensional parameter. Instead, in general the observed marker genotypes are used to estimate**

*ρ***and then they are plugged into $\left(\right)close="">{\ell}_{\mathbf{y}\mid {\mathbf{m}}^{o}}$, and this log-likelihood is used to estimate the remaining parameters. Estimation of**

*ρ***based on observed marker genotypes may consist of simply using the observed allele frequencies [5, 7], or may be done by maximising**

*ρ**ℓ*

_{ mark }(

*ρ***v**) as a function of

**(this is essentially the method of Gengler et al. [9], although, for computational reasons, that method adds a small residual error to the distribution of**

*ρ***M**). The high-dimensional parameter

**v**also enters into both terms of the full log-likelihood, although it only enters into $\left(\right)close="">{\ell}_{\mathbf{y}\mid {\mathbf{m}}^{o}}$ via the scaling parameter

*s*=

*s*(

**v**), and therefore estimation of

**v**should also be in principle based on maximising the full log-likelihood. Usually, estimating $s=\sum _{j}{v}_{j}$ is based on the observed marker genotypes, either using ${v}_{j}=2{\widehat{\rho}}_{j}(1-{\widehat{\rho}}_{j})$, implying $s=\sum _{j}2{\widehat{\rho}}_{j}(1-{\widehat{\rho}}_{j})$ as in [2, 3], or using

*s*such that the average diagonal of

**G**equals the average diagonal of

**A**

_{11}as in [7]. Furthermore, it has been demonstrated that the accuracy of prediction is improved and bias is reduced by adjusting the marker-based relationship matrix as

**G**(

**m**

^{o})

*β*+

*α*where

*α*and

*β*are based on the elements in

**G**(

**m**

^{o}) and

**A**

_{11}; further details are given below. It should be noted that such an adjustment is based on observed marker genotypes only, and lacks a theoretical justification within the framework considered here. To summarise, the allele frequencies in

**G**(

**m**

^{o}) and the adjustments necessary for the compatibility of

**G**(

**m**

^{o}) and

**A**

_{11}should be in principle derived based on both observed marker genotypes and observed phenotypes.

Furthermore, when computing BLUP breeding values $\widehat{\mathbf{a}}$ with plugged-in parameter estimates, the uncertainty in these parameters estimates is ignored, which is an important issue in the case of the high-dimensional parameter ** ρ**. Alternatively, uncertainty in parameter estimates may be incorporated into the predictions by using a Bayesian approach. Demonstration that a Bayesian approach with prior distributions on

**and**

*ρ***v**results in an method in which the pedigree-based relationship matrix is adjusted to a marker-based relationship matrix, is presented below.

### Adjusting the pedigree-based relationship matrix

The derivation of the proposed adjustment of the pedigree-based relationship matrix is based on assigning priors on the high-dimensional parameters ** ρ** and

**v**in the previously described single-step model, and then considers the first and second order moments of the marginal distribution of

**M**(integrating

**and**

*ρ***v**). Appendix A shows that the resulting marker distribution satisfies

with ${\stackrel{~}{v}}_{j}={\stackrel{~}{v}}_{\mathrm{jj}}=\stackrel{~}{s}/p$, *j* = 1, … , *p*, $\stackrel{~}{s}$ being a parameter, and ${\stackrel{~}{v}}_{j{j}^{\prime}}=0$ when *j* ≠ *j*^{′}. Matrix $\stackrel{~}{\mathbf{A}}\left(\gamma \right)$ is an additive relationship matrix that satisfies the usual recursions but with the peculiar feature that base animals in the pedigree are related and inbred. Within the population of base animals, the relationship coefficient is *γ* and the inbreeding coefficient is *γ* / 2. Table 1 contains data for a small pedigree used to derive matrix $\stackrel{~}{\mathbf{A}}$ as shown in Table 2. The mean and variance-covariance structure of **M** shown above is of the form in [3], with scaling parameter $\stackrel{~}{s}=\sum _{j}{\stackrel{~}{v}}_{j}$, and hence the breeding values **g** have a combined relationship matrix ${\mathbf{H}}_{\gamma ,\stackrel{~}{s},\omega}$, where the inverse is

with ${\mathbf{G}}_{\gamma ,\stackrel{~}{s},\omega}=(1-\omega ){\mathbf{m}}^{o}{\left({\mathbf{m}}^{o}\right)}^{\mathrm{T}}/\stackrel{~}{s}+\omega \stackrel{~}{\mathbf{A}}{\left(\gamma \right)}_{11}$. This construction is a single-step method for which the individuals in the base population are related and inbred, and the marker-based relationship matrix, ${\mathbf{m}}^{o}{\left({\mathbf{m}}^{o}\right)}^{\mathrm{T}}/\stackrel{~}{s}$, has allele frequencies equal to 0.5.

Based on the derivation above (see also Appendix A), the full marginal log-likelihood function for parameter inference becomes

where ${\stackrel{~}{\ell}}_{\mathbf{y}\mid {\mathbf{m}}^{o}}({\sigma}_{a}^{2},{\sigma}_{e}^{2},\omega ,\gamma ,\stackrel{~}{s})$ for fixed *ω*, *γ* and $\stackrel{~}{s}$ is the single-step REML-log-likelihood for the phenotypes conditional on the observed marker genotypes, with marker-based relationship matrix ${\mathbf{m}}^{o}{\left({\mathbf{m}}^{o}\right)}^{\mathrm{T}}/\stackrel{~}{s}$ and pedigree-based relationship matrix $\stackrel{~}{\mathbf{A}}\left(\gamma \right)$, and

is the log-likelihood of the observed marker genotypes. Using the log-likelihood in equation (4) instead of the log-likelihood in equation (2) makes the estimation of parameters feasible in practice by numeric maximization methods since the high-dimensional parameters ** ρ** and

**v**are replaced by two parameters, a parameter

*γ*that determines the relationship and inbreeding of individuals in the base population and a parameter $\stackrel{~}{s}$ that scales the marker-based relationship matrix.

The computations require algorithms that compute **A**(*γ*)^{−1} and **A**_{11}(*γ*) efficiently. Computations of inbreeding coefficients with related base animals have been considered by [10, 11] in the context of incomplete pedigrees. In Appendix B, algorithms are presented that extend the approaches of Quaas [12] for computing **A**^{−1} and of Colleau [13] for computing **A**_{11} to the case where base animals are related and inbred.

Maximisation of the full log-likelihood in equation (4) is done by first specifying a discrete three-dimensional grid of values for parameters $\omega ,\gamma ,\stackrel{~}{s}$ and then, for each value of $(\omega ,\gamma ,\stackrel{~}{s})$, computing the maximum values of the log-likelihood ${\stackrel{~}{\ell}}_{\mathbf{y}\mid {\mathbf{m}}^{o}}$ and the log-likelihood of observed marker genotypes ${\stackrel{~}{\ell}}_{\mathit{\text{mark}}}$. This provides a three-dimensional profile log-likelihood $\widehat{\stackrel{~}{\ell}}(\omega ,\gamma ,\stackrel{~}{s})$, which can then be assessed to find the maximum.

For faster computing, an alternative to using the full log-likelihood is to determine parameters *γ* and $\stackrel{~}{s}$ based on observed marker genotypes only, i.e. by maximising ${\stackrel{~}{\ell}}_{\mathit{\text{mark}}}(\gamma ,\stackrel{~}{s})$, and to estimate the remaining parameter based on ${\stackrel{~}{\ell}}_{\mathbf{y}\mid {\mathbf{m}}^{o}}({\sigma}_{a}^{2},{\sigma}_{e}^{2},\omega ,\gamma ,\stackrel{~}{s})$ for a grid of values for *ω*, with estimates of *γ* and $\stackrel{~}{s}$ plugged in. Setting the derivative of ${\stackrel{~}{\ell}}_{\mathit{\text{mark}}}(\gamma ,\stackrel{~}{s})$ with respect to $\stackrel{~}{s}$ equal to zero gives

which has the solution

Substituting $\widehat{\stackrel{~}{s}}\left(\gamma \right)$ into equation (5), we obtain

which has to be maximised numerically to estimate *γ*.

### Adjusting the marker-based relationship matrix (G-adjust)

Alternatively, an adjustment of the form **G**_{
a
} = **G** *β* + *α* is used, where **G** is the marker-based relationship matrix with allele frequencies $\widehat{\mathit{\rho}}$ equal to the observed ones and scaling parameter $\u015d=\sum _{j}2{\widehat{\rho}}_{j}(1-{\widehat{\rho}}_{j})$, and parameters *α* and *β* are determined by fitting **G**_{
a
} to **A**_{11}. This adjustment was used by VanRaden [14], Christensen et al. [6] and Gao et al. [15], with the first paper suggesting that *α* and *β* should be estimated by least square estimation, i.e. by minimizing the sum of squares of **G** *β* + *α* − **A**_{11} and the other two papers suggesting that they should be to estimated by equating means of diagonal elements and all elements in the two matrices. Here, the later is applied and *α* and *β* are estimated by solving the two equations

for *α* and *β*, where $\stackrel{\u0304}{G}$ and ${\u0100}_{11}$ denote means of all elements in the two matrices, and *dG* and *dA* denote means of diagonal elements in the two matrices.

### Simulated example 1

This example is deliberately simple and very extreme, and constructed for the purpose of showing that parameter estimates of *γ* and $\stackrel{~}{s}$ can depend on the observed phenotypes.

The base population consists of two individuals, one sire and one dam with 15 bi-allelic markers, and their genotypes are simulated assuming independence between markers and with equal allele frequencies. Furthermore, it is assumed that the markers are independently inherited, are all QTL with allele substitution effect equal to 1, and heritability of the phenotype is equal to 1.

The two base individuals produce 100 offspring (generation 1) that all have observed phenotypes. The two individuals in generation 1 with the largest own phenotype value are selected as parents and produce 100 offspring (generation 2) that are all genotyped.

Two different approaches to estimate parameters are compared. In the first approach, all parameters are estimated using the full log-likelihood in equation (4). In the second approach, *γ* and $\stackrel{~}{s}$ are estimated based only on the log-likelihood of the observed marker genotypes in equation (5), and the remaining parameters are estimated based on the REML-log-likelihood ${\stackrel{~}{\ell}}_{\mathbf{y}\mid {\mathbf{m}}^{o}}({\sigma}_{a}^{2},{\sigma}_{e}^{2},\omega ,\gamma ,\stackrel{~}{s})$, with estimates of *γ* and $\stackrel{~}{s}$ plugged-in.

### Simulated example 2

This example is inspired by a pig nucleus breeding scheme and consists of five generations in which all animals have recorded phenotypes. In each generation, 150 boars are mated to 1 500 sows to produce 15 000 offspring (50/50 males/females). For the next generation, boars with a high own phenotype value are chosen and sows are selected at random. The last three generations of selected boars are genotyped and a sixth generation of 300 candidate boars are also genotyped. The breeding value is the sum of 500 independent QTL effects simulated from a Gamma(5.4,0.42) distribution, and the heritability of the phenotype is 0.22. This dataset is described in more detail in Christensen and Lund [3].

Three different approaches are compared. In the first approach, all parameters are estimated using the full log-likelihood in equation (4). In the second approach, parameters *γ* and $\stackrel{~}{s}$ are estimated based only on the log-likelihood of the observed marker genotypes in equation (5), and the remaining parameters are estimated based on the REML-log-likelihood ${\stackrel{~}{\ell}}_{\mathbf{y}\mid {\mathbf{m}}^{o}}({\sigma}_{a}^{2},{\sigma}_{e}^{2},\omega ,\gamma ,\stackrel{~}{s})$, with estimates of *γ* and $\stackrel{~}{s}$ plugged in. Finally, the G-adjust approach is used for which *α* and *β* are estimated using equation (6) and the remaining parameters are estimated by $\left(\right)close="">{\ell}_{\mathbf{y}\mid {\mathbf{m}}^{o}}$. For all three approaches, the correlation between predicted breeding value $\widehat{\mathbf{a}}$ and true breeding value **a** for the candidate boars is reported as well as the estimated regression coefficient (reg) for the regression of **a** on $\widehat{\mathbf{a}}$, where deviation from one indicates bias.

## Results

### Simulated example 1

Table 3 shows that both *γ* and $\stackrel{~}{s}$ were smaller when estimated with the full log-likelihood than with the log-likelihood of the observed marker genotypes.

### Simulated example 2

Table 4 shows that *γ* and $\stackrel{~}{s}$ were slightly smaller when estimated with the full log-likelihood than with the log-likelihood of the observed marker genotypes. Parameter *ω* was about 0.375 whether estimated with the full log-likelihood or with the log-likelihood ${\stackrel{~}{\ell}}_{\mathbf{y}\mid {\mathbf{m}}^{o}}$, with parameter estimates of *γ* and $\stackrel{~}{s}$ from ${\stackrel{~}{\ell}}_{\mathit{\text{mark}}}$ plugged-in. When presented in three dimensions, the profile log-likelihood showed a very weak association between $(\gamma ,\stackrel{~}{s})$ and *ω*. A contour plot of the profile log-likelihood surface for *γ* and $\stackrel{~}{s}$ is shown in Figure 1 and the profile log-likelihood function for *ω* is shown in Figure 2.

Estimated parameters obtained with the G-adjust approach are also shown in Table 4, where it should be noted that $\widehat{\omega}=0.6$.

In terms of prediction performance, the correlation between predicted and true breeding value and the regression coefficient were 0.536 and 1.17, respectively, in both cases when the pedigree-based matrix was adjusted, but 0.493 and 1.34, respectively, when the G-adjust approach was used.

## Discussion

This paper provides a coherent approach to handle the issue of compatibility between pedigree-based and marker-based relationship matrices in single-step genetic evaluation. The approach is computationally fast and feasible for large datasets, as is the case for previously developed single-step methods. Parameter *γ* can adjust the pedigree-based relationship matrix to the marker-based relationship matrix that is scaled by parameter $\stackrel{~}{s}$, and these two parameters should in principle be estimated using the full log-likelihood for observed phenotypes and observed marker genotypes. This is computationally feasible by computing the full log-likelihood values for parameters *γ*, $\stackrel{~}{s}$ and *ω* in a three-dimensional grid. However, in practice this can be computationally burdensome and an appealing alternative is to estimate *γ* and $\stackrel{~}{s}$ based on the observed marker genotypes only. Analysis of simulated datasets, shows that the estimates of parameters *γ* and $\stackrel{~}{s}$ depend on the observed phenotypes as well as on the observed marker genotypes, although this dependence is not too large. A conjecture is that in a scenario with a small number of genotyped animals it is more important to use the full log-likelihood than in a scenario with a large number of genotyped animals. Based on these two simulated datasets only, no general conclusion can be made and further studies are needed to determine in which scenarios it would be safe to base inference on a two-step procedure in which *γ* and $\stackrel{~}{s}$ are estimated based on the observed marker genotypes only, and the remaining parameters are then estimated by the log-likelihood of the phenotypes conditional on the observed marker genotypes.

Using the approach developed in this paper may also provide insight on performance of other approaches for making marker-based and pedigree-based relationship matrices compatible in single-step genetic evaluation. Examples of such approaches are the adjustments of the marker-based relationship matrix reported in [4–6, 14, 15] and also investigated here, and the approach by Meuwissen et al. [16]. In [16], first, an average of position specific identical by descent matrices based on linkage information [17] (hereafter denoted **G**_{
FG
}) were used instead of the combined relationship matrix (1), and second, the final method consisted of replacing the pedigree-based relationship matrix **A** by the matrix **G**_{
FG
} in matrix (1) and then adjusting the marker-based relationship matrix to the **G**_{
FG
} matrix. Based on simple simulated datasets, without selection, the method resulted in high accuracy and low bias. The computation of **G**_{
FG
} is computationally burdensome and therefore this approach was not considered here. It should be noted that for these other approaches, the combined relationship matrix is constructed based on observed marker genotypes only, and the inference on remaining parameters is based on the log-likelihood conditional on the observed marker genotypes. The approach in this paper may provide guidelines on when it is safe to base inference on such procedures in which, first the observed marker genotypes are used to construct a combined relationship matrix, and second the remaining inference is based on the log-likelihood for the phenotypes conditional on observed marker genotypes.

Evaluation of accuracy and bias of prediction with the simulated dataset 2 shows that the approaches based on adjusting the pedigree-based relationship matrix, where $\text{Cor}(\widehat{\mathbf{a}},\mathbf{a})=0.536$ and reg = 1.17, seem to perform better than the G-adjust approach, where $\text{Cor}(\widehat{\mathbf{a}},\mathbf{a})=0.493$ and reg = 1.34. However, $\widehat{\omega}$ is also larger in the second case (0.60) than in the first case (0.375). If *ω* is set at 0.375 in the second case then $\text{Cor}(\widehat{\mathbf{a}},\mathbf{a})=0.534$, which is close to the value in the first case, but reg is somewhat larger, i.e. 1.30. If *ω* is set at 0.1, then $\text{Cor}(\widehat{\mathbf{a}},\mathbf{a})=0.545$ and reg = 1.00 in the first case, and $\text{Cor}(\widehat{\mathbf{a}},\mathbf{a})=0.550$ and reg = 1.12 in the second case, and if *ω* is set at 0.01, then $\text{Cor}(\widehat{\mathbf{a}},\mathbf{a})=0.548$ and reg = 1.05 in the second case. Thus, the first two approaches seem to perform better in terms of estimating a more proper *ω* and have somewhat less bias, while accuracy is similar for all three approaches.

This paper suggests that the pedigree-based relationship matrix should be adjusted instead of the marker-based relationship matrix. On the one hand, from a practical point of view this would be rather inconvenient since standard software used for REML estimation and BLUP would have to be modified, but on the other hand, it is conceptually simpler and may be easier to extend. For example, when the population consists of a mixture of breeds, it may be simpler to extend the approach in this paper and specify a parametric structure on the relationships of the animals in the base population and estimate those parameters, instead of developing an appropriate way of adjusting the marker-based relationship matrix of the genotyped animals across breeds. An additional issue is the interpretation of the genetic variance parameter ${\sigma}_{a}^{2}$. The parameter estimates in Table 1 are ${\widehat{\sigma}}_{a}^{2}=6.507$ when $\widehat{\gamma}=0.4605$, ${\widehat{\sigma}}_{a}^{2}=6.512$ when $\widehat{\gamma}=0.4615$, and ${\widehat{\sigma}}_{a}^{2}=5.084$ for G-adjust, where base animals are unrelated, i.e. *γ* = 0. This may seem counter-intuitive since the variance of breeding values for a base animal is ${\sigma}_{a}^{2}(1+\gamma /2)$. However, when studying the inverse relationship matrix $\stackrel{~}{\mathbf{A}}{\left(\gamma \right)}^{-1}$, then the averages of diagonal elements are 3.807, 3.809 and 2.93 for the three cases, respectively, and since 3.807/6.507, 3.809/6.512 and 2.93/5.084 are roughly the same, the parameter estimates actually make good sense, although the interpretation of ${\sigma}_{a}^{2}$ is unclear. Finally, it should be noted that parameter *γ* could influence to some extent accuracy and bias of prediction, and neither *γ* = 0 nor *γ* estimated to make marker-based and pedigree-based relationships compatible, would be optimal for that purpose. Adjusting the pedigree-based relationship instead of the marked-based relationship matrix to make the two matrices compatible in a single-step method is an interesting alternative.

For the simulated dataset 1, the estimate of the relative polygenic weight *ω* was about 0, as was expected since in the simulation the markers were assumed to capture all the genetic variation. However, for the simulated dataset 2, $\widehat{\omega}$ was large when adjusting the pedigree-based relationship matrix, i.e. 0.375, and even larger, i.e. 0.60, when the G-adjust approach was used. For the second dataset, the prediction biases were reduced when *ω* was set at 0.1. This poses the question whether it is actually possible to estimate *ω* at a reasonable value from data, or whether *ω* should be determined manually to control the bias. It should be noted that confidence intervals of this parameter are large in these examples and further studies are needed.

## Conclusions

Using the full log-likelihood and adjusting the pedigree-based relationship matrix to be compatible with the marker-based relationship matrix provides a new and interesting approach to handle the issue of compatibility between the two matrices in single-step genetic evaluation.

## Appendix A

The derivation of the proposed adjustment of the pedigree-based relationship matrix is based on assigning priors on the high-dimensional parameters ** ρ** and

**v**in the previously described single-step model. This derivation is presented here.

The model for the marker genotypes, **M**, given the allele frequencies ** ρ** and variances

**v**, is as in [3],

where *v*_{
j
} = *v*_{j,j} is a parameter and ${v}_{j{j}^{\prime}}=0$ for *j* ≠ *j*^{′}. Prior distributions for ** ρ** and

**v**are not specified explicitly, only the necessary assumptions are stated. Assuming that the alleles are randomly labeled 1 and 2, then a priori the expected allele frequency is E[

*ρ*

_{ j }] = 0.5. Additional assumptions are that

*η*

_{1}= Var[

*ρ*

_{ j }] and

*η*

_{2}= E[

*v*

_{ j }] do not depend on

*j*, and that $\text{Cov}[{\rho}_{j},{\rho}_{{j}^{\prime}}]=0$ for

*j*≠

*j*

^{′}.

With these prior distributions, a marginal distribution of **M** may be obtained by integrating ** ρ** and

**v**. Using well-known formulas for conditional expectations and covariances, the mean and covariances become

Defining ${\stackrel{~}{v}}_{j}=2{\eta}_{1}+{\eta}_{2}$, and *γ* = 4*η*_{1} / (2*η*_{1} + *η*_{2}), we obtain

with $\xc3{\left(\gamma \right)}_{i{i}^{\prime}}=(1+\gamma /2){A}_{i{i}^{\prime}}+\gamma $. It is not difficult to see that $\stackrel{~}{\mathbf{A}}\left(\gamma \right)$ satisfies the usual recursions for an additive relationship matrix. Requiring that 0 ≤ *γ* < 1 is equivalent to making a further assumption that 2*η*_{1} < *η*_{2} (this is for example satisfied when *ρ*_{
j
}∼*U*] 0;1 and [*v*_{
j
} = 2*ρ*_{
j
}(1 − *ρ*_{
j
}), where *η*_{1} = 1/12 and *η*_{2} = 1/3). The first and second order moments of **M** are therefore of the form considered in [3], with pedigree-based relationship matrix $\stackrel{~}{\mathbf{A}}\left(\gamma \right)$, allele frequencies ${\stackrel{~}{\rho}}_{j}=0.5$, scaling parameter $\stackrel{~}{s}=\sum _{j}{\stackrel{~}{v}}_{j}$ with ${\stackrel{~}{v}}_{j}$ not depending on *j*, and marker-based relationship matrix ${\mathbf{m}}^{o}{\left({\mathbf{m}}^{o}\right)}^{\mathrm{T}}/\stackrel{~}{s}$.

Note that the random labelling of alleles as 1 or 2 is not important because the resulting expression does not depend on how alleles are labelled.

Using ${\stackrel{~}{v}}_{j}=\stackrel{~}{s}/p$, the log-likelihood for the observed marker genotypes becomes

This log-likelihood may also be viewed as the log-likelihood for $\left(\right)close="">{\mathbf{M}}^{o}{\left({\mathbf{M}}^{o}\right)}^{\mathrm{T}}$ being Wishart distributed ${W}_{{n}_{1}}\left(\stackrel{~}{s}{\stackrel{~}{\mathbf{A}}}_{11}\right(\gamma )/p,p)$.

## Appendix B

The aim here is to present algorithms for computing ${\left(\stackrel{~}{\mathbf{A}}\right(\gamma \left)\right)}^{-1}$ and $\stackrel{~}{\mathbf{A}}{\left(\gamma \right)}_{11}$ by recursions. Computations of inbreeding coefficients with related based animals were considered by [10, 11]. For simplicity, let’s assume that, either both parents are known or both parents are unknown. This is not an important restriction in practice, since the case with one unknown parent can be handled by assigning an artificial animal-id to this unknown parent and by letting both its parents be unknown.

Let us partition matrix $\stackrel{~}{\mathbf{A}}$ (skipping the dependence on *γ* in the notation from now and onwards) according to whether the animals are base animals (both parents unknown) or not

Similar to the usual **A** matrix (see [18]) a decomposition exists

where **T** is a lower triangular matrix with entries *T*_{
ii
} = 1, *T*_{
ik
} = (*T*_{f(i)k} + *T*_{m(i)k}) / 2 when *i* has known parents *f*(*i*) and *m*(*i*) and *i* > *k*, and *T*_{
ik
} = 0 otherwise (*k* > *i* or *i* has unknown parents), and $\stackrel{~}{\mathbf{D}}$ is a diagonal matrix containing the variance of the Mendelian sampling part for matrix ${\stackrel{~}{\mathbf{A}}}^{\mathit{\text{nonbas}}}$, i.e. ${\stackrel{~}{D}}_{\mathrm{ii}}=1-({\xc3}_{f\left(i\right)f\left(i\right)}+{\xc3}_{m\left(i\right)m\left(i\right)})/4$. Comparing this decomposition to the decomposition for the usual **A** matrix, then matrix ${\stackrel{~}{\mathbf{A}}}^{\mathit{\text{bas}}}$ has replaced an identity matrix and $\stackrel{~}{\mathbf{D}}$ has replaced a diagonal matrix containing the Mendelian sampling terms related to **A** for the animals with known parents.

The inverse matrix becomes

Here **T**^{−1} is a lower triangular matrix with ones on the diagonal and the only non-zero elements being −1 / 2 for offspring-parent entries, and matrix ${\left({\stackrel{~}{\mathbf{A}}}^{\mathit{\text{bas}}}\right)}^{-1}$ has diagonal elements equal to (1 + (*n*_{
bas
}−3/2)*γ*) / ((1−*γ*/2)(1 + (*n*_{
bas
}−1/2)*γ*)) and off-diagonal elements equal to − *γ* / ((1 − *γ* / 2)(1 + (*n*_{
bas
} − 1 / 2)*γ*)), where *n*_{
bas
} is the dimension of ${\stackrel{~}{\mathbf{A}}}^{\mathit{\text{bas}}}$.

A procedure to obtain the diagonal of $\stackrel{~}{\mathbf{A}}$ and the inverse ${\left(\stackrel{~}{\mathbf{A}}\right)}^{-1}$ can be constructed similar to the algorithm by Quaas [12], which utilises the form $\stackrel{~}{\mathbf{A}}=\mathbf{L}{\mathbf{L}}^{T}$ where

Matrix $\sqrt{{\stackrel{~}{\mathbf{A}}}_{\mathit{\text{bas}}}}$ is a lower triangular matrix such that ${\stackrel{~}{\mathbf{A}}}_{\mathit{\text{bas}}}=\sqrt{{\stackrel{~}{\mathbf{A}}}_{\mathit{\text{bas}}}}{\left(\sqrt{{\stackrel{~}{\mathbf{A}}}_{\mathit{\text{bas}}}}\right)}^{\mathrm{T}}$. First, matrix $\sqrt{{\stackrel{~}{\mathbf{A}}}_{\mathit{\text{bas}}}}$ is computed and ${\xc3}_{\mathrm{ii}}$ is set to 1 + *γ* / 2 for the base animals. Second, the remaining part of the algorithm is the same as in Quaas [12], where recursively, rows in *L* are computed using ${\stackrel{~}{D}}_{\mathrm{ii}}=1-({\xc3}_{f\left(i\right)f\left(i\right)}+{\xc3}_{m\left(i\right)m\left(i\right)})/4$ and ${\xc3}_{\mathrm{ii}}=\sum _{k=1}^{i}{L}_{\mathrm{ik}}^{2}$. Obtaining ${\left(\stackrel{~}{\mathbf{A}}\right)}^{-1}$ at the same time is done by first setting elements in ${\left(\stackrel{~}{\mathbf{A}}\right)}^{-1}$ for base animals equal to the elements in ${\left({\stackrel{~}{\mathbf{A}}}^{\mathit{\text{bas}}}\right)}^{-1}$, and then adding elements to ${\left(\stackrel{~}{\mathbf{A}}\right)}^{-1}$ in the usual way (assuming a half-stored format, add $1/{\stackrel{~}{D}}_{\mathrm{ii}}$ to the (*i*,*i*) element, add $-1/\left(2{\stackrel{~}{D}}_{\mathrm{ii}}\right)$ to the (*i*,*f*(*i*)) and (*i*,*m*(*i*)) elements, and add $1/\left(4{\stackrel{~}{D}}_{\mathrm{ii}}\right)$ to the (*f*(*i*),*f*(*i*)), (*m*(*i*),*m*(*i*)) and (*m*(*i*),*f*(*i*)) elements).

Matrix ${\stackrel{~}{\mathbf{A}}}_{11}$ can be obtained by a modification of the Colleau algorithm [13, 19, 20] by using an algorithm to compute $\stackrel{~}{\mathbf{A}}\mathbf{x}$, where **x** is a vector. The product of matrix $\stackrel{~}{\mathbf{A}}$ and vector **x** is

Therefore, an algorithm (with notation similar to that in Appendix A in Misztal et al. [19]) consists of first computing **r** = **T**^{T}**x** by solving the sparse system $\left(\right)close="">{\left({\mathbf{T}}^{-1}\right)}^{\mathrm{T}}\mathbf{r}=\mathbf{x}$ for **r**, then computing

and finally computing $\stackrel{~}{\mathbf{A}}\mathbf{x}={\mathbf{T}}^{\mathrm{T}}\mathbf{t}$ by solving the sparse system ${\left({\mathbf{T}}^{-1}\right)}^{\mathrm{T}}\left(\stackrel{~}{\mathbf{A}}\mathbf{x}\right)=\mathbf{r}$ for $\stackrel{~}{\mathbf{A}}\mathbf{x}$.

## References

- 1.
Legarra A, Aguilar I, Misztal I: A relationship matrix including full pedigree and genomic information. J Dairy Sci. 2009, 92: 4656-4663. 10.3168/jds.2009-2061.

- 2.
Aguilar I, Misztal I, Johnson DL, Legarra A, Tsuruta S, Lawlor TJ: Hot topic: A unified approach to utilize phenotypic, full pedigree, and genomic information for genetic evaluations of Holstein final score. J Dairy Sci. 2010, 93: 743-752. 10.3168/jds.2009-2730.

- 3.
Christensen OF, Lund MS: Genomic prediction when some animals are not genotyped. Genet Sel Evol. 2010, 42: 2-10.1186/1297-9686-42-2.

- 4.
Vitezica ZG, Aguilar I, Misztal I, Legarra A: Bias in genomic predictions for populations under selection. Genet Res. 2011, 93: 357-366. 10.1017/S001667231100022X.

- 5.
Chen CY, Misztal I, Aguilar I, Legarra A, Muir WM: Effect of different genomic relationship matrices on accuracy and scale. J Anim Sci. 2011, 89: 2673-2679. 10.2527/jas.2010-3555.

- 6.
Christensen OF, Madsen P, Nielsen B, Ostersen T, Su G: Single-step methods for genomic evaluation in pigs. Animal. 2012, 6: 1565-1571. 10.1017/S1751731112000742.

- 7.
Forni S, Aguilar I, Misztal I: Different genomic relationship matrices for single-step analysis using phenotypic, pedigree and genomic information. Genet Sel Evol. 2011, 43: 1-10.1186/1297-9686-43-1.

- 8.
Gilmour AR, Thompson R, Cullis BR: Average information REML: an efficient algorithm for parameter estimation in linear mixed models. Biometrics. 1995, 51: 1440-1450. 10.2307/2533274.

- 9.
Gengler N, Mayeres P, Szydlowski M: A simple method to approximate gene content in large pedigree populations: application to the myostation gene in dual-purpose Belgian Blue cattle. Animal. 2007, 1: 21-28. 10.1017/S1751731107392628.

- 10.
VanRaden PM: Accounting for inbreeding and crossbreeding in genetic evaluation of large populations. J Dairy Sci. 1992, 75: 3136-3144. 10.3168/jds.S0022-0302(92)78077-1.

- 11.
Colleau JJ, Sargolzaei M: MIM: an indirect method to assess inbreeding and coancestry in large incomplete pedigrees of selected dairy cattle. J Anim Breed Genet. 2011, 128: 163-173. 10.1111/j.1439-0388.2010.00899.x.

- 12.
Quaas RL: Computing the diagonal elements and inverse of a large numerator relationship matrix. Biometrics. 1976, 32: 949-953. 10.2307/2529279.

- 13.
Colleau JJ: An indirect approach to the extensive calculation of relationship coefficients. Genet Sel Evol. 2002, 34: 409-421. 10.1186/1297-9686-34-4-409.

- 14.
VanRaden PM: Efficient methods to compute genomic predictions. J Dairy Sci. 2008, 91: 4414-4423. 10.3168/jds.2007-0980.

- 15.
Gao H, Christensen OF, Madsen P, Nielsen US, Zhang Y, Lund MS, Su G: Comparison on genomic predictions using GBLUP models and two single-step blending methods with different relationship matrices in the Nordic Holstein population. Genet Sel Evol. 2012, 44: 8-10.1186/1297-9686-44-8.

- 16.
Meuwissen THE, Luan T, Woolliams JA: The unified approach to the use of genomic and pedigree information in genomic evaluations revisited. J Anim Breed Genet. 2011, 128: 429-439. 10.1111/j.1439-0388.2011.00966.x.

- 17.
Fernando RL, Grossman M: Marker assisted selection using best linear unbiased prediction. Genet Sel Evol. 1989, 21: 467-477. 10.1186/1297-9686-21-4-467.

- 18.
Mrode RA: Linear models for the prediction of animal breeding values. 2005, Wallingford: CABI Publishing

- 19.
Misztal I, Legarra A, Aguilar I: Computing procedures for genetic evaluation including phenotypic, full pedigree and genomic information. J Dairy Sci. 2009, 92: 4648-4655. 10.3168/jds.2009-2064.

- 20.
Aguilar I, Misztal I, Legarra A, Tsuruta S: Efficient computation of the genomic relationship matrix and other matrices used in single-step evaluation. J Anim Breed Genet. 2011, 128: 422-428. 10.1111/j.1439-0388.2010.00912.x.

## Acknowledgements

The work was performed in a project funded through the Green Development and Demonstration Programme (grant no. 3405-11-0279) by the Danish Ministry of Food, Agriculture and Fisheries, the Pig Research Centre and Aarhus University. The author thanks two anonymous reviewers, an associate editor and editor Jack Dekkers for their comments, and in particular editor Hélène Hayes for her careful and competent language revisions.

## Author information

## Additional information

### Competing interests

The author declares that he has no competing interests.

### Authors’ contributions

The author has read and approved the final manuscript.

An erratum to this article is available at http://dx.doi.org/10.1186/1297-9686-46-20.

## Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

## Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## About this article

#### Received

#### Accepted

#### Published

#### DOI

### Keywords

- Simulated Dataset
- Relationship Matrix
- Marker Genotype
- Genotyped Animal
- Base Animal