We assume that markers are summarised into a gene content matrix, m (m
ij
= -1, when the SNP j of individual i is 11, m
ij
= 0 for 12, and m
ij
= 1 for 22), and we use capital letters M
ij
to denote when the markers are random variables. For the genomic relationship matrix (1), the matrix p is the expectation of M, i.e. the entries in column j are p
j
= 2(ρ
j
- 1/2) with ρ
j
being the allele frequency of the second allele at loci j, and h is a diagonal matrix chosen such that E[G(M)] = A, the usual pedigree derived additive relationship matrix. In VanRaden [3] three different genomic relationship matrices are presented, where the first two are on the form in (1), and here, we focus on the first one
with s = ∑
j
2ρ
j
(1 - ρ
j
).
The model is as follows
where y is phenotype, X and Z are incidence matrices, β denotes fixed effects, e is error,
is the polygenic genetic effect, and
is the genomic genetic effect. Here A is the usual pedigree derived additive relationship matrix, and G*(mobs) is the extension of (2) to be derived in the following section.
In the following sections, first, we derive the extension of the marker based relationship matrix, G*(mobs), and second, we study the variance-covariance matrix of the combined genetic effect g + a. Then procedures for parameter estimation using AI-REML, and breeding value estimation are presented. Finally, a simulation data set is described.
Genomic relationship matrix with a relationship of markers
Gengler et al. [9] suggested that missing genotypes could be modelled using the usual mixed model methodology with relationship matrix A. We now combine that idea with the genomic relationship matrix on the form (1). For simplicity, the derivation is made for the form (2), but it is straight-forward to generalise to (1) also.
The model for the genomic genetic effect is as follows
where M is the gene content matrix. We assume that E[M
j
] = 1p
j
, Var(M
j
) = v
j
A, with A the usual relationship matrix, v
j
= 2ρ
j
(1 - ρ
j
), and s = ∑
j
v
j
. The covariances of M
j
, and M
j'
for two different loci j ≠ j' are on the form Cov(M
j
, M
j'
) = vj,j'A where the vj,j's are unspecified since they are cancelling in the derivations that follow.
We split M into two sub-matrices containing the animals with observed genotypes and those without, respectively,
and in the following we distinguish between small letter mobs (observed realisation of random variables Mobs) and capital letter Mmiss(unobserved markers are random variables). In Appendix A, the mean vector and variance-covariance matrix of the conditional distribution [g|mobs] (with Mmissmarginalised out) are shown to be
Where
When all animals have been genotyped, G*(mobs) = G(mobs), and when no animals have been genotyped, G*(mobs) = A, which makes the extension in (4) rather elegant. We assume that the distribution of [g|mobs] is multivariate normal, which for the non-genotyped animals is not strictly true, but an approximation.
The inverse of the genomic relationship matrix may be obtained from the inverse of A,
Using some algebra, the inverse of the genomic relationship matrix becomes
Considering the terms in (6), because of the low dimension of G(mobs) and A11 a direct inversion of these matrices should be possible for practical computations, and A-1 is a sparse matrix which can be computed directly without constructing A itself and using standard techniques. To compute A11 there might be cases where most of the A matrix has to be computed, potentially causing a memory storage problem.
Alternatively, A11 = ((A-1)-1)11 may be computed using the formula (5) on A-1 and using sparse matrix computation. The formula (6) requires that G(mobs) is invertible which may not actually be the case. In the next section this problem is automatically solved by combining the genomic genetic effect g with the polygenic effect a.
We also note that the determinant equals
where A22 - A21
A12 is easily obtained from A-1, and the determinant can be computed using sparse matrix computation.
The combined genetic effect
The combined genetic effect is the sum of the genomic genetic effect and the polygenic effect,
= g + a, and using this notation the model (3) may now be written as
where
. Introducing the notation
and
, then
with
= (1 - w)G*(mobs) + wA. Substituting (4) and rearranging the terms, we obtain
where
The parameter w is interpreted as the relative weight on the polygenic effect, and it may be estimated from data as shown in the next section or be chosen to equal a small value.
Similar to the previous section the inverse equals
and here G
w
is necessarily invertible when w > 0 (even when G(mobs) is singular).
Variance component estimation
Here we consider parameter estimation using average information (AI)-REML based on the mixed model equations [10, 11]
where
. We will not enter into details, but just note that the sparse structure of the left hand side matrix in (9) is the cornerstone for the fast computation of the AI-matrix used in the numerical maximisation of the REML likelihood. Considering the terms in this matrix, then ZTZ is a sparse matrix, and from (4) we see that
has some sparse structure, although
is a dense matrix. Depending on the proportion of animals genotyped it may in some cases not be necessarily advantageous to compute the AI-matrix using (9), but instead an AI-REML algorithm based on the inverse phenotypic variance-covariance matrix,
, could be used, see [12]. Here, we assume that the majority of animals are not genotyped and use the sparse structure of G*(mobs)-1 for AI-REML based on the mixed model equations.
The AI-REML method based on the mixed model equations is implemented in software DMU [13] and requires input in the form of the vector of phenotypes, the nonzero entries of
and the log-determinant log(det(
)) = log(det(G
w
)) + log(det(A22 - A21
A12)). For a given w the software provides estimates of
and
, values of the REML log-likelihood at the maximum and (when required) BLUE solution
and BLUP solution
. Here, the parameter w is estimated by using a grid of values, i.e. w = 0.01, 0.03, ..., 0.19, and computing the REML log-likelihood for each value. The resulting profile likelihood curve, log
, has a peak at the estimate
, and a measure of the associated uncertainty is the interval {w|log
> log
- 3.84} where 3.84 is the 95% quantile of a χ2(1)-distribution.
Breeding value estimation
Here we consider estimation (prediction) of breeding values. For animals included in the parameter estimation (animals with phenotypes, and some additional animals whose markers provide information about the unknown markers for non-genotyped animals with phenotypes), the GEBVs are the solution vector
to (9) with the parameter values being the estimated ones from the previous section. The software DMU provides these GEBVs and their precision.
For animals not included in the parameter estimation, then denoting this subset of animals by index 3 the GEBVs
are obtained by solving
where
, Z
all
and
now contain all animals. Again software DMU provides these GEBVs and their precision.
For a scenario with a large number of genotyped animals whose marker information does not provide information for the parameter estimation, Appendix B presents a method for breeding value estimation where only part of the
needs to be computed.
A simulated data set
The simulated data set is inspired by a pig nucleus breeding program, but is formulated in a simplified form. We assume, 10 chromosomes each 160 cM long, and a panel of p = 5000 equidistant SNP markers is used. It is assumed that 500 QTLs affect the phenotype, and the size of these effects is simulated from a Gamma(5.4, 0.42)-distribution. First, a base population consisting of 150 boars and 1500 sows is generated by assuming random mating for 50 generations in a population with an effective population size of 100. Then the following mating and selection scheme is followed for five generations. In each generation, 150 boars are mated with 1500 sows to produce 15000 offspring (half of them males). For the next generation, the 150 boars with the highest value of their own phenotype are selected, and 1500 sows are selected randomly. It is assumed that family records are available for all five generations, phenotypes of all boars available for all five generations (35000 records), and the selected boars in the last three generations are genotyped (450 animals). In addition, to estimate the allele frequencies required for the method, the 150 boars in the base population are genotyped (and the allele frequencies used are the estimated frequencies from these 150 boars). For prediction, it is assumed that 300 selection candidates (without phenotypes) for generation 6 are genotyped.
To evaluate the method advocated in this paper (one-step), two other methods are investigated. The first method (ped) computes traditional EBVs using the pedigree based relationship matrix (without using markers). The second method (two-step) is a two-step procedure similar to methods used in practical genomic selection [14, 15] and is based on genotyped animals only using the model
where y
EBV
is the vector of traditional EBVs, and
with G
w
= 0.99G(mobs) + 0.01A11.
For the one-step method, the genotypes of the selection candidates provide information about the genotypes of their (non-genotyped) mothers and hence information about other non-genotyped animals further back in the pedigree. Therefore they also provide some information about the genotypes of the boars without offspring, and since these boars have phenotypes but not genotypes then the selection candidates should be included in the parameter estimation. However, to investigate how important it is to include these animals, a second analysis (one-step-2) is also performed where they are not included. Finally, to investigate the importance of obtaining the allele frequencies in the base population, the scenario where the boars in the base population have not been genotyped is also studied. The use of three different allele frequencies are compared: 1) true allele frequencies (obtained from the 150 boars in the base population), 2) estimated allele frequencies for boars used in generation 3, 3) allele frequencies estimated using the approach by Gengler et al. [9].