The relationship between the animals in the test and reference data sets has an effect on the accuracy of genomic predictions. Close relationships between the two data sets' result in the highest accuracy for GEBV. Similar results were predicted by Hayes et al.  and observed by Habier et al. [3, 4] for populations that share a close relationship. However, breeding values that are predicted for closely related animals using the traditional pedigree-based BLUP approach also achieve high accuracy. The current study has shown that when there is a distant relationship between the animals in the test and reference data sets, gBLUP is still able to predict an animal's breeding value with some accuracy. Furthermore, when the animals are unrelated by pedigree or when the pedigree relationships are low, gBLUP can use information from distant relatives to maintain a proportion of accuracy of the GEBV.
The information gathered from only distantly related animals enabled an estimate of breeding value to be made with some accuracy. However, when relatives were included in the reference data set, the importance of information on distantly related animals may be reduced. Selection index theory shows that when information on closely related animals is available, more weight is placed on this information and therefore information from distantly related animals becomes less important. Although the importance of information from distant relatives is reduced, this extra information, which is not used in pedigree-based methods, enables gBLUP to achieve a higher accuracy of the EBV. The inclusion of information on relatives improves the accuracy of the predicted breeding values.
If there are no close relationships between animals in the reference and test data sets, the accuracy of the GEBV is driven by distant relationships, which will be more useful when there is more LD in the population. The accuracy obtained for these animals can be called the 'baseline accuracy', which is the accuracy that may be expected for a member of the population that does not have any close relatives in the reference data set. Goddard [6
] and Daetwyler et al. [15
] proposed predictive formulae for the accuracy of genomic predictions. These methods depend on the size of the reference data set, the effective population size of the breed, the heritability of the trait and the length of the genome [6
]. The overall N
will govern the effective number and size of chromosome segments (M
) that are segregating in the population. If the effective population size is small, it is expected that animals will share larger chromosome segments and the genomic predictions will be more accurate [5
]. The accuracy (r) for an individual with no phenotype, as described by Goddard [6
], is then predicted as:
Where ρ = (1 + a+ 2√a), with a = 1+2*λ/N and N is the number of animals in the reference,
λ = σ2
u where σ2
e is the residual variance and σ2
u is the genetic variance at a single locus and is estimated by σ2
u = h2/M
·k where M
L and is the effective number of chromosome segments, h2 is the heritability and k = 1/log(2N
). For the simulation example N = 1750, N
= 100, h2 = 0.3 and L = 30. Then k = 0.189, λ = 3773.8, a = 5.31 and consequently the accuracy for an individual with no phenotype was equal to 0.36. Similarly, the alternative method described by Daetwyler et al.  results in a predicted accuracy of 0.28 (details not shown). The predicted accuracies resulting from either method were similar to the baseline accuracy in our study achieved by gBLUP in unrelated individuals (0.34). In the theoretical prediction methods, there is some ambiguity about the approximation of M
[5, 23], with proposed values equal to: a) 2N
L); b) 4N
L and c) 2N
L. Using  for each of these values results in predicted accuracies of a) 0.74 b) 0.27 and c) 0.36. Consequently 2N
L appears to be the most appropriate variable for baseline accuracy in our simulation example. For the Merino sheep data, with an estimated N
of approximately 1,000 , the expected accuracy was 0.15 and lower than that achieved by gBLUP for EMD (0.28) and for SC_WT (0.18). This increase for gBLUP in the real data is possibly due to extra information from animals that shared a genomic relationship but were unknown in the pedigree, or the estimation of N
may have been affected by heterogeneity of the breed, which really consists of several sub-populations.
Accuracy estimated using the prediction error variance of the mixed model equations (r
(PEV)) was shown to be a good approximation of empirical accuracy for the simulation example. Estimated and empirical accuracies were also very similar when using gBLUP for the EMD example. However, some differences between r
(PEV) and empirical accuracy were observed for both, BLUP-D and gBLUP in real data in the case of SC_WT. In the simulation example, the empirical accuracy was the correlation between the TBV and EBV (or GEBV), whereas in the Merino data example, the empirical accuracy was the correlation between the ASBV and EBV (or GEBV). The ASBVs are progeny test estimates and have some prediction error associated with them. The empirical accuracy was also likely to be affected by sampling because of the small size of each test data set (50-60 animals). Furthermore, unlike the simulation data, where all animals were linked by a true pedigree, many Merino animals in the unrelated test set had no direct pedigree relationships with the reference data set and therefore only zero breeding values were estimated for these animals. In contrast, in the case of missing pedigree, gBLUP could use genomic relationship information and a more accurate breeding value was estimated for all animals in the test set.
Another complexity in our real data example is the heterogeneity of the Merino sheep population, as it consists of many sub-populations. In routine ASBV analyses, this population structure is accounted for using pedigree information and genetic groups based on individual flock data. When correlating GEBV and ASBV, we accounted for sub-population effects by assigning sires to groups of "fine wool", "medium wool" and "strong wool". Empirical accuracies for SC_WT were clearly affected by correcting for the sub-population structure, which may explain why there are some differences between r(PEV)and r(cor) for this trait. The corrections had little to no effect on empirical accuracy for EMD. Note that EMD was corrected for SC_WT and this may have removed some of the sub-population effects on EMD.
The makeup of reference data sets is an important factor for the design of genomic evaluation systems to enable additional genetic gain from genomic selection at the lowest cost. This is especially true for beef cattle and sheep breeding programs that do not have a distinct nucleus tier. We have shown that genomic predictions are more accurate when animals are related to the reference data set; however substantial baseline accuracy can be achieved for all animals in the population. To achieve this, the reference data set will need to include a large number of animals that cover the genetic diversity of the given population (breed). It may be important to include animals that are expected to contribute more to the future gene pool in that breed but these contributions need to be balanced by contributions to genetic diversity .
The optimal size of the reference data set will depend on N
of the given population; populations with higher N
may need a larger reference data set so that suitable baseline accuracies can be achieved. If the baseline accuracy is low (large N
and small reference data set size) the contribution of relatives' information will be larger, however this information from relatives is only limited to closely related individuals and will not last over many generations.