The analysed data comprised SNP summary statistics from GWAS performed on 5062 Danish Holstein bulls, 924 Danish Red Dairy Cattle bulls, and 2122 Finnish Red Dairy Cattle bulls [4]. The association was calculated for 25.4 million variants that were imputed with Minimac2 [5] from 630,000 SNPs using the 1000 Bull Genomes reference population from Run4, consisting of 1147 individuals. SNP additive effects were estimated for deregressed estimated breeding values (EBV) for stature. These were used as pseudophenotypes, separately for each breed, with a single SNP mixed linear model including an additive polygenic effect with a covariance described by a genomic relationship matrix. The model was implemented via the EMMAX software [6].
Based on their ID number, SNPs were annotated to genes corresponding to the ARS-UCD1.2 reference genome using the Bioconductor BioMart tool version 3.14 [7] and then genes were annotated to KEGG reference pathways (map) using the David software version 6.8 [8]. The effects of KEGG pathways on stature were estimated separately for each breed using the following mixed linear model that accounted for the similarity between pathways:
$$\mathbf{y}=\mathbf{1}\upmu +\mathbf{Z}\mathbf{t}+\mathbf{e},$$
(1)
where \(\mathbf{y}\) is the vector of absolute values of SNP additive effects on stature that are estimated separately for each breed in the GWAS of Bouwman et al. [4], \(\mathbf{1}\) is a vector of ones, \(\upmu\) represents the general mean, \(\mathbf{t}\) is the vector of the random effects of KEGG pathways with a preimposed normal distribution defined by \(\mathrm{N}\left(0,{\mathbf{V}\upsigma }_{\mathrm{t}}^{2}\right)\), \(\mathbf{e}\) is the vector of residuals distributed as \(\mathrm{N}\left(0,{\mathbf{I}\upsigma }_{\mathrm{e}}^{2}\right)\), \(\mathbf{Z}\) is the incidence matrix for \(\mathbf{t}\). Note that if multiple SNPs were identified within a gene, only one SNP with the strongest effect was included in \(\mathbf{y}\), so that each gene is represented by a single variant. The similarity between KEGG pathways \(\mathrm{i}\) and \(\mathrm{j}\), was introduced into the model by incorporating a non-diagonal KEGG covariance matrix \(\mathbf{V}\). This covariance was expressed by the Jaccard similarity coefficient:
$$\mathrm{J}\left(\mathrm{i},\mathrm{j}\right)=\frac{\mathrm{M}}{\mathrm{N}},$$
(2)
where \(\mathrm{M}\) represents the number of genes shared between KEGG pathways \(\mathrm{i}\) and \(\mathrm{j}\), while \(\mathrm{N}\) represents the total number of genes involved in KEGG pathways \(\mathrm{i}\) and \(\mathrm{j}\). Variance components were assumed to be known: \({\upsigma }_{\mathrm{t}}^{2}=0.3{\upsigma }_{\mathrm{y}}^{2}\) and \({\upsigma }_{\mathrm{e}}^{2}=0.7{\upsigma }_{\mathrm{y}}^{2}.\)
The mixed model equations [9] were used to obtain solutions for \(\upmu\) and \(\mathbf{t}\):
$$\left[\begin{array}{c}\widehat{{\varvec{\upmu}}}\\ \widehat{\mathbf{t}}\end{array}\right]={\left[\begin{array}{cc}{\mathbf{1}}^{\mathrm{T}}{\mathbf{R}}^{\mathbf{-1}}{\mathbf{1}}& {\mathbf{1}}^{\mathrm{T}}{\mathbf{R}}^{\mathbf{-1}}\mathbf{Z}\\ {\mathbf{Z}}^{\mathrm{T}}{\mathbf{R}}^{\mathbf{-1}}{\mathbf{1}}& {\mathbf{Z}}^{\mathrm{T}}{\mathbf{R}}^{\mathbf{-1}}\mathbf{Z}+{\mathbf{G}}^{\mathbf{-1}}\end{array}\right]}^{\mathbf{-1}}\left[\begin{array}{c}{\mathbf{1}}^{\mathrm{T}}{\mathbf{R}}^{\mathbf{-1}}\mathbf{y}\\ {\mathbf{Z}}^{\mathrm{T}}{\mathbf{R}}^{\mathbf{-1}}\mathbf{y}\end{array}\right],$$
(3)
where \(\mathbf{R}=\mathbf{I}{\widehat{\upsigma }}_{\mathrm{e}}^{2}\) and \(\mathbf{G}=\mathbf{V}{\widehat{\upsigma }}_{\mathrm{t}}^{2}\).
To maximise the computational performance of the estimation/prediction process, a custom Python program implementing the NumPy 1.19.5 library [10] was used. Since all calculations were carried out on a high-performance server, the NumPy library was also used to set the array indexing and order, which further improved the computing time compared to a native Python application. Each element of \(\widehat{\mathbf{t}}\) was assessed for significance (\({\mathrm{H}}_{0}:{\widehat{\mathrm{t}}}_{\mathrm{i}}\le 0\) vs. \({\mathrm{H}}_{1}:{\widehat{\mathrm{t}}}_{\mathrm{i}}>0\)) by calculating the probability of obtaining a more extreme value from the \(\mathrm{N}\left(0,{\upsigma }_{\mathrm{t}}^{2}\right)\) density function. Since NumPy and SciPy application programming interfaces (API) are implemented with LAPACK and BLAS, which require Fortran memory layout, all input matrices were transformed to Fortran to avoid costly transposing. In comparison to a fixed matrix input, this approach results in a ten times faster estimation process.